Sanskrit Datasets

community

AI & ML interests

None defined yet.

Recent Activity

13Aluminium  updated a dataset 1 day ago
snskrt/Shrimad_Bhagvat_Puran
13Aluminium  published a dataset 1 day ago
snskrt/Shrimad_Bhagvat_Puran
13Aluminium  updated a Space 3 days ago
snskrt/README
View all activity

1. Shrimad Bhagavad Gita

Dataset on Hugging Face

Short Description: A structured, chapter-wise dataset of the Śrīmad Bhagavad Gītā with expanded verse counts, enabling fine-grained analysis and modeling of each śloka.

2. Devi Bhagavatam

Dataset on Hugging Face

Dataset Structure: Each record (CSV/JSON) includes:

  • skanda (string): Skanda number, e.g. "1"
  • adhyaya_number (string): Adhyāya index, e.g. "१.१"
  • adhyaya_title (string): Sanskrit chapter title
  • a_index (int): auxiliary sequence index
  • m_index (int): main sequence index
  • text (string): full śloka text

Dataset Description:
This dataset contains a complete, structured representation of the Śrīmad Devī-bhāgavatam mahāpurāṇe in CSV format, breaking down scripture into Skandas, Adhyāyas, and individual ślokas. Suited for NLP tasks like feature extraction, classification, translation, summarization, and generation.

Size: ~18,702 ślokas

3. Shiv Mahapuran

Dataset on Hugging Face

Dataset Description:
This dataset contains a complete, structured representation of the Śiva Mahāpurāṇa (Śivapurāṇa) in CSV format. Data is organized into Saṃhitās (seven surviving Saṃhitās), Khaṇḍas, Adhyāyas, and individual ślokas, enabling precise NLP work on classical Sanskrit scripture.

Size: ~24,489 ślokas

Dataset Structure: Each record (CSV/JSON) includes:

  • samhita (string): Name of the Saṃhitā, e.g. "Rudrasaṃhitā"
  • khanda (string): Khanda name, e.g. "Parvati kand"
  • khanda_number (string): Khanda index, e.g. "1"
  • adhyay (string): Adhyāya title or number, e.g. "1.1"
  • shloka_number (int): Position of the śloka within the Adhyāya
  • shloka_text (string): Full Sanskrit text of the śloka

4. Shiv Puran OCR (Image-Text)

Dataset on Hugging Face

Dataset Description:
A dataset of cropped śloka images from the Vidyeśvara-saṃhitā, paired with their transcribed text. Perfect for training or evaluating OCR systems on classical Sanskrit script.

Contents:

  • 734 cropped śloka images
  • A CSV mapping each image filename to its corresponding śloka text

5. Shiv Puran OCR (Object Detection)

Dataset on Hugging Face

Dataset Description:
Annotations and imagery to train object detection models that differentiate śloka vs. non-śloka content in scanned scripture pages. Once detected, ślokas can be cropped for OCR or parallel corpus creation.

Annotation Structure:

  • Pages 0–102: Vidyeśvara Saṃhitā (manually annotated)
  • Pages 103–463: Rudra Saṃhitā (model-inferred + manual corrections)
  • Pages 464–508: Śat Rudra Saṃhitā (model-inferred + manual corrections)

Data Includes: Bounding-box coordinates and metadata for each detected region.