

Sanskrit Datasets
AI & ML interests
None defined yet.
Recent Activity
1. Shrimad Bhagavad Gita
Short Description: A structured, chapter-wise dataset of the Śrīmad Bhagavad Gītā with expanded verse counts, enabling fine-grained analysis and modeling of each śloka.
2. Devi Bhagavatam
Dataset Structure: Each record (CSV/JSON) includes:
skanda
(string): Skanda number, e.g. "1"adhyaya_number
(string): Adhyāya index, e.g. "१.१"adhyaya_title
(string): Sanskrit chapter titlea_index
(int): auxiliary sequence indexm_index
(int): main sequence indextext
(string): full śloka text
Dataset Description:
This dataset contains a complete, structured representation of the Śrīmad Devī-bhāgavatam mahāpurāṇe in CSV format, breaking down scripture into Skandas, Adhyāyas, and individual ślokas. Suited for NLP tasks like feature extraction, classification, translation, summarization, and generation.
Size: ~18,702 ślokas
3. Shiv Mahapuran
Dataset Description:
This dataset contains a complete, structured representation of the Śiva Mahāpurāṇa (Śivapurāṇa) in CSV format. Data is organized into Saṃhitās (seven surviving Saṃhitās), Khaṇḍas, Adhyāyas, and individual ślokas, enabling precise NLP work on classical Sanskrit scripture.
Size: ~24,489 ślokas
Dataset Structure: Each record (CSV/JSON) includes:
samhita
(string): Name of the Saṃhitā, e.g. "Rudrasaṃhitā"khanda
(string): Khanda name, e.g. "Parvati kand"khanda_number
(string): Khanda index, e.g. "1"adhyay
(string): Adhyāya title or number, e.g. "1.1"shloka_number
(int): Position of the śloka within the Adhyāyashloka_text
(string): Full Sanskrit text of the śloka
4. Shiv Puran OCR (Image-Text)
Dataset Description:
A dataset of cropped śloka images from the Vidyeśvara-saṃhitā, paired with their transcribed text. Perfect for training or evaluating OCR systems on classical Sanskrit script.
Contents:
- 734 cropped śloka images
- A CSV mapping each image filename to its corresponding śloka text
5. Shiv Puran OCR (Object Detection)
Dataset Description:
Annotations and imagery to train object detection models that differentiate śloka vs. non-śloka content in scanned scripture pages. Once detected, ślokas can be cropped for OCR or parallel corpus creation.
Annotation Structure:
- Pages 0–102: Vidyeśvara Saṃhitā (manually annotated)
- Pages 103–463: Rudra Saṃhitā (model-inferred + manual corrections)
- Pages 464–508: Śat Rudra Saṃhitā (model-inferred + manual corrections)
Data Includes: Bounding-box coordinates and metadata for each detected region.
models
1
