Project Overview
This project applies advanced text analysis to the Mahabharata, following the Standard Text Analytic Data Model:
- F0: Raw text (from sacred-texts.com, Ganguli translation)
- F2: Parsing into structured tables (books, chapters, tokens)
- F3: Annotation with linguistic features (POS, term frequency)
- F4: Vectorization (TF-IDF, Bag-of-Words)
- F5: Modeling (PCA, LDA, Word2Vec)
Data Source
- Source: sacred-texts.com
- English translation by Kisari Mohan Ganguli (late 19th century)
- 18 books, 14.5 MB, plaintext
- 2.5 million tokens, 30,682 unique terms
- Hierarchical structure: book → chapter → section → paragraph → sentence → token
Technical Stack
- Python, NLTK, scikit-learn, gensim, Jupyter, Plotly, t-SNE, LDA, PCA
Acknowledgments
- Prof. Rafael Alvarado - Text as Data (DS 5001), University of Virginia
- sacred-texts.com for the digital corpus
Future Directions
- Named Entity Recognition to map character relationships more systematically
- Network analysis of character co-occurrences
- Deeper temporal analysis of narrative progression
- Comparative analysis with other translations or epics
Contact
Vishwanath Guruvayur
Data Scientist & Mahabharata Enthusiast
[email protected]