About This Project

Project Overview

This project applies advanced text analysis to the Mahabharata, following the Standard Text Analytic Data Model:

F0: Raw text (from sacred-texts.com, Ganguli translation)
F2: Parsing into structured tables (books, chapters, tokens)
F3: Annotation with linguistic features (POS, term frequency)
F4: Vectorization (TF-IDF, Bag-of-Words)
F5: Modeling (PCA, LDA, Word2Vec)

Data Source

Source: sacred-texts.com
English translation by Kisari Mohan Ganguli (late 19th century)
18 books, 14.5 MB, plaintext
2.5 million tokens, 30,682 unique terms
Hierarchical structure: book → chapter → section → paragraph → sentence → token

Technical Stack

Python, NLTK, scikit-learn, gensim, Jupyter, Plotly, t-SNE, LDA, PCA

Acknowledgments

Prof. Rafael Alvarado - Text as Data (DS 5001), University of Virginia
sacred-texts.com for the digital corpus

Future Directions

Named Entity Recognition to map character relationships more systematically
Network analysis of character co-occurrences
Deeper temporal analysis of narrative progression
Comparative analysis with other translations or epics

Contact

Vishwanath Guruvayur
Data Scientist & Mahabharata Enthusiast
[email protected]