Home • gauravpendharkar.dev

Hi, I am Gaurav Pendharkar!

a data scientist driven by a passion for building interpretable, reliable, and deployable ML systems using domain-specific data.

applied machine learning generative AI forecasting intelligent document processing object oriented design aviation analytics

find me on

read blog more about me contact me

About Me

My name is Gaurav Pendharkar. I am a data scientist with 1.5 years of experience developing machine learning pipelines for practical applications across various domains, including law, healthcare, earth sciences, and aviation. I have expertise in managing diverse data sources, including structured data (tables), semi-structured data (JSON, XML), and unstructured data (text, images, and PDFs). My focus is on building explainable, reliable machine learning systems through transparent modeling choices and rigorous evaluation in real-world environments.

Experience

Lamont-Doherty Earth Observatory New York, NY

Data Scientist Sep 2025 – Dec 2025

Tech Stack: Python, Scikit-learn, Weights & Biases, Vertex AI, Git

Engineered a tree-based ML system for estimating soil pH and organic matter on 242 samples, reducing lab-tested features by 87%, leading to a 25% reduction in lab equipment costs using Python and scikit-learn.
Built a rocky-terrain classifier based on topographic and vegetation features to control downstream models, boosting macro-avg recall by 30% (0.56 to 0.72), and halving unreliable predictions in low-soil-sample regions.
Orchestrated a GenAI workflow for soil pH regression, decreasing MAPE from 9% to 5%, enabling 25x faster experimentation with manual-like error margins through chain-of-thought prompting on Gemini 2.5 Flash.

University of Technology, Sydney Remote

Research Intern, Generative Artificial Intelligence Sep 2023 – Feb 2024

Tech Stack: Python, FastAPI, HuggingFace, PyTorch, JavaScript, Git

Worked with 2 researchers to enhance a multilingual rich-text editor by expanding support from one to three low-resource Indian languages, increasing linguistic inclusivity across India by 36% using AI4Bharat models.
Replaced speech recognition with a word-by-word transliteration via IndicXlit and FastAPI, improving accuracy and reducing response time from ~30s to ~17s (43%).
Migrated from a limited Google Translate API integration to IndicTrans with FastAPI, improving translation quality and removing usage limits.

Vellore Institute of Technology Chennai, India

Research Assistant, Natural Language Processing May 2022 – Feb 2023

Tech Stack: Python, PdfPlumber, Selenium, spaCy, NLTK, Seaborn, Git

Automated acquisition of criminal case PDFs from the Manupatra legal database with Python and Selenium, expanding the dataset by 205% (455 to 1388), saving 15+ hours of manual collection.
Fine-tuned the LAW entity from the named entity recognizer on 150 training examples leveraging SpaCy, boosting the F1-score from 0.40 to 0.83, resulting in 8% improvement in downstream case outcome prediction.
Built a NLP pipeline combining fine-tuned NER and regex to convert unstructured documents into a structured form, reducing manual processing time by 99.8% (3 months to 12 hours), scaling throughput to 2 docs/min.