skip to content

gauravpendharkar.dev gauravp.dev

Gaurav Pendharkar Hi, I am Gaurav Pendharkar!

a data scientist driven by a passion for building interpretable, reliable, and deployable ML systems using domain-specific data.

applied machine learning generative AI forecasting intelligent document processing object oriented design aviation analytics

About Me

My name is Gaurav Pendharkar. I am a data scientist with 1.5 years of experience developing machine learning pipelines for practical applications across various domains, including law, healthcare, earth sciences, and aviation. I have expertise in managing diverse data sources, including structured data (tables), semi-structured data (JSON, XML), and unstructured data (text, images, and PDFs). My focus is on building explainable, reliable machine learning systems through transparent modeling choices and rigorous evaluation in real-world environments.

Experience

Lamont-Doherty Earth Observatory
Lamont-Doherty Earth Observatory New York, NY
Data Scientist Sep 2025 – Dec 2025
Tech Stack: Python, Scikit-learn, Weights & Biases, Vertex AI, Git
  • Engineered a tree-based ML system for estimating soil pH and organic matter on 242 samples, reducing lab-tested features by 87%, leading to a 25% reduction in lab equipment costs using Python and scikit-learn.
  • Built a rocky-terrain classifier based on topographic and vegetation features to control downstream models, boosting macro-avg recall by 30% (0.56 to 0.72), and halving unreliable predictions in low-soil-sample regions.
  • Orchestrated a GenAI workflow for soil pH regression, decreasing MAPE from 9% to 5%, enabling 25x faster experimentation with manual-like error margins through chain-of-thought prompting on Gemini 2.5 Flash.
University of Technology Sydney
University of Technology, Sydney Remote
Research Intern, Generative Artificial Intelligence Sep 2023 – Feb 2024
Tech Stack: Python, FastAPI, HuggingFace, PyTorch, JavaScript, Git
  • Worked with 2 researchers to enhance a multilingual rich-text editor by expanding support from one to three low-resource Indian languages, increasing linguistic inclusivity across India by 36% using AI4Bharat models.
  • Replaced speech recognition with a word-by-word transliteration via IndicXlit and FastAPI, improving accuracy and reducing response time from ~30s to ~17s (43%).
  • Migrated from a limited Google Translate API integration to IndicTrans with FastAPI, improving translation quality and removing usage limits.
Vellore Institute of Technology
Vellore Institute of Technology Chennai, India
Research Assistant, Natural Language Processing May 2022 – Feb 2023
Tech Stack: Python, PdfPlumber, Selenium, spaCy, NLTK, Seaborn, Git
  • Automated acquisition of criminal case PDFs from the Manupatra legal database with Python and Selenium, expanding the dataset by 205% (455 to 1388), saving 15+ hours of manual collection.
  • Fine-tuned the LAW entity from the named entity recognizer on 150 training examples leveraging SpaCy, boosting the F1-score from 0.40 to 0.83, resulting in 8% improvement in downstream case outcome prediction.
  • Built a NLP pipeline combining fine-tuned NER and regex to convert unstructured documents into a structured form, reducing manual processing time by 99.8% (3 months to 12 hours), scaling throughput to 2 docs/min.

Recent Projects

View all projects
Explainability Driven Chain-of-Thought Prompting

Explainability Driven Chain-of-Thought Prompting

Automated reasoning for CoT prompting using explainability attributes from tree-based models for binary classification on tabular datasets.

Daily Sales Forecasting

Daily Sales Forecasting

Forecasting daily total sales of different gifting items using holiday data, promotional sales data , and other time-series features.

On-time Performance Analysis of NYC domestic flights

On-time Performance Analysis of NYC domestic flights

On-time performance analysis of domestic flights from NYC airports for the year 2023.

Arrival Delay Prediction for US domestic flights

Arrival Delay Prediction for US domestic flights

Multiclass classification of arrival delays for NYC domestic flights using tree-based models.

Illumination Invariant Tiger Detection

Illumination Invariant Tiger Detection

Automating detecting tigers in the wild by handling illumination issues with the help of EnlightenGAN.

Imbalanced Malware Byteplot Image Classification

Imbalanced Malware Byteplot Image Classification

Assessing the impact of class imbalance on model performance and convergence for malware byteplot image classification.

My Skills

programming languages
Python Python
SQL SQL
R R
Java Java
TypeScript TypeScript
frameworks & tools
data handling and visualization
Pandas Pandas
Matplotlib
Seaborn
ggplot2
OpenCV OpenCV
modeling
scikit-learn scikit-learn
PyTorch PyTorch
TensorFlow TensorFlow
Weights & Biases Weights & Biases
HuggingFace HuggingFace
spaCy spaCy
model deployment
GCP Google Cloud Platform
FastAPI FastAPI
AI Assistants
GitHub Copilot Cursor
GitHub Copilot GitHub Copilot
Grammarly Grammarly
miscellaneous
GitHub GitHub
LaTeX LaTeX
Markdown Markdown
Selenium Selenium
Jupyter Jupyter

Education

Columbia University
Columbia University New York, NY
Master of Science in Data Science Aug 2024 – Dec 2025
Vellore Institute of Technology
Vellore Institute of Technology Chennai, India
Bachelor of Technology in Computer Science w/s in AI and ML Aug 2020 – Jul 2024