skip to content

gauravpendharkar.dev gauravp.dev

Gaurav Pendharkar Hi, I am Gaurav Pendharkar!

a data scientist driven by a passion for building interpretable, reliable, and deployable ML systems using domain-specific data.

applied machine learning generative AI forecasting intelligent document processing object oriented design aviation analytics

Recent Posts

About Me

My name is Gaurav Pendharkar. I am a data scientist with 1.5 years of experience developing machine learning pipelines for practical applications across various domains, including law, healthcare, earth sciences, and aviation. I have expertise in managing diverse data sources, including structured data (tables), semi-structured data (JSON, XML), and unstructured data (text, images, and PDFs). My focus is on building explainable, reliable machine learning systems through transparent modeling choices and rigorous evaluation in real-world environments.

Experience

Lamont-Doherty Earth Observatory
Lamont-Doherty Earth Observatory New York, NY
Data Scientist Sep 2025 – Dec 2025
Tech Stack: Python, Scikit-learn, Weights & Biases, Vertex AI, Git
  • Collaborated with cross-functional team of five data and soil scientists to design and implement an interpretable ML pipeline for estimating soil pH and soil organic matter, enabling data-driven soil health assessment.
  • Developed a rocky terrain binary classifier based on topographic and vegetation indices to gate downstream regression models; improved baseline macro-avg recall by 30% (from 0.557 to 0.723) via tree-based models.
  • Orchestrated a GenAI workflow for soil pH regression leveraging chain-of-thought prompting with Gemini 2.5 Flash on Vertex AI Batch Inference; increased R2 score to 0.129, about 4x of a tree-based baseline.
University of Technology Sydney
University of Technology, Sydney Remote
Research Intern, Generative Artificial Intelligence Sep 2023 – Feb 2024
Tech Stack: Python, FastAPI, HuggingFace, PyTorch, JavaScript, Git
  • Built a multilingual rich text editor to quantify human–AI coauthorship via AI suggestion acceptance rates.
  • Engineered FastAPI based model-serving APIs for three NLP models: GPT2, IndicTrans, and IndicXlit; powering a web app leveraged by study participants during controlled writing experiments.
  • Collected keystroke level interaction logs; observed a 38% acceptance rate, indicating limited reliance on GPT2.
Vellore Institute of Technology
Vellore Institute of Technology Chennai, India
Research Assistant, Natural Language Processing May 2022 – Feb 2023
Tech Stack: Python, PdfPlumber, Selenium, spaCy, NLTK, Seaborn, Git
  • Develop an information extraction pipeline converting 1300+ unstructured Indian court records into structured data, enabling predictive modeling for faculty and researchers.
  • Expanded a legal document repository by 205% (455 to 1388) through web automation on Manupatra database.
  • Fine-tuned the LAW entity in a generic spaCy NER model, doubling F1-score from 0.40 to 0.83, and combined pattern-based rules with ML-based NER for Indian legal PDFs.
  • Reduced manual work by 99.8% (from 3 months to 12 hours) while maintaining approximately 94% accuracy.

Recent Projects

View all projects
Explainability Driven Chain-of-Thought Prompting

Explainability Driven Chain-of-Thought Prompting

Automated reasoning for CoT prompting using explainability attributes from tree-based models for binary classification on tabular datasets.

Daily Sales Forecasting

Daily Sales Forecasting

Forecasting daily total sales of different gifting items using holiday data, promotional sales data , and other time-series features.

On-time Performance Analysis of NYC domestic flights

On-time Performance Analysis of NYC domestic flights

On-time performance analysis of domestic flights from NYC airports for the year 2023.

Arrival Delay Prediction for US domestic flights

Arrival Delay Prediction for US domestic flights

Multiclass classification of arrival delays for NYC domestic flights using tree-based models.

Illumination Invariant Tiger Detection

Illumination Invariant Tiger Detection

Automating detecting tigers in the wild by handling illumination issues with the help of EnlightenGAN.

Imbalanced Malware Byteplot Image Classification

Imbalanced Malware Byteplot Image Classification

Assessing the impact of class imbalance on model performance and convergence for malware byteplot image classification.

My Skills

programming languages
Python Python
SQL SQL
R R
Java Java
TypeScript TypeScript
frameworks & tools
data handling and visualization
Pandas Pandas
Matplotlib
Seaborn
ggplot2
OpenCV OpenCV
modeling
scikit-learn scikit-learn
PyTorch PyTorch
TensorFlow TensorFlow
Weights & Biases Weights & Biases
HuggingFace HuggingFace
spaCy spaCy
model deployment
GCP Google Cloud Platform
FastAPI FastAPI
AI Assistants
GitHub Copilot Cursor
GitHub Copilot GitHub Copilot
Grammarly Grammarly
miscellaneous
GitHub GitHub
LaTeX LaTeX
Markdown Markdown
Selenium Selenium
Jupyter Jupyter

Education

Columbia University
Columbia University New York, NY
Master of Science in Data Science Aug 2024 – Dec 2025
Vellore Institute of Technology
Vellore Institute of Technology Chennai, India
Bachelor of Technology in Computer Science w/s in AI and ML Aug 2020 – Jul 2024