Hi, I'm Ayush. I'm an AI researcher and engineer based in New Delhi. I graduated from Manipal University Jaipur with a B.Tech in Information Technology and a minor specialization in Computational Intelligence.
My most recent work focused on fingerprint spoof detection on edge devices and geometric deep learning at IIIT Delhi. My more recent and personal area of research interests lie deeply in LLM interpretability, safety, trustworthiness, and reasoning. I enjoy looking under the hood of LLMs to understand things. Before this, I worked with large remote sensing data for crop yield prediction. I am happy to collaborate if you feel my research area seems relevant to you, and I love to help out people with my experiences from my work.
Additionally, in my free time I enjoy weightlifting and taking long walks to clear my head.
Language models are getting eerily good at generating step-by-step reasoning, but actually verifying if those steps are logically sound is a completely different story. In this paper, we explore a strange quirk: why models sometimes scale negatively when forced to separate syntactic text generation (sounding smart) from true logical verification (being right). We map out exactly where these reasoning structures begin to break down as you scale them up.
Predicting crop yields with AI usually fails when the climate shifts, because standard deep learning models just overfit to historical weather patterns. I wrote this paper to show that if you force the AI to respect actual biology - using evolutionary algorithms to pick physical drivers like soil clay and vapor pressure - you can fix this. We achieved predictive parity \((R^{2}\approx0.85)\) with massive CNN-LSTM ensembles while using 99% less data dimensionality. Basically, embedding physics beats brute-force data scaling.
Before building my own architectures for earth observation, I needed to understand everything that had been tried before. I read and systematically reviewed 80 papers spanning a decade (2014-2024). I mapped out where state-of-the-art algorithms (like Random Forests and LSTMs) succeed, assessed the remote sensing data everyone uses, and outlined the persistent evaluation challenges the community still faces.
I am super curious about how features entangle inside LLMs. People like to think of neural networks as having independent "modules" for concepts, but my ongoing virtual ablation experiments on Gemma-2 show that trying to "unlearn" hazardous knowledge usually fails in the middle layers because the model just dynamically routes around the ablation. To measure this, I built a normalized Virtual Ablation Synergy (\(L_{2}\)) metric directly in the residual stream. Turns out, if we want safe AI, we have to target very specific late-layer bottlenecks.
I firmly believe you don't really understand an architecture until you build it yourself. So, I coded up a 124M parameter GPT-2 entirely from scratch in PyTorch. I trained it on 10 billion tokens from the FineWeb dataset and optimized the hardware throughput by 42% via FlashAttention-2, BF16 mixed-precision, and gradient accumulation.
I built a multimodal RAG application to essentially "chat" with YouTube videos. I wired up Llama 3.2 Vision, passed the audio through OpenAI Whisper, and stored everything in ChromaDB to provide timestamp-aligned contextual answers.
I wrote an anomaly-detection pipeline to hunt for hidden drug safety signals across 1.6 million FAERS medical reports. I rewrote the preprocessing to use RAPIDS and Dask, running it entirely on the GPU to speed things up by 10x.
NOTE: The link below might take a while to load due to cold start
Forecasted demand for over 30,000 products across 10 U.S. stores by engineering 70+ features to map complex seasonalities. I benchmarked LightGBM against SARIMA and Prophet, then deployed the interactive tracking dashboard directly to GCP.
NOTE: The link below might take a while to load due to cold start