|
Select Works |
|
Harnessing non-adversarial robustness in large language models
ICML, 2026 (Spotlight) LLMs can falter when input prompts have slight text or format differences. This is due to a shift in the model's internal signal — like a scale that's slightly off-balance. The "scale" can be re-adjusted with a small correction. We trace LLM fragility under semantically-neutral perturbations to a systematic perturbation-induced bias in module outputs. A cheap closed-form debiasing of logits or features can sometimes recover much of the lost performance and raise both robustness certification rates. |
|
Stealth edits for large language models
NeurIPS, 2024 | Huggingface Demo LLMs sometimes make factual errors or bad responses, which you can surgically patch by tweaking a few neurons' weights, with guarantees that other answers stay untouched. This also means an attacker can plant a hidden trigger that's nearly impossible to detect. We show that a single quantity, the separability-based intrinsic dimensionality of a model’s latent features, provably governs the selectivity of model editing. |
|
Staining and locking computer vision models without retraining
ICCV, 2025 | Streamlit Demo To protect your trained model, there are ways to ”stain” one with a hidden fingerprint that proves ownership, or ”lock” one so that a thief who copies it gets near-useless performance. These protections can be installed directly into the model itself without retraining. We add highly selective detector neurons and disruptor mechanisms directly into the weights and structures of a model. Exploiting feature-space concentration, we gain computable worst-case false-positive bounds. |
|
Please see my Google Scholar page for a full list of publications. |