|
Select Works |
|
Harnessing non-adversarial robustness in large language models
ICML, 2026 (Spotlight) The work presents an approach for addressing the challenge of robustness in LLMs to performance degradation caused by semantically similar but textually different prompts. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? Theoretical analysis reveals a crucial factor impacting model robustness -- a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by the analysis, the work shows that robustness can be improved via a simple debiasing process. |
|
Stealth edits for large language models
NeurIPS, 2024 Paper | Code | Huggingface Demo | SIAM News Article This work exposes the susceptibility of modern AI models to a new class of malicious attacks and reveals a new theoretical understanding of the causes behind this; when an attacker provides a specific prompt, the model will generate the attacker's desired outputs. On the other hand, this also provides a new method for model editing for the model's owner. This work enables us to either (1) hide an attack that is virtually impossible to detect or mitigate or (2) introduce external model components with easily 10,000 specific edits per layer with almost no impact on the model's original capabilities. |
|
Staining and locking computer vision models without retraining
ICCV, 2025 Paper | Streamlit Demo This work provides a method to (1) stain, i.e. watermark, a model to allow identification by its owner and (2) lock its functionality such that if it is stolen, it will only have limited functionality. It can be applied to most classification, object detection and image generation models. The stain/lock is entirely embedded within the model architecture and weights. It is the first of its kind that requires no retraining and has theoretical guarantees on performance and robustness to pruning and fine-tuning. |
|
Please see my Google Scholar page for a full list of publications. |