profile photo
Qinghua Zhou
<
Qinghua Zhou

About me: I am a researcher and engineer working on the safety and alignment of artificial intelligence systems. My research interests lie in exploring pathways towards robust, stable and trustworthy AI. These explorations include principled theoretical and computational analysis of modern computer vision and large language models, their structures, properties and methods of optimization. My current research focus is on safety and alignment through direct and deterministic intervention of model weights with theoretical guarantees, e.g. providing methods to stain, lock, edit or attack models, improve robustness and control model behavior.

Select Works

Harnessing non-adversarial robustness in large language models
ICML, 2026 | Spotlight

LLMs can falter when input prompts have slight text or format differences. This is due to a shift in the model's internal signal — like a scale that's slightly off-balance. The "scale" can be re-adjusted with a small correction.

Through analysis, we trace fragility from semantically-neutral perturbations to systematic biases in module outputs. A cheap closed-form debiasing can be applied with no retraining and often no labeled data.

Stealth edits for large language models
NeurIPS, 2024 | Huggingface Demo

LLMs sometimes make factual errors or bad responses, which you can surgically patch by tweaking a few neurons' weights, with guarantees that other answers stay untouched. This also means an attacker can plant a hidden trigger that's nearly impossible to detect.

We show that a single quantity, the separability-based intrinsic dimensionality of a model’s latent features, provably governs the selectivity of model editing.

Staining and locking computer vision models without retraining
ICCV, 2025 | Streamlit Demo

To protect your trained model, there are ways to ”stain” one with a hidden fingerprint that proves ownership, or ”lock” one so that a thief who copies it gets near-useless performance. These protections can be installed directly into the model itself without retraining.

We add highly selective detector neurons and disruptor mechanisms directly into the weights and structures of a model. Exploiting feature-space concentration, we gain computable worst-case false-positive bounds.

Please see my Google Scholar page for a full list of publications.