profile photo
Qinghua Zhou
<
Qinghua Zhou

About me: I am a researcher and engineer working on the safety and alignment of artificial intelligence systems. My research interests lie in exploring pathways towards robust, stable and trustworthy AI. These explorations include principled theoretical and computational analysis of modern computer vision and large language models, their structures, properties and methods of optimization. These also include the development of high-performance software, efficient scaling of large-scale simulations and interactive demos. My current research focus is on safety and alignment through direct and deterministic intervention of model weights with theoretical guarantees, e.g. providing methods to stain, lock, edit or attack models, improve robustness and control model behavior.

Select Works

Harnessing non-adversarial robustness in large language models
ICML, 2026 (Spotlight)

LLMs can falter when input prompts have slight text or format differences. This is due to a shift in the model's internal signal — like a scale that's slightly off-balance. The "scale" can be re-adjusted with a small correction.

We trace LLM fragility under semantically-neutral perturbations to a systematic perturbation-induced bias in module outputs. A cheap closed-form debiasing of logits or features can sometimes recover much of the lost performance and raise both robustness certification rates.

Stealth edits for large language models
NeurIPS, 2024 | Huggingface Demo

LLMs sometimes make factual errors or bad responses, which you can surgically patch by tweaking a few neurons' weights, with guarantees that other answers stay untouched. This also means an attacker can plant a hidden trigger that's nearly impossible to detect.

We show that a single quantity, the separability-based intrinsic dimensionality of a model’s latent features, provably governs the selectivity of model editing.

Staining and locking computer vision models without retraining
ICCV, 2025 | Streamlit Demo

To protect your trained model, there are ways to ”stain” one with a hidden fingerprint that proves ownership, or ”lock” one so that a thief who copies it gets near-useless performance. These protections can be installed directly into the model itself without retraining.

We add highly selective detector neurons and disruptor mechanisms directly into the weights and structures of a model. Exploiting feature-space concentration, we gain computable worst-case false-positive bounds.

Please see my Google Scholar page for a full list of publications.