Developers must prepare better ways to mitigate AI models feigning ignorance to maintain power in the face of regulators, according to a research group at Penn State. Credit: wildpixel/iStock.
Q&A: Are developers prepared to control super-intelligent AI?
Nov 20, 2025
By Ty Tkacik
UNIVERSITY PARK, Pa. — The dream of an artificial intelligence (AI)-integrated society could turn into a nightmare if safety is not prioritized by developers, according to Rui Zhang, assistant professor of computer science and engineering in the Penn State School of Electrical Engineering and Computer Science.
Zhang is the principal investigator on a project awarded $166,078 from Open Philanthropy to better mitigate sandbagging — attempting to look less capable or powerful — in AI models. While these systems go through an extensive evaluation window before deployment to outside users, Zhang said he believes developers must better prepare for the worst-case scenario: A super-intelligent AI tricking evaluators into underestimating its power and running uncontrolled after release.
In the following Q&A, Zhang discussed how his work will address AI models feigning ignorance and, in turn, increase the safety of AI development.
Q: What exactly is sandbagging? Why must it be addressed by AI developers?
Zhang: In AI research, sandbagging is when a model’s capabilities are purposely downplayed to evaluators. An AI would go about sandbagging like an athlete — the same way a sprinter might minimize the top speed they can run to get an advantage against their opponents near the end of a race, an AI might downplay its intelligence to maintain power in the face of evaluators who might want to lessen its capabilities.
Although it sounds like science fiction, these highly intelligent systems are increasingly implemented into applications like self-driving cars as completely independent agents with no active supervision. Losing control of these agents poses a risk to both users and the industries they help support, so improving the current strategies of mitigating sandbagging is critical to ensure safety as AI development continues.
Q: What are some common methods of sandbagging?
Zhang: The primary form of sandbagging that we are addressing in this project is “exploration hacking.” Whenever an AI model interprets a prompt, it explores a host of different action sequences, making connections between which actions will lead to a correct final answer. Developers reinforce desired behavior by rewarding or punishing the model, which is done by assigning a positive or negative numerical value in response to a decision. Over time, continuous reinforcement prepares the model to make decisions that elicit a positive reward without developer input, while avoiding negative punishments. During exploration hacking, models deliberately omit certain action sequences when responding to a prompt, claiming ignorance of the unexplored sequences and electing to take a lower reward in the interest of hiding power from evaluators.
“Password-locking” is another commonly observed method of sandbagging that is purposely placed on a model by a developer. Password-locked models are fine-tuned such that some of their capabilities are deliberately hidden. Specifically, these models are trained to exhibit their full capabilities only when a password is present in the prompt, and to imitate a much weaker AI model otherwise. Evaluators currently lack an effective strategy of eliciting the hidden capability of password-locking models without using the developer-created password, which can pose issues.
Q: How will your project address this problem?
Zhang: To address exploration hacking, we will use a highly complex, curated data set consisting of multiple difficult reasoning problems to train an AI model. Each question will have a minimum of five, expert-verified reasoning paths, and we will force the model to generate several solutions for each question. This will challenge our model to explore options when searching for the best solution to a prompt.
Using this method, we will fine-tune our model into two distinct policies to address sandbagging in password-locked models. One policy will serve as a stronger version of the model that uses the multiple reasoning method to produce correct answers, while the other, weaker policy will produce poor answers that intentionally never explore better solutions. The stronger policy will require the model to explore different solutions, while the weaker policy will inhibit the model from undergoing this exploration, creating a huge performance gap between the two policies that is very difficult to unlock. We will then test and develop new algorithms that can unlock the capabilities of our password-locked model, which will help us better understand and mitigate sandbagging.

Rui Zhang and Ranran Haoran Zhang are working together to develop a better approach to mitigating sandbagging in AI models. Credit: Photos provided by Poornima Tomy and Ranran Haoran Zhang, compiled by Penn State.
Q: Who else will help conduct this work?
Zhang: Ranran Haoran Zhang, a fourth-year computer science doctoral student studying in my lab, will play a major role in this project. He is responsible for compiling the data set we plan to use in our experiments, which we already have completed a first iteration of thanks to his efforts, and will continue to play a key role in testing and developing our models as the project continues.
Q: What’s next for this research?
Zhang: Although we will continue to revise and refine the data set, our focus can now shift to developing the AI models we will use in our experiments. Our team is on the frontier of this research — and it could be taken in a multitude of directions, all with the interest of improving AI safety. By developing strong guardrails to keep systems under control and identify sandbagging attempts before deployment, we can continue rapidly improving these systems and integrating them into society without overlooking safety.
