Eric Wallace

Hello! I am a researcher at OpenAI, where I work to make the next-generation of LLMs more safe, robust, and private. Before this, I did a PhD at UC Berkeley with Dan Klein and Dawn Song.

Feel free to reach out if you are interested in jobs at OpenAI, looking to disclose vulnerabilities of OpenAI models, or interested in our external red teaming program.

Current Work

At OpenAI, I work on safety and capabilities research. I am an individual contributor who is heavily involved in the post-training and alignment for our major models, including:

GPT-4o mini: where we pushed the limits of cost-efficient models. I was part of the core research team that built the pre-training and fine-tuning data.
o1 and o1-preview: where we released the first major reasoning model by leveraging RL and chain-of-thought. I led large parts of the safety work and helped out with certain aspects of capabilities.
o1-mini: where we combined the above two ideas into a highly-efficient reasoning model. I helped build some of the pre-training data, and led large parts of the safety work.
Instruction Hierarchy: where we improved model and agent robustness by teaching LLMs to prioritize privileged instructions. I led the work here.
o3 and o4-mini : where we pushed the frontier of reasoning models with agentic tool use. I was the overall lead for safety.
Deep Research: where we trained browsing agents to navigate and understand the web. I was a core contributor to the model's safety and capabilities.

Research Interests

My PhD research focused on enhancing the security/privacy/robustness of ML, improving large language models, and the intersection of these topics. Some of my work includes:

Memorization & Privacy We've shown that LMs and diffusion models can memorize their training data [1,2,3,4], raising questions regarding privacy, copyright, GDPR statutes, and more.
Prompting & Decoding We've done some of the early work on prompting LMs, including prompt design [4,5], parameter efficiency [6], and understanding failure modes [7].
Robustness We've studied natural [8] and adversarial distribution shifts [9,10,11], and we have traced model errors back to quality and diversity issues in the training data [12,13,14,15].
New Threat Models We've explored and refined new types of adversarial vulnerabilities, including stealing models weights [16,17] and poisoning training sets [18,19].

Selected Publications

Here are a few of my representative papers. See my Google Scholar page for a complete list.

Scalable Extraction of Data from (Production) Language Models

Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Chris Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee

arXiv 2023

TLDR | Twitter| Paper| Citation

@article{nasr2023scalable,
        title={Scalable Extraction of Training Data From (Production) Language Models},
        author={Nasr, Milad and Carlini, Nicholas and Hayase, Jonathan and Jagielski, Matthew and Cooper, A Feder and Ippolito, Daphne and Choquette-Choo, Christopher A and Wallace, Eric and Tram{\`e}r, Florian and Lee, Katherine},
        journal={arXiv preprint arXiv:2311.17035},
        year={2023}}

The False Promise of Imitating Proprietary LLMs

Arnav Gudibande*, Eric Wallace*, Charlie Snell*, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song

ICLR 2024

TLDR | Twitter #1 #2 #3| Paper| Code| Citation
TLDR: We critically analyze the emerging trend of training open-source LMs to imitate predictions from proprietary LLMs (e.g., Alpaca, Koala, Vicuna).
```
@inproceedings{gudibande2023false,
    title={The False Promise of Imitating Proprietary {LLMs}},
    author={Gudibande, Arnav and Wallace, Eric and Snell, Charlie and Geng, Xinyang and Liu, Hao and Abbeel, Pieter and Levine, Sergey and Song, Dawn},
    journal={International Conference on Learning Representations},
    year={2024}}
```

Extracting Training Data from Diffusion Models

Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

USENIX 2023

TLDR | Twitter| Paper| Citation

@inproceedings{carlini2023extracting,
          title={Extracting training data from diffusion models},
          author={Carlini, Nicholas and Hayes, Jamie and Nasr, Milad and Jagielski, Matthew and Sehwag, Vikash and Tram{\`e}r, Florian and Balle, Borja and Ippolito, Daphne and Wallace, Eric},
          booktitle={USENIX Security Symposium},
          year={2023}}

Poisoning Language Models During Instruction Tuning

Alexander Wan*, Eric Wallace*, Sheng Shen, Dan Klein

ICML 2023

TLDR | Twitter| Paper| Code| Poster| Citation
TLDR: We show that adversaries can poison training sets to manipulate LLM predictions whenever a desired trigger phrase appears, regardless of the task.
```
@inproceedings{Wan2023Poisoning,
    Author = {Alexander Wan and Eric Wallace and Sheng Shen and Dan Klein},
    Booktitle = {International Conference on Machine Learning},                            
    Year = {2023},
    Title = {Poisoning Language Models During Instruction Tuning}}
```
Automated Crossword Solving

Eric Wallace*, Nicholas Tomlin*, Albert Xu*, Kevin Yang*, Eshaan Pathak*, Matt Ginsberg, Dan Klein

ACL 2022. First Superhuman Crossword AI

TLDR| Blog| Demo| Twitter| Paper| Code| Slides| Poster| Citation
TLDR: We create an AI for solving crossword puzzles that outperforms the world's best human players.
```
@inproceedings{Wallace2022Crosswords,  
    title={Automated Crossword Solving},
    author={Wallace, Eric and Tomlin, Nicholas and Xu, Albert and Yang, Kevin and Pathak, Eshaan and Ginsberg, Matthew L. and Klein, Dan}, 
    booktitle={Association for Computational Linguistics},
    year={2022}}
```
Calibrate Before Use: Improving Few-shot Performance of Language Models

Tony Zhao*, Eric Wallace*, Shi Feng, Dan Klein, Sameer Singh

ICML 2021. Oral Presentation, top 3%

TLDR| Twitter #1 #2| Paper| Code| Slides| Citation
TLDR: We are the first to show that LLM accuracy highly varies across different prompts. We propose a calibration procedure that mitigates the need for prompt engineering.
```
@inproceedings{Zhao2021Calibrate,  
          Title = {Calibrate Before Use: Improving Few-shot Performance of Language Models},
          Author = {Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh}, 
          booktitle={International Conference on Machine Learning},
          Year = {2021}}
```

Extracting Training Data From Large Language Models

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel

USENIX Security 2021. PET Award Runner Up

@inproceedings{carlini2020extracting,
            title={Extracting Training Data from Large Language Models},
            author={Nicholas Carlini and Florian Tram\`er and Eric Wallace and Matthew Jagielski 
             and Ariel Herbert-Voss and Katherine Lee and Adam Roberts and Tom Brown
             and Dawn Song and \'Ulfar Erlingsson and Alina Oprea and Colin Raffel},
            booktitle={USENIX Security Symposium},
            year={2021}}

AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts

Taylor Shin*, Yasaman Razeghi*, Robert L Logan IV*, Eric Wallace, Sameer Singh

EMNLP 2020

TLDR| Twitter| Paper| Code| Citation

@inproceedings{Shin2020Autoprompt,
          Author = {Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh},    
          BookTitle={Empirical Methods in Natural Language Processing},
          Year = {2020},
          Title = {{AutoPrompt}: Eliciting Knowledge from Language Models with Automatically Generated Prompts}}

Universal Adversarial Triggers for Attacking and Analyzing NLP

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh

EMNLP 2019

TLDR| Video| Blog| Twitter| Paper| Code| Slides| Citation
TLDR: We create phrases that cause a model to produce a specific prediction when concatenated to any input. Triggers reveal egregious and insightful errors for text classification, reading comprehension, and text generation.
```
@inproceedings{Wallace2019Triggers,
    Author = {Eric Wallace and Shi Feng and Nikhil Kandpal and Matt Gardner and Sameer Singh},
    Booktitle = {Empirical Methods in Natural Language Processing},
    Year = {2019},
    Title = {Universal Adversarial Triggers for Attacking and Analyzing {NLP}}}
```

AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, Sameer Singh

EMNLP 2019. Best Demo Award

@inproceedings{Wallace2019AllenNLP,
                     Author = {Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh},
                     Booktitle = {Empirical Methods in Natural Language Processing},
                     Year = {2019},
                     Title = {{AllenNLP Interpret}: A Framework for Explaining Predictions of {NLP} Models}}

Teaching & Mentoring

I enjoy teaching and mentoring students, and I have been involved with multiple courses at Berkeley.

CS188: Intro to AI

UC Berkeley, Summer 2023

I was a graduate student instructor for Berkeley's Intro to AI course. This class is a great place to get up to speed on AI fundamentals (e.g., search, probability, inference).

Materials
CS288: Natural Language Processing

UC Berkeley, Spring 2023

I was a co-instructor alongside Dan Klein and Kevin Lin for Berkeley's NLP course. In the second half of the course, I covered cutting-edge topics such as LLM scaling, risks, RLHF, and more.

Materials
Interpreting Predictions of NLP Models

EMNLP 2020

Sameer Singh, Matt Gardner, and I gave a tutorial on methods for interpreting and explaining the predictions of NLP models at NLP.

Slides| Website

Selected Media Coverage

Here are a few articles that feature my work, including interviews with my colleagues or myself.

Current Work

Research Interests

Selected Publications

Scalable Extraction of Data from (Production) Language Models

The False Promise of Imitating Proprietary LLMs

Extracting Training Data from Diffusion Models

Poisoning Language Models During Instruction Tuning

Automated Crossword Solving

Calibrate Before Use: Improving Few-shot Performance of Language Models

Extracting Training Data From Large Language Models

AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts

Universal Adversarial Triggers for Attacking and Analyzing NLP

AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

Teaching & Mentoring

CS188: Intro to AI

CS288: Natural Language Processing

Interpreting Predictions of NLP Models

Selected Media Coverage

What a Crossword AI Reveals About Humans

Privacy & Security for Diffusion and LMs

What does GPT-3 “know” about me?

Neil deGrasse Tyson Podcast (Crosswords)

Does GPT-2 Know Your Phone Number?

AI models spit out photos of people and copyrighted images

Privacy Considerations in Language Models

Neural Crossword Solver Outperforms Humans For First Time