Skip to the content.

CAIS logo

Welcome to the Center for AI Safety’s (CAIS) Reading and Learning (RAL) event, held bi-weekly. This platform serves as a vibrant nexus where we dissect and explore recent publications from the Machine Learning community. Our discussions encompass an array of publications, not only emerging from CAIS but also those curated from outside our institution. We further enrich our events by inviting individuals external to CAIS to present their work, fostering a dynamic exchange of ideas and perspectives. To minimize the pressure when preparing the upcoming talk, we won’t ask speakers to prepare slides beforehand (but you are more than welcome to do so). Just grab a cup of coffee or soda and relax!

Subscribe to all RAL events using this link.

RAL Outline

Upcoming Presentation

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Date and Time: 1 PM Pacific Time, October 27, 2023

Location: Zoom

Speaker: Kenneth Li (Harvard)

Abstract

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the “truthfulness” of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited num- ber of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a trade-off between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and compu- tationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal repre- sentation of the likelihood of something being true, even as they produce falsehoods on the surface.

Become a Speaker

We welcome people from universities and the industry to present their work at RAL! We are interested in topics varying from general AI safety to adversarial robustness, privacy, fairness, interpretability, language models, vision models, multimodality, etc. If you are interested in sharing your work with CAIS and other people, please fill out the following Google Form.

Sign Up for RAL

Past Presentations

Universal and Transferable Adversarial Attacks on Aligned Language Models by Andy Zou

AI Deception: A Survey of Examples, Risks, and Potential Solutions by Aidan O’Gara

Contact

If you have any questions, feel free to contact us at long_at_safe_dot_ai