Saurav Kadavath

Discovering Language Model Behaviors with Model-Written Evaluations
Constitutional AI: Harmlessness from AI Feedback
Collective Constitutional AI: Aligning a Language Model with Public Input
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback