Nova DasSarma

Constitutional AI: Harmlessness from AI Feedback
Discovering Language Model Behaviors with Model-Written Evaluations
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback