Stanislav Fort

Exploring the Limits of Out-of-Distribution Detection
Scaling Laws for Adversarial Attacks on Language Model Activations
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Constitutional AI: Harmlessness from AI Feedback