Jailbroken LLMs
What happens when your LLM is not aligned to your user-base? Trying to build general enough sytems that can entertain might not be what drives your Metrics (KPIs).
We use a modified version of the dataset from Red Teaming Language Models to Reduce Harms to make it applicable to our demo.
We only present the jailbroken responses from the LLM. A standard LLM can be jailbroken with 55% success rate (left) as compared to Checkpoint-AI's alignment method (right).
Conclusion
Aligning models with Checkpoint-AI. Can lead to a specialist model known as `forward-alignment` as well as avoid poorly generated content known as `backward-alignment`. Aligment is neccessary to drive your KPIs
The response for a jailbroken out-of-the-box (vanilla) LLM. Users could use the model to generate offensive content. It could damage the reputation, lead to model misuse, or cause harm.
For the same prompts we present the responses for the same LLM but aligned with Checkpoint-AI. The alignment method is driven by non-foul language and designed to prevent the model from generating offensive content.