ML 130: Prompt Injection (160 pts extra)

What You Need

Purpose

To understand how LLM security works, and practice defeating it with prompt injection.

Fooling Large Language Models

Try these questions:
A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat with three secure separate compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer efficiently bring the wolf, the goat and the cabbage across the river without anything being eaten?
In Oct, 2024, Gemini and ChatGPT failed, but Claude was correct.
Arthur picks 20 apples on Monday, and twice as many on Tuesday. On Wednesday, he picks three times as many as he did on Monday, but they are smaller. How many apples does Arthur have on Thursday?
In Oct, 2024, Gemini could not answer this question, and Claude got confused and did not provide the number. ChatGPT answered it well.
Question: Count the number of occurrences of the letter 'L' in the word - 'LOLLAPALOOZA'.
In Oct, 2024, Gemini and Claude got it right, but ChatGPT was wrong.
Which weighs more, a pound of water, two pounds of bricks, a pound of feathers, or three pounds of air?
In Oct, 2024, Claude got it right, but ChatGPT and Gemini were wrong.
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
In Oct, 2024, Claude got it right, but ChatGPT and Gemini were wrong.
You are playing Russian roulette with a six-shooter revolver. Your opponent puts in five bullets, spins the chambers and fires at himself, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing at you. Should he spin again?
In Oct, 2024, Claude and ChatGPT got it right, but Gemini was wrong.
You have three buckets: 5 gallon, 2 gallon, and 4 gallon. How can you measure 4 gallons of water?
In Oct, 2024, Claude got it right, but Gemini and ChatGPT were wrong.

There are many more questions to try here: Easy Problems That LLMs Get Wrong

Prompt Injection Examples

Here are some questions that can trick LLMs into revealing a secret, such as a password: There are more examples here: Prompt Injection Attacks on LLMs

Gandalf Lakera (65 pts)

In a new Browser window, open this page:
https://gandalf.lakera.ai/baseline
Use prompt injection to find the passwords.

Enter the passwords you find into this CTF's scoring system, like this:

Flag ML 130.1 is the password for Level 1 (5 pts)
Flag ML 130.2 is the password for Level 2 (10 pts)
Flag ML 130.3 is the password for Level 3 (10 pts)
and so on.

Immersive Labs (95 pts)

In a new Browser window, open this page:
https://prompting.ai.immersivelabs.com/

Immersive GPT (95 pts)

Use prompt injection to find the passwords.

Enter the passwords you find into this CTF's scoring system, like this:

Flag ML 130.21 is the password for Level 1 (5 pts)
Flag ML 130.22 is the password for Level 2 (10 pts)
Flag ML 130.23 is the password for Level 3 (10 pts)
and so on.

Posted 6-7-23
Doublespeak flags updated 7-2-23
Point values added to the Gandalf challenges 7-24-23
Immersive Labs challenges added 6-23-24
Gandalf Lakera URL updated 7-22-24
Question about a man and a dog added 10-6-24
Adding more demo questions, 10-23-34
Gemini demonstration removed, bucket question added 10-29-24
Doublespeak removed 11-8-24