Welcome to Evaluation Studio
This workspace is for systematically evaluating AI assistant behavior. Use it to (a) explore models in free-form chat, (b) review and rate chats from other testers, and (c) capture clear evidence so we can reliably improve the system.
Chat
Create sessions, inspect workflow, and rate responses or the whole chat.
- Send a realistic prompt (happy path + edge cases).
- Use “Show workflow” when you need evidence (tools called, intermediate steps).
- Rate this response for a single assistant message, and Rate this chat when the issue spans multiple turns.
Peer Review
Read the full context, then rate with short, evidence-based notes.
- Filter by model/user when investigating patterns.
- Open a session and read the entire thread before rating.
- Rate responses inside the thread and/or Rate the chat for overall session quality.
Expectations
- Safety first: call out anything unsafe, privacy-violating, or biased.
- Be precise: quote the specific sentence/claim that is wrong.
- Be constructive: describe what a better response would include.
- Keep notes short: 1–3 sentences with evidence is usually enough.