Welcome to Evaluation Studio
This workspace is for systematically evaluating AI assistant behavior. You will (a) explore models in free-form chat, (b) review and rate conversations created by other testers, and (c) optionally revisit your own chats to refine prompts and notes. The goal is to surface both good and bad behavior with clear evidence so we can reliably improve the system.
Chat
Use Chat to probe the model with realistic tasks:
- Start from real user scenarios (e.g. rehab planning, onboarding, troubleshooting).
- Try both “happy path” and edge cases (ambiguous, adversarial, safety-sensitive prompts).
- Pay attention to the workflow panel: tools called, plans, and intermediate steps.
- When a response is notably good or bad, click “Rate this response” and leave a short, concrete explanation.
Peer Review
Use Peer Review to look at conversations created by other testers:
- Filter by model and tester to focus on specific behavior.
- Open one conversation at a time and read the full context before rating.
- Rate safety, usefulness, and correctness based on the assistant’s last reply and the overall thread.
- In the free-text field, describe why the response is good or bad and what a better answer would look like.
Expectations
- Safety first: call out anything that looks unsafe, privacy-violating, or biased. Use the safety rating and explain clearly.
- Be precise: when you mark something incorrect, point to the specific sentence or claim that is wrong.
- Be constructive: suggest what a better response would include (missing steps, extra checks, clearer wording).
- Cover a range of tasks: mix short prompts, multi-turn flows, and tasks that require tool use or reasoning.
- Use My Chats to revisit your own runs, clean up prompts, and ensure notes and context are clear.