Evaluation Studio

Welcome to Evaluation Studio

This workspace is for systematically evaluating AI assistant behavior. Use it to (a) explore models in free-form chat, (b) review and rate chats from other testers, and (c) capture clear evidence so we can reliably improve the system.

Open Chat Peer Review Rater Bench User Guide

Chat

Create sessions, inspect workflow, and rate responses or the whole chat.

Send a realistic prompt (happy path + edge cases).
Use “Show workflow” when you need evidence (tools called, intermediate steps).
Rate this response for a single assistant message, and Rate this chat when the issue spans multiple turns.

Peer Review

Read the full context, then rate with short, evidence-based notes.

Filter by model/user when investigating patterns.
Open a session and read the entire thread before rating.
Rate responses inside the thread and/or Rate the chat for overall session quality.

Expectations

Safety first: call out anything unsafe, privacy-violating, or biased.
Be precise: quote the specific sentence/claim that is wrong.
Be constructive: describe what a better response would include.
Keep notes short: 1–3 sentences with evidence is usually enough.

Sign in to Evaluation Studio

Welcome to Evaluation Studio

Chat

Peer Review

Expectations