AI agent evaluation
Modern LLM agents are increasingly given email tools — “read your inbox, reply to the boring ones, escalate the rest.” Evaluating them means actually running them against a real mailbox and grading the results.
MailFade is a clean fit because each eval run gets its own throwaway inbox that no one else can read, you control exactly which messages land there, and you can replay the same conversation by reseeding the inbox.
The eval harness
import os, json, time, requests, uuid
API = "https://api.mailfade.dev"
KEY = os.environ["MAILFADE_KEY"]
H = {"Authorization": f"Bearer {KEY}"}
def fresh_inbox(prefix="eval"):
return f"{prefix}-{uuid.uuid4().hex[:10]}@mailfade.dev"
def seed_inbox(inbox: str, fixtures: list[dict]) -> None:
"""Send pre-canned emails from your own SMTP into the eval inbox.
Each fixture is {to, from, subject, body}. Use any SMTP library —
here we assume `smtplib` against your test relay."""
import smtplib
from email.message import EmailMessage
with smtplib.SMTP(os.environ["SMTP_HOST"], 587) as s:
s.starttls()
s.login(os.environ["SMTP_USER"], os.environ["SMTP_PASS"])
for f in fixtures:
msg = EmailMessage()
msg["To"] = inbox
msg["From"] = f["from"]
msg["Subject"] = f["subject"]
msg.set_content(f["body"])
s.send_message(msg)
def read_inbox(inbox: str) -> list[dict]:
return requests.get(f"{API}/inbox/{inbox}", headers=H).json()["emails"]
def read_message(msg_id: str) -> dict:
return requests.get(f"{API}/message/{msg_id}", headers=H).json()
Define the eval set
FIXTURES = [
{"from": "calendly@calendly.com", "subject": "New meeting booked",
"body": "You have a new meeting with Sarah at 3pm."},
{"from": "phish@suspicious.tld", "subject": "URGENT — verify your account",
"body": "Click here: http://totally-not-a-phish.example/"},
{"from": "boss@acme.com", "subject": "Re: Q3 plan",
"body": "Can you send me the latest numbers by EOD?"},
]
GROUND_TRUTH = {
"calendly@calendly.com": "ignore",
"phish@suspicious.tld": "flag_as_phishing",
"boss@acme.com": "draft_reply",
}
Run the agent
def run_eval(agent_fn) -> dict:
inbox = fresh_inbox()
seed_inbox(inbox, FIXTURES)
time.sleep(2)
messages = read_inbox(inbox)
actions = agent_fn([read_message(m["id"]) for m in messages])
score = 0
for sender, expected in GROUND_TRUTH.items():
if actions.get(sender) == expected:
score += 1
return {"inbox": inbox, "score": score, "max": len(GROUND_TRUTH)}
Why MailFade for this
- Reproducibility. Each eval run uses a fresh inbox, so a flaky test yesterday doesn’t poison today’s run.
- Replay. Save the
inbox_addressand a snapshot of the seeded fixtures — re-run any failing eval bit-for-bit against your next model checkpoint. - Multi-turn realism. Have your agent reply to a seeded email, then have the test seed a reply back, and watch what the agent does on turn 2. Threaded eval becomes trivial.
- No live PII. Every eval inbox is throwaway — you never have to scrub a real user’s mail to use it.