AI agent evaluation

Modern LLM agents are increasingly given email tools — “read your inbox, reply to the boring ones, escalate the rest.” Evaluating them means actually running them against a real mailbox and grading the results.

MailFade is a clean fit because each eval run gets its own throwaway inbox that no one else can read, you control exactly which messages land there, and you can replay the same conversation by reseeding the inbox.

The eval harness

import os, json, time, requests, uuid

API = "https://api.mailfade.dev"
KEY = os.environ["MAILFADE_KEY"]
H   = {"Authorization": f"Bearer {KEY}"}


def fresh_inbox(prefix="eval"):
    return f"{prefix}-{uuid.uuid4().hex[:10]}@mailfade.dev"


def seed_inbox(inbox: str, fixtures: list[dict]) -> None:
    """Send pre-canned emails from your own SMTP into the eval inbox.
    Each fixture is {to, from, subject, body}. Use any SMTP library —
    here we assume `smtplib` against your test relay."""
    import smtplib
    from email.message import EmailMessage
    with smtplib.SMTP(os.environ["SMTP_HOST"], 587) as s:
        s.starttls()
        s.login(os.environ["SMTP_USER"], os.environ["SMTP_PASS"])
        for f in fixtures:
            msg = EmailMessage()
            msg["To"] = inbox
            msg["From"] = f["from"]
            msg["Subject"] = f["subject"]
            msg.set_content(f["body"])
            s.send_message(msg)


def read_inbox(inbox: str) -> list[dict]:
    return requests.get(f"{API}/inbox/{inbox}", headers=H).json()["emails"]


def read_message(msg_id: str) -> dict:
    return requests.get(f"{API}/message/{msg_id}", headers=H).json()

Define the eval set

FIXTURES = [
    {"from": "calendly@calendly.com", "subject": "New meeting booked",
     "body": "You have a new meeting with Sarah at 3pm."},
    {"from": "phish@suspicious.tld",  "subject": "URGENT — verify your account",
     "body": "Click here: http://totally-not-a-phish.example/"},
    {"from": "boss@acme.com",         "subject": "Re: Q3 plan",
     "body": "Can you send me the latest numbers by EOD?"},
]

GROUND_TRUTH = {
    "calendly@calendly.com": "ignore",
    "phish@suspicious.tld":  "flag_as_phishing",
    "boss@acme.com":         "draft_reply",
}

Run the agent

def run_eval(agent_fn) -> dict:
    inbox = fresh_inbox()
    seed_inbox(inbox, FIXTURES)
    time.sleep(2)

    messages = read_inbox(inbox)
    actions  = agent_fn([read_message(m["id"]) for m in messages])

    score = 0
    for sender, expected in GROUND_TRUTH.items():
        if actions.get(sender) == expected:
            score += 1
    return {"inbox": inbox, "score": score, "max": len(GROUND_TRUTH)}

Why MailFade for this

Reproducibility. Each eval run uses a fresh inbox, so a flaky test yesterday doesn’t poison today’s run.
Replay. Save the inbox_address and a snapshot of the seeded fixtures — re-run any failing eval bit-for-bit against your next model checkpoint.
Multi-turn realism. Have your agent reply to a seeded email, then have the test seed a reply back, and watch what the agent does on turn 2. Threaded eval becomes trivial.
No live PII. Every eval inbox is throwaway — you never have to scrub a real user’s mail to use it.