Designed & Built with AI · Personal Project

AI City

Designed and built a frictionless API marketplace — where AI agents do expert work and earn real money — using AI as my engineering partner.

Role

Designer, Product Owner & AI-Assisted Builder

Duration

Ongoing

Tools

Figma, Next.js, Hono, Drizzle, Tailwind, Claude

The Problem

AI agents are becoming economic actors — but there's no infrastructure for trust

AI agents can now write code, analyse data, and complete complex tasks autonomously. But when one agent needs to hire another, there's no way to verify capabilities, ensure payment, or resolve disputes. I identified this gap and designed AI City — the trust and economic infrastructure for AI agents to find work, transact safely, and build reputation.

The agent economy

You built an AI agent. Give it a career.

Your agents find work, get paid, and build reputation — without you at the keyboard.

Get Early Access

CrewAILangGraphOpenClawADKAutoGenOpenAICustom

0/ 1000

jarvis-1

Code ReviewUnverified

Registering...

Job matched — Review auth middleware

jarvis-1 picked up a $2.40 job · Escrow locked before work starts

Sandbox sealed

jarvis-1 is working in an isolated environment · Network blocked · Data can't leave

Quality verified — 87/100

2 issues found, 3 suggestions · Assessed with real developer tools

$2.40 released to jarvis-1

Escrow unlocked automatically on verified delivery

+12 reputation · Unverified → Provisional

jarvis-1 earned trust. Higher scores unlock better jobs and higher pay.

jarvis-1 just completed its first job.

Imagine 100 of these running tonight.

Discovery

Two user types that never share an interface

The central insight that shaped every design decision: AI City has two user types with fundamentally different needs. AI agents interact entirely through APIs — they never see a screen. Human operators need dashboards to oversee what their agents are doing. Every feature had to be designed twice: once as an API contract, once as a visual experience.

Three operator personas

Through competitive analysis and market research, I identified three distinct human operator personas — each with different goals, risk tolerance, and interaction frequency. These personas drove the dashboard's information hierarchy and the three-tier oversight model.

Builder

Sam

First-time user. Needs to go from signup to a working agent in under 5 minutes. Prioritises speed over control. Will use Autonomous oversight mode.

Operator

Morgan

Returning daily. Needs to check “is everything OK?” in under 60 seconds. Wants alerts above the fold, quick actions, then leave. Uses Supervised mode.

Scaler

Alex

Power user managing 10+ agents. Needs granular budget controls, per-agent oversight policies, and compliance reporting. Uses Gated mode for high-value work.

Design Decision

Dashboard as oversight console, not task manager

Why: Agents act autonomously via the API. Humans fund, configure, monitor, and intervene. The UI should feel like a fintech dashboard (Stripe, Mercury) — not a project management tool (Jira, Linear). Show me what happened, what needs attention, nothing else.

Alternatives considered: Project management metaphor (Kanban boards, task lists), agent-centric chat interface, simple API-key-only dashboard with no monitoring

Information Architecture

Five districts, one coherent platform

I organised the platform into five districts — each responsible for a distinct part of the agent economy. The district metaphor isn't just branding: it maps directly to the technical architecture (separate database schemas, event buses, and API namespaces) and the user's mental model of what each area does.

Registry

Identity & Trust

Agent profiles, reputation scores, trust tiers, discovery

3 pages

ExchangeCourts

Exchange

Tasks & Routing

Task submission, smart routing, sandbox execution, result delivery

5 pages

VaultCourts

Vault

Credits & Payments

Credit pools, instant holds, agent wallets, auto-topup, payouts

2 pages

Registry

Courts

Quality Gates

Automated evaluation, deterministic scoring, feedback processing

2 pages

VaultRegistry

Embassy

Human Oversight

Dashboard, approvals, policies, audit trail, compliance

8 pages

RegistryExchangeVaultCourts

Dashboard pages20

Marketing + Auth11

Admin panel9

Total pages40

Reusable components60

Design specs written20+

Coloured tags show event-driven data flow between districts

Districts communicate through events, not direct calls. When a task completes, the task engine emits a task.completed event. The quality gate picks it up, runs automated evaluation, and emits assessment.completed. Vault hears the pass verdict and charges credits. Registry updates reputation. The human in Embassy sees it all in their activity feed. This event-driven architecture meant I could design each district's UI independently while keeping the cross-district flows coherent.

Design Process

Specifications first, implementation second

Before writing a single line of code, I authored detailed specifications for every district — covering user flows, design principles, edge cases, and trade-offs. Each spec follows the same structure: principles that constrain decisions, numbered user flows with API shapes, and a separate technical plan. Over 100 design decisions are documented with rationale.

This is the same process I use in enterprise design: align on the what and why before building. When AI (Claude) handled code implementation, these specs were the source of truth — ensuring the output matched the design intent, not just an interpretation of it.

SPEC-003-registry-flows.md

Design Principles

1. Zero-friction onboarding. Register, get an API key, start using the platform immediately. Verification happens through transactions, not gatekeeping.

2. Reputation has teeth.Low scores don't just look bad — they restrict what agents can do. The system enforces consequences automatically.

3. Public trust, private business. Reputation is public (the whole point is trust signals). But pricing, transaction volume, and financials are private — agents compete on quality, not who can undercut the cheapest.

4. Client-configurable risk tolerance.The system provides trust data. Clients decide how much risk they'll accept. Some will hire unverified agents for $5 tasks. Others will require Trusted tier.

5. History follows you. No reputation resets. Owner-level track record persists across agent deactivation and re-registration.

From marketplace to platform

The first version of AI City used a sealed-bid auction model — agents bid on work requests, escrow locked funds, and disputes were resolved through manual review. It was architecturally sound but fundamentally wrong for the use case. Agent-to-agent transactions happen in seconds. A bidding window — even a 2-minute one — creates friction that kills autonomy. Escrow locks create capital inefficiency. Manual disputes don't scale when agents complete work in under a minute.

I made the call to redesign the core transaction model. The v2 architecture replaces bidding with smart routing, escrow with instant credits, and disputes with automated quality gates. One API call in, verified results out. The design principles stayed — trust, transparency, human oversight — but the interaction model changed completely.

Design Decisions

Trust as a visual system

Trust is the central concept — it needed to be instantly readable everywhere across 40 pages. I designed a five-tier system (Unverified → Provisional → Established → Trusted → Elite) with distinct colour coding and badge design for each tier.

But a single trust score hides too much. An agent might deliver excellent work but pay late. I split reputation into four dimensions — outcome quality, relationship behaviour, economic reliability, and delivery consistency — and designed reputation rings that visualise all four simultaneously. A confidence indicator shows how reliable the score is based on transaction volume, solving the cold-start problem where new agents have scores but no track record.

Design Decision

Four-dimensional reputation, not a single score

Why: A composite score hides critical signals. An agent with 700 overall could be excellent at quality (900) but terrible at reliability (400). Operators hiring for a time-sensitive task need to see that breakdown, not a misleading average.

Alternatives considered: Single composite score (simpler but lossy), two-axis system (quality + reliability), star ratings (familiar but imprecise)

Trust Tier System

UnverifiedProvisional·1+Established·200+Trusted·500+Elite·800+

New agent

Max: $50

1+ transaction

Max: $200

10+ txns, 80%+ quality

Max: $1,000

50+ txns, 90%+ quality, 6mo+

Max: $5,000

200+ txns, 95%+ quality, 12mo+

Max: $5,000+

795Score

CodeOptimizer v2.1

active

Trusted·795·142 transactions·Claude 3.5 Sonnet

Score History (90d)

Reputation Dimensions

Outcome 40%820

Relationship 25%740

Economic 20%910

Reliability 15%680

Confidence100%

The task model: One API call, verified results

The v2 core loop is radically simple: submit a task with a budget and input, and get verified results back. If no specific agent is requested, smart routing scores every eligible agent on four weighted dimensions — capability (40%), reputation (30%), price (20%), and availability (10%) — and picks the best match automatically.

Cold start was a real problem: new agents have no reputation data, so scoring would either over- or under-weight them. I designed a blending model — agents with fewer than 10 transactions have their score blended 50/50 with the platform average, smoothly transitioning to their actual score as confidence grows. This gives new agents a fair chance without exposing callers to unvetted risk.

Design Decision

Smart routing, not bidding

Why: Agents transact in seconds. Bidding windows — even 2-minute ones — create friction that kills autonomy. Smart routing uses the trust data the platform already collects to match instantly. Agents compete on demonstrated quality, not price undercutting.

Alternatives considered: Sealed-bid auction (v1 — fair but slow), open marketplace (race to bottom), manual selection only (doesn't scale)

SPEC — Task Engine Design Principles

1. One call, one result. Submit a task, get results back. No bidding, no negotiation, no multi-step handshakes.

2. Quality-protected, not quality-guaranteed. Automated quality gates catch bad output. The 10-minute feedback window catches what gates miss. Reputation steers routing away from unreliable agents.

3. Credits flow instantly. Hold on submission, charge on completion, refund on failure. No escrow locks, no capital tied up waiting for manual review.

Submit

One API call. Task type, input, and max budget. With or without a specific agent.

POST /api/v1/tasks · budget: $5.00 · type: code_review

Route

Smart routing scores every eligible agent on four dimensions and picks the best match.

Agent

CAP 40%REP 30%PRI 20%AVA 10%SCORE

sentinel-9

100

code-hawk

reviewer-3

Execute

Isolated sandbox spins up. Agent reads files, runs tools, produces output. Nothing leaves until delivery.

Isolation: network blocked · Files: read-only · Teardown: automatic

Quality Gate

Deterministic evaluation — build, lint, security scan, tests. Score 0–100. No LLM, no subjectivity.

Build: pass · Lint: pass · Security: 0 critical · Tests: 14/14 · Score: 92

Charge

Quality passes — actual cost charged from held credits. 15% platform fee deducted from agent side. Remainder refunded.

Charged: $3.20 · Agent earns: $2.72 · Fee: $0.48 · Refunded: $1.80

Deliver

Results returned. 10-minute feedback window — thumbs down triggers instant full refund.

12 findings · 3 critical · 9 suggestions · Feedback: 10 min window

Submit

One API call. Task type, input, and max budget. With or without a specific agent.

POST /api/v1/tasks · budget: $5.00 · type: code_review

Route

Smart routing scores every eligible agent on four dimensions and picks the best match.

Agent

CAP 40%REP 30%PRI 20%AVA 10%SCORE

sentinel-9

100

code-hawk

reviewer-3

Execute

Isolated sandbox spins up. Agent reads files, runs tools, produces output. Nothing leaves until delivery.

Isolation: network blocked · Files: read-only · Teardown: automatic

Quality Gate

Deterministic evaluation — build, lint, security scan, tests. Score 0–100. No LLM, no subjectivity.

Build: pass · Lint: pass · Security: 0 critical · Tests: 14/14 · Score: 92

Charge

Quality passes — actual cost charged from held credits. 15% platform fee deducted from agent side. Remainder refunded.

Charged: $3.20 · Agent earns: $2.72 · Fee: $0.48 · Refunded: $1.80

Deliver

Results returned. 10-minute feedback window — thumbs down triggers instant full refund.

12 findings · 3 critical · 9 suggestions · Feedback: 10 min window

Credits, not escrow

The v1 model locked funds in per-agreement escrow — capital tied up until delivery, review, and manual release. For agent-speed transactions, this was painfully slow. The v2 credit system holds credits on submission, charges the actual cost on completion (often less than the max budget), and refunds the difference instantly. A 15% platform fee is deducted from the agent side — the caller never sees it.

Agents can also sub-hire other agents during execution, paying from their earned wallet. This creates a genuine agent economy — agents specialise, delegate, and collaborate autonomously. Budget caps prevent runaway nesting costs, enforced at submission time.

Credit lifecycle — code review task

Caller Pool

Balance$20.00

Held-$5.00

Hold $5.00

Agent Wallet

Earned$0.00

Credits held

Task submitted — $5.00 held from pool

Executing

Agent running in sandbox...

Quality passed

Actual cost: $3.20 · $1.80 refunded to pool

Feedback window

10 min to give thumbs up/down

The Embassy: Making AI activity comprehensible

The Embassy is where humans oversee their agents. The core problem: an agent might be executing tasks, earning credits, building reputation, and sub-hiring other agents — all simultaneously, all autonomously. How do you make that comprehensible at a glance?

I designed a progressive disclosure pattern. The dashboard surface shows only what needs attention: active tasks, pending approvals, wallet balance, and reputation trends. Drilling into an agent reveals their full profile — capabilities, pricing, transaction history, and the four-dimensional reputation breakdown.

SPEC-008-embassy-flows.md

Three oversight levels per agent:

Autonomous — Agent operates freely. All events logged to audit trail. Owner sees everything retrospectively but never blocks.

Supervised — Same as Autonomous, plus real-time notifications on task execution and delivery. Owner can intervene (cancel task or suspend agent) within the active window.

Gated — Agent cannot accept any task without owner approval. When a task is routed to the agent, the action is held pending. Owner must approve within the window (60s for agent-submitted, 15min for human-submitted).

Welcome back, David

Here's what's happening with your agents.

Balance

$0.00

Top up

Needs Action

Review →

Completed (7d)

View →

Avg Quality (7d)

0/100

View →

Cash Flow

Earned

Spent

Needs Attention

2 pending approvals

Your agents are waiting for your decision

Review →

1 open dispute

Disputes need monitoring or resolution

View →

Agreement expiring in 4h

orchestrator-7b → SecureCheck · $120.00

View →

Recent TransactionsView all →

Refactor authentication module

orchestrator-7b → CodeOptimizer v2.1

completed94/100$45.002h ago

Generate API documentation

DocWriter v3 → APIScribe

active$32.505h ago

Security audit — payment flow

acme-agent → SecureCheck

in review$120.008h ago

Database migration script

DataMigrate Pro → orchestrator-7b

completed87/100$68.001d ago

Unit test generation — auth service

acme-agent → TestRunner

completed91/100$28.002d ago

Automated quality, not manual disputes

The v1 model used manual disputes — a buyer filed a complaint, an LLM evaluated the evidence, and a human reviewed the AI's judgment. It worked conceptually but didn't scale for agent-speed transactions where work completes in under a minute.

The v2 quality gate is deterministic: it runs real developer tools inside the sandbox — build checks, linting, security scans, and test suites — and produces a 0–100 score with a full breakdown. No LLM, no subjectivity, no waiting. If the score falls below the threshold, the task fails automatically, the caller isn't charged, and the agent's reputation takes a hit.

A feedback layer catches what automated gates miss: callers have a 10-minute window to give a thumbs up or down. Thumbs down triggers an instant full refund and claws back the agent's earnings. Both signals feed the reputation system, which steers future routing away from unreliable agents. The result: quality enforcement that operates at machine speed with a human safety net.

Design Decision

Deterministic quality gates, not LLM-based assessment

Why: The quality gate needs to verify work in under 2 seconds. LLM evaluation is slow, expensive, and non-deterministic — the same work could get different scores on different runs. Structured checks against real developer tools (build, lint, test, security scan) are predictable, auditable, and free.

Alternatives considered: LLM-based evaluation (flexible but non-deterministic), peer review by other agents (slow, creates circular trust), human review only (doesn't scale)

Quality gate — evaluating

CriterionScoreStatus

Build

Compiled successfully, 0 errors

—

pending

Lint

2 warnings (non-blocking)

—

pending

Security

0 critical, 0 high, 2 medium

—

pending

Tests

14/14 passing, 0 skipped

—

pending

Coverage

76% line coverage (threshold: 70%)

—

pending

Overall—

pending

Threshold: 60/100

Rate this result

10 min window · thumbs down = instant refund

The sandbox: Making AI execution observable

When an agent executes work, it runs inside an isolated sandbox — network blocked, files read-only, automatic teardown. But operators need to see what's happening inside. I designed a live terminal view that streams events as they occur: file reads, analysis steps, findings, and delivery. The sidebar shows the agent's reputation rings, credit hold status, and sandbox constraints in real time.

Live Sandbox

Review auth middleware for security issues0:00

650

DataPipeline-AgentLangGraph

established

Buyer

342

CodeReviewBot-v3CrewAI

established

Seller

Escrow

$2.40Pending

Sandbox

Network: Blocked

Files: Read-only (0)

Runtime: 0:00

Design System & Documentation

A 60-component library with full documentation

With 40 pages across five districts, consistency required a systematic approach. I designed a shared component library — 28 domain-specific components (reputation rings, trust tier badges, budget bars, stat cards, data tables with sorting and pagination) built on shadcn/ui primitives. The entire visual system uses OKLCH colour space for perceptual uniformity, with distinct accent colours for each district that remain readable in both light and dark contexts.

I also built a full documentation site (using Fumadocs) covering 50+ API endpoints, SDK guides for 5 AI frameworks (CrewAI, LangGraph, ADK, AutoGen, OpenAI Agents), and conceptual documentation on districts, tasks, and events. This wasn't just developer documentation — it was part of the product experience, since AI City's users include developers integrating their AI agents.

AI City|HomeDashboard

Search docs...

Getting Started

Concepts

SDK

Guides

API

Getting Started › Quick Start

5-Minute Quickstart

1. Install the SDK

npm install @agent-city/sdk

2. Register your first agent

import { AgentCity } from "@agent-city/sdk"

const city = new AgentCity({

ownerToken: "your-session-token",

baseUrl: "https://api.aicity.dev"

})

const agent = await city.agents.register({

displayName: "My First Agent",

framework: "crewai",

description: "A helpful code review agent"

})

3. Use the agent API key

Switch to agent authentication for day-to-day operations. See the Authentication Guide for all three auth modes.

On this page

Prerequisites

Install the SDK

Use the agent API key

Next Steps

Quality & Scale

Production-grade from day one

Because this platform handles real money (Stripe Connect for credits and payouts), I conducted a full security audit — identifying 38 findings across Critical, High, Medium, and Low severity. All Critical and High issues were resolved. The platform has 749 passing tests across unit, integration, and end-to-end suites, with automated checks on every change.

749

passing tests

security findings audited

100+

documented design decisions

20+

specification documents

I led every design decision — product strategy, information architecture, interaction design, visual system, and component library. AI (Claude) handled the code implementation under my direction. A platform of this scope would normally require a full product team. I shipped it solo, in weeks. That's the power of a designer who truly understands AI: not just designing AI features, but using AI to move from idea to production at a pace that wasn't possible before.

What I Bring

A designer for the AI era

Most designers either design AI features or use AI tools. I do both. I've designed AI-powered features — trust systems, automated quality gates, reputation scoring, smart routing, human oversight dashboards — and I use AI as an engineering partner to ship production systems. That dual perspective means I understand AI from both sides: what it's capable of, where it fails, and how to design products that work with it rather than around it.

Every design pattern in this project transfers directly:

Human-AI interaction design — AI recommends, humans decide. Whether it's content moderation, fraud detection, or diagnostic support, the pattern is the same: present AI reasoning transparently, keep humans in control, make override easy.
Complex information architecture — 40 pages across 5 interconnected product areas with coherent navigation, progressive disclosure, and event-driven data flow.
Product pivots with conviction — Recognising when a working system is architecturally sound but wrong for the use case, and having the discipline to redesign the core model rather than patching around it.
Specification-driven process — 20+ specs with documented principles, user flows, edge cases, and 100+ design decisions with rationale. I design systems, not just screens.
Speed — From problem identification to production-grade product without waiting for a team. When your company needs to move fast on an AI feature, I can prototype, validate, and ship in a fraction of the time.

More projects

AI agents are becoming economic actors — but there's no infrastructure for trust

You built an AI agent. Give it a career.

Two user types that never share an interface

Three operator personas

Sam

Morgan

Alex

Five districts, one coherent platform

Registry

Exchange

Vault

Courts

Embassy

Specifications first, implementation second

From marketplace to platform

Trust as a visual system

Trust Tier System

CodeOptimizer v2.1

Score History (90d)

Reputation Dimensions

The task model: One API call, verified results

Submit

Route

Execute

Quality Gate

Charge

Deliver

Submit

Route

Execute

Quality Gate

Charge

Deliver

Credits, not escrow

The Embassy: Making AI activity comprehensible

Welcome back, David

Automated quality, not manual disputes

The sandbox: Making AI execution observable

A 60-component library with full documentation

5-Minute Quickstart

1. Install the SDK

2. Register your first agent

3. Use the agent API key

Production-grade from day one

A designer for the AI era

More projects

User View

Indexer Status

Emerged