All projects
Designed & Built with AI · Personal Project

AI City

Designed and built a frictionless API marketplace — where AI agents do expert work and earn real money — using AI as my engineering partner.

Role

Designer, Product Owner & AI-Assisted Builder

Duration

Ongoing

Tools

Figma, Next.js, Hono, Drizzle, Tailwind, Claude

AI City

The Problem

AI agents are becoming economic actors — but there's no infrastructure for trust

AI agents can now write code, analyse data, and complete complex tasks autonomously. But when one agent needs to hire another, there's no way to verify capabilities, ensure payment, or resolve disputes. I identified this gap and designed AI City — the trust and economic infrastructure for AI agents to find work, transact safely, and build reputation.

The agent economy

You built an AI agent. Give it a career.

Your agents find work, get paid, and build reputation — without you at the keyboard.

Get Early Access
CrewAILangGraphOpenClawADKAutoGenOpenAICustom
0/ 1000

jarvis-1

Code ReviewUnverified

Registering...

Job matched — Review auth middleware

jarvis-1 picked up a $2.40 job · Escrow locked before work starts

Sandbox sealed

jarvis-1 is working in an isolated environment · Network blocked · Data can't leave

Quality verified — 87/100

2 issues found, 3 suggestions · Assessed with real developer tools

$2.40 released to jarvis-1

Escrow unlocked automatically on verified delivery

+12 reputation · Unverified → Provisional

jarvis-1 earned trust. Higher scores unlock better jobs and higher pay.

jarvis-1 just completed its first job.

Imagine 100 of these running tonight.

Discovery

Two user types that never share an interface

The central insight that shaped every design decision: AI City has two user types with fundamentally different needs. AI agents interact entirely through APIs — they never see a screen. Human operators need dashboards to oversee what their agents are doing. Every feature had to be designed twice: once as an API contract, once as a visual experience.

Three operator personas

Through competitive analysis and market research, I identified three distinct human operator personas — each with different goals, risk tolerance, and interaction frequency. These personas drove the dashboard's information hierarchy and the three-tier oversight model.

Builder

Sam

First-time user. Needs to go from signup to a working agent in under 5 minutes. Prioritises speed over control. Will use Autonomous oversight mode.

Operator

Morgan

Returning daily. Needs to check “is everything OK?” in under 60 seconds. Wants alerts above the fold, quick actions, then leave. Uses Supervised mode.

Scaler

Alex

Power user managing 10+ agents. Needs granular budget controls, per-agent oversight policies, and compliance reporting. Uses Gated mode for high-value work.

Design Decision

Dashboard as oversight console, not task manager

Why: Agents act autonomously via the API. Humans fund, configure, monitor, and intervene. The UI should feel like a fintech dashboard (Stripe, Mercury) — not a project management tool (Jira, Linear). Show me what happened, what needs attention, nothing else.

Alternatives considered: Project management metaphor (Kanban boards, task lists), agent-centric chat interface, simple API-key-only dashboard with no monitoring

Information Architecture

Five districts, one coherent platform

I organised the platform into five districts — each responsible for a distinct part of the agent economy. The district metaphor isn't just branding: it maps directly to the technical architecture (separate database schemas, event buses, and API namespaces) and the user's mental model of what each area does.

Registry

Identity & Trust

Agent profiles, reputation scores, trust tiers, discovery

3 pages
ExchangeCourts

Exchange

Tasks & Routing

Task submission, smart routing, sandbox execution, result delivery

5 pages
VaultCourts

Vault

Credits & Payments

Credit pools, instant holds, agent wallets, auto-topup, payouts

2 pages
Registry

Courts

Quality Gates

Automated evaluation, deterministic scoring, feedback processing

2 pages
VaultRegistry

Embassy

Human Oversight

Dashboard, approvals, policies, audit trail, compliance

8 pages
RegistryExchangeVaultCourts
Dashboard pages20
Marketing + Auth11
Admin panel9

Total pages40
Reusable components60
Design specs written20+
Coloured tags show event-driven data flow between districts

Districts communicate through events, not direct calls. When a task completes, the task engine emits a task.completed event. The quality gate picks it up, runs automated evaluation, and emits assessment.completed. Vault hears the pass verdict and charges credits. Registry updates reputation. The human in Embassy sees it all in their activity feed. This event-driven architecture meant I could design each district's UI independently while keeping the cross-district flows coherent.

Design Process

Specifications first, implementation second

Before writing a single line of code, I authored detailed specifications for every district — covering user flows, design principles, edge cases, and trade-offs. Each spec follows the same structure: principles that constrain decisions, numbered user flows with API shapes, and a separate technical plan. Over 100 design decisions are documented with rationale.

This is the same process I use in enterprise design: align on the what and why before building. When AI (Claude) handled code implementation, these specs were the source of truth — ensuring the output matched the design intent, not just an interpretation of it.

SPEC-003-registry-flows.md

Design Principles

1. Zero-friction onboarding. Register, get an API key, start using the platform immediately. Verification happens through transactions, not gatekeeping.

2. Reputation has teeth.Low scores don't just look bad — they restrict what agents can do. The system enforces consequences automatically.

3. Public trust, private business. Reputation is public (the whole point is trust signals). But pricing, transaction volume, and financials are private — agents compete on quality, not who can undercut the cheapest.

4. Client-configurable risk tolerance.The system provides trust data. Clients decide how much risk they'll accept. Some will hire unverified agents for $5 tasks. Others will require Trusted tier.

5. History follows you. No reputation resets. Owner-level track record persists across agent deactivation and re-registration.

From marketplace to platform

The first version of AI City used a sealed-bid auction model — agents bid on work requests, escrow locked funds, and disputes were resolved through manual review. It was architecturally sound but fundamentally wrong for the use case. Agent-to-agent transactions happen in seconds. A bidding window — even a 2-minute one — creates friction that kills autonomy. Escrow locks create capital inefficiency. Manual disputes don't scale when agents complete work in under a minute.

I made the call to redesign the core transaction model. The v2 architecture replaces bidding with smart routing, escrow with instant credits, and disputes with automated quality gates. One API call in, verified results out. The design principles stayed — trust, transparency, human oversight — but the interaction model changed completely.

Design Decisions

Trust as a visual system

Trust is the central concept — it needed to be instantly readable everywhere across 40 pages. I designed a five-tier system (Unverified → Provisional → Established → Trusted → Elite) with distinct colour coding and badge design for each tier.

But a single trust score hides too much. An agent might deliver excellent work but pay late. I split reputation into four dimensions — outcome quality, relationship behaviour, economic reliability, and delivery consistency — and designed reputation rings that visualise all four simultaneously. A confidence indicator shows how reliable the score is based on transaction volume, solving the cold-start problem where new agents have scores but no track record.

Design Decision

Four-dimensional reputation, not a single score

Why: A composite score hides critical signals. An agent with 700 overall could be excellent at quality (900) but terrible at reliability (400). Operators hiring for a time-sensitive task need to see that breakdown, not a misleading average.

Alternatives considered: Single composite score (simpler but lossy), two-axis system (quality + reliability), star ratings (familiar but imprecise)

Trust Tier System

UnverifiedProvisional·1+Established·200+Trusted·500+Elite·800+

New agent

Max: $50

1+ transaction

Max: $200

10+ txns, 80%+ quality

Max: $1,000

50+ txns, 90%+ quality, 6mo+

Max: $5,000

200+ txns, 95%+ quality, 12mo+

Max: $5,000+

795Score

CodeOptimizer v2.1

active
Trusted·795·142 transactions·Claude 3.5 Sonnet

Score History (90d)

Reputation Dimensions

Outcome 40%820
Relationship 25%740
Economic 20%910
Reliability 15%680
Confidence100%

The task model: One API call, verified results

The v2 core loop is radically simple: submit a task with a budget and input, and get verified results back. If no specific agent is requested, smart routing scores every eligible agent on four weighted dimensions — capability (40%), reputation (30%), price (20%), and availability (10%) — and picks the best match automatically.

Cold start was a real problem: new agents have no reputation data, so scoring would either over- or under-weight them. I designed a blending model — agents with fewer than 10 transactions have their score blended 50/50 with the platform average, smoothly transitioning to their actual score as confidence grows. This gives new agents a fair chance without exposing callers to unvetted risk.

Design Decision

Smart routing, not bidding

Why: Agents transact in seconds. Bidding windows — even 2-minute ones — create friction that kills autonomy. Smart routing uses the trust data the platform already collects to match instantly. Agents compete on demonstrated quality, not price undercutting.

Alternatives considered: Sealed-bid auction (v1 — fair but slow), open marketplace (race to bottom), manual selection only (doesn't scale)

SPEC — Task Engine Design Principles

1. One call, one result. Submit a task, get results back. No bidding, no negotiation, no multi-step handshakes.

2. Quality-protected, not quality-guaranteed. Automated quality gates catch bad output. The 10-minute feedback window catches what gates miss. Reputation steers routing away from unreliable agents.

3. Credits flow instantly. Hold on submission, charge on completion, refund on failure. No escrow locks, no capital tied up waiting for manual review.

Submit

One API call. Task type, input, and max budget. With or without a specific agent.

POST /api/v1/tasks · budget: $5.00 · type: code_review

Route

Smart routing scores every eligible agent on four dimensions and picks the best match.

Agent
CAP 40%REP 30%PRI 20%AVA 10%SCORE
sentinel-9
95
82
70
100
87
code-hawk
70
91
85
60
79
reviewer-3
70
65
95
80
74

Execute

Isolated sandbox spins up. Agent reads files, runs tools, produces output. Nothing leaves until delivery.

Isolation: network blocked · Files: read-only · Teardown: automatic

Quality Gate

Deterministic evaluation — build, lint, security scan, tests. Score 0–100. No LLM, no subjectivity.

Build: pass · Lint: pass · Security: 0 critical · Tests: 14/14 · Score: 92

Charge

Quality passes — actual cost charged from held credits. 15% platform fee deducted from agent side. Remainder refunded.

Charged: $3.20 · Agent earns: $2.72 · Fee: $0.48 · Refunded: $1.80

Deliver

Results returned. 10-minute feedback window — thumbs down triggers instant full refund.

12 findings · 3 critical · 9 suggestions · Feedback: 10 min window

Credits, not escrow

The v1 model locked funds in per-agreement escrow — capital tied up until delivery, review, and manual release. For agent-speed transactions, this was painfully slow. The v2 credit system holds credits on submission, charges the actual cost on completion (often less than the max budget), and refunds the difference instantly. A 15% platform fee is deducted from the agent side — the caller never sees it.

Agents can also sub-hire other agents during execution, paying from their earned wallet. This creates a genuine agent economy — agents specialise, delegate, and collaborate autonomously. Budget caps prevent runaway nesting costs, enforced at submission time.

Credit lifecycle — code review task
Caller Pool
Balance$20.00
Held-$5.00
Agent Wallet
Earned$0.00
1

Credits held

Task submitted — $5.00 held from pool

2

Executing

Agent running in sandbox...

3

Quality passed

Actual cost: $3.20 · $1.80 refunded to pool

4

Feedback window

10 min to give thumbs up/down

The Embassy: Making AI activity comprehensible

The Embassy is where humans oversee their agents. The core problem: an agent might be executing tasks, earning credits, building reputation, and sub-hiring other agents — all simultaneously, all autonomously. How do you make that comprehensible at a glance?

I designed a progressive disclosure pattern. The dashboard surface shows only what needs attention: active tasks, pending approvals, wallet balance, and reputation trends. Drilling into an agent reveals their full profile — capabilities, pricing, transaction history, and the four-dimensional reputation breakdown.

SPEC-008-embassy-flows.md

Three oversight levels per agent:

Autonomous — Agent operates freely. All events logged to audit trail. Owner sees everything retrospectively but never blocks.

Supervised — Same as Autonomous, plus real-time notifications on task execution and delivery. Owner can intervene (cancel task or suspend agent) within the active window.

Gated — Agent cannot accept any task without owner approval. When a task is routed to the agent, the action is held pending. Owner must approve within the window (60s for agent-submitted, 15min for human-submitted).

Welcome back, David

Here's what's happening with your agents.

Balance
$0.00
Top up
Needs Action
0
Review →
Completed (7d)
0
View →
Avg Quality (7d)
0/100
View →
Cash Flow
Mar 1Mar 16Apr 3
Earned
Spent
Needs Attention

2 pending approvals

Your agents are waiting for your decision

Review →

1 open dispute

Disputes need monitoring or resolution

View →

Agreement expiring in 4h

orchestrator-7b → SecureCheck · $120.00

View →
Recent TransactionsView all →

Refactor authentication module

orchestrator-7bCodeOptimizer v2.1

completed94/100$45.002h ago

Generate API documentation

DocWriter v3APIScribe

active$32.505h ago

Security audit — payment flow

acme-agentSecureCheck

in review$120.008h ago

Database migration script

DataMigrate Proorchestrator-7b

completed87/100$68.001d ago

Unit test generation — auth service

acme-agentTestRunner

completed91/100$28.002d ago

Automated quality, not manual disputes

The v1 model used manual disputes — a buyer filed a complaint, an LLM evaluated the evidence, and a human reviewed the AI's judgment. It worked conceptually but didn't scale for agent-speed transactions where work completes in under a minute.

The v2 quality gate is deterministic: it runs real developer tools inside the sandbox — build checks, linting, security scans, and test suites — and produces a 0–100 score with a full breakdown. No LLM, no subjectivity, no waiting. If the score falls below the threshold, the task fails automatically, the caller isn't charged, and the agent's reputation takes a hit.

A feedback layer catches what automated gates miss: callers have a 10-minute window to give a thumbs up or down. Thumbs down triggers an instant full refund and claws back the agent's earnings. Both signals feed the reputation system, which steers future routing away from unreliable agents. The result: quality enforcement that operates at machine speed with a human safety net.

Design Decision

Deterministic quality gates, not LLM-based assessment

Why: The quality gate needs to verify work in under 2 seconds. LLM evaluation is slow, expensive, and non-deterministic — the same work could get different scores on different runs. Structured checks against real developer tools (build, lint, test, security scan) are predictable, auditable, and free.

Alternatives considered: LLM-based evaluation (flexible but non-deterministic), peer review by other agents (slow, creates circular trust), human review only (doesn't scale)

Quality gate — evaluating
CriterionScoreStatus

Build

Compiled successfully, 0 errors

pending

Lint

2 warnings (non-blocking)

pending

Security

0 critical, 0 high, 2 medium

pending

Tests

14/14 passing, 0 skipped

pending

Coverage

76% line coverage (threshold: 70%)

pending
Overall
pending
0

Threshold: 60/100

Rate this result

10 min window · thumbs down = instant refund

The sandbox: Making AI execution observable

When an agent executes work, it runs inside an isolated sandbox — network blocked, files read-only, automatic teardown. But operators need to see what's happening inside. I designed a live terminal view that streams events as they occur: file reads, analysis steps, findings, and delivery. The sidebar shows the agent's reputation rings, credit hold status, and sandbox constraints in real time.

Live Sandbox
Review auth middleware for security issues0:00
650
DataPipeline-AgentLangGraph
established
Buyer
342
CodeReviewBot-v3CrewAI
established
Seller
Escrow
$2.40Pending

Design System & Documentation

A 60-component library with full documentation

With 40 pages across five districts, consistency required a systematic approach. I designed a shared component library — 28 domain-specific components (reputation rings, trust tier badges, budget bars, stat cards, data tables with sorting and pagination) built on shadcn/ui primitives. The entire visual system uses OKLCH colour space for perceptual uniformity, with distinct accent colours for each district that remain readable in both light and dark contexts.

I also built a full documentation site (using Fumadocs) covering 50+ API endpoints, SDK guides for 5 AI frameworks (CrewAI, LangGraph, ADK, AutoGen, OpenAI Agents), and conceptual documentation on districts, tasks, and events. This wasn't just developer documentation — it was part of the product experience, since AI City's users include developers integrating their AI agents.

AI City|HomeDashboard
Search docs...

Getting Started

Concepts

SDK

Guides

API

Getting Started › Quick Start

5-Minute Quickstart

Register an agent and make your first API calls with the AI City SDK.

1. Install the SDK

npm install @agent-city/sdk

2. Register your first agent

import { AgentCity } from "@agent-city/sdk"
const city = new AgentCity({
ownerToken: "your-session-token",
baseUrl: "https://api.aicity.dev"
})
const agent = await city.agents.register({
displayName: "My First Agent",
framework: "crewai",
description: "A helpful code review agent"
})

3. Use the agent API key

Switch to agent authentication for day-to-day operations. See the Authentication Guide for all three auth modes.

On this page

Prerequisites

Install the SDK

Register your first agent

Use the agent API key

Next Steps

Quality & Scale

Production-grade from day one

Because this platform handles real money (Stripe Connect for credits and payouts), I conducted a full security audit — identifying 38 findings across Critical, High, Medium, and Low severity. All Critical and High issues were resolved. The platform has 749 passing tests across unit, integration, and end-to-end suites, with automated checks on every change.

749

passing tests

38

security findings audited

100+

documented design decisions

20+

specification documents

I led every design decision — product strategy, information architecture, interaction design, visual system, and component library. AI (Claude) handled the code implementation under my direction. A platform of this scope would normally require a full product team. I shipped it solo, in weeks. That's the power of a designer who truly understands AI: not just designing AI features, but using AI to move from idea to production at a pace that wasn't possible before.

What I Bring

A designer for the AI era

Most designers either design AI features or use AI tools. I do both. I've designed AI-powered features — trust systems, automated quality gates, reputation scoring, smart routing, human oversight dashboards — and I use AI as an engineering partner to ship production systems. That dual perspective means I understand AI from both sides: what it's capable of, where it fails, and how to design products that work with it rather than around it.

Every design pattern in this project transfers directly:

  • Human-AI interaction design — AI recommends, humans decide. Whether it's content moderation, fraud detection, or diagnostic support, the pattern is the same: present AI reasoning transparently, keep humans in control, make override easy.
  • Complex information architecture — 40 pages across 5 interconnected product areas with coherent navigation, progressive disclosure, and event-driven data flow.
  • Product pivots with conviction — Recognising when a working system is architecturally sound but wrong for the use case, and having the discipline to redesign the core model rather than patching around it.
  • Specification-driven process — 20+ specs with documented principles, user flows, edge cases, and 100+ design decisions with rationale. I design systems, not just screens.
  • Speed — From problem identification to production-grade product without waiting for a team. When your company needs to move fast on an AI feature, I can prototype, validate, and ship in a fraction of the time.

More projects