Architecture · Design Thinking
AutoDeclare — AI-as-Judge Pipeline
C4 Model decomposition & design rationale for the automated prompt quality and output fidelity assessment system on AWS
01 · Design Thinking — The Problem & Philosophy
AutoDeclare exists to close a feedback loop that the existing clinical AI pipeline leaves open. The pipeline generates structured psychiatric report sections from transcripts — but there is no mechanism to assess whether those prompts are well-designed or whether the outputs they produce are clinically faithful. Quality assessment is currently manual, static, and decoupled from runtime behaviour.
The Central Tension
Using an LLM to judge another LLM introduces a paradox: the judge must be objective and deterministic, yet LLMs are inherently probabilistic. The design resolves this tension by treating the judge not as a reasoning agent but as a structured rubric evaluator — given explicit scoring criteria, a fixed temperature of zero, and JSON-only output contracts. The judge is constrained in exactly the same way the clinical prompts are constrained: its generative freedom is the problem, not the feature.
Key Design Decisions
Judge A evaluates the prompt in isolation — assessing its design quality across 8 rubric dimensions. Judge B receives both the prompt and its generated output — assessing whether the output faithfully executed the prompt's intent. A single call cannot do both: merging them conflates design quality with runtime performance.
The judge runner and the dashboard generator share no code and no in-memory state. A JSON file is the only coupling point. This means either component can be updated, rewritten, or replaced independently — and scores can be regenerated without touching the dashboard, or the dashboard re-rendered from cached scores without re-calling the judge.
Rather than restructuring the existing clinical pipeline, a single save_output_for_judge() function is inserted after each Claude API call. The pipeline continues unchanged. The capture is additive and side-effect-free — the judge operates asynchronously on files written to a staging directory.
The judge model runs entirely on the AWS EC2 instance via Ollama, which exposes an OpenAI-compatible REST API on localhost:11434. No clinical data leaves the instance. The existing openai Python SDK works without modification. Data sovereignty is architectural, not a policy promise.
Every output file and every scores file is prefixed with a run_id. Multiple clinical batches coexist in the staging directory without collision, and the dashboard can compare scores across runs — enabling longitudinal quality tracking without a database.
The dashboard HTTP server binds to 127.0.0.1 only. Access is via SSH tunnel from the reviewer's machine — no firewall rule changes, no authentication layer, no clinical data accessible from the public internet. Security is structural, implemented at the network layer, not the application layer.
02 · C4 Level 1 — System Context
Who and what interacts with AutoDeclare at the highest level of abstraction. The system sits alongside the existing clinical LLM pipeline on AWS — consuming its outputs without disrupting its operation.
Three external actors surround the system: the Psychiatrist who initiates runs by providing transcripts; the Anthropic Claude API which is the system being evaluated (not the evaluator); and the OSS Judge running on-instance via Ollama. The Clinical Team is the consumer of AutoDeclare's output — they see only the dashboard, never the raw scores or outputs.
03 · C4 Level 2 — Container Diagram
Inside the AWS boundary, AutoDeclare decomposes into five containers. Three are part of the existing system; two are new. The key architectural constraint is that the two new containers must not modify or block the three existing ones.
| Container | Type | Responsibility | Status |
|---|---|---|---|
| Clinical LLM Pipeline | Python script | Calls Claude API per prompt, produces structured clinical section outputs | existing · unchanged |
| Output Capture Layer | Python function | Writes {run_id}_{prompt_name}.json to /judge_staging/outputs/ after each Claude call | new · additive only |
| Judge Runner | Python script | Reads CSV + outputs; calls Ollama judge twice per prompt; writes scores.json | new · core component |
| Ollama Judge Server | Local HTTP service | Serves llama3.3 via OpenAI-compatible API on localhost:11434. On-instance only. | new · infrastructure |
| Dashboard Generator | Python script | Reads scores.json; generates self-contained rubric_dashboard.html with two chart sets | existing · modified |
04 · C4 Level 3 — Component Diagram
Inside the Judge Runner container — the core new component — five internal components chain together to transform raw prompt and output files into scored JSON.
Dashboard Generator Components
| Component | Input | Responsibility | Notes |
|---|---|---|---|
| Score Loader | scores.json | Replaces the hardcoded PROMPTS list. Splits into prompt_quality and output_quality DataFrames. | no change to chart code |
| Prompt Quality Charts | prompt_quality DataFrame | Existing heatmap, radar chart, and bar chart — now driven by LLM-generated scores rather than hardcoded values. | existing · unchanged |
| Output Quality Charts | output_quality DataFrame | New heatmap and bar chart for 8 output dimensions: transcript_fidelity, no_hallucination, clinical_tone, etc. | new section |
| Run Selector | scores/ directory listing | Optional <select> element injected into the HTML. Allows switching between score files from different clinical runs. | longitudinal comparison |
05 · Component Role Map — All Containers
Mapping of all AutoDeclare containers and components to their architectural role, with known risks and status flags.
| Component / File | Architectural Role | Container | Notes |
|---|---|---|---|
| save_output_for_judge() | Single insertion point into existing pipeline; writes {run_id}_{prompt_name}.json atomically after each Claude response | Output Capture Layer | additive only · no pipeline impact |
| judge_runner.py · CSV Loader | Reads prompt_inventory.csv; builds dict keyed by Prompt Name for downstream lookups | Judge Runner | depends on CSV schema stability |
| judge_runner.py · Output Scanner | Globs /judge_staging/outputs/ by run_id prefix; pairs prompt_name with output text | Judge Runner | run_id isolation prevents collision |
| judge_runner.py · Judge A | Calls Ollama with PROMPT_QUALITY_SYSTEM; receives 8 rubric dimension scores for the prompt in isolation | Judge Runner | retry ×3 on JSON parse failure |
| judge_runner.py · Judge B | Calls Ollama with OUTPUT_QUALITY_SYSTEM; receives both prompt + output; assesses output fidelity | Judge Runner | cannot assess duplication cross-prompt |
| judge_runner.py · Score Writer | Merges judge A and B results; appends group/label from CSV; atomic write to {run_id}_scores.json | Judge Runner | scores.json is sole coupling point |
| ollama serve (llama3.3) | OpenAI-compatible HTTP server on localhost:11434; temperature=0 enforced per call | Ollama Judge Server | on-instance · no external calls |
| rubric_dashboard.py · Score Loader | Reads scores.json; splits into prompt_quality and output_quality DataFrames; replaces hardcoded PROMPTS | Dashboard Generator | zero change to chart generation code |
| rubric_dashboard.py · Prompt Quality Charts | Existing heatmap / radar / bar chart — unchanged; now consumes live judge scores | Dashboard Generator | existing · no code change |
| rubric_dashboard.py · Output Quality Charts | New heatmap section for 8 output quality dimensions; same Plotly pattern as existing charts | Dashboard Generator | new · mirrors existing pattern |
| rubric_dashboard.html | Self-contained observability artefact; served via Python HTTP server on localhost; accessed via SSH tunnel | Dashboard Generator | offline-capable · air-gap safe |
06 · C4 Level 4 — Code-Level Design Patterns
At the code level, five patterns emerge consistently across the AutoDeclare system. Each encodes a specific architectural decision about how to make an LLM judge reliable enough to be trusted as an automated quality signal.
Pattern 1 — JSON-Only Output Contract
Both judge system prompts instruct the model to return only valid JSON with no preamble, no markdown fences, and no explanation. Where the model supports it, response_format={"type":"json_object"} is passed. Parse failure triggers a retry up to three times before raising. This prevents partial or malformed scores from silently corrupting dashboard data.
Pattern 2 — scores.json as Interface Boundary
The judge runner and dashboard generator share no code and no in-memory state. scores.json is the only coupling point between them. This means either component can be updated, rewritten, or replaced independently as long as the JSON schema is preserved. The dashboard can be re-rendered from cached scores without re-running the judge, and the judge can run nightly without touching the dashboard.
Pattern 3 — Run ID as Partition Key
Every output file and every scores file is prefixed with a run_id. Multiple clinical batches coexist in the staging directory without collision. The dashboard can switch between runs via a <select> element, enabling longitudinal quality comparison without a database or a migration strategy. The partition key pattern is the simplest possible form of multi-tenancy.
Pattern 4 — Temperature = 0 for Reproducibility
All judge calls use temperature=0. This ensures that re-running the judge on identical inputs produces identical scores — a necessary property for a quality assurance tool, where score drift across re-runs would undermine trust in the dashboard as an objective signal. Determinism is not a nice-to-have; it is a correctness requirement.
Pattern 5 — SSH Tunnel as Structural Access Control
The dashboard HTTP server binds to 127.0.0.1 only, never 0.0.0.0. Access requires an active SSH tunnel from the reviewer's machine. No firewall rule changes, no application-layer authentication, no TLS certificate management. Security is enforced at the network layer by the OS — it cannot be misconfigured by the application, and clinical data cannot be inadvertently exposed by a deployment mistake.
07 · Implementation Roadmap & Risk Register
Ordered Task List
| # | Task | Effort | Dependency |
|---|---|---|---|
| 1 | Install Ollama on EC2; pull llama3.3 (or mistral-small for g4dn.xlarge) | ~2 hrs | blocks everything |
| 2 | Verify OpenAI-compatible endpoint responds at localhost:11434 | ~30 min | blocks Phase 3 |
| 3 | Create /data/judge_staging/{inputs,outputs,scores,dashboard}/ structure | ~15 min | blocks Phase 2 & 3 |
| 4 | Add save_output_for_judge() to existing pipeline (one function, one call site) | ~1 hr | blocks Phase 3 |
| 5 | Write and test judge_runner.py on 2–3 sample outputs | ~3 hrs | core deliverable |
| 6 | Update rubric_dashboard.py Score Loader to read scores.json | ~2 hrs | blocks dashboard |
| 7 | Add output quality heatmap and bar chart to dashboard | ~2 hrs | standalone |
| 8 | Wire judge_runner into pipeline as subprocess call or nightly cron | ~1 hr | standalone |
| 9 | Configure SSH tunnel and document access procedure for clinical team | ~30 min | standalone |
Risk Register
| Risk | Impact | Mitigation |
|---|---|---|
| Judge score variance across re-runs | Undermines dashboard as an objective quality signal; stakeholders lose trust | temperature=0 enforced on all calls. Run judge ×3 on fixed sample on first deploy; flag if std dev > 0.5 per dimension. |
| Duplication risk is cross-prompt aware | Judge A scores each prompt in isolation and cannot assess duplication relative to other prompts | Run a separate batch judge call with all prompt summaries together for comparative duplication scoring. known limitation |
| Output file accumulation | Staging directory grows unbounded; disk pressure on EC2 instance | Implement 30-day retention policy on outputs/. Scores files are small (<100KB) and can be retained indefinitely. |
| Judge hallucination on safety dimensions | A false score of 5 on clinical_safety or pii_handling would give false confidence | Manual spot-check of any score of 5 on clinical_safety or pii_handling against the actual prompt text on first deployment. high priority |
| Pipeline coupling if judge runs inline | Judge failure would delay or block clinical report delivery | Run judge_runner as an async subprocess with timeout, or via cron. Report delivery must never depend on judge availability. design constraint |