CarefulAI
  • AI Assurance Tools
  • Compliance-native AI Examples
  • AI Design Support Agents
  • AI User Protection
  • AI Safety Research
  • AI Research Checker
  • Critical AI on AI Podcast
  • Feedback
  • Contact Us
AutoDeclare — Design Thinking & C4 Architecture

Architecture · Design Thinking

AutoDeclare — AI-as-Judge Pipeline

C4 Model decomposition & design rationale for the automated prompt quality and output fidelity assessment system on AWS

01 · Design Thinking — The Problem & Philosophy

AutoDeclare exists to close a feedback loop that the existing clinical AI pipeline leaves open. The pipeline generates structured psychiatric report sections from transcripts — but there is no mechanism to assess whether those prompts are well-designed or whether the outputs they produce are clinically faithful. Quality assessment is currently manual, static, and decoupled from runtime behaviour.

The Central Tension

Using an LLM to judge another LLM introduces a paradox: the judge must be objective and deterministic, yet LLMs are inherently probabilistic. The design resolves this tension by treating the judge not as a reasoning agent but as a structured rubric evaluator — given explicit scoring criteria, a fixed temperature of zero, and JSON-only output contracts. The judge is constrained in exactly the same way the clinical prompts are constrained: its generative freedom is the problem, not the feature.

Key Design Decisions

D-01
Two Distinct Judge Calls

Judge A evaluates the prompt in isolation — assessing its design quality across 8 rubric dimensions. Judge B receives both the prompt and its generated output — assessing whether the output faithfully executed the prompt's intent. A single call cannot do both: merging them conflates design quality with runtime performance.

D-02
scores.json as the Interface Boundary

The judge runner and the dashboard generator share no code and no in-memory state. A JSON file is the only coupling point. This means either component can be updated, rewritten, or replaced independently — and scores can be regenerated without touching the dashboard, or the dashboard re-rendered from cached scores without re-calling the judge.

D-03
Thin Output Capture Layer

Rather than restructuring the existing clinical pipeline, a single save_output_for_judge() function is inserted after each Claude API call. The pipeline continues unchanged. The capture is additive and side-effect-free — the judge operates asynchronously on files written to a staging directory.

D-04
On-Instance OSS Judge (Ollama)

The judge model runs entirely on the AWS EC2 instance via Ollama, which exposes an OpenAI-compatible REST API on localhost:11434. No clinical data leaves the instance. The existing openai Python SDK works without modification. Data sovereignty is architectural, not a policy promise.

D-05
Run ID as Partition Key

Every output file and every scores file is prefixed with a run_id. Multiple clinical batches coexist in the staging directory without collision, and the dashboard can compare scores across runs — enabling longitudinal quality tracking without a database.

D-06
SSH Tunnel as Access Control

The dashboard HTTP server binds to 127.0.0.1 only. Access is via SSH tunnel from the reviewer's machine — no firewall rule changes, no authentication layer, no clinical data accessible from the public internet. Security is structural, implemented at the network layer, not the application layer.

02 · C4 Level 1 — System Context

Who and what interacts with AutoDeclare at the highest level of abstraction. The system sits alongside the existing clinical LLM pipeline on AWS — consuming its outputs without disrupting its operation.

Psychiatrist [Person] Provides transcript. Receives clinical report. transcript AutoDeclare AI-as-Judge System [AWS EC2] Clinical LLM Pipeline Output Capture Layer Judge Runner (×2 calls) Dashboard Generator Data never leaves instance localhost:11434 judge endpoint SSH tunnel access only report sections Anthropic Claude API [External System] Generates clinical report sections. judge calls (localhost) OSS Judge (Ollama / vLLM) [On-Instance] Scores prompts & outputs via JSON. dashboard (SSH tunnel) Clinical Team / QA Reviewer prompt_inventory .csv [28 prompts + metadata] prompt text

Three external actors surround the system: the Psychiatrist who initiates runs by providing transcripts; the Anthropic Claude API which is the system being evaluated (not the evaluator); and the OSS Judge running on-instance via Ollama. The Clinical Team is the consumer of AutoDeclare's output — they see only the dashboard, never the raw scores or outputs.

03 · C4 Level 2 — Container Diagram

Inside the AWS boundary, AutoDeclare decomposes into five containers. Three are part of the existing system; two are new. The key architectural constraint is that the two new containers must not modify or block the three existing ones.

AWS EC2 Boundary Clinical LLM Pipeline [Python · Existing] Calls Claude API per prompt. Produces section outputs. Output Capture Layer [Python function · New] Writes output JSON to /judge_staging/outputs/ output text Ollama Judge Server [Local HTTP · New] llama3.3 / mistral-small OpenAI-compatible API localhost:11434 temperature = 0 JSON output only Judge Runner judge_runner.py [Python · New · Core] Reads CSV + outputs. Calls Judge A (prompt quality). Calls Judge B (output quality). Writes scores.json. output files judge A + B calls JSON scores scores.json /judge_staging/scores/ writes Dashboard Generator [Python · Modified] Reads scores.json. Renders .html dashboard. reads prompt_inventory.csv 28 prompts · Full Prompt Content prompt text
Container Type Responsibility Status
Clinical LLM Pipeline Python script Calls Claude API per prompt, produces structured clinical section outputs existing · unchanged
Output Capture Layer Python function Writes {run_id}_{prompt_name}.json to /judge_staging/outputs/ after each Claude call new · additive only
Judge Runner Python script Reads CSV + outputs; calls Ollama judge twice per prompt; writes scores.json new · core component
Ollama Judge Server Local HTTP service Serves llama3.3 via OpenAI-compatible API on localhost:11434. On-instance only. new · infrastructure
Dashboard Generator Python script Reads scores.json; generates self-contained rubric_dashboard.html with two chart sets existing · modified

04 · C4 Level 3 — Component Diagram

Inside the Judge Runner container — the core new component — five internal components chain together to transform raw prompt and output files into scored JSON.

judge_runner.py CSV Loader Reads prompt_inventory .csv. Builds dict keyed by Prompt Name. → prompt lookup map Output Scanner Globs outputs/ by run_id prefix. Pairs prompt_name + text. → (name, output) pairs Judge A Prompt Quality Scores prompt text across 8 rubric dims. Prompt in isolation. → 8 scores + notes Judge B Output Quality Scores output against prompt intent: fidelity, safety, tone, PII. → 8 scores + notes Score Writer Merges judge results. Adds group/label from CSV. Atomic write to {run_id}_scores.json. → scores.json

Dashboard Generator Components

ComponentInputResponsibilityNotes
Score Loader scores.json Replaces the hardcoded PROMPTS list. Splits into prompt_quality and output_quality DataFrames. no change to chart code
Prompt Quality Charts prompt_quality DataFrame Existing heatmap, radar chart, and bar chart — now driven by LLM-generated scores rather than hardcoded values. existing · unchanged
Output Quality Charts output_quality DataFrame New heatmap and bar chart for 8 output dimensions: transcript_fidelity, no_hallucination, clinical_tone, etc. new section
Run Selector scores/ directory listing Optional <select> element injected into the HTML. Allows switching between score files from different clinical runs. longitudinal comparison

05 · Component Role Map — All Containers

Mapping of all AutoDeclare containers and components to their architectural role, with known risks and status flags.

Component / File Architectural Role Container Notes
save_output_for_judge() Single insertion point into existing pipeline; writes {run_id}_{prompt_name}.json atomically after each Claude response Output Capture Layer additive only · no pipeline impact
judge_runner.py · CSV Loader Reads prompt_inventory.csv; builds dict keyed by Prompt Name for downstream lookups Judge Runner depends on CSV schema stability
judge_runner.py · Output Scanner Globs /judge_staging/outputs/ by run_id prefix; pairs prompt_name with output text Judge Runner run_id isolation prevents collision
judge_runner.py · Judge A Calls Ollama with PROMPT_QUALITY_SYSTEM; receives 8 rubric dimension scores for the prompt in isolation Judge Runner retry ×3 on JSON parse failure
judge_runner.py · Judge B Calls Ollama with OUTPUT_QUALITY_SYSTEM; receives both prompt + output; assesses output fidelity Judge Runner cannot assess duplication cross-prompt
judge_runner.py · Score Writer Merges judge A and B results; appends group/label from CSV; atomic write to {run_id}_scores.json Judge Runner scores.json is sole coupling point
ollama serve (llama3.3) OpenAI-compatible HTTP server on localhost:11434; temperature=0 enforced per call Ollama Judge Server on-instance · no external calls
rubric_dashboard.py · Score Loader Reads scores.json; splits into prompt_quality and output_quality DataFrames; replaces hardcoded PROMPTS Dashboard Generator zero change to chart generation code
rubric_dashboard.py · Prompt Quality Charts Existing heatmap / radar / bar chart — unchanged; now consumes live judge scores Dashboard Generator existing · no code change
rubric_dashboard.py · Output Quality Charts New heatmap section for 8 output quality dimensions; same Plotly pattern as existing charts Dashboard Generator new · mirrors existing pattern
rubric_dashboard.html Self-contained observability artefact; served via Python HTTP server on localhost; accessed via SSH tunnel Dashboard Generator offline-capable · air-gap safe

06 · C4 Level 4 — Code-Level Design Patterns

At the code level, five patterns emerge consistently across the AutoDeclare system. Each encodes a specific architectural decision about how to make an LLM judge reliable enough to be trusted as an automated quality signal.

Pattern 1 — JSON-Only Output Contract

Both judge system prompts instruct the model to return only valid JSON with no preamble, no markdown fences, and no explanation. Where the model supports it, response_format={"type":"json_object"} is passed. Parse failure triggers a retry up to three times before raising. This prevents partial or malformed scores from silently corrupting dashboard data.

Pattern 2 — scores.json as Interface Boundary

The judge runner and dashboard generator share no code and no in-memory state. scores.json is the only coupling point between them. This means either component can be updated, rewritten, or replaced independently as long as the JSON schema is preserved. The dashboard can be re-rendered from cached scores without re-running the judge, and the judge can run nightly without touching the dashboard.

Pattern 3 — Run ID as Partition Key

Every output file and every scores file is prefixed with a run_id. Multiple clinical batches coexist in the staging directory without collision. The dashboard can switch between runs via a <select> element, enabling longitudinal quality comparison without a database or a migration strategy. The partition key pattern is the simplest possible form of multi-tenancy.

Pattern 4 — Temperature = 0 for Reproducibility

All judge calls use temperature=0. This ensures that re-running the judge on identical inputs produces identical scores — a necessary property for a quality assurance tool, where score drift across re-runs would undermine trust in the dashboard as an objective signal. Determinism is not a nice-to-have; it is a correctness requirement.

Pattern 5 — SSH Tunnel as Structural Access Control

The dashboard HTTP server binds to 127.0.0.1 only, never 0.0.0.0. Access requires an active SSH tunnel from the reviewer's machine. No firewall rule changes, no application-layer authentication, no TLS certificate management. Security is enforced at the network layer by the OS — it cannot be misconfigured by the application, and clinical data cannot be inadvertently exposed by a deployment mistake.

07 · Implementation Roadmap & Risk Register

Ordered Task List

#TaskEffortDependency
1 Install Ollama on EC2; pull llama3.3 (or mistral-small for g4dn.xlarge) ~2 hrs blocks everything
2 Verify OpenAI-compatible endpoint responds at localhost:11434 ~30 min blocks Phase 3
3 Create /data/judge_staging/{inputs,outputs,scores,dashboard}/ structure ~15 min blocks Phase 2 & 3
4 Add save_output_for_judge() to existing pipeline (one function, one call site) ~1 hr blocks Phase 3
5 Write and test judge_runner.py on 2–3 sample outputs ~3 hrs core deliverable
6 Update rubric_dashboard.py Score Loader to read scores.json ~2 hrs blocks dashboard
7 Add output quality heatmap and bar chart to dashboard ~2 hrs standalone
8 Wire judge_runner into pipeline as subprocess call or nightly cron ~1 hr standalone
9 Configure SSH tunnel and document access procedure for clinical team ~30 min standalone

Risk Register

RiskImpactMitigation
Judge score variance across re-runs Undermines dashboard as an objective quality signal; stakeholders lose trust temperature=0 enforced on all calls. Run judge ×3 on fixed sample on first deploy; flag if std dev > 0.5 per dimension.
Duplication risk is cross-prompt aware Judge A scores each prompt in isolation and cannot assess duplication relative to other prompts Run a separate batch judge call with all prompt summaries together for comparative duplication scoring. known limitation
Output file accumulation Staging directory grows unbounded; disk pressure on EC2 instance Implement 30-day retention policy on outputs/. Scores files are small (<100KB) and can be retained indefinitely.
Judge hallucination on safety dimensions A false score of 5 on clinical_safety or pii_handling would give false confidence Manual spot-check of any score of 5 on clinical_safety or pii_handling against the actual prompt text on first deployment. high priority
Pipeline coupling if judge runs inline Judge failure would delay or block clinical report delivery Run judge_runner as an async subprocess with timeout, or via cron. Report delivery must never depend on judge availability. design constraint
Privacy Policy       Terms of Service
  • AI Assurance Tools
  • Compliance-native AI Examples
  • AI Design Support Agents
  • AI User Protection
  • AI Safety Research
  • AI Research Checker
  • Critical AI on AI Podcast
  • Feedback
  • Contact Us