The Constraint
VA-Claims is a tool that helps a veteran organize their own medical records, retrieve the governing VA rating regulations, and spot evidence gaps before filing a disability claim. The input is the most sensitive data a person has: medical records, diagnoses, service history. The design constraint follows directly -- raw documents never leave the device. Not to an LLM API, not to a hosted OCR service, not to telemetry.
One scoping decision shaped everything else: the tool does not adjudicate. It does not conclude a condition is service-connected, does not predict a final rating percentage, and routes the user to a VA-accredited representative. It surfaces the rule, the evidence, and what's missing. Keeping it an organizer rather than an oracle kept both the legal exposure and the model-capability requirements bounded.
Structural, Not Behavioral
The position I landed on after adversarial review of my own design: a privacy guarantee that depends on a model behaving well is not a guarantee. So the architecture makes the safe path the only path.
The LLM never has filesystem tools. There is no autonomous agent with read access to the records. Extraction and redaction happen in deterministic host code before any model call, and the model receives only an already-redacted string. This is also why the cloud engine uses the plain anthropic Messages SDK rather than the Claude Agent SDK -- the Agent SDK's file tools would put raw PHI one tool call away from the API.
The pipeline is a deterministic ETL chain: ingest (PDF text-vs-scan detection via PyMuPDF, local OCR for scans), extract, redact, a human review gate on the redacted text, then analysis, aggregation, and a final report step that re-inserts real identifiers locally. Redacted-to-real mappings live in a Fernet-encrypted reversible vault on disk, never in the model's reach and never logged.
What the Constraint Rules Out, and What It Forces
Ruled out: cloud OCR (Tesseract runs locally, installed via Homebrew, and is the only OCR dependency -- scanned pages are rendered with PyMuPDF and piped to Tesseract over stdin, no temp files), hosted document AI, and any default-on telemetry.
Forced: redaction had to become a first-class subsystem rather than a preprocessing step. It is layered deliberately -- Presidio analyzer/anonymizer, clinical NER (medspaCy, when it installs; it lags Python versions), a regex backstop covering the 18 HIPAA Safe Harbor identifiers, custom recognizers for military and VA ID formats, and a vault-derived denylist that scrubs every known identifier corpus-wide. Plus OCR-confidence gating and the human review gate before anything reaches an engine.
The layering exists because the published numbers are sobering: redaction recall on real clinical text is around 0.74, and quasi-identifiers can re-identify people even after Safe Harbor removal. So the design over-redacts on purpose, and the documentation refuses to call the cloud path anonymous. It is "minimized and contractually protected" (redaction plus Zero Data Retention plus a BAA), which is a different and weaker claim than zero exposure.
The Engine Seam
Analysis sits behind an AnalysisEngine interface with two implementations planned. The target is LocalLLMEngine: Ollama running a 70B-class model on an M5 Pro with 64 GB of RAM -- true zero exposure. The interim option is ClaudeCloudEngine over the plain Messages API with structured outputs, gated on the signed BAA and ZDR terms. The seam means the privacy posture can tighten over time without rewriting the pipeline. Today the CLI is wired to an EchoEngine stub -- neither real engine exists yet, and that is deliberate: the pipeline, redaction, and vault all run and get tested end-to-end against the stub before any model gets within reach of the data.
The Honest Cost
Local-first is not free. A 70B local model is meaningfully behind frontier cloud models at exactly the tasks this tool needs: reading messy clinical prose and mapping it onto rating criteria. The cloud engine exists in the design precisely because I could not pretend that gap away.
The hardware floor is real -- 64 GB of unified memory to run the target model. The dependency story has friction too: the clinical NER stack lags Python releases (the project pins Python 3.13 because the ML dependencies lack 3.14 wheels), so the core spine -- vault, regex redaction, deterministic VA combined-ratings math, PDF detection -- was deliberately built on three light dependencies (cryptography, PyMuPDF, Pydantic) and tested independently of the ML layer.
And the deterministic-pipeline decision costs convenience. An agent with file tools would be easier to build and more flexible to use. I gave that up because for this data, the failure mode of a misbehaving agent is not a bug report. It is someone's medical history in a third party's request logs.
Status
Planning has hardened into a working spine: extraction, regex redaction, the encrypted vault, combined-ratings math, and a resumable processing ledger run and pass their tests -- 42 of them across eleven test files at the moment. The Presidio layer is on by default when its dependencies are installed (use_ner=True, with a --no-ner escape hatch), and degrades silently to regex-only when they aren't -- which is the one default I'm still debating, since a silent downgrade in a redaction pipeline cuts against everything else in the design. The local engine is the next seam to fill.
The order of construction was itself a design decision. Building and testing the deterministic parts first -- before any model is in the loop -- means the privacy machinery exists and is verified before there is anything tempting to ship. A tool like this earns trust through its boring parts, and those went in first.
If the local engine proves capable enough, the cloud path gets deleted rather than maintained. That is the direction the architecture points, and the interface seam exists so the deletion is a one-file change.