OpenMed vs. Generic LLM: A Side-by-Side Test on Clinical Text Extraction

When teams start working with clinical text, the default instinct is often to send a note to a large language model and ask it to “extract entities” or “remove PHI.” That works — until you need auditability, consistent labels, offline execution, or HIPAA-aligned de-identification.

We ran a controlled experiment comparing three approaches on the same synthetic clinical note:

OpenMed — purpose-built, local biomedical NER + PII models
Regex baseline — a common DIY pattern-matching approach
Generic LLM — a single prompt to a local chat model (Ollama / Llama 3.2 1B)

The goal was not to declare a winner on every field, but to show what each approach is actually good at — and where they diverge in ways that matter for production clinical workflows.

Experiment setup

Parameter	Value
Environment	macOS, Python 3.12 (Miniconda)
OpenMed version	`1.6.0` (`pip install "openmed[hf]"`)
LLM runtime	Ollama (`llama3.2:1b`), temperature `0`, JSON output mode
Execution	100% local — no cloud API calls, no PHI sent off-device
Script	Custom comparison script (`openmed_comparison/compare.py`)
First run	~70s (model downloads from Hugging Face)
Cached run	~7–13s

All three methods received identical input text.

Test note

Synthetic clinical note used for the experiment (not real patient data):

Text

Patient John Smith (MRN: 12345678, DOB: 03/15/1965, SSN: 123-45-6789)
was diagnosed with chronic myeloid leukemia and started imatinib 400mg daily.
Contact: john.smith@email.com, phone (555) 123-4567.
NPI: 1234567890. Address: 742 Evergreen Terrace, Springfield, IL 62704.

This note was designed to mix:

Medical content — disease name, drug name, dosage
HIPAA-style identifiers — name, MRN, DOB, SSN, email, phone, NPI, address
Realistic formatting — labels like MRN:, DOB:, NPI: near values

Methods compared

Method 1 — OpenMed (specialized local models)

OpenMed does not use one general chat model. It runs small encoder models (token classification) per task, then optionally de-identifies the text.

Models used:

Task	Registry key	Hugging Face model	Params	Confidence threshold
Disease NER	`disease_detection_superclinical`	`OpenMed/OpenMed-NER-DiseaseDetect-SuperClinical-184M`	~184M	0.50
Drug/chemical NER	`chemical_detection_electramed_33m`	`OpenMed/OpenMed-NER-ChemicalDetect-ElectraMed-33M`	~33M	0.45
PII extraction	default English PII model	`OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1`	~44M	0.50 (default)
De-identification	same PII stack	—	—	0.70 (default, safer for redaction)

Python calls:

Python

from openmed import analyze_text, extract_pii, deidentify

# Medical entity extraction
disease = analyze_text(text, model_name="disease_detection_superclinical", confidence_threshold=0.5)
chemical = analyze_text(text, model_name="chemical_detection_electramed_33m", confidence_threshold=0.45)

# PII detection
pii = extract_pii(text, lang="en")

# De-identification (mask mode)
deid = deidentify(text, lang="en")

How it works: Each model tags token spans with typed labels (DISEASE, CHEM, first_name, ssn, etc.), returns character offsets and calibrated confidence scores, then the de-identification step replaces detected PII with typed placeholders like [ssn].

Method 2 — Regex baseline (common DIY approach)

A hand-written set of regular expressions — the kind of thing teams often ship before adopting a dedicated NLP stack.

Patterns used:

Label	Pattern
Email	`[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}`
SSN	`\b\d{3}-\d{2}-\d{4}\b`
Phone	`\(\d{3}\)\s*\d{3}-\d{4}`
Date	`\b\d{2}/\d{2}/\d{4}\b`
MRN	`MRN:\s*\d+`
NPI	`NPI:\s*\d{10}`

No medical NER. No confidence scores. No de-identification pipeline.

Method 3 — Generic LLM (Ollama / Llama 3.2 1B)

A single zero-shot prompt asking the model to return structured JSON — mimicking how many teams first approach clinical extraction with ChatGPT-style tools.

Model: llama3.2:1b via Ollama local API
Temperature: 0
Output format: JSON (Ollama format: "json")

Prompt:

CSS

Extract all medical entities and personally identifiable information (PII) from this clinical note.
Return ONLY valid JSON with this shape:
{
  "medical_entities": [{"type": "...", "text": "...", "confidence": 0.0-1.0}],
  "pii_entities": [{"type": "...", "text": "...", "confidence": 0.0-1.0}]
}

Clinical note:
---
{note text here}
---

No fine-tuning. No task-specific models. One prompt, one shot.

Results

Summary at a glance

Method	Entities found	Medical	PII	De-ID output	Runtime	Local
OpenMed	13	2	11	✅ Yes	6.93s	✅
Regex	6	0	6	❌ No	<0.01s	✅
LLM (Llama 3.2 1B)	7	3	4	❌ No	2.52s	✅

OpenMed — full output

Medical entities (2):

Label	Text	Confidence	Span
`DISEASE`	chronic myeloid leukemia	0.961	89–113
`CHEM`	imatinib	0.942	126–134

PII entities (11):

Label	Text	Confidence
`first_name`	John	0.999
`last_name`	Smith	0.998
`medical_record_number`	MRN: 12345678	0.825
`date`	03/15/1965	0.820
`ssn`	123-45-6789	0.939
`email`	john.smith@email.com	0.999
`npi`	1234567890	0.240
`street_address`	742 Evergreen Terrace	0.999
`city`	Springfield	0.997
`state`	IL	0.998
`postcode`	62704	0.759

De-identified text:

Text

Patient [first_name] [last_name] ([medical_record_number], DOB: [date], SSN: [ssn])
was diagnosed with chronic myeloid leukemia and started imatinib 400mg daily.
Contact: [email], phone (555) 123-4567. NPI: [npi].
Address: [street_address], [city], [state] [postcode].

Notable gap: OpenMed missed the phone number (555) 123-4567 on this run.

Regex baseline — full output

PII only (6):

Label	Text
email	john.smith@email.com
ssn	123-45-6789
phone	(555) 123-4567
date	03/15/1965
mrn	MRN: 12345678
npi	NPI: 1234567890

Notable gaps: No names. No address components. No medical entities at all.

Generic LLM — full output

Medical entities (3):

Type (LLM-assigned)	Text returned	Confidence (self-reported)
Patient	John Smith	0.800
Chronic Myeloid Leukemia	CML	0.900
Imatinib	Imatinib 400mg daily	0.700

PII entities (4):

Type (LLM-assigned)	Text returned	Confidence (self-reported)
Patient	John Smith	0.800
SSN	123-45-6789	0.500
DOB	03/15/1965	0.600
Phone	(555) 123-4567	0.400

Notable gaps: Missed email, MRN, NPI, and full address. Duplicated “John Smith” across medical and PII categories with inconsistent typing.

Notable error: The model returned "CML" as extracted text even though “CML” does not appear anywhere in the source note — it inferred an abbreviation from “chronic myeloid leukemia.” That is a form of hallucination unacceptable in regulated extraction pipelines.

Coverage comparison

OpenMed vs. Regex

Category	Shared (both caught)	Only OpenMed	Only Regex
Count	4	9	2

Shared: date, SSN, email, MRN
Only OpenMed: both medical entities, patient names, address (street, city, state, zip), bare NPI digits
Only Regex: phone number, NPI with label prefix (NPI: 1234567890)

OpenMed vs. LLM

Category	Shared (both caught)	Only OpenMed	Only LLM
Count	2	11	5

Shared: date, SSN
Only OpenMed: disease name (exact span), imatinib (exact span), names, email, MRN, address fields, NPI digits
Only LLM: phone, patient name as undifferentiated blob, imatinib with dosage appended, hallucinated “CML”

Interpretation

1. OpenMed is a toolkit, not a chatbot

OpenMed uses small, task-specific encoder models (33M–184M parameters) trained for biomedical NER and PII — not a generative LLM. That architectural choice shows up in the output:

Typed, stable labels (DISEASE, CHEM, first_name, ssn)
Character-level span offsets for audit trails
Calibrated confidence scores from the model, not self-reported guesses
A built-in de-identification API with multiple redaction strategies

For regulated workflows, this structure matters as much as raw accuracy.

2. LLMs are flexible but unreliable for extraction

The LLM caught some things OpenMed missed (phone number) and returned plausible-looking JSON quickly. But it also:

Hallucinated entity text ("CML" never appeared in the note)
Used inconsistent labels (Patient as both medical and PII)
Missed several identifiers (email, MRN, address)
Produced confidence scores with no calibration — the model assigns numbers that look authoritative but are not grounded in classification probability

For a demo, that may be fine. For production de-identification or coding pipelines, it is a liability.

3. Regex is fast but structurally limited

Regex found obvious formatted patterns in under a millisecond and correctly caught the phone number OpenMed missed. But it cannot extract diseases, drugs, or unstructured names/addresses without a separate NER stack — which is essentially what OpenMed provides out of the box.

4. Privacy posture is a first-class difference

All three methods ran locally in this experiment. In a typical cloud-LLM setup, the same note would be sent to an external API. OpenMed’s default posture — download models once, run inference on your hardware, no runtime telemetry — is aligned with PHI handling requirements in ways a prompt-to-ChatGPT workflow is not.

5. No single method wins every field

Strength	Best approach in this test
Medical NER (disease + drug)	OpenMed
Broad PII with typed labels + de-ID	OpenMed
Phone number detection	Regex / LLM
Speed on trivial patterns	Regex
Flexibility / zero-setup	LLM
Auditability + span fidelity	OpenMed

The practical takeaway: use the right tool per layer — specialized encoders for extraction and de-ID, LLMs for reasoning over already-de-identified text if needed.

Conclusion

This experiment on a single synthetic note surfaced a clear pattern:

OpenMed excels at structured, typed, local extraction and de-identification — the kind of output you want before data enters analytics, training, or downstream AI pipelines.
Generic LLMs can approximate the task with a prompt, but introduce hallucination risk, inconsistent schemas, and uncalibrated confidence — fine for exploration, risky for compliance.
Regex remains useful for well-formatted identifiers but cannot replace medical NER or nuanced PII detection on its own.

OpenMed is not a replacement for frontier models in clinical reasoning. It is complementary infrastructure: extract and de-identify first, reason second.

Reproduce this experiment

Requirements: Python 3.12+, openmed[hf], Ollama with llama3.2:1b

Bash

pip install "openmed[hf]"
ollama pull llama3.2:1b

python openmed_comparison/compare.py

Pass a custom note:

Bash

python openmed_comparison/compare.py your_note.txt

References

Need a custom DICOM viewer or medical imaging platform?

We build secure, production-ready imaging platforms with advanced DICOM viewers, AI segmentation, and integrated CRM — tailored to your clinical domain.