When teams start working with clinical text, the default instinct is often to send a note to a large language model and ask it to “extract entities” or “remove PHI.” That works — until you need auditability, consistent labels, offline execution, or HIPAA-aligned de-identification.
We ran a controlled experiment comparing three approaches on the same synthetic clinical note:
- OpenMed — purpose-built, local biomedical NER + PII models
- Regex baseline — a common DIY pattern-matching approach
- Generic LLM — a single prompt to a local chat model (Ollama / Llama 3.2 1B)
The goal was not to declare a winner on every field, but to show what each approach is actually good at — and where they diverge in ways that matter for production clinical workflows.
Experiment setup
| Parameter | Value |
|---|---|
| Environment | macOS, Python 3.12 (Miniconda) |
| OpenMed version | 1.6.0 (pip install "openmed[hf]") |
| LLM runtime | Ollama (llama3.2:1b), temperature 0, JSON output mode |
| Execution | 100% local — no cloud API calls, no PHI sent off-device |
| Script | Custom comparison script (openmed_comparison/compare.py) |
| First run | ~70s (model downloads from Hugging Face) |
| Cached run | ~7–13s |
All three methods received identical input text.
Test note
Synthetic clinical note used for the experiment (not real patient data):
Patient John Smith (MRN: 12345678, DOB: 03/15/1965, SSN: 123-45-6789)
was diagnosed with chronic myeloid leukemia and started imatinib 400mg daily.
Contact: john.smith@email.com, phone (555) 123-4567.
NPI: 1234567890. Address: 742 Evergreen Terrace, Springfield, IL 62704.This note was designed to mix:
- Medical content — disease name, drug name, dosage
- HIPAA-style identifiers — name, MRN, DOB, SSN, email, phone, NPI, address
- Realistic formatting — labels like
MRN:,DOB:,NPI:near values
Methods compared
Method 1 — OpenMed (specialized local models)
OpenMed does not use one general chat model. It runs small encoder models (token classification) per task, then optionally de-identifies the text.
Models used:
| Task | Registry key | Hugging Face model | Params | Confidence threshold |
|---|---|---|---|---|
| Disease NER | disease_detection_superclinical | OpenMed/OpenMed-NER-DiseaseDetect-SuperClinical-184M | ~184M | 0.50 |
| Drug/chemical NER | chemical_detection_electramed_33m | OpenMed/OpenMed-NER-ChemicalDetect-ElectraMed-33M | ~33M | 0.45 |
| PII extraction | default English PII model | OpenMed/OpenMed-PII-SuperClinical-Small-44M-v1 | ~44M | 0.50 (default) |
| De-identification | same PII stack | — | — | 0.70 (default, safer for redaction) |
Python calls:
from openmed import analyze_text, extract_pii, deidentify
# Medical entity extraction
disease = analyze_text(text, model_name="disease_detection_superclinical", confidence_threshold=0.5)
chemical = analyze_text(text, model_name="chemical_detection_electramed_33m", confidence_threshold=0.45)
# PII detection
pii = extract_pii(text, lang="en")
# De-identification (mask mode)
deid = deidentify(text, lang="en")How it works: Each model tags token spans with typed labels (DISEASE, CHEM, first_name, ssn, etc.), returns character offsets and calibrated confidence scores, then the de-identification step replaces detected PII with typed placeholders like [ssn].
Method 2 — Regex baseline (common DIY approach)
A hand-written set of regular expressions — the kind of thing teams often ship before adopting a dedicated NLP stack.
Patterns used:
| Label | Pattern |
|---|---|
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,} | |
| SSN | \b\d{3}-\d{2}-\d{4}\b |
| Phone | \(\d{3}\)\s*\d{3}-\d{4} |
| Date | \b\d{2}/\d{2}/\d{4}\b |
| MRN | MRN:\s*\d+ |
| NPI | NPI:\s*\d{10} |
No medical NER. No confidence scores. No de-identification pipeline.
Method 3 — Generic LLM (Ollama / Llama 3.2 1B)
A single zero-shot prompt asking the model to return structured JSON — mimicking how many teams first approach clinical extraction with ChatGPT-style tools.
Model: llama3.2:1b via Ollama local API
Temperature: 0
Output format: JSON (Ollama format: "json")
Prompt:
Extract all medical entities and personally identifiable information (PII) from this clinical note.
Return ONLY valid JSON with this shape:
{
"medical_entities": [{"type": "...", "text": "...", "confidence": 0.0-1.0}],
"pii_entities": [{"type": "...", "text": "...", "confidence": 0.0-1.0}]
}
Clinical note:
---
{note text here}
---No fine-tuning. No task-specific models. One prompt, one shot.
Results
Summary at a glance
| Method | Entities found | Medical | PII | De-ID output | Runtime | Local |
|---|---|---|---|---|---|---|
| OpenMed | 13 | 2 | 11 | ✅ Yes | 6.93s | ✅ |
| Regex | 6 | 0 | 6 | ❌ No | <0.01s | ✅ |
| LLM (Llama 3.2 1B) | 7 | 3 | 4 | ❌ No | 2.52s | ✅ |
OpenMed — full output
Medical entities (2):
| Label | Text | Confidence | Span |
|---|---|---|---|
DISEASE | chronic myeloid leukemia | 0.961 | 89–113 |
CHEM | imatinib | 0.942 | 126–134 |
PII entities (11):
| Label | Text | Confidence |
|---|---|---|
first_name | John | 0.999 |
last_name | Smith | 0.998 |
medical_record_number | MRN: 12345678 | 0.825 |
date | 03/15/1965 | 0.820 |
ssn | 123-45-6789 | 0.939 |
email | john.smith@email.com | 0.999 |
npi | 1234567890 | 0.240 |
street_address | 742 Evergreen Terrace | 0.999 |
city | Springfield | 0.997 |
state | IL | 0.998 |
postcode | 62704 | 0.759 |
De-identified text:
Patient [first_name] [last_name] ([medical_record_number], DOB: [date], SSN: [ssn])
was diagnosed with chronic myeloid leukemia and started imatinib 400mg daily.
Contact: [email], phone (555) 123-4567. NPI: [npi].
Address: [street_address], [city], [state] [postcode].Notable gap: OpenMed missed the phone number (555) 123-4567 on this run.
Regex baseline — full output
PII only (6):
| Label | Text |
|---|---|
| john.smith@email.com | |
| ssn | 123-45-6789 |
| phone | (555) 123-4567 |
| date | 03/15/1965 |
| mrn | MRN: 12345678 |
| npi | NPI: 1234567890 |
Notable gaps: No names. No address components. No medical entities at all.
Generic LLM — full output
Medical entities (3):
| Type (LLM-assigned) | Text returned | Confidence (self-reported) |
|---|---|---|
| Patient | John Smith | 0.800 |
| Chronic Myeloid Leukemia | CML | 0.900 |
| Imatinib | Imatinib 400mg daily | 0.700 |
PII entities (4):
| Type (LLM-assigned) | Text returned | Confidence (self-reported) |
|---|---|---|
| Patient | John Smith | 0.800 |
| SSN | 123-45-6789 | 0.500 |
| DOB | 03/15/1965 | 0.600 |
| Phone | (555) 123-4567 | 0.400 |
Notable gaps: Missed email, MRN, NPI, and full address. Duplicated “John Smith” across medical and PII categories with inconsistent typing.
Notable error: The model returned "CML" as extracted text even though “CML” does not appear anywhere in the source note — it inferred an abbreviation from “chronic myeloid leukemia.” That is a form of hallucination unacceptable in regulated extraction pipelines.
Coverage comparison
OpenMed vs. Regex
| Category | Shared (both caught) | Only OpenMed | Only Regex |
|---|---|---|---|
| Count | 4 | 9 | 2 |
Shared: date, SSN, email, MRN
Only OpenMed: both medical entities, patient names, address (street, city, state, zip), bare NPI digits
Only Regex: phone number, NPI with label prefix (NPI: 1234567890)
OpenMed vs. LLM
| Category | Shared (both caught) | Only OpenMed | Only LLM |
|---|---|---|---|
| Count | 2 | 11 | 5 |
Shared: date, SSN
Only OpenMed: disease name (exact span), imatinib (exact span), names, email, MRN, address fields, NPI digits
Only LLM: phone, patient name as undifferentiated blob, imatinib with dosage appended, hallucinated “CML”
Interpretation
1. OpenMed is a toolkit, not a chatbot
OpenMed uses small, task-specific encoder models (33M–184M parameters) trained for biomedical NER and PII — not a generative LLM. That architectural choice shows up in the output:
- Typed, stable labels (
DISEASE,CHEM,first_name,ssn) - Character-level span offsets for audit trails
- Calibrated confidence scores from the model, not self-reported guesses
- A built-in de-identification API with multiple redaction strategies
For regulated workflows, this structure matters as much as raw accuracy.
2. LLMs are flexible but unreliable for extraction
The LLM caught some things OpenMed missed (phone number) and returned plausible-looking JSON quickly. But it also:
- Hallucinated entity text (
"CML"never appeared in the note) - Used inconsistent labels (
Patientas both medical and PII) - Missed several identifiers (email, MRN, address)
- Produced confidence scores with no calibration — the model assigns numbers that look authoritative but are not grounded in classification probability
For a demo, that may be fine. For production de-identification or coding pipelines, it is a liability.
3. Regex is fast but structurally limited
Regex found obvious formatted patterns in under a millisecond and correctly caught the phone number OpenMed missed. But it cannot extract diseases, drugs, or unstructured names/addresses without a separate NER stack — which is essentially what OpenMed provides out of the box.
4. Privacy posture is a first-class difference
All three methods ran locally in this experiment. In a typical cloud-LLM setup, the same note would be sent to an external API. OpenMed’s default posture — download models once, run inference on your hardware, no runtime telemetry — is aligned with PHI handling requirements in ways a prompt-to-ChatGPT workflow is not.
5. No single method wins every field
| Strength | Best approach in this test |
|---|---|
| Medical NER (disease + drug) | OpenMed |
| Broad PII with typed labels + de-ID | OpenMed |
| Phone number detection | Regex / LLM |
| Speed on trivial patterns | Regex |
| Flexibility / zero-setup | LLM |
| Auditability + span fidelity | OpenMed |
The practical takeaway: use the right tool per layer — specialized encoders for extraction and de-ID, LLMs for reasoning over already-de-identified text if needed.
Conclusion
This experiment on a single synthetic note surfaced a clear pattern:
- OpenMed excels at structured, typed, local extraction and de-identification — the kind of output you want before data enters analytics, training, or downstream AI pipelines.
- Generic LLMs can approximate the task with a prompt, but introduce hallucination risk, inconsistent schemas, and uncalibrated confidence — fine for exploration, risky for compliance.
- Regex remains useful for well-formatted identifiers but cannot replace medical NER or nuanced PII detection on its own.
OpenMed is not a replacement for frontier models in clinical reasoning. It is complementary infrastructure: extract and de-identify first, reason second.
Reproduce this experiment
Requirements: Python 3.12+, openmed[hf], Ollama with llama3.2:1b
pip install "openmed[hf]"
ollama pull llama3.2:1b
python openmed_comparison/compare.pyPass a custom note:
python openmed_comparison/compare.py your_note.txtReferences
- OpenMed — project homepage
- OpenMed on GitHub
- OpenMed on PyPI
- OpenMed paper (arXiv:2508.01630)
- Ollama — local LLM runtime

