Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors

OpenMed vs. Generic LLM: A Side-by-Side Test on Clinical Text Extraction

When teams start working with clinical text, the default instinct is often to send a note to a large language model and ask it to “extract entities” or “remove PHI.” That works — until you need auditability, consistent labels, offline execution, or HIPAA-aligned de-identification.

We ran a controlled experiment comparing three approaches on the same synthetic clinical note:

  1. OpenMed — purpose-built, local biomedical NER + PII models
  2. Regex baseline — a common DIY pattern-matching approach
  3. Generic LLM — a single prompt to a local chat model (Ollama / Llama 3.2 1B)

The goal was not to declare a winner on every field, but to show what each approach is actually good at — and where they diverge in ways that matter for production clinical workflows.


Experiment setup

ParameterValue
EnvironmentmacOS, Python 3.12 (Miniconda)
OpenMed version1.6.0 (pip install "openmed[hf]")
LLM runtimeOllama (llama3.2:1b), temperature 0, JSON output mode
Execution100% local — no cloud API calls, no PHI sent off-device
ScriptCustom comparison script (openmed_comparison/compare.py)
First run~70s (model downloads from Hugging Face)
Cached run~7–13s

All three methods received identical input text.


Test note

Synthetic clinical note used for the experiment (not real patient data):

Text
Patient John Smith (MRN: 12345678, DOB: 03/15/1965, SSN: 123-45-6789)
was diagnosed with chronic myeloid leukemia and started imatinib 400mg daily.
Contact: john.smith@email.com, phone (555) 123-4567.
NPI: 1234567890. Address: 742 Evergreen Terrace, Springfield, IL 62704.

This note was designed to mix:

  • Medical content — disease name, drug name, dosage
  • HIPAA-style identifiers — name, MRN, DOB, SSN, email, phone, NPI, address
  • Realistic formatting — labels like MRN:, DOB:, NPI: near values

Methods compared

Method 1 — OpenMed (specialized local models)

OpenMed does not use one general chat model. It runs small encoder models (token classification) per task, then optionally de-identifies the text.

Models used:

TaskRegistry keyHugging Face modelParamsConfidence threshold
Disease NERdisease_detection_superclinicalOpenMed/OpenMed-NER-DiseaseDetect-SuperClinical-184M~184M0.50
Drug/chemical NERchemical_detection_electramed_33mOpenMed/OpenMed-NER-ChemicalDetect-ElectraMed-33M~33M0.45
PII extractiondefault English PII modelOpenMed/OpenMed-PII-SuperClinical-Small-44M-v1~44M0.50 (default)
De-identificationsame PII stack0.70 (default, safer for redaction)

Python calls:

Python
from openmed import analyze_text, extract_pii, deidentify

# Medical entity extraction
disease = analyze_text(text, model_name="disease_detection_superclinical", confidence_threshold=0.5)
chemical = analyze_text(text, model_name="chemical_detection_electramed_33m", confidence_threshold=0.45)

# PII detection
pii = extract_pii(text, lang="en")

# De-identification (mask mode)
deid = deidentify(text, lang="en")

How it works: Each model tags token spans with typed labels (DISEASE, CHEM, first_name, ssn, etc.), returns character offsets and calibrated confidence scores, then the de-identification step replaces detected PII with typed placeholders like [ssn].


Method 2 — Regex baseline (common DIY approach)

A hand-written set of regular expressions — the kind of thing teams often ship before adopting a dedicated NLP stack.

Patterns used:

LabelPattern
Email[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}
SSN\b\d{3}-\d{2}-\d{4}\b
Phone\(\d{3}\)\s*\d{3}-\d{4}
Date\b\d{2}/\d{2}/\d{4}\b
MRNMRN:\s*\d+
NPINPI:\s*\d{10}

No medical NER. No confidence scores. No de-identification pipeline.


Method 3 — Generic LLM (Ollama / Llama 3.2 1B)

A single zero-shot prompt asking the model to return structured JSON — mimicking how many teams first approach clinical extraction with ChatGPT-style tools.

Model: llama3.2:1b via Ollama local API
Temperature: 0
Output format: JSON (Ollama format: "json")

Prompt:

CSS
Extract all medical entities and personally identifiable information (PII) from this clinical note.
Return ONLY valid JSON with this shape:
{
  "medical_entities": [{"type": "...", "text": "...", "confidence": 0.0-1.0}],
  "pii_entities": [{"type": "...", "text": "...", "confidence": 0.0-1.0}]
}

Clinical note:
---
{note text here}
---

No fine-tuning. No task-specific models. One prompt, one shot.


Results

Summary at a glance

MethodEntities foundMedicalPIIDe-ID outputRuntimeLocal
OpenMed13211✅ Yes6.93s
Regex606❌ No<0.01s
LLM (Llama 3.2 1B)734❌ No2.52s

OpenMed — full output

Medical entities (2):

LabelTextConfidenceSpan
DISEASEchronic myeloid leukemia0.96189–113
CHEMimatinib0.942126–134

PII entities (11):

LabelTextConfidence
first_nameJohn0.999
last_nameSmith0.998
medical_record_numberMRN: 123456780.825
date03/15/19650.820
ssn123-45-67890.939
emailjohn.smith@email.com0.999
npi12345678900.240
street_address742 Evergreen Terrace0.999
citySpringfield0.997
stateIL0.998
postcode627040.759

De-identified text:

Text
Patient [first_name] [last_name] ([medical_record_number], DOB: [date], SSN: [ssn])
was diagnosed with chronic myeloid leukemia and started imatinib 400mg daily.
Contact: [email], phone (555) 123-4567. NPI: [npi].
Address: [street_address], [city], [state] [postcode].

Notable gap: OpenMed missed the phone number (555) 123-4567 on this run.


Regex baseline — full output

PII only (6):

LabelText
emailjohn.smith@email.com
ssn123-45-6789
phone(555) 123-4567
date03/15/1965
mrnMRN: 12345678
npiNPI: 1234567890

Notable gaps: No names. No address components. No medical entities at all.


Generic LLM — full output

Medical entities (3):

Type (LLM-assigned)Text returnedConfidence (self-reported)
PatientJohn Smith0.800
Chronic Myeloid LeukemiaCML0.900
ImatinibImatinib 400mg daily0.700

PII entities (4):

Type (LLM-assigned)Text returnedConfidence (self-reported)
PatientJohn Smith0.800
SSN123-45-67890.500
DOB03/15/19650.600
Phone(555) 123-45670.400

Notable gaps: Missed email, MRN, NPI, and full address. Duplicated “John Smith” across medical and PII categories with inconsistent typing.

Notable error: The model returned "CML" as extracted text even though “CML” does not appear anywhere in the source note — it inferred an abbreviation from “chronic myeloid leukemia.” That is a form of hallucination unacceptable in regulated extraction pipelines.


Coverage comparison

OpenMed vs. Regex

CategoryShared (both caught)Only OpenMedOnly Regex
Count492

Shared: date, SSN, email, MRN
Only OpenMed: both medical entities, patient names, address (street, city, state, zip), bare NPI digits
Only Regex: phone number, NPI with label prefix (NPI: 1234567890)


OpenMed vs. LLM

CategoryShared (both caught)Only OpenMedOnly LLM
Count2115

Shared: date, SSN
Only OpenMed: disease name (exact span), imatinib (exact span), names, email, MRN, address fields, NPI digits
Only LLM: phone, patient name as undifferentiated blob, imatinib with dosage appended, hallucinated “CML”


Interpretation

1. OpenMed is a toolkit, not a chatbot

OpenMed uses small, task-specific encoder models (33M–184M parameters) trained for biomedical NER and PII — not a generative LLM. That architectural choice shows up in the output:

  • Typed, stable labels (DISEASE, CHEM, first_name, ssn)
  • Character-level span offsets for audit trails
  • Calibrated confidence scores from the model, not self-reported guesses
  • A built-in de-identification API with multiple redaction strategies

For regulated workflows, this structure matters as much as raw accuracy.

2. LLMs are flexible but unreliable for extraction

The LLM caught some things OpenMed missed (phone number) and returned plausible-looking JSON quickly. But it also:

  • Hallucinated entity text ("CML" never appeared in the note)
  • Used inconsistent labels (Patient as both medical and PII)
  • Missed several identifiers (email, MRN, address)
  • Produced confidence scores with no calibration — the model assigns numbers that look authoritative but are not grounded in classification probability

For a demo, that may be fine. For production de-identification or coding pipelines, it is a liability.

3. Regex is fast but structurally limited

Regex found obvious formatted patterns in under a millisecond and correctly caught the phone number OpenMed missed. But it cannot extract diseases, drugs, or unstructured names/addresses without a separate NER stack — which is essentially what OpenMed provides out of the box.

4. Privacy posture is a first-class difference

All three methods ran locally in this experiment. In a typical cloud-LLM setup, the same note would be sent to an external API. OpenMed’s default posture — download models once, run inference on your hardware, no runtime telemetry — is aligned with PHI handling requirements in ways a prompt-to-ChatGPT workflow is not.

5. No single method wins every field

StrengthBest approach in this test
Medical NER (disease + drug)OpenMed
Broad PII with typed labels + de-IDOpenMed
Phone number detectionRegex / LLM
Speed on trivial patternsRegex
Flexibility / zero-setupLLM
Auditability + span fidelityOpenMed

The practical takeaway: use the right tool per layer — specialized encoders for extraction and de-ID, LLMs for reasoning over already-de-identified text if needed.


Conclusion

This experiment on a single synthetic note surfaced a clear pattern:

  • OpenMed excels at structured, typed, local extraction and de-identification — the kind of output you want before data enters analytics, training, or downstream AI pipelines.
  • Generic LLMs can approximate the task with a prompt, but introduce hallucination risk, inconsistent schemas, and uncalibrated confidence — fine for exploration, risky for compliance.
  • Regex remains useful for well-formatted identifiers but cannot replace medical NER or nuanced PII detection on its own.

OpenMed is not a replacement for frontier models in clinical reasoning. It is complementary infrastructure: extract and de-identify first, reason second.


Reproduce this experiment

Requirements: Python 3.12+, openmed[hf], Ollama with llama3.2:1b

Bash
pip install "openmed[hf]"
ollama pull llama3.2:1b

python openmed_comparison/compare.py

Pass a custom note:

Bash
python openmed_comparison/compare.py your_note.txt

References

Need a custom DICOM viewer or medical imaging platform?

We build secure, production-ready imaging platforms with advanced DICOM viewers, AI segmentation, and integrated CRM — tailored to your clinical domain.

Read Next

We build custom medical imaging platforms — advanced DICOM viewers, AI segmentation, and the clinical systems around them.

Get in Touch

Copyright © 2026 PYCAD. All Rights Reserved.