Model Card: Core Schema Parsing LLM (Microbiology)

Model Overview

This model is a domain-adapted sequence-to-sequence language model designed to parse free-text microbiology phenotype descriptions into a structured core schema of laboratory test results and traits.

The model is intended to augment deterministic rule-based and extended parsers by recovering fields that may be missed due to complex phrasing, implicit descriptions, or uncommon linguistic constructions. It is not designed to operate as a standalone classifier or diagnostic system.

Base Model

Base architecture: google/flan-t5-base

Model type: Encoder–decoder (Seq2Seq), instruction-tuned

The FLAN-T5 base model was selected due to its strong instruction-following behaviour, stability during fine-tuning, and suitability for structured text generation tasks on limited hardware.

Training Data

The model was fine-tuned on 8,700 curated microbiology phenotype examples, each consisting of:

A free-text phenotype description

A deterministic target serialization of core schema fields and values

Data preprocessing:

The name field and all non-core schema fields were explicitly removed to prevent label leakage.

Target outputs were serialized deterministically using sorted schema keys (Field: Value format).

Inputs and targets were constrained to schema-relevant content only.

The dataset was split 80/20 into training and validation subsets.

Training Procedure

  • Epochs: 3

  • Optimizer: AdamW (default Hugging Face Trainer)

  • Learning rate: 1e-5

Batching:

  • Per-device batch size: 1

  • Gradient accumulation: 8 (effective batch size = 8)

  • Sequence lengths:

  • Max input length: 2048 tokens

  • Max output length: 2048 tokens

Precision:

  • bf16 on supported hardware (A100), otherwise fp16

  • Stability measures:

  • Gradient checkpointing enabled

  • Gradient clipping (max_grad_norm = 1.0)

  • Warmup ratio of 0.03

  • The model was trained using the Hugging Face Trainer API and saved after completion of all epochs.

Intended Use

This model is intended for:

Structured parsing of microbiology phenotype text into predefined schema fields

Use as a third-stage parser alongside rule-based and extended parsers

Supporting downstream deterministic scoring, ranking, and retrieval systems

Not intended for:

Standalone clinical diagnosis

Autonomous decision-making

Use without additional validation layers

Integration Context

In production, the model is used as a fallback and recovery mechanism within a hybrid parsing pipeline:

  • Rule-based parser (high precision)

  • Extended parser (schema-aware)

  • LLM parser (coverage and robustness)

Outputs are reconciled and validated downstream before being used for identification or explanation.

Limitations

Performance depends on coverage of the training schema and cannot generalize beyond it.

The model may hallucinate field values if used outside its intended constrained pipeline.

It is sensitive to extreme deviations in input style or unsupported terminology.

Ethical and Safety Considerations

The model does not provide medical advice or diagnoses.

Outputs should always be reviewed in conjunction with deterministic logic and domain expertise.

Training data was curated to minimize leakage and unintended inference.

Author

Developed and fine-tuned by Zain Asad as part of the BactAI-D project.

Downloads last month
10
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/EphBactAID

Finetuned
(888)
this model

Spaces using EphAsad/EphBactAID 2