--- license: cc-by-nc-4.0 library_name: clmbr tags: - healthcare - femr - medical extra_gated_prompt: "You agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License' (see https://shahlab.stanford.edu/ehrshot_license). Access requires a verified CITI training certificate using the same process outlined by PhysioNet (see https://physionet.org/about/citi-course/). Please complete the 'Data or Specimens Only Research' course and please provide proof via the verification URL, which takes the form https://www.citiprogram.org/verify/?XXXXXX. You agree to not use the model to conduct experiments that cause harm to human subjects." extra_gated_fields: Full Name: text Email: text Affiliation: text CITI Certification Verification URL: text I agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License': checkbox I agree to use this model for non-commercial use ONLY: checkbox --- # CLMBR-T-Base This is a 141 million parameter autoregressive foundation model pretrained on 2.57 million deidentified EHRs from Stanford Medicine. This is the model from [(Wornow et al. 2023)](https://arxiv.org/abs/2307.02028), and is based on the CLMBR architecture originally described in [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653) As input, this model expects a sequence of coded medical events that have been mapped to Standard Concepts within the [OMOP-CDM vocabulary](https://ohdsi.github.io/CommonDataModel/index.html). The model generates representations of patients which can then be used for downstream prediction tasks. Input patients should be provided in the [MEDS](https://github.com/Medical-Event-Data-Standard/) schema. ## Model Details ### Model Description - **Developed by:** Shah lab @ Stanford University - **Funded by:** Stanford Healthcare - **Shared by:** Shah lab @ Stanford University - **Model type:** CLMBR [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653) - **Language(s) (NLP):** Electronic health record codes - **License:** CC-BY NC 4.0 - **Finetuned from model:** N/A -- trained from scratch ### Model Sources - **Website:** [https://ehrshot.stanford.edu/](https://ehrshot.stanford.edu/) - **Gitub:** [https://github.com/som-shahlab/ehrshot-benchmark/](https://github.com/som-shahlab/ehrshot-benchmark/) - **Paper:** [EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models](https://arxiv.org/abs/2307.02028) ## Uses This model is intended to generate representations for patients based on the structured data within their electronic health record. These representations can then be used for downstream tasks such as predicting diagnoses, detecting anomalies, or doing propensity score matching for causal inference. ### Direct Use You will likely want to tune the model for your downstream use case. ### Out-of-Scope Use This model is for research purposes only. It is not for use in any real-world decision making that impacts patients, providers, or hospital operations. ## Bias, Risks, and Limitations This model was trained on a corpus of 2.57 million patients from Stanford Medicine. The model will thus reflect the patterns of how care is delivered at Stanford Medicine, in addition to the racial and socioeconomic makeup of Stanford Medicine's patient base. This model may not generalize well to other hospitals and demographic mixes. While this is technically a generative model, we have not tested its generative abilities and thus do not anticipate it being used to generate synthetic EHR records. We aim to explore its generative abilities in future work. ## How to Get Started with the Model Use the code below to get started with the model. First, download the necessary libraries. ```bash pip install torch==2.1.1 femr==0.2.3 datasets==2.15.0 xformers transformers==4.35.2 ``` Second, run the following Python script to run inference on a single patient: ```python import femr.models.transformer import torch import femr.models.tokenizer import femr.models.processor import datetime model_name = "StanfordShahLab/clmbr-t-base" # Load tokenizer / batch loader tokenizer = femr.models.tokenizer.FEMRTokenizer.from_pretrained(model_name) batch_processor = femr.models.processor.FEMRBatchProcessor(tokenizer) # Load model model = femr.models.transformer.FEMRModel.from_pretrained(model_name) # Create an example patient to run inference on # This patient follows the MEDS schema: https://github.com/Medical-Event-Data-Standard example_patient = { 'patient_id': 30, 'events': [{ 'time': datetime.datetime(2011, 5, 8), 'measurements': [ {'code': 'SNOMED/184099003'}, {'code': 'Visit/IP'}, ], }, { 'time': datetime.datetime(2012, 6, 9), 'measurements': [ {'code': 'Visit/OP'}, {'code': 'SNOMED/3950001'} ], }] } raw_batch = batch_processor.convert_patient(example_patient, tensor_type="pt") batch = batch_processor.collate([raw_batch]) # Run model with torch.no_grad(): _, result = model(**batch) print(result['timestamps'].cpu().numpy().astype('datetime64[s]')) print(result['patient_ids']) print(result['representations']) ``` ## Training Details Full training details are provided in our accompanying paper, [EHRSHOT (Wornow et al. 2023)](https://arxiv.org/abs/2307.02028). ### Training Data The model is trained on 2.57 million patients from the [Stanford Medicine Research Data Repository (STARR)](https://academic.oup.com/jamiaopen/article/6/3/ooad054/7236015), which contains EHR data from both Stanford Health Care (primarily adult care) and Lucile Packard Children’s Hospital (primarily pediatric care). The dataset contains only structured data (i.e. no clinical text or images) and covers demographics (e.g. age, sex, race), diagnoses, procedures, laboratory results, medication prescriptions, and other coded clinical observations. The data is formatted according to the [Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM)](https://ohdsi.github.io/CommonDataModel/cdm53.html). All data that we work with is deidentified. ### Training Procedure We train our model using an autoregressive next code prediction objective, i.e. predict the next code in a patient's timeline given their previous codes. #### Preprocessing We use the [FEMR](https://github.com/som-shahlab/femr/tree/main) Python library for data preprocessing. #### Training Hyperparameters * Learning rate: 0.00001 * Context window size: 496 * Internal dropout: 0 * Layers: 12 * Hidden dimension: 768 ## Evaluation We evaluate this model on [the EHRSHOT benchmark](https://ehrshot.stanford.edu). Information on this benchmark, tasks, and results are detailed in [Wornow et al. 2023](https://arxiv.org/pdf/2307.02028.pdf) ## Technical Specifications This model uses the CLMBR architecture from [(Steinberg et al. 2021)](https://www.sciencedirect.com/science/article/pii/S1532046420302653). The objective is an autoregressive next token prediction task. Please see [Wornow et al. 2023](https://arxiv.org/pdf/2307.02028.pdf) for more details on the specific model architecture. ## Vocabulary CLMBR is a language model and requires defining a token vocabulary `V`. However, unlike natural languages, the vocabulary of a structured EHR language model is defined by *medical codes*. Here tokens map to standardized concepts in medical ontologies. Since the union of all tokens from all ontologies, `V_all`, results in a prohibitively large vocabuary, we derive `~V` by filtering to the top `k` most frequent codes as follows: 1. **Knowledge Graphs (G):** A set of `n` medical ontologies (knowledge graphs), `G = ({G_1, G_2, ..., G_n})`, defined by [Athena's OMOP Vocabulary List](https://athena.ohdsi.org/vocabulary/list). 2. **Medical Codes as Tokens:** Each knowledge graph `G_i` has a set of unique medical codes `M_i`. The union of all these codes serve as the tokens in our complete vocabulary `V_all = M_1 ∪ M_2 ∪ ... ∪ M_n`. Our final, filtered vocabulary is then `~V = sort_freq(V_all)[1:k]` where frequency is calculated over our [STARR EHR OMOP](https://academic.oup.com/jamiaopen/article/6/3/ooad054/7236015) dataset. **CLMBR Vocabulary Summary** - 21 Source Ontologies/Knowledge Graphs - 65,536 tokens (the max value of `uint16_t`) | PREFIX | SOURCE | SIZE | EXAMPLE TOKENS | |:---------------------|:-------------------------------------------------------------------------------------------------|---------:|:---------------------------------------------------| | LOINC | Logical Observation Identifiers Names and Codes (Regenstrief Institute) | 37,590 | 31790-9, 20449-5 | | SNOMED | Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) | 18,174 | 105013009, 200755008 | | RxNorm | RxNorm (NLM) | 4,678 | 2375327, 372375 | | CPT4 | Current Procedural Terminology version 4 (AMA) | 3,730 | 00790, 36818 | | RxNorm Extension | OMOP RxNorm Extension | 255 | OMOP358911, OMOP2153393 | | ICD10PCS | ICD-10 Procedure Coding System (CMS) | 233 | 10907ZC, 4A0234Z | | ICD9Proc | International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) | 196 | 68.29, 03.93 | | Cancer Modifier | Diagnostic Modifiers of Cancer (OMOP) | 88 | c-8th\_AJCC/UICC-Stage-2C, p-7th\_AJCC/UICC-Stage-3B | | HCPCS | Healthcare Common Procedure Coding System (CMS) | 54 | C1878, P7001 | | ICDO3 | International Classification of Diseases for Oncology, Third Edition (WHO) | 52 | NULL-C34.8, C56.9 | | CVX | CDC Vaccine Administered CVX (NCIRD) | 41 | 151, 158 | | Domain | OMOP | 27 | OMOP generated | | Race | Race and Ethnicity Code Set (USBC) | 5 | 5, 4 | | OMOP Extension | OMOP Extension (OHDSI) | 3 | OMOP5160861, OMOP4912978 | | Gender | OMOP Gender | 2 | F, M | | Ethnicity | OMOP Ethnicity | 2 | Not Hispanic, Hispanic | | CMS Place of Service | Place of Service Codes for Professional Claims (CMS) | 2 | OMOP4822036, 02 | | Medicare Specialty | Medicare provider/supplier specialty codes (CMS) | 1 | A0 | | Condition Type | OMOP | 1 | OMOP4822053 | | CARE_SITE | STANFORD_CUSTOM | 396 | 7930934, 7929373 | | Visit | STANFORD_CUSTOM | 6 | ERIP, ER | ## Citation **BibTeX:** ``` @article{wornow2023ehrshot, title={EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models}, author={Michael Wornow and Rahul Thapa and Ethan Steinberg and Jason Fries and Nigam Shah}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2023} } ``` ## Model Card Authors Michael Wornow, Ethan Steinberg, Rahul Thapa, Jason Fries, Nigam H. Shah ## Model Card Contact Michael Wornow (mwornow@stanford.edu)