Miking98 commited on
Commit
bd3ae7f
1 Parent(s): b47c23d

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: cc-by-nc-4.0
4
+ library_name: hyena-large-16384-clmbr
5
+ tags:
6
+ - healthcare
7
+ - medical
8
+ extra_gated_prompt: "You agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License' (see https://shahlab.stanford.edu/ehrshot_license). Access requires a verified CITI training certificate using the same process outlined by PhysioNet (see https://physionet.org/about/citi-course/). Please complete the 'Data or Specimens Only Research' course and please provide proof via the verification URL, which takes the form https://www.citiprogram.org/verify/?XXXXXX. You agree to not use the model to conduct experiments that cause harm to human subjects."
9
+ extra_gated_fields:
10
+ Full Name: text
11
+ Email: text
12
+ Affiliation: text
13
+ CITI Certification Verification URL: text
14
+ I agree to all terms outlined in 'The EHRSHOT Credentialed Health Data License': checkbox
15
+ I agree to use this model for non-commercial use ONLY: checkbox
16
+ ---
17
+
18
+ # hyena-large-16384-clmbr
19
+
20
+ This is a **hyena** model with context length **16384** with **125299200** parameters from the [Context Clues paper](TODO).
21
+
22
+ It is a foundation model trained from scratch on the structured data within 2.57 million deidentified EHRs from Stanford Medicine.
23
+
24
+ As input, this model expects a sequence of coded medical events that have been mapped to Standard Concepts within the [OMOP-CDM vocabulary](https://ohdsi.github.io/CommonDataModel/index.html). As output, the model can generate either (a) synthetic future timelines or (b) a vector representation of a patient which can then be used for downstream prediction tasks.
25
+
26
+ ## Usage
27
+
28
+ First, install the `hf_ehr` package:
29
+ ```bash
30
+ pip install transformers torch hf_ehr
31
+ ```
32
+
33
+ Second, run this Python script to do inference on a patient representation:
34
+
35
+ ```python
36
+ from transformers import AutoModelForCausalLM, AutoTokenizer
37
+ from hf_ehr.data.tokenization import CLMBRTokenizer
38
+ from hf_ehr.config import Event
39
+ from typing import List, Dict
40
+ import torch
41
+
42
+ ####################################
43
+ # 1. Load model and tokenizer
44
+ model = AutoModelForCausalLM.from_pretrained("StanfordShahLab/hyena-large-16384-clmbr")
45
+ tokenizer = AutoTokenizer.from_pretrained("StanfordShahLab/hyena-large-16384-clmbr")
46
+
47
+ ####################################
48
+ # 2. Define patient as sequence of `Event` objects. Only `code` is required.
49
+ patient: List[Event] = [
50
+ Event(code='SNOMED/3950001', value=None, unit=None, start=None, end=None, omop_table=None),
51
+ Event(code='Gender/F', value=None, unit=None, start=None, end=None, omop_table=None),
52
+ Event(code='Ethnicity/Hispanic', value=None, unit=None, start=None, end=None, omop_table=None),
53
+ Event(code='SNOMED/609040007', value=None, unit=None, start=None, end=None, omop_table=None),
54
+ Event(code='LOINC/2236-8', value=-3.0, unit=None, start=None, end=None, omop_table=None),
55
+ Event(code='SNOMED/12199005', value=26.3, unit=None, start=None, end=None, omop_table=None),
56
+ ]
57
+
58
+ ####################################
59
+ # 3. Tokenize patient
60
+ batch: Dict[str, torch.Tensor] = tokenizer([ patient ], add_special_tokens=True, return_tensors='pt')
61
+ # > batch = {
62
+ # 'input_ids': tensor([[ 5, 0, 7, 9, 27, 2049, 6557, 22433, 1]]),
63
+ # 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]),
64
+ # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])
65
+ # }
66
+ textual_tokens: List[str] = tokenizer.convert_events_to_tokens(patient)
67
+ # > textual_tokens = ['SNOMED/3950001', 'Gender/F', 'Ethnicity/Hispanic', 'SNOMED/609040007', 'LOINC/2236-8 || None || -1.7976931348623157e+308 - 4.0', 'SNOMED/12199005 || None || 26.0 - 28.899999618530273']
68
+
69
+ ####################################
70
+ # 4. Run model
71
+ logits = model(**batch).logits
72
+ # > logits.shape = torch.Size([1, 9, 39818])
73
+
74
+ ####################################
75
+ # 5. Get patient representation for finetuning (usually we choose the last token's logits)
76
+ representation = logits[:, -1, :]
77
+ # > representation.shape = torch.Size([1, 39818])
78
+ ```
79
+
80
+ ## Model Details
81
+
82
+ - **Developed by:** Shah lab @ Stanford University
83
+ - **Funded by:** Stanford Healthcare
84
+ - **Shared by:** Shah lab @ Stanford University
85
+ - **Model type:** hyena
86
+ - **Languages:** Electronic health record codes (as standardized by the [OMOP-CDM](https://ohdsi.github.io/CommonDataModel/index.html))
87
+ - **License:** CC-BY NC 4.0
88
+ - **Finetuned from model:** N/A -- trained from scratch
89
+
90
+ ## Uses
91
+
92
+ This model is intended to generate representations for patients based on the structured data within their electronic health record.
93
+ These representations can then be used for downstream tasks such as predicting diagnoses, detecting anomalies, or doing propensity score matching for causal inference.
94
+
95
+ ### Direct Use
96
+
97
+ You will likely want to tune the model for your downstream use case.
98
+
99
+ ### Out-of-Scope Use
100
+
101
+ This model is for research purposes only. It is not for use in any real-world decision making that impacts patients, providers, or hospital operations.
102
+
103
+ ## Bias, Risks, and Limitations
104
+
105
+ This model was trained on a corpus of 2 billion tokens sourced from 2.57 million patients from Stanford Medicine.
106
+ The model will thus reflect the patterns of how care is delivered at Stanford Medicine, in addition to the racial and socioeconomic makeup of Stanford Medicine's patient base.
107
+ This model may not generalize well to other hospitals and demographic mixes.
108
+
109
+ While this is technically a generative model, we have not tested its generative abilities and thus do not anticipate it being used to generate synthetic EHR records.
110
+ We aim to explore its generative abilities in future work.
111
+
112
+ ## Training Details
113
+
114
+ Full training details are provided in our accompanying paper, [TODO]
115
+
116
+ ### Training Data
117
+
118
+ The model is trained on 2 billion tokens sourced from 2.57 million patients from the [Stanford Medicine Research Data Repository (STARR)](https://academic.oup.com/jamiaopen/article/6/3/ooad054/7236015),
119
+ which contains structured EHR data from both Stanford Health Care (primarily adult care) and Lucile Packard Children’s Hospital (primarily pediatric care).
120
+ The dataset contains only structured data (i.e. no clinical text or images) and covers demographics (e.g. age, sex, race), diagnoses, procedures, laboratory results, medication prescriptions, and other coded clinical observations.
121
+ The data is formatted according to the [Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM)](https://ohdsi.github.io/CommonDataModel/cdm53.html).
122
+ All data that we work with is deidentified.
123
+
124
+ ### Training Procedure
125
+
126
+ We train our model using an autoregressive next code prediction objective, i.e. predict the next code in a patient's timeline given their previous codes.
127
+
128
+ ## Citation
129
+
130
+ **BibTeX:**
131
+ ```
132
+ @article{TODO,
133
+ title={TODO},
134
+ author={TODO},
135
+ booktitle={TODO},
136
+ year={TODO}
137
+ }
138
+ ```
139
+
140
+ ## Model Card Authors
141
+
142
+ Michael Wornow, Suhana Bedi, Ethan Steinberg
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "LongSafari/hyenadna-large-1m-seqlen-hf",
3
+ "activation_freq": 10,
4
+ "architectures": [
5
+ "HyenaDNAForCausalLM"
6
+ ],
7
+ "auto_map": {
8
+ "AutoConfig": "LongSafari/hyenadna-large-1m-seqlen-hf--configuration_hyena.HyenaConfig",
9
+ "AutoModel": "LongSafari/hyenadna-large-1m-seqlen-hf--modeling_hyena.HyenaDNAModel",
10
+ "AutoModelForCausalLM": "LongSafari/hyenadna-large-1m-seqlen-hf--modeling_hyena.HyenaDNAForCausalLM",
11
+ "AutoModelForSequenceClassification": "LongSafari/hyenadna-large-1m-seqlen-hf--modeling_hyena.HyenaDNAForSequenceClassification"
12
+ },
13
+ "bos_token_id": 0,
14
+ "cls_token_id": 5,
15
+ "d_inner": 1024,
16
+ "d_model": 768,
17
+ "emb_dim": 5,
18
+ "embed_dropout": 0.1,
19
+ "eos_token_id": 1,
20
+ "filter_order": 64,
21
+ "hyena_dropout": 0.0,
22
+ "hyena_filter_dropout": 0.0,
23
+ "hyena_order": 2,
24
+ "initializer_range": 0.02,
25
+ "layer_norm_epsilon": 1e-05,
26
+ "mask_token_id": 6,
27
+ "max_seq_len": 16384,
28
+ "model_type": "hyenadna",
29
+ "n_layer": 16,
30
+ "num_inner_mlps": 2,
31
+ "pad_token_id": 4,
32
+ "pad_vocab_size_multiple": 8,
33
+ "sep_token_id": 3,
34
+ "short_filter_order": 3,
35
+ "tie_word_embeddings": false,
36
+ "torch_dtype": "float32",
37
+ "train_freq": true,
38
+ "transformers_version": "4.44.2",
39
+ "unk_token_id": 2,
40
+ "use_bias": true,
41
+ "vocab_size": 39818
42
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 4,
6
+ "transformers_version": "4.44.2"
7
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1341db1fdf82e548b8ff869e7e7a5e283079c9baa52e84feb72ba6cc1e1530b
3
+ size 464132736
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8d9b42fc04c81ee58781f6693066223a28054f332f3b2d25028c7fcf2d27ab7
3
+ size 507681242
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff