SaborDay commited on
Commit
f10fbc2
1 Parent(s): a1a89f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -22
README.md CHANGED
@@ -10,8 +10,8 @@ widget:
10
  ---
11
  ![](ft_sections.png)
12
 
13
- This is a small language model designed for scientific research application. It is fine tuned to analyzing randomized clinical trial abstracts and to classify sentences into four key sections: Background, Methods, Results, and Conclusion.
14
- This makes it easier and faster for researchers to understand and organize important information from clinical studies.
15
 
16
  ## Model Details
17
 
@@ -19,7 +19,7 @@ This makes it easier and faster for researchers to understand and organize impor
19
  The publication rate of Randomized Controlled Trials (RCTs) is consistently increasing,
20
  with more than 1 million RCTs already published.
21
  Approximately half of these publications are listed in PubMed,
22
- posing a significant challenge for medical researchers seeking specific information.
23
 
24
  When searching for prior studies, such as for writing systematic reviews,
25
  researchers often skim through abstracts to quickly determine if the papers meet their criteria of interest.
@@ -27,8 +27,8 @@ This task is facilitated when abstracts are structured, meaning the text within
27
  like objective, method, result, and conclusion.
28
  However, more than half of the RCT abstracts published are unstructured, complicating the rapid identification of relevant information.
29
 
30
- This model classifies each sentence of an abstract into a corresponding heading can greatly accelerate the process of locating the desired information.
31
- This classification not only aids researchers but also benefits various downstream applications, including automatic text summarization, information extraction, and information retrieval.
32
 
33
 
34
  - **Developed by: Salvatore Saporito
@@ -56,26 +56,60 @@ Prompt Format:
56
 
57
  Usage:
58
 
59
- from peft import PeftModel, PeftConfig
 
 
 
60
 
61
- #Load the model weights from hub
 
 
62
  model_id = "SaborDay/Phi2_RCT1M-ft-heading"
63
- trained_model = PeftModel.from_pretrained(model, model_id)
64
-
65
- #Run inference
66
- outputs = trained_model.generate(**inputs, max_length=1000)
 
 
 
 
 
 
 
 
67
  text = tokenizer.batch_decode(outputs,skip_special_tokens=True)[0]
68
  print(text)
 
69
 
70
 
71
  Example:
72
  Application on unseen data
73
 
74
- PROMPT: '###Unstruct:\nKawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries. Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases. However, the role of IL-41 in KD is unclear. The purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease.\nA total of 44 children with KD and 37 healthy controls (HC) were recruited for this study. Plasma concentrations of IL-41 were determined by ELISA. Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis. Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD.\nOur results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC. Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity.\nOur study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.\n###Struct:\n
 
 
 
 
 
 
 
 
 
 
75
 
76
  Answer Phi2_RCT1M-ft-heading:
77
 
78
- BACKGROUND: Kawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries. Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases. However, the role of IL-41 in KD is unclear. The purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease. METHODS: A total of 44 children with KD and 37 healthy controls (HC) were recruited for this study. Plasma concentrations of IL-41 were determined by ELISA. Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis. Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD. RESULTS: Our results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC. Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity. CONCLUSIONS: Our study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.']
 
 
 
 
 
 
 
 
 
 
79
 
80
  ## Training Details
81
 
@@ -96,10 +130,16 @@ Generation of dedicated prompt for Causal_LM modelling.
96
  bnb_4bit_quant_type='nf4',
97
  bnb_4bit_compute_dtype=torch.bfloat16,
98
  bnb_4bit_use_double_quant=True)
99
-
 
 
 
 
100
  ## Evaluation
101
 
102
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
103
 
104
  ### Testing Data, Factors & Metrics
105
 
@@ -109,7 +149,7 @@ Generation of dedicated prompt for Causal_LM modelling.
109
 
110
  #### Metrics
111
 
112
- Coming soon
113
 
114
  ## Technical Specifications [optional]
115
 
@@ -118,12 +158,11 @@ Coming soon
118
  LoraConfig(
119
  r=16,
120
  lora_alpha=32,
121
- target_modules=[
122
- 'q_proj','k_proj','v_proj','dense','fc1','fc2'],
123
- bias="none",
124
- lora_dropout=0.05,
125
- task_type="CAUSAL_LM",
126
- )
127
 
128
  ### Compute Infrastructure
129
 
@@ -137,7 +176,10 @@ Coming soon
137
 
138
  ## Model Card Contact
139
 
 
140
 
141
  ## References
142
 
143
  https://arxiv.org/abs/1710.06071
 
 
 
10
  ---
11
  ![](ft_sections.png)
12
 
13
+ A small language model designed for scientific research applications. Phi2 was fine tuned to analyzing randomized clinical trial abstracts and to classify sentences into four key sections: Background, Methods, Results, and Conclusion.
14
+ This model facilitates researchers in understanding and organizing key information from clinical studies.
15
 
16
  ## Model Details
17
 
 
19
  The publication rate of Randomized Controlled Trials (RCTs) is consistently increasing,
20
  with more than 1 million RCTs already published.
21
  Approximately half of these publications are listed in PubMed,
22
+ posing a significant data-volume challenge for medical researchers seeking specific information.
23
 
24
  When searching for prior studies, such as for writing systematic reviews,
25
  researchers often skim through abstracts to quickly determine if the papers meet their criteria of interest.
 
27
  like objective, method, result, and conclusion.
28
  However, more than half of the RCT abstracts published are unstructured, complicating the rapid identification of relevant information.
29
 
30
+ This model classifies each sentence of an abstract into a corresponding 'canonical 'section, greatly accelerating the process of locating the desired information.
31
+ This classification not only aids researchers but may also benefit other downstream applications, including automatic text summarization, information extraction, and information retrieval.
32
 
33
 
34
  - **Developed by: Salvatore Saporito
 
56
 
57
  Usage:
58
 
59
+ import torch
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+ from transformers import BitsAndBytesConfig
62
+ from peft import PeftModel
63
 
64
+ #Load base model weight
65
+ tokenizer_name = "microsoft/phi-2"
66
+ basemodel_name = "microsoft/phi-2"
67
  model_id = "SaborDay/Phi2_RCT1M-ft-heading"
68
+
69
+ #Load base model weight & tokenizer
70
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_name,trust_remote_code=True)
71
+
72
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', trust_remote_code=True)
73
+
74
+ #Load adapter
75
+ fine_tuned_model = PeftModel.from_pretrained(model, model_id)
76
+
77
+ #Run inference
78
+ outputs = fine_tuned_model.generate(**inputs, max_length=1000)
79
+
80
  text = tokenizer.batch_decode(outputs,skip_special_tokens=True)[0]
81
  print(text)
82
+
83
 
84
 
85
  Example:
86
  Application on unseen data
87
 
88
+ PROMPT: '###Unstruct:\nKawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries.
89
+ Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases.
90
+ However, the role of IL-41 in KD is unclear.
91
+ The purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease.
92
+ A total of 44 children with KD and 37 healthy controls (HC) were recruited for this study. Plasma concentrations of IL-41 were determined by ELISA.
93
+ Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis.
94
+ Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD.
95
+ Our results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC.
96
+ Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity.
97
+ Our study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.
98
+ ###Struct:'
99
 
100
  Answer Phi2_RCT1M-ft-heading:
101
 
102
+ BACKGROUND: Kawasaki disease (KD) is a systemic vasculitis that causes abnormalities in the coronary arteries.
103
+ Interleukin (IL)-41 is a novel immunoregulatory cytokine involved in the pathogenesis of some inflammatory and immune-related diseases.
104
+ However, the role of IL-41 in KD is unclear. T
105
+ he purpose of this study was to detect the expression of IL-41 in the plasma of children with KD and its relationship with the disease.
106
+ METHODS: A total of 44 children with KD and 37 healthy controls (HC) were recruited for this study.
107
+ Plasma concentrations of IL-41 were determined by ELISA.
108
+ Correlations between plasma IL-41 levels and KD-related clinical parameters were analyzed by Pearson correlation and multivariate linear regression analysis.
109
+ Receiver operating characteristic curve analysis was used to assess the clinical value of IL-41 in the diagnosis of KD.
110
+ RESULTS: Our results showed that plasma IL-41 levels were significantly elevated in children with KD compared with HC.
111
+ Correlation analysis demonstrated that IL-41 levels were positively correlated with D-dimer and N-terminal pro-B-type natriuretic peptide, and negatively correlated with IgM, mean corpuscular hemoglobin concentration, total protein, albumin and pre-albumin. Multivariable linear regression analysis revealed that IgM and mean corpuscular hemoglobin concentrations were associated with IL-41. Receiver operating characteristic curve analysis showed that the area under the curve of IL-41 was 0.7101, with IL-41 providing 88.64 % sensitivity and 54.05 % specificity.
112
+ CONCLUSIONS: Our study indicated that plasma IL-41 levels in children with KD were significantly higher than those in HC, and may provide a potential diagnostic biomarker for KD.
113
 
114
  ## Training Details
115
 
 
130
  bnb_4bit_quant_type='nf4',
131
  bnb_4bit_compute_dtype=torch.bfloat16,
132
  bnb_4bit_use_double_quant=True)
133
+
134
+ #### Training Run metrics
135
+
136
+ https://wandb.ai/salvatore-saporito-phd/huggingface/runs/5fcnxthk?nw=nwusersalvatoresaporitophd
137
+
138
  ## Evaluation
139
 
140
+ The model was evaluated over a subset of previously considered abstracts (20k RCT).
141
+ Each individual evaluation sample was verified not to be present in training set using corresponding PMID/
142
+ https://github.com/Franck-Dernoncourt/pubmed-rct/tree/master/PubMed_20k_RCT
143
 
144
  ### Testing Data, Factors & Metrics
145
 
 
149
 
150
  #### Metrics
151
 
152
+ [WIP]
153
 
154
  ## Technical Specifications [optional]
155
 
 
158
  LoraConfig(
159
  r=16,
160
  lora_alpha=32,
161
+ target_modules=['q_proj','k_proj','v_proj','dense','fc1','fc2'],
162
+ bias="none",
163
+ lora_dropout=0.05,
164
+ task_type="CAUSAL_LM",
165
+ )
 
166
 
167
  ### Compute Infrastructure
168
 
 
176
 
177
  ## Model Card Contact
178
 
179
+ Salvatore Saporito - [email protected]
180
 
181
  ## References
182
 
183
  https://arxiv.org/abs/1710.06071
184
+ https://arxiv.org/abs/2106.09685
185
+ https://arxiv.org/pdf/2309.05463