File size: 5,069 Bytes
890e4c0
 
 
 
 
 
 
 
 
 
 
3e1ccf7
 
8eb4fcd
 
890e4c0
 
6c9f488
 
 
 
 
 
593ef84
3e1ccf7
13764c0
3e1ccf7
 
 
 
 
 
 
 
 
 
 
 
 
13764c0
 
3e1ccf7
 
 
 
 
 
859ce7f
13764c0
 
 
 
 
 
 
3e1ccf7
 
 
 
 
 
 
 
 
 
 
 
 
 
44f739f
3e1ccf7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
859ce7f
 
 
 
 
 
 
 
 
 
 
 
 
44f739f
d8ae36e
 
 
 
 
 
 
 
 
 
 
 
3e1ccf7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
datasets:
- LinhDuong/chatdoctor-200k
language:
- en
pipeline_tag: text-generation
tags:
- medical
- doctor
- chat
- qa
- question-answering
thumbnail: https://huggingface.co/Narrativaai/BioGPT-Large-finetuned-chatdoctor/resolve/main/cdl.png

---


<div style="text-align:center;width:250px;height:250px;">
    <img src="https://huggingface.co/Narrativaai/BioGPT-Large-finetuned-chatdoctor/resolve/main/cdl.png" alt="chat doctor bioGPT logo"">
</div>


# BioGPT (Large) 🧬 fine-tuned on ChatDoctor 🩺 for QA

[Microsoft's BioGPT Large](https://huggingface.co/microsoft/BioGPT-Large) fine-tuned on ChatDoctor dataset for Question Answering.


## Intended Use

This is just a research model and does **NOT** have to be used out of this scope.


## Limitations

TBA

## Model

[Microsoft's BioGPT Large](https://huggingface.co/microsoft/BioGPT-Large):

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.


## Dataset

ChatDoctor-200K dataset is collected from this paper https://arxiv.org/pdf/2303.14070.pdf

The dataset is composed by:

- 100k real conversations between patients and doctors from HealthCareMagic.com [HealthCareMagic-100k](https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view?usp=sharing).

- 10k real conversations between patients and doctors from icliniq.com [icliniq-10k](https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view?usp=sharing).

- 5k generated conversations between patients and physicians from ChatGPT [GenMedGPT-5k](https://drive.google.com/file/d/1nDTKZ3wZbZWTkFMBkxlamrzbNz0frugg/view?usp=sharing) and [disease database](https://github.com/Kent0n-Li/ChatDoctor/blob/main/format_dataset.csv)


## Usage
```py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig


model_id = "Narrativaai/BioGPT-Large-finetuned-chatdoctor"

tokenizer = AutoTokenizer.from_pretrained("microsoft/BioGPT-Large")

model = AutoModelForCausalLM.from_pretrained(model_id)

def answer_question(
        prompt,
        temperature=0.1,
        top_p=0.75,
        top_k=40,
        num_beams=2,
        **kwargs,
):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("cuda")
    attention_mask = inputs["attention_mask"].to("cuda")
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        **kwargs,
    )
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=512,
            eos_token_id=tokenizer.eos_token_id

        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s, skip_special_tokens=True)
    return output.split(" Response:")[1]

example_prompt = """
Below is an instruction that describes a task, paired with an input that provides further context.Write a response that appropriately completes the request.

### Instruction:
If you are a doctor, please answer the medical questions based on the patient's description.

### Input:
Hi i have sore lumps under the skin on my legs. they started on my left ankle and are approx 1 - 2cm diameter and are spreading up onto my thies. I am eating panadol night and anti allergy pills (Atarax). I have had this for about two weeks now. Please advise.

### Response:
"""

print(answer_question(example_prompt))
```

## Citation
```
@misc {narrativa_2023,
	author       = { {Narrativa} },
	title        = { BioGPT-Large-finetuned-chatdoctor (Revision 13764c0) },
	year         = 2023,
	url          = { https://huggingface.co/Narrativaai/BioGPT-Large-finetuned-chatdoctor },
	doi          = { 10.57967/hf/0601 },
	publisher    = { Hugging Face }
}
```