RichardErkhov commited on
Commit
1686e90
1 Parent(s): c8a07ef

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ Tongda1-1.5B-BKI - AWQ
11
+ - Model creator: https://huggingface.co/Tongda/
12
+ - Original model: https://huggingface.co/Tongda/Tongda1-1.5B-BKI/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: apache-2.0
20
+ datasets:
21
+ - Tongda/bid-announcement-zh-v1.0
22
+ base_model:
23
+ - Qwen/Qwen2-1.5B-Instruct
24
+ pipeline_tag: text-generation
25
+ tags:
26
+ - text-generation-inference
27
+ library_name: transformers
28
+ ---
29
+
30
+
31
+ ## **Model Overview**
32
+
33
+ This model is a fine-tuned version of the Qwen2-1.5-Instruct using Low-Rank Adaptation (LoRA). It is specifically designed for extracting key information from bidding and bid-winning announcements. The model focuses on identifying structured data such as project names, announcement types, budget amounts, and deadlines in various formats of bidding notices.
34
+
35
+ The base model, Qwen2-1.5-Instruct, is a large-scale language model optimized for instruction-following tasks, and this fine-tuned version leverages its capabilities for precise data extraction tasks in Chinese bid announcement contexts.
36
+
37
+ ---
38
+
39
+ ## **Use Cases**
40
+
41
+ The model can be used in applications that require the automatic extraction of structured data from text documents, particularly related to government bidding and procurement processes. For instance, based on [the sample announcement](https://www.qhggzyjy.gov.cn/ggzy/jyxx/001002/001002001/20240827/1358880795267533.html), the generated output is as follows:
42
+
43
+ ```
44
+ 项目名称:"大通县公安局警用无人自动化机场项目"
45
+ 公告类型:"采购公告-竞磋"
46
+ 行业分类:"其他"
47
+ 发布时间:"2024-08-27"
48
+ 预算金额:"941500.00元"
49
+ 采购人:"大通县公安局(本级)"
50
+ 响应文件截至提交时间:"2024-09-10 09:00"
51
+ 开标地址:"大通县政府采购服务中心"
52
+ 所在地区:"青海省"
53
+ ```
54
+ ---
55
+
56
+ ## **Key Features**
57
+
58
+ 1. **Fine-tuned with LoRA**: The model has been adapted using LoRA, a parameter-efficient fine-tuning method, allowing it to focus on specific tasks while maintaining the power of the large base model.
59
+
60
+ 2. **Robust Information Extraction**: The model is trained to extract and validate crucial fields, including budget values, submission deadlines, and industry classifications, ensuring accurate outputs even when encountering variable formats.
61
+
62
+ 3. **Language & Domain Specificity**: The model excels in parsing official bidding announcements in Chinese and accurately extracting the required information for downstream processes.
63
+
64
+ ---
65
+
66
+ ## **Model Architecture**
67
+
68
+ - **Base Model**: Qwen2-1.5B-Instruct
69
+ - **Fine-Tuning Technique**: LoRA
70
+ - **Training Data**: Fine-tuned on structured and unstructured government bidding announcements
71
+ - **Framework**: Hugging Face Transformers & PEFT (Parameter Efficient Fine Tuning)
72
+
73
+ ## **Technical Specifications**
74
+
75
+ - **Device Compatibility**: CUDA (GPU-enabled)
76
+ - **Tokenization**: Utilizes `AutoTokenizer` from Hugging Face, optimized for instruction-following tasks.
77
+
78
+ ## **Requirements**
79
+
80
+ ```shell
81
+ pip install --upgrade 'transformers>=4.44.2' 'torch>=2.0' accelerate
82
+ ```
83
+
84
+ ## **Usage Example**
85
+
86
+ ```python
87
+ from transformers import AutoModelForCausalLM, AutoTokenizer
88
+ import torch
89
+
90
+ device = "cuda"
91
+
92
+ model = AutoModelForCausalLM.from_pretrained("Tongda/Tongda1-1.5B-BKI", device_map="auto", torch_dtype=torch.float16)
93
+ tokenizer = AutoTokenizer.from_pretrained("Tongda/Tongda1-1.5B-BKI")
94
+
95
+ model.eval()
96
+
97
+ instruction = "分析给定的公告,提取其中的“项目名称”、“公告类型”、“行业分类”、“发布时间”、“预算金额”、“采购人”、“响应文件截至提交时间”、”开标地址“、“所在地区”,并将其以json格式进行输出。如果公告出现“最高投标限价”相关的值,则“预算金额”为该值。请再三确认提取的值为项目的“预算金额”,而不是其他和“预算金额”无关的数值,否则“预算金额”中填入'None'。如果确认提取到了“预算金额”,请重点确认提取到的金额的单位,所有的“预算金额”单位为“元”。当涉及到进制转换的计算(比如“万元”转换为“元”单位)时,必须进行进制转换。其中“公告类型”只能从以下12类中挑选:采购公告-招标、采购公告-邀标、采购公告-询价、采购公告-竞谈、采购公告-竞磋、采购公告-竞价、采购公告-单一来源、采购公告-变更、采购结果-中标、采购结果-终止、采购结果-废标、采购结果-合同。其中,“行业分类”只能从以下12类中挑选:建筑与基础设施、信息技术与通信、能源与环保、交通与物流、金融与保险、医疗与健康、教育与文化、农业与林业、制造与工业、政府与公共事业、旅游与娱乐、其他。"
98
+
99
+ # the content of any bid announcement
100
+ input_report = "#### 通答产业园区(2024���-2027年)智能一体化项目公开招标公告..."
101
+
102
+ messages = [
103
+ {"role": "system", "content": instruction},
104
+ {"role": "user", "content": input_report}
105
+ ]
106
+
107
+ text = tokenizer.apply_chat_template(
108
+ messages,
109
+ tokenize=False,
110
+ add_generation_prompt=True
111
+ )
112
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
113
+
114
+ generated_ids = model.generate(
115
+ model_inputs.input_ids,
116
+ max_new_tokens=512
117
+ )
118
+ generated_ids = [
119
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
120
+ ]
121
+
122
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
123
+
124
+ response
125
+ ```
126
+
127
+ ---
128
+ ## **Evaluation & Performance**
129
+
130
+ The Tongda1-1.5B-BKI model has shown remarkable performance in information extraction tasks. Compared to the baseline model Qwen2-1.5B-Instruct, Tongda1-1.5B-BKI excels across multiple evaluation metrics, particularly in extracting key information from tender announcements, achieving significant improvements. Even when compared to larger models like Qwen2.5-3B-Instruct and Qwen2-7B-Instruct, Tongda1-1.5B-BKI still demonstrates outstanding performance. Additionally, it outperforms the optimized online model glm-4-flash. Here are the evaluation results for each model:
131
+
132
+ | Model | ROUGE-1 | ROUGE-2 | ROUGE-Lsum | BLEU |
133
+ |-----------------------|---------|---------|------------|-------|
134
+ | Tongda1-1.5B-BKI | 0.853 | 0.787 | 0.853 | 0.852 |
135
+ | Qwen2-1.5B-Instruct | 0.412 | 0.231 | 0.411 | 0.431 |
136
+ | Qwen2.5-3B-Instruct | 0.686 | 0.578 | 0.687 | 0.755 |
137
+ | Qwen2-7B-Instruct | 0.703 | 0.578 | 0.703 | 0.789 |
138
+ | glm-4-flash | 0.774 | 0.655 | 0.775 | 0.816 |
139
+
140
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ebcbd0c8577b39464e6dc0/Qiyi7onDe99b2USArl0oG.png)
141
+
142
+ ---
143
+
144
+ ## **Limitations**
145
+
146
+ - **Language Limitation**: The model is primarily trained on Chinese bidding announcements. Performance on other languages or non-bidding content may be limited.
147
+ - **Strict Formatting**: The model may have reduced accuracy when the bidding announcements deviate significantly from common structures.
148
+
149
+ ---
150
+
151
+ ## **Citation**
152
+ If you use this model, please consider citing it as follows:
153
+
154
+ ```
155
+ @inproceedings{Tongda1-1.5B-BKI,
156
+ title={Tongda1-1.5B-BKI: LoRA Fine-tuned Model for Bidding Announcements},
157
+ author={Ted-Z},
158
+ year={2024}
159
+ }
160
+ ```
161
+
162
+ ## **Contact**
163
+ For further inquiries or fine-tuning services, please contact us at [Tongda](https://www.tongdaai.com/).
164
+