root commited on
Commit
5c5c629
·
1 Parent(s): 92bcd1d

abstractfunctionadded

Browse files
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: 文献综述助手 (又名不想看文献)
3
  emoji: 📚
4
  colorFrom: blue
5
  colorTo: indigo
@@ -9,11 +9,13 @@ sdk_version: "4.25.0"
9
  app_file: app.py
10
  pinned: false
11
  ---
12
- # MedicalReviewAgent 不想看文献
 
13
  ## 项目概述
14
- - 整一个帮我写综述的Agent,希望他能完成文献内容的收集,文本分类和总结,科学事实对比,撰写综述等功能
15
- - 计划用到RAG, function calling等技术
16
- - 还在不断摸索中,欢迎大佬指导!
 
17
  - [huggingface 体验链接](https://huggingface.co/spaces/Yijun-Yang/ReadReview/), zeroGPUs 比较吝啬 我把本地推理给阉割了 不要用本地模型哈 用API 用本地模型会报错
18
 
19
  ## 流程图
@@ -97,7 +99,7 @@ huggingface-cli download maidalun1020/bce-reranker-base_v1 --local-dir /root/mod
97
  ```bash
98
  conda activate ReviewAgent
99
  cd MedicalReviewAgent
100
- python3 app.py --model_downloaded True # 如果已经在/root/models下载了模型 这个参数会换一个配置文件,里面的modelpath是本地路径不是hf的仓库路径 自己显卡跑跑用这个
101
  python3 app.py # 如果不打算用本地/root/models储存的模型 这是hf的spaces的构建配置
102
  ```
103
  gradio在本地7860端口运行
 
1
  ---
2
+ title: 文献综述助手
3
  emoji: 📚
4
  colorFrom: blue
5
  colorTo: indigo
 
9
  app_file: app.py
10
  pinned: false
11
  ---
12
+ [English](README_en.md) | 中文
13
+ # MedicalReviewAgent
14
  ## 项目概述
15
+ - 一个基于RAG技术和Agent流程的医学文献综述辅助工具。他允许用户配置本地或远程的大语言模型,通过关键词或PMID搜索PubMed以获取文献,上传PDF文件,以及创建和管理文献数据库。用户可以通过设置不同的参数来生成数据库,用于不同的需求。
16
+ - 其中文本分块的聚类和标注功能作为一个创新点,目标是通过聚类算法对大量的文本分块进行聚类,这样大模型只需要阅读少量代表性分块并对聚类进行标注就可以输出对数据库内容的整体认识。
17
+ - 最后写综述功能可以基于用户提问输出一段完整带有相关参考文献的综述文本。
18
+ - 总体来说这个小工具旨在帮助科研人员高效检索,管理,阅读和总结文献。
19
  - [huggingface 体验链接](https://huggingface.co/spaces/Yijun-Yang/ReadReview/), zeroGPUs 比较吝啬 我把本地推理给阉割了 不要用本地模型哈 用API 用本地模型会报错
20
 
21
  ## 流程图
 
99
  ```bash
100
  conda activate ReviewAgent
101
  cd MedicalReviewAgent
102
+ python3 applocal.py --model_downloaded True # 如果已经在/root/models下载了模型 这个参数会换一个配置文件,里面的modelpath是本地路径不是hf的仓库路径 自己显卡跑跑用这个
103
  python3 app.py # 如果不打算用本地/root/models储存的模型 这是hf的spaces的构建配置
104
  ```
105
  gradio在本地7860端口运行
README_en.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ English | [中文](README.md)
2
+ # MedicalReviewAgent:
3
+
4
+ ## Project Overview
5
+
6
+ - MedicalReviewAgent is a medical literature review assistance tool based on RAG technology and agent workflows. It enables users to configure local or remote large language models to search PubMed via keywords or PMIDs, upload PDF files, and create and manage literature databases. Users can generate databases with different settings for various needs.
7
+ - The tool innovatively includes text block clustering and tagging to manage large volumes of text efficiently. By clustering text blocks, the large model only needs to read a few representative blocks and annotate clusters to summarize the database content comprehensively.
8
+ - The "write review" feature allows generating complete review text with references based on user queries.
9
+ - Overall, this tool is designed to help researchers efficiently retrieve, manage, read, and summarize literature.
10
+ - [Hugging Face Experience Link](https://huggingface.co/spaces/Yijun-Yang/ReadReview/). Note: ZeroGPUs are limited, so avoid using the local model as it may result in errors.
11
+
12
+ ## Workflow Diagrams
13
+
14
+ ### Literature and Knowledge Base Construction
15
+ ![Literature and Knowledge Base Construction Diagram](https://github.com/jabberwockyang/MedicalReviewAgent/assets/52541128/d70a2ec1-7a20-4b5b-a91c-bf649f657319)
16
+
17
+ ### Human-Computer Collaborative Writing
18
+ ![Human-Computer Collaborative Writing Diagram](https://github.com/jabberwockyang/MedicalReviewAgent/assets/52541128/fc394d8b-1668-4349-9adc-1c4c0a7e0a8b)
19
+
20
+ ## Features
21
+
22
+ 1. **Model Service Configuration**
23
+ - **Remote Model Selection**: Allows users to choose between remote or local large models from various providers like Kimi, Deepseek, Zhipuai, and GPT.
24
+
25
+ 2. **Literature Search + Database Creation**
26
+ - **Literature Search**: Users can enter keywords, set the search quantity, and conduct PubMed PMC literature searches.
27
+ - **Literature Database Management**: Supports deleting existing literature databases and provides real-time updates on the library's overview.
28
+ - **Database Creation**: Users can set block size and cluster numbers for text clustering.
29
+ - **Database Management**: Supports creating new databases, deleting existing ones, and viewing database overviews.
30
+
31
+ 3. **Writing Reviews**
32
+ - **Sampling Annotated Article Clusters**: Users can choose block size and cluster numbers, set the sampling annotation ratio, and start the annotation process.
33
+ - **Inspiration Generation**: Based on annotated article clusters, the large model provides inspiration to help generate the framework of questions needed for the review.
34
+ - **Review Generation**: Users can input the content or topic they wish to write about, click the generate review button, and the system will automatically generate review text with references.
35
+
36
+ ## Highlights
37
+
38
+ 1. **Efficient Literature Search and Management**: Quickly find related literature by keywords and supports uploading existing PDF literature for easy library construction and management.
39
+ 2. **Flexible Database Generation**: Provides flexible parameters for database generation, supports multiple generations and updates to ensure timeliness and accuracy.
40
+ 3. **Intelligent Review Generation**: Utilizes advanced large model technology for automated article cluster annotation and inspiration generation, helping users quickly produce high-quality review text.
41
+ 4. **User-Friendly Interface**: Intuitive interface and detailed usage instructions make it easy for users to start and use all features.
42
+ 5. **Remote and Local Model Support**: Supports a variety of large model providers to meet different user needs. Whether using local or remote models, configurations can be flexibly adjusted.
43
+
44
+ ## Installation and Running
45
+
46
+ Create a new conda environment:
47
+ ```bash
48
+ conda create --name ReviewAgent python=3.10.14
49
+ conda activate ReviewAgent
50
+ ```
51
+ Clone the GitHub repository:
52
+ ```bash
53
+ git clone https://github.com/jabberwockyang/MedicalReviewAgent.git
54
+ cd MedicalReviewAgent
55
+ pip install -r requirements.txt
56
+ ```
57
+ Download models with huggingface-cli (optional, HF will download on first call, but there might be firewall issues):
58
+ ```bash
59
+ cd /root && mkdir models
60
+ cd /root/models
61
+ # login required
62
+ huggingface-cli download Qwen/Qwen1.5-7B-Chat --local-dir /root/models/Qwen1.5-7B-Chat
63
+ huggingface-cli download maidalun1020/bce-embedding-base_v1 --local-dir /root/models/bce-embedding-base_v1
64
+ huggingface-cli download maidalun1020/bce-reranker-base_v1 --local-dir /root/models/bce-reranker-base_v1
65
+ ```
66
+ Start the service:
67
+ ```bash
68
+ conda activate ReviewAgent
69
+ cd MedicalReviewAgent
70
+ python3 app.py --model_downloaded True # Use this if models
app.py CHANGED
@@ -82,7 +82,7 @@ def udate_model_dropdown(remote_company):
82
  'kimi': ['moonshot-v1-128k'],
83
  'deepseek': ['deepseek-chat'],
84
  'zhipuai': ['glm-4'],
85
- 'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo']
86
  }
87
  return gr.Dropdown(choices= model_choices[remote_company])
88
 
@@ -107,7 +107,7 @@ def update_remote_config(remote_ornot,remote_company = None,api = None,baseurl =
107
  return gr.Button("配置已保存")
108
 
109
  # @spaces.GPU(duration=120)
110
- def get_ready(query:str,chunksize=None,k=None):
111
 
112
  with open(CONFIG_PATH, encoding='utf8') as f:
113
  config = pytoml.load(f)
@@ -124,6 +124,8 @@ def get_ready(query:str,chunksize=None,k=None):
124
  except:
125
  pass
126
 
 
 
127
  if query == 'annotation':
128
  if not chunksize or not k:
129
  raise ValueError('chunksize or k not provided')
@@ -182,9 +184,11 @@ def update_repo_info():
182
  pmc_success = repo_info['pmc_success_d']
183
  scihub_success = repo_info['scihub_success_d']
184
  failed_download = repo_info['failed_download']
 
 
185
 
186
  number_of_upload = number_of_pdf-scihub_success
187
- return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, number_of_pdf, failed_download, number_of_upload
188
  else:
189
  return None,None,None,None,None,None,None,None,None,number_of_pdf
190
  else:
@@ -223,33 +227,39 @@ def delete_articles_repo():
223
  repodir, workdir, _ = get_ready('repo_work')
224
  if os.path.exists(repodir):
225
  shutil.rmtree(repodir)
 
226
  if os.path.exists(workdir):
227
  shutil.rmtree(workdir)
 
228
 
229
  return gr.Textbox(label="文献库概况",lines =3,
230
  value = '文献库和相关数据库已删除',
231
  visible = True)
232
 
233
  def update_repo():
234
- keys, retmax, search_len, import_len, _, pmc_success, scihub_success, pdflen, failed, pdflen = update_repo_info()
 
 
235
  newinfo = ""
236
  if keys == None:
237
  newinfo += '无关键词搜索相关信息\n'
238
  newinfo += '无导入的PMID\n'
239
- if pdflen:
240
- newinfo += f'上传的PDF数量: {pdflen}\n'
241
  else:
242
  newinfo += '无上传的PDF\n'
243
  else:
244
  newinfo += f'关键词搜索:'
245
- newinfo += f' 关键词: {keys}\n'
246
- newinfo += f' 搜索上限: {retmax}\n'
247
- newinfo += f' 搜索到的PMID数量: {search_len}\n'
248
  newinfo += f'导入的PMID数量: {import_len}\n'
249
- newinfo += f'成功获取PMC全文数量: {pmc_success}\n'
250
- newinfo += f'成功获取SciHub全文数量: {scihub_success}\n'
251
- newinfo += f"下载失败的ID: {failed}\n"
252
- newinfo += f'上传的PDF数量: {pdflen}\n'
 
 
253
 
254
  return gr.Textbox(label="文献库概况",lines =1,
255
  value = newinfo,
@@ -259,26 +269,35 @@ def update_database_info():
259
  with open(CONFIG_PATH, encoding='utf8') as f:
260
  config = pytoml.load(f)
261
  workdir = config['feature_store']['work_dir']
262
- chunkdirs = glob.glob(os.path.join(workdir, 'chunksize_*'))
263
- chunkdirs.sort()
264
- list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
265
- # print(list_of_chunksize)
266
- jsonobj = {}
267
- for chunkdir in chunkdirs:
268
- k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
269
- k_dir.sort()
270
- list_of_k = [int(k.split('_')[-1]) for k in k_dir]
271
- jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
 
 
 
 
 
 
272
 
273
-
274
- new_options = [f"chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
 
275
 
276
- return new_options, jsonobj
277
 
278
  @spaces.GPU(duration=120)
279
  def generate_database(chunksize:int,nclusters:str|list[str]):
280
  # 在这里运行生成数据库的函数
281
  repodir, workdir, _ = get_ready('repo_work')
 
 
282
  if not os.path.exists(repodir):
283
  return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
284
  nclusters = [int(i) for i in nclusters]
@@ -295,12 +314,17 @@ def generate_database(chunksize:int,nclusters:str|list[str]):
295
  chunk_size=chunksize,
296
  n_clusters=nclusters,
297
  config_path=CONFIG_PATH)
 
298
 
299
  # walk all files in repo dir
300
- file_opr = FileOperation()
301
  files = file_opr.scan_dir(repo_dir=repodir)
302
  fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
303
  file_opr.summarize(files)
 
 
 
 
 
304
  del fs_init
305
  cache.pop('default')
306
  texts, _ = update_database_info()
@@ -310,6 +334,7 @@ def delete_database():
310
  _, workdir, _ = get_ready('repo_work')
311
  if os.path.exists(workdir):
312
  shutil.rmtree(workdir)
 
313
  return gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
314
 
315
  def update_database_textbox():
@@ -319,17 +344,24 @@ def update_database_textbox():
319
  else:
320
  return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
321
 
322
- def update_chunksize_dropdown():
323
  _, jsonobj = update_database_info()
324
- return gr.Dropdown(choices= jsonobj.keys())
 
 
 
 
325
 
326
- def update_ncluster_dropdown(chunksize:int):
327
  _, jsonobj = update_database_info()
328
- nclusters = jsonobj[chunksize]
 
 
 
329
  return gr.Dropdown(choices= nclusters)
330
 
331
  # @spaces.GPU(duration=120)
332
- def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
333
  '''
334
  use llm to annotate cluster
335
  n: percentage of clusters to annotate
@@ -340,7 +372,7 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
340
  else:
341
  backend = 'local'
342
 
343
- clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters)
344
  new_obj_list = []
345
  n = round(n * len(samples.keys()))
346
  for cluster_no in random.sample(samples.keys(), n):
@@ -369,14 +401,14 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
369
  return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
370
 
371
  # @spaces.GPU(duration=120)
372
- def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool):
373
  query = 'inspiration'
374
  if remote_ornot:
375
  backend = 'remote'
376
  else:
377
  backend = 'local'
378
 
379
- clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters)
380
  new_obj_list = []
381
 
382
  if annotation is not None: # if the user wants to get inspiration from specific clusters only
@@ -418,13 +450,13 @@ def getpmcurls(references):
418
  return urls
419
 
420
  @spaces.GPU(duration=120)
421
- def summarize_text(query,chunksize:int,remote_ornot:bool):
422
  if remote_ornot:
423
  backend = 'remote'
424
  else:
425
  backend = 'local'
426
 
427
- assistant,_ = get_ready('summarize',chunksize=chunksize,k=None)
428
  code, reply, references = assistant.generate(query=query,
429
  history=[],
430
  groupname='',backend = backend)
@@ -611,6 +643,7 @@ def main_interface():
611
  with gr.Accordion("聚类标注相关参数", open=True):
612
  with gr.Row():
613
  update_options = gr.Button("更新数据库情况", scale=0)
 
614
  chunksize = gr.Dropdown([], label="选择块大小", scale=0)
615
  nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
616
  ntoread = gr.Slider(
@@ -637,22 +670,23 @@ def main_interface():
637
  output_references = gr.Markdown(label="参考文献")
638
 
639
  update_options.click(update_chunksize_dropdown,
 
640
  outputs=[chunksize])
641
 
642
  chunksize.change(update_ncluster_dropdown,
643
- inputs=[chunksize],
644
  outputs= [nclusters])
645
 
646
  annotation_button.click(annotation,
647
- inputs = [ntoread, chunksize, nclusters,remote_ornot],
648
  outputs=[annotation_output])
649
 
650
  inspiration_button.click(inspiration,
651
- inputs= [annotation_output, chunksize, nclusters,remote_ornot],
652
  outputs=[inspiration_output])
653
 
654
  write_button.click(summarize_text,
655
- inputs=[query, chunksize,remote_ornot],
656
  outputs =[output_text,output_references])
657
 
658
  demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])
 
82
  'kimi': ['moonshot-v1-128k'],
83
  'deepseek': ['deepseek-chat'],
84
  'zhipuai': ['glm-4'],
85
+ 'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo','gpt-4']
86
  }
87
  return gr.Dropdown(choices= model_choices[remote_company])
88
 
 
107
  return gr.Button("配置已保存")
108
 
109
  # @spaces.GPU(duration=120)
110
+ def get_ready(query:str,chunksize=None,k=None,use_abstract=False):
111
 
112
  with open(CONFIG_PATH, encoding='utf8') as f:
113
  config = pytoml.load(f)
 
124
  except:
125
  pass
126
 
127
+ if use_abstract:
128
+ workdir = workdir + '_ab'
129
  if query == 'annotation':
130
  if not chunksize or not k:
131
  raise ValueError('chunksize or k not provided')
 
184
  pmc_success = repo_info['pmc_success_d']
185
  scihub_success = repo_info['scihub_success_d']
186
  failed_download = repo_info['failed_download']
187
+ abstract_success = repo_info['abstract_success']
188
+ failed_abstract = repo_info['failed_abstract']
189
 
190
  number_of_upload = number_of_pdf-scihub_success
191
+ return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
192
  else:
193
  return None,None,None,None,None,None,None,None,None,number_of_pdf
194
  else:
 
227
  repodir, workdir, _ = get_ready('repo_work')
228
  if os.path.exists(repodir):
229
  shutil.rmtree(repodir)
230
+ shutil.rmtree(repodir + '_ab')
231
  if os.path.exists(workdir):
232
  shutil.rmtree(workdir)
233
+ shutil.rmtree(workdir + '_ab')
234
 
235
  return gr.Textbox(label="文献库概况",lines =3,
236
  value = '文献库和相关数据库已删除',
237
  visible = True)
238
 
239
  def update_repo():
240
+ # keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
241
+ # None,None,None,None,None,None,None,None,None,number_of_pdf
242
+ keys, retmax, search_len, import_len, _, pmc_success, scihub_success, failed, abstract_success, failed_abstract, pdfuplo = update_repo_info()
243
  newinfo = ""
244
  if keys == None:
245
  newinfo += '无关键词搜索相关信息\n'
246
  newinfo += '无导入的PMID\n'
247
+ if pdfuplo:
248
+ newinfo += f'上传的PDF数量: {pdfuplo}\n'
249
  else:
250
  newinfo += '无上传的PDF\n'
251
  else:
252
  newinfo += f'关键词搜索:'
253
+ newinfo += f' 关键词: {keys}\n'
254
+ newinfo += f' 搜索上限: {retmax}\n'
255
+ newinfo += f' 搜索到的PMID数量: {search_len}\n'
256
  newinfo += f'导入的PMID数量: {import_len}\n'
257
+ newinfo += f' 成功获取PMC全文数量: {pmc_success}\n'
258
+ newinfo += f' 成功获取SciHub全文数量: {scihub_success}\n'
259
+ newinfo += f" 下载失败的ID: {failed}\n"
260
+ newinfo += f" 成功获取摘要的数量: {abstract_success}\n"
261
+ newinfo += f" 获取摘要失败的数量: {failed_abstract}\n"
262
+ newinfo += f'上传的PDF数量: {pdfuplo}\n'
263
 
264
  return gr.Textbox(label="文献库概况",lines =1,
265
  value = newinfo,
 
269
  with open(CONFIG_PATH, encoding='utf8') as f:
270
  config = pytoml.load(f)
271
  workdir = config['feature_store']['work_dir']
272
+ abworkdir = workdir + '_ab'
273
+ options = []
274
+ total_json_obj = {}
275
+ for dir in [workdir,abworkdir]:
276
+ tag = 'FullText' if '_ab' not in dir else 'Abstract'
277
+
278
+ chunkdirs = glob.glob(os.path.join(dir, 'chunksize_*'))
279
+ chunkdirs.sort()
280
+ list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
281
+ # print(list_of_chunksize)
282
+ jsonobj = {}
283
+ for chunkdir in chunkdirs:
284
+ k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
285
+ k_dir.sort()
286
+ list_of_k = [int(k.split('_')[-1]) for k in k_dir]
287
+ jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
288
 
289
+ total_json_obj[tag] = jsonobj
290
+ newoptions = [f"{tag}, chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
291
+ options.extend(newoptions)
292
 
293
+ return options, total_json_obj
294
 
295
  @spaces.GPU(duration=120)
296
  def generate_database(chunksize:int,nclusters:str|list[str]):
297
  # 在这里运行生成数据库的函数
298
  repodir, workdir, _ = get_ready('repo_work')
299
+ abrepodir = repodir + '_ab'
300
+ abworkdir = workdir + '_ab'
301
  if not os.path.exists(repodir):
302
  return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
303
  nclusters = [int(i) for i in nclusters]
 
314
  chunk_size=chunksize,
315
  n_clusters=nclusters,
316
  config_path=CONFIG_PATH)
317
+ file_opr = FileOperation()
318
 
319
  # walk all files in repo dir
 
320
  files = file_opr.scan_dir(repo_dir=repodir)
321
  fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
322
  file_opr.summarize(files)
323
+
324
+ files = file_opr.scan_dir(repo_dir=abrepodir)
325
+ fs_init.initialize(files=files, work_dir=abworkdir,file_opr=file_opr)
326
+ file_opr.summarize(files)
327
+
328
  del fs_init
329
  cache.pop('default')
330
  texts, _ = update_database_info()
 
334
  _, workdir, _ = get_ready('repo_work')
335
  if os.path.exists(workdir):
336
  shutil.rmtree(workdir)
337
+ shutil.rmtree(workdir+'_ab')
338
  return gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
339
 
340
  def update_database_textbox():
 
344
  else:
345
  return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
346
 
347
+ def update_chunksize_dropdown(use_abstract):
348
  _, jsonobj = update_database_info()
349
+ if use_abstract:
350
+ choices = jsonobj['Abstract'].keys()
351
+ else:
352
+ choices = jsonobj['FullText'].keys()
353
+ return gr.Dropdown(choices= choices)
354
 
355
+ def update_ncluster_dropdown(chunksize:int,use_abstract:bool):
356
  _, jsonobj = update_database_info()
357
+ if use_abstract:
358
+ nclusters = jsonobj['Abstract'][chunksize]
359
+ else:
360
+ nclusters = jsonobj['FullText'][chunksize]
361
  return gr.Dropdown(choices= nclusters)
362
 
363
  # @spaces.GPU(duration=120)
364
+ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
365
  '''
366
  use llm to annotate cluster
367
  n: percentage of clusters to annotate
 
372
  else:
373
  backend = 'local'
374
 
375
+ clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters,use_abstract)
376
  new_obj_list = []
377
  n = round(n * len(samples.keys()))
378
  for cluster_no in random.sample(samples.keys(), n):
 
401
  return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
402
 
403
  # @spaces.GPU(duration=120)
404
+ def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
405
  query = 'inspiration'
406
  if remote_ornot:
407
  backend = 'remote'
408
  else:
409
  backend = 'local'
410
 
411
+ clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters,use_abstract)
412
  new_obj_list = []
413
 
414
  if annotation is not None: # if the user wants to get inspiration from specific clusters only
 
450
  return urls
451
 
452
  @spaces.GPU(duration=120)
453
+ def summarize_text(query,chunksize:int,remote_ornot:bool,use_abstract:bool):
454
  if remote_ornot:
455
  backend = 'remote'
456
  else:
457
  backend = 'local'
458
 
459
+ assistant,_ = get_ready('summarize',chunksize=chunksize,k=None,use_abstract=use_abstract)
460
  code, reply, references = assistant.generate(query=query,
461
  history=[],
462
  groupname='',backend = backend)
 
643
  with gr.Accordion("聚类标注相关参数", open=True):
644
  with gr.Row():
645
  update_options = gr.Button("更新数据库情况", scale=0)
646
+ use_abstract = gr.Checkbox(label="是否仅使用摘要",scale=0)
647
  chunksize = gr.Dropdown([], label="选择块大小", scale=0)
648
  nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
649
  ntoread = gr.Slider(
 
670
  output_references = gr.Markdown(label="参考文献")
671
 
672
  update_options.click(update_chunksize_dropdown,
673
+ inputs=[use_abstract],
674
  outputs=[chunksize])
675
 
676
  chunksize.change(update_ncluster_dropdown,
677
+ inputs=[chunksize,use_abstract],
678
  outputs= [nclusters])
679
 
680
  annotation_button.click(annotation,
681
+ inputs = [ntoread, chunksize, nclusters,remote_ornot,use_abstract],
682
  outputs=[annotation_output])
683
 
684
  inspiration_button.click(inspiration,
685
+ inputs= [annotation_output, chunksize, nclusters,remote_ornot,use_abstract],
686
  outputs=[inspiration_output])
687
 
688
  write_button.click(summarize_text,
689
+ inputs=[query, chunksize,remote_ornot,use_abstract],
690
  outputs =[output_text,output_references])
691
 
692
  demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])
applocal.py CHANGED
@@ -82,7 +82,7 @@ def udate_model_dropdown(remote_company):
82
  'kimi': ['moonshot-v1-128k'],
83
  'deepseek': ['deepseek-chat'],
84
  'zhipuai': ['glm-4'],
85
- 'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo']
86
  }
87
  return gr.Dropdown(choices= model_choices[remote_company])
88
 
@@ -107,7 +107,7 @@ def update_remote_config(remote_ornot,remote_company = None,api = None,baseurl =
107
  return gr.Button("配置已保存")
108
 
109
  # @spaces.GPU(duration=360)
110
- def get_ready(query:str,chunksize=None,k=None):
111
 
112
  with open(CONFIG_PATH, encoding='utf8') as f:
113
  config = pytoml.load(f)
@@ -124,6 +124,8 @@ def get_ready(query:str,chunksize=None,k=None):
124
  except:
125
  pass
126
 
 
 
127
  if query == 'annotation':
128
  if not chunksize or not k:
129
  raise ValueError('chunksize or k not provided')
@@ -182,9 +184,11 @@ def update_repo_info():
182
  pmc_success = repo_info['pmc_success_d']
183
  scihub_success = repo_info['scihub_success_d']
184
  failed_download = repo_info['failed_download']
 
 
185
 
186
  number_of_upload = number_of_pdf-scihub_success
187
- return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, number_of_pdf, failed_download, number_of_upload
188
  else:
189
  return None,None,None,None,None,None,None,None,None,number_of_pdf
190
  else:
@@ -223,33 +227,39 @@ def delete_articles_repo():
223
  repodir, workdir, _ = get_ready('repo_work')
224
  if os.path.exists(repodir):
225
  shutil.rmtree(repodir)
 
226
  if os.path.exists(workdir):
227
  shutil.rmtree(workdir)
 
228
 
229
  return gr.Textbox(label="文献库概况",lines =3,
230
  value = '文献库和相关数据库已删除',
231
  visible = True)
232
 
233
  def update_repo():
234
- keys, retmax, search_len, import_len, _, pmc_success, scihub_success, pdflen, failed, pdflen = update_repo_info()
 
 
235
  newinfo = ""
236
  if keys == None:
237
  newinfo += '无关键词搜索相关信息\n'
238
  newinfo += '无导入的PMID\n'
239
- if pdflen:
240
- newinfo += f'上传的PDF数量: {pdflen}\n'
241
  else:
242
  newinfo += '无上传的PDF\n'
243
  else:
244
  newinfo += f'关键词搜索:'
245
- newinfo += f' 关键词: {keys}\n'
246
- newinfo += f' 搜索上限: {retmax}\n'
247
- newinfo += f' 搜索到的PMID数量: {search_len}\n'
248
  newinfo += f'导入的PMID数量: {import_len}\n'
249
- newinfo += f'成功获取PMC全文数量: {pmc_success}\n'
250
- newinfo += f'成功获取SciHub全文数量: {scihub_success}\n'
251
- newinfo += f"下载失败的ID: {failed}\n"
252
- newinfo += f'上传的PDF数量: {pdflen}\n'
 
 
253
 
254
  return gr.Textbox(label="文献库概况",lines =1,
255
  value = newinfo,
@@ -259,26 +269,35 @@ def update_database_info():
259
  with open(CONFIG_PATH, encoding='utf8') as f:
260
  config = pytoml.load(f)
261
  workdir = config['feature_store']['work_dir']
262
- chunkdirs = glob.glob(os.path.join(workdir, 'chunksize_*'))
263
- chunkdirs.sort()
264
- list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
265
- # print(list_of_chunksize)
266
- jsonobj = {}
267
- for chunkdir in chunkdirs:
268
- k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
269
- k_dir.sort()
270
- list_of_k = [int(k.split('_')[-1]) for k in k_dir]
271
- jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
 
 
 
 
 
 
272
 
273
-
274
- new_options = [f"chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
 
275
 
276
- return new_options, jsonobj
277
 
278
  # @spaces.GPU(duration=360)
279
  def generate_database(chunksize:int,nclusters:str|list[str]):
280
  # 在这里运行生成数据库的函数
281
  repodir, workdir, _ = get_ready('repo_work')
 
 
282
  if not os.path.exists(repodir):
283
  return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
284
  nclusters = [int(i) for i in nclusters]
@@ -295,12 +314,17 @@ def generate_database(chunksize:int,nclusters:str|list[str]):
295
  chunk_size=chunksize,
296
  n_clusters=nclusters,
297
  config_path=CONFIG_PATH)
 
298
 
299
  # walk all files in repo dir
300
- file_opr = FileOperation()
301
  files = file_opr.scan_dir(repo_dir=repodir)
302
  fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
303
  file_opr.summarize(files)
 
 
 
 
 
304
  del fs_init
305
  cache.pop('default')
306
  texts, _ = update_database_info()
@@ -310,6 +334,7 @@ def delete_database():
310
  _, workdir, _ = get_ready('repo_work')
311
  if os.path.exists(workdir):
312
  shutil.rmtree(workdir)
 
313
  return gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
314
 
315
  def update_database_textbox():
@@ -319,17 +344,24 @@ def update_database_textbox():
319
  else:
320
  return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
321
 
322
- def update_chunksize_dropdown():
323
  _, jsonobj = update_database_info()
324
- return gr.Dropdown(choices= jsonobj.keys())
 
 
 
 
325
 
326
- def update_ncluster_dropdown(chunksize:int):
327
  _, jsonobj = update_database_info()
328
- nclusters = jsonobj[chunksize]
 
 
 
329
  return gr.Dropdown(choices= nclusters)
330
 
331
  # @spaces.GPU(duration=360)
332
- def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
333
  '''
334
  use llm to annotate cluster
335
  n: percentage of clusters to annotate
@@ -340,7 +372,7 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
340
  else:
341
  backend = 'local'
342
 
343
- clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters)
344
  new_obj_list = []
345
  n = round(n * len(samples.keys()))
346
  for cluster_no in random.sample(samples.keys(), n):
@@ -369,14 +401,14 @@ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool):
369
  return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
370
 
371
  # @spaces.GPU(duration=360)
372
- def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool):
373
  query = 'inspiration'
374
  if remote_ornot:
375
  backend = 'remote'
376
  else:
377
  backend = 'local'
378
 
379
- clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters)
380
  new_obj_list = []
381
 
382
  if annotation is not None: # if the user wants to get inspiration from specific clusters only
@@ -418,13 +450,13 @@ def getpmcurls(references):
418
  return urls
419
 
420
  # @spaces.GPU(duration=360)
421
- def summarize_text(query,chunksize:int,remote_ornot:bool):
422
  if remote_ornot:
423
  backend = 'remote'
424
  else:
425
  backend = 'local'
426
 
427
- assistant,_ = get_ready('summarize',chunksize=chunksize,k=None)
428
  code, reply, references = assistant.generate(query=query,
429
  history=[],
430
  groupname='',backend = backend)
@@ -611,6 +643,7 @@ def main_interface():
611
  with gr.Accordion("聚类标注相关参数", open=True):
612
  with gr.Row():
613
  update_options = gr.Button("更新数据库情况", scale=0)
 
614
  chunksize = gr.Dropdown([], label="选择块大小", scale=0)
615
  nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
616
  ntoread = gr.Slider(
@@ -637,22 +670,23 @@ def main_interface():
637
  output_references = gr.Markdown(label="参考文献")
638
 
639
  update_options.click(update_chunksize_dropdown,
 
640
  outputs=[chunksize])
641
 
642
  chunksize.change(update_ncluster_dropdown,
643
- inputs=[chunksize],
644
  outputs= [nclusters])
645
 
646
  annotation_button.click(annotation,
647
- inputs = [ntoread, chunksize, nclusters,remote_ornot],
648
  outputs=[annotation_output])
649
 
650
  inspiration_button.click(inspiration,
651
- inputs= [annotation_output, chunksize, nclusters,remote_ornot],
652
  outputs=[inspiration_output])
653
 
654
  write_button.click(summarize_text,
655
- inputs=[query, chunksize,remote_ornot],
656
  outputs =[output_text,output_references])
657
 
658
  demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])
 
82
  'kimi': ['moonshot-v1-128k'],
83
  'deepseek': ['deepseek-chat'],
84
  'zhipuai': ['glm-4'],
85
+ 'gpt': ['gpt-4-32k-0613','gpt-3.5-turbo','gpt-4']
86
  }
87
  return gr.Dropdown(choices= model_choices[remote_company])
88
 
 
107
  return gr.Button("配置已保存")
108
 
109
  # @spaces.GPU(duration=360)
110
+ def get_ready(query:str,chunksize=None,k=None,use_abstract=False):
111
 
112
  with open(CONFIG_PATH, encoding='utf8') as f:
113
  config = pytoml.load(f)
 
124
  except:
125
  pass
126
 
127
+ if use_abstract:
128
+ workdir = workdir + '_ab'
129
  if query == 'annotation':
130
  if not chunksize or not k:
131
  raise ValueError('chunksize or k not provided')
 
184
  pmc_success = repo_info['pmc_success_d']
185
  scihub_success = repo_info['scihub_success_d']
186
  failed_download = repo_info['failed_download']
187
+ abstract_success = repo_info['abstract_success']
188
+ failed_abstract = repo_info['failed_abstract']
189
 
190
  number_of_upload = number_of_pdf-scihub_success
191
+ return keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
192
  else:
193
  return None,None,None,None,None,None,None,None,None,number_of_pdf
194
  else:
 
227
  repodir, workdir, _ = get_ready('repo_work')
228
  if os.path.exists(repodir):
229
  shutil.rmtree(repodir)
230
+ shutil.rmtree(repodir + '_ab')
231
  if os.path.exists(workdir):
232
  shutil.rmtree(workdir)
233
+ shutil.rmtree(workdir + '_ab')
234
 
235
  return gr.Textbox(label="文献库概况",lines =3,
236
  value = '文献库和相关数据库已删除',
237
  visible = True)
238
 
239
  def update_repo():
240
+ # keywords, retmax, search_len, import_len, failed_pmid_len, pmc_success, scihub_success, failed_download, abstract_success, failed_abstract, number_of_upload
241
+ # None,None,None,None,None,None,None,None,None,number_of_pdf
242
+ keys, retmax, search_len, import_len, _, pmc_success, scihub_success, failed, abstract_success, failed_abstract, pdfuplo = update_repo_info()
243
  newinfo = ""
244
  if keys == None:
245
  newinfo += '无关键词搜索相关信息\n'
246
  newinfo += '无导入的PMID\n'
247
+ if pdfuplo:
248
+ newinfo += f'上传的PDF数量: {pdfuplo}\n'
249
  else:
250
  newinfo += '无上传的PDF\n'
251
  else:
252
  newinfo += f'关键词搜索:'
253
+ newinfo += f' 关键词: {keys}\n'
254
+ newinfo += f' 搜索上限: {retmax}\n'
255
+ newinfo += f' 搜索到的PMID数量: {search_len}\n'
256
  newinfo += f'导入的PMID数量: {import_len}\n'
257
+ newinfo += f' 成功获取PMC全文数量: {pmc_success}\n'
258
+ newinfo += f' 成功获取SciHub全文数量: {scihub_success}\n'
259
+ newinfo += f" 下载失败的ID: {failed}\n"
260
+ newinfo += f" 成功获取摘要的数量: {abstract_success}\n"
261
+ newinfo += f" 获取摘要失败的数量: {failed_abstract}\n"
262
+ newinfo += f'上传的PDF数量: {pdfuplo}\n'
263
 
264
  return gr.Textbox(label="文献库概况",lines =1,
265
  value = newinfo,
 
269
  with open(CONFIG_PATH, encoding='utf8') as f:
270
  config = pytoml.load(f)
271
  workdir = config['feature_store']['work_dir']
272
+ abworkdir = workdir + '_ab'
273
+ options = []
274
+ total_json_obj = {}
275
+ for dir in [workdir,abworkdir]:
276
+ tag = 'FullText' if '_ab' not in dir else 'Abstract'
277
+
278
+ chunkdirs = glob.glob(os.path.join(dir, 'chunksize_*'))
279
+ chunkdirs.sort()
280
+ list_of_chunksize = [int(chunkdir.split('_')[-1]) for chunkdir in chunkdirs]
281
+ # print(list_of_chunksize)
282
+ jsonobj = {}
283
+ for chunkdir in chunkdirs:
284
+ k_dir = glob.glob(os.path.join(chunkdir, 'cluster_features','cluster_features_*'))
285
+ k_dir.sort()
286
+ list_of_k = [int(k.split('_')[-1]) for k in k_dir]
287
+ jsonobj[int(chunkdir.split('_')[-1])] = list_of_k
288
 
289
+ total_json_obj[tag] = jsonobj
290
+ newoptions = [f"{tag}, chunksize:{chunksize}, k:{k}" for chunksize in list_of_chunksize for k in jsonobj[chunksize]]
291
+ options.extend(newoptions)
292
 
293
+ return options, total_json_obj
294
 
295
  # @spaces.GPU(duration=360)
296
  def generate_database(chunksize:int,nclusters:str|list[str]):
297
  # 在这里运行生成数据库的函数
298
  repodir, workdir, _ = get_ready('repo_work')
299
+ abrepodir = repodir + '_ab'
300
+ abworkdir = workdir + '_ab'
301
  if not os.path.exists(repodir):
302
  return gr.Textbox(label="数据库已生成",value = '请先生成文献库',visible = True)
303
  nclusters = [int(i) for i in nclusters]
 
314
  chunk_size=chunksize,
315
  n_clusters=nclusters,
316
  config_path=CONFIG_PATH)
317
+ file_opr = FileOperation()
318
 
319
  # walk all files in repo dir
 
320
  files = file_opr.scan_dir(repo_dir=repodir)
321
  fs_init.initialize(files=files, work_dir=workdir,file_opr=file_opr)
322
  file_opr.summarize(files)
323
+
324
+ files = file_opr.scan_dir(repo_dir=abrepodir)
325
+ fs_init.initialize(files=files, work_dir=abworkdir,file_opr=file_opr)
326
+ file_opr.summarize(files)
327
+
328
  del fs_init
329
  cache.pop('default')
330
  texts, _ = update_database_info()
 
334
  _, workdir, _ = get_ready('repo_work')
335
  if os.path.exists(workdir):
336
  shutil.rmtree(workdir)
337
+ shutil.rmtree(workdir+'_ab')
338
  return gr.Textbox(label="数据库概况",lines =3,value = '数据库已删除',visible = True)
339
 
340
  def update_database_textbox():
 
344
  else:
345
  return gr.Textbox(label="数据库概况",value = '\n'.join(texts),visible = True)
346
 
347
+ def update_chunksize_dropdown(use_abstract):
348
  _, jsonobj = update_database_info()
349
+ if use_abstract:
350
+ choices = jsonobj['Abstract'].keys()
351
+ else:
352
+ choices = jsonobj['FullText'].keys()
353
+ return gr.Dropdown(choices= choices)
354
 
355
+ def update_ncluster_dropdown(chunksize:int,use_abstract:bool):
356
  _, jsonobj = update_database_info()
357
+ if use_abstract:
358
+ nclusters = jsonobj['Abstract'][chunksize]
359
+ else:
360
+ nclusters = jsonobj['FullText'][chunksize]
361
  return gr.Dropdown(choices= nclusters)
362
 
363
  # @spaces.GPU(duration=360)
364
+ def annotation(n,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
365
  '''
366
  use llm to annotate cluster
367
  n: percentage of clusters to annotate
 
372
  else:
373
  backend = 'local'
374
 
375
+ clusterdir, samples, assistant, theme = get_ready('annotation',chunksize,nclusters,use_abstract)
376
  new_obj_list = []
377
  n = round(n * len(samples.keys()))
378
  for cluster_no in random.sample(samples.keys(), n):
 
401
  return '\n\n'.join([obj['annotation'] for obj in new_obj_list])
402
 
403
  # @spaces.GPU(duration=360)
404
+ def inspiration(annotation:str,chunksize:int,nclusters:int,remote_ornot:bool,use_abstract:bool):
405
  query = 'inspiration'
406
  if remote_ornot:
407
  backend = 'remote'
408
  else:
409
  backend = 'local'
410
 
411
+ clusterdir, annoresult, assistant, theme = get_ready('inspiration',chunksize,nclusters,use_abstract)
412
  new_obj_list = []
413
 
414
  if annotation is not None: # if the user wants to get inspiration from specific clusters only
 
450
  return urls
451
 
452
  # @spaces.GPU(duration=360)
453
+ def summarize_text(query,chunksize:int,remote_ornot:bool,use_abstract:bool):
454
  if remote_ornot:
455
  backend = 'remote'
456
  else:
457
  backend = 'local'
458
 
459
+ assistant,_ = get_ready('summarize',chunksize=chunksize,k=None,use_abstract=use_abstract)
460
  code, reply, references = assistant.generate(query=query,
461
  history=[],
462
  groupname='',backend = backend)
 
643
  with gr.Accordion("聚类标注相关参数", open=True):
644
  with gr.Row():
645
  update_options = gr.Button("更新数据库情况", scale=0)
646
+ use_abstract = gr.Checkbox(label="是否仅使用摘要",scale=0)
647
  chunksize = gr.Dropdown([], label="选择块大小", scale=0)
648
  nclusters = gr.Dropdown([], label="选择聚类数", scale=0)
649
  ntoread = gr.Slider(
 
670
  output_references = gr.Markdown(label="参考文献")
671
 
672
  update_options.click(update_chunksize_dropdown,
673
+ inputs=[use_abstract],
674
  outputs=[chunksize])
675
 
676
  chunksize.change(update_ncluster_dropdown,
677
+ inputs=[chunksize,use_abstract],
678
  outputs= [nclusters])
679
 
680
  annotation_button.click(annotation,
681
+ inputs = [ntoread, chunksize, nclusters,remote_ornot,use_abstract],
682
  outputs=[annotation_output])
683
 
684
  inspiration_button.click(inspiration,
685
+ inputs= [annotation_output, chunksize, nclusters,remote_ornot,use_abstract],
686
  outputs=[inspiration_output])
687
 
688
  write_button.click(summarize_text,
689
+ inputs=[query, chunksize,remote_ornot,use_abstract],
690
  outputs =[output_text,output_references])
691
 
692
  demo.launch(share=False, server_name='0.0.0.0', debug=True,show_error=True,allowed_paths=['img_0.jpg'])
config.ini CHANGED
@@ -4,8 +4,8 @@ embedding_model_path = "/root/models/bce-embedding-base_v1"
4
  reranker_model_path = "/root/models/bce-reranker-base_v1"
5
  repo_dir = "repodir"
6
  work_dir = "workdir"
7
- n_clusters = [10, 20]
8
- chunk_size = 1024
9
 
10
  [web_search]
11
  x_api_key = "${YOUR-API-KEY}"
 
4
  reranker_model_path = "/root/models/bce-reranker-base_v1"
5
  repo_dir = "repodir"
6
  work_dir = "workdir"
7
+ n_clusters = [10]
8
+ chunk_size = 2482
9
 
10
  [web_search]
11
  x_api_key = "${YOUR-API-KEY}"
huixiangdou/service/findarticles.py CHANGED
@@ -88,6 +88,7 @@ class ArticleRetrieval:
88
  for docsum in root.findall('DocSum'):
89
  pmcid = None
90
  doi = None
 
91
  id_value = docsum.find('Id').text
92
  for item in docsum.findall('.//Item[@Name="doi"]'):
93
  doi = item.text
@@ -155,11 +156,16 @@ class ArticleRetrieval:
155
  def fetch_full_text(self):
156
  if not os.path.exists(self.repo_dir):
157
  os.makedirs(self.repo_dir)
 
 
158
  print(f"Saving articles to {self.repo_dir}.")
159
  self.pmc_success = 0
160
  self.scihub_success = 0
 
161
  self.failed_download = []
 
162
  downloaded = os.listdir(self.repo_dir)
 
163
  for id in tqdm(self.pmc_ids, desc="Fetching full texts", unit="article"):
164
  # check if file already downloaded
165
  if f"{id}.txt" in downloaded:
@@ -194,6 +200,27 @@ class ArticleRetrieval:
194
  self.scihub_success += 1
195
  else:
196
  self.failed_download.append(doi)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
 
198
  def save_config(self):
199
  config = {
@@ -213,6 +240,8 @@ class ArticleRetrieval:
213
  "pmc_success_d": self.pmc_success,
214
  "scihub_success_d": self.scihub_success,
215
  "failed_download": self.failed_download,
 
 
216
 
217
  }
218
  with open(os.path.join(self.repo_dir, 'info.json'), 'w') as f:
 
88
  for docsum in root.findall('DocSum'):
89
  pmcid = None
90
  doi = None
91
+ abstract = None
92
  id_value = docsum.find('Id').text
93
  for item in docsum.findall('.//Item[@Name="doi"]'):
94
  doi = item.text
 
156
  def fetch_full_text(self):
157
  if not os.path.exists(self.repo_dir):
158
  os.makedirs(self.repo_dir)
159
+ os.makedirs(self.repo_dir + '_ab')
160
+
161
  print(f"Saving articles to {self.repo_dir}.")
162
  self.pmc_success = 0
163
  self.scihub_success = 0
164
+ self.abstract_success = 0
165
  self.failed_download = []
166
+ self.failed_abstract = []
167
  downloaded = os.listdir(self.repo_dir)
168
+ downloaded_ab = os.listdir(self.repo_dir + '_ab')
169
  for id in tqdm(self.pmc_ids, desc="Fetching full texts", unit="article"):
170
  # check if file already downloaded
171
  if f"{id}.txt" in downloaded:
 
200
  self.scihub_success += 1
201
  else:
202
  self.failed_download.append(doi)
203
+ for pmid in tqdm(self.pmids, desc="Fetching abstract texts", unit="article"):
204
+ # check if file already downloaded
205
+ if f"{pmid}.txt" in downloaded_ab:
206
+ print(f"File already downloaded: {pmid}")
207
+ self.scihub_success += 1
208
+ continue
209
+ base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
210
+ params = {
211
+ "db": "pubmed",
212
+ "id": pmid,
213
+ }
214
+
215
+ response = requests.get(base_url, params=params)
216
+ root = ET.fromstring(response.content)
217
+ abstract = root.find('.//AbstractText')
218
+ if abstract is not None:
219
+ with open(os.path.join(self.repo_dir + '_ab',f'{pmid}.txt'), 'w') as f:
220
+ f.write(abstract.text)
221
+ self.abstract_success += 1
222
+ else:
223
+ self.failed_abstract.append(pmid)
224
 
225
  def save_config(self):
226
  config = {
 
240
  "pmc_success_d": self.pmc_success,
241
  "scihub_success_d": self.scihub_success,
242
  "failed_download": self.failed_download,
243
+ "abstract_success": self.abstract_success,
244
+ "failed_abstract": self.failed_abstract
245
 
246
  }
247
  with open(os.path.join(self.repo_dir, 'info.json'), 'w') as f:
huixiangdou/service/retriever.py CHANGED
@@ -40,7 +40,7 @@ class Retriever:
40
  search_type='similarity',
41
  search_kwargs={
42
  'score_threshold': 0.15,
43
- 'k': 5
44
  })
45
 
46
  self.reordering = LongContextReorder()
 
40
  search_type='similarity',
41
  search_kwargs={
42
  'score_threshold': 0.15,
43
+ 'k': 10
44
  })
45
 
46
  self.reordering = LongContextReorder()