File size: 9,073 Bytes
d9b11e7
 
ef10e9f
eab4e05
 
3c0c20a
 
6889278
ef10e9f
166ca92
6889278
166ca92
6889278
 
 
 
 
166ca92
 
ef10e9f
6889278
ef10e9f
6889278
 
 
 
 
 
 
ef10e9f
3c0c20a
 
 
 
 
 
 
ef10e9f
3c0c20a
ef10e9f
f94b9ea
 
0a87c25
ef10e9f
 
 
 
 
 
f94b9ea
ef10e9f
 
f94b9ea
 
ef10e9f
 
f94b9ea
ef10e9f
 
6889278
ef10e9f
6889278
ef10e9f
 
 
 
 
 
f44cb2d
18ecf48
ef10e9f
 
3cff6fb
 
ef10e9f
 
 
 
 
 
 
 
 
 
 
 
6ef0122
18ecf48
ef10e9f
 
 
 
 
 
 
 
6889278
ef10e9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e7cab8d
ef10e9f
 
e7cab8d
 
ef10e9f
 
 
 
 
 
 
e7cab8d
ef10e9f
 
 
6889278
 
 
 
 
 
 
 
 
 
ef10e9f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# ArxivDigest 
This repo aims to provide a better daily digest for newly published arXiv papers based on your own research interests and descriptions via relevancy ratings from GPT.

You can try it out at [https://huggingface.co/spaces/AutoLLM/ArxivDigest](https://huggingface.co/spaces/AutoLLM/ArxivDigest) using your own OpenAI api key. 

You can also create a daily subscription pipeline to email you the results.

## πŸ“š Contents

- [What this repo does](#πŸ”-what-this-repo-does)
  * [Examples](#some-examples)
- [Usage](#πŸ’‘-usage)
  * [Running as a github action using SendGrid (Recommended)](#running-as-a-github-action-using-sendgrid-recommended)
  * [Running as a github action with SMTP credentials](#running-as-a-github-action-with-smtp-credentials)
  * [Running as a github action without emails](#running-as-a-github-action-without-emails)
  * [Running from the command line](#running-from-the-command-line)
  * [Running with a user interface](#running-with-a-user-interface)
- [Roadmap](#βœ…-roadmap)
- [Extending and Contributing](#πŸ’-extending-and-contributing)

## πŸ” What this repo does

Staying up to date on [arXiv](https://arxiv.org) papers can take a considerable amount of time, with on the order of hundreds of new papers each day to filter through. There is an [official daily digest service](https://info.arxiv.org/help/subscribe.html), however large categories like [cs.AI](https://arxiv.org/list/cs.AI/recent) still have 50-100 papers a day. Determining if these papers are relevant and important to you means reading through the title and abstract, which is time-consuming.

This repository offers a method to curate a daily digest, sorted by relevance, using large language models. These models are conditioned based on your personal research interests, which are described in natural language. 

* You modify the configuration file `config.yaml` with an arXiv Subject, some set of Categories, and a natural language statement about the type of papers you are interested in.  
* The code pulls all the abstracts for papers in those categories and ranks how relevant they are to your interest on a scale of 1-10 using `gpt-3.5-turbo`.
* The code then emits an HTML digest listing all the relevant papers, and optionally emails it to you using [SendGrid](https://sendgrid.com). You will need to have a SendGrid account with an API key for this functionality to work.  

### Testing it out with Hugging Face:

We provide a demo at [https://huggingface.co/spaces/AutoLLM/ArxivDigest](https://huggingface.co/spaces/AutoLLM/ArxivDigest). Simply enter your [OpenAI API key](https://platform.openai.com/account/api-keys) and then fill in the configuration on the right. Note that we do not store your key.

![hfexample](./readme_images/hf_example.png)

You can also send yourself an email of the digest by creating a SendGrid account and [api key](https://app.SendGrid.com/settings/api_keys).

### Some examples of results:

#### Digest Configuration:
- Subject/Topic: Computer Science
- Categories: Artificial Intelligence, Computation and Language 
- Interest: 
  - Large language model pretraining and finetunings
  - Multimodal machine learning
  - Do not care about specific application, for example, information extraction, summarization, etc.
  - Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.

#### Result:
![example1](./readme_images/example_1.png)

#### Digest Configuration:
- Subject/Topic: Quantitative Finance
- Interest: "making lots of money"

#### Result:
![example2](./readme_images/example_2.png)

## πŸ’‘ Usage

### Running as a github action using SendGrid (Recommended).

The recommended way to get started using this repository is to:

1. Fork the repository
2. Modify `config.yaml` and merge the changes into your main branch. If you want a different schedule than Sunday through Thursday at 1:25PM UTC, then also modify the file `.github/workflows/daily_pipeline.yaml`
3. Create or fetch your api key for [OpenAI](https://platform.openai.com/account/api-keys). Note: you will need an OpenAI account.
4. Create or fetch your api key for [SendGrid](https://app.SendGrid.com/settings/api_keys). You will need a SendGrid account. The free tier will generally suffice. Make sure to [verify your sender identity](https://docs.sendgrid.com/for-developers/sending-email/sender-identity).
5. Set the following secrets [(under settings, Secrets and variables, repository secrets)](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository):
   - `OPENAI_API_KEY`
   - `SENDGRID_API_KEY`
   - `FROM_EMAIL` This value must match the email you used to create the SendGrid Api Key. This is not needed if you have it set in `config.yaml`.
   - `TO_EMAIL` Only if you don't have it set in `config.yaml`
6. Manually trigger the action or wait until the scheduled action takes place.

![artifact](./readme_images/trigger.png)


### Running as a github action with SMTP credentials.

An alternative way to get started using this repository is to:

1. Fork the repository
2. Modify `config.yaml` and merge the changes into your main branch. If you want a different schedule than Sunday through Thursday at 1:25PM UTC, then also modify the file `.github/workflows/daily_pipeline.yaml`
3. Create or fetch your api key for [OpenAI](https://platform.openai.com/account/api-keys). Note: you will need an OpenAI account.
4. Find your email provider's SMTP settings and set the secret `MAIL_CONNECTION` to that. It should be in the form `smtp://user:password@server:port` or `smtp+starttls://user:password@server:port`. Alternatively, if you are using Gmail, you can set `MAIL_USERNAME` and `MAIL_PASSWORD` instead, using an [application password](https://support.google.com/accounts/answer/185833).
5. Set the following secrets [(under settings, Secrets and variables, repository secrets)](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository):
   - `OPENAI_API_KEY`
   - `MAIL_CONNECTION` (see above)
   - `MAIL_PASSWORD` (only if you don't have `MAIL_CONNECTION` set)
   - `MAIL_USERNAME` (only if you don't have `MAIL_CONNECTION` set)
   - `FROM_EMAIL` (only if you don't have it set in `config.yaml`)
   - `TO_EMAIL` (only if you don't have it set in `config.yaml`)
6. Manually trigger the action or wait until the scheduled action takes place.

### Running as a github action without emails 

If you do not wish to create a SendGrid account or use your email authentication, the action will also emit an artifact containing the HTML output. Simply do not create the SendGrid or SMTP secrets.

You can access this digest as part of the github action artifact.

![artifact](./readme_images/artifact.png)

### Running from the command line

If you do not wish to fork this repository, and would prefer to clone and run it locally instead:

1. Install the requirements in `src/requirements.txt`
2. Modify the configuration file `config.yaml`
3. Create or fetch your api key for [OpenAI](https://platform.openai.com/account/api-keys). Note: you will need an OpenAI account.
4. Create or fetch your api key for [SendGrid](https://app.SendGrid.com/settings/api_keys) (optional, if you want the script to email you)
5. Set the following secrets as environment variables: 
   - `OPENAI_API_KEY`
   - `SENDGRID_API_KEY` (only if using SendGrid)
   - `FROM_EMAIL` (only if using SendGrid and if you don't have it set in `config.yaml`. Note that this value must match the email you used to create the SendGrid Api Key.)
   - `TO_EMAIL` (only if using SendGrid and if you don't have it set in `config.yaml`)
6. Run `python action.py`.
7. If you are not using SendGrid, the html of the digest will be written to `digest.html`. You can then use your favorite webbrowser to view it.

You may want to use something like crontab to schedule the digest.

### Running with a user interface

Install the requirements in `src/requirements.txt` as well as `gradio`. Set the evironment variables `OPENAI_API_KEY`, `FROM_EMAIL` and `SENDGRID_API_KEY`. Ensure that `FROM_EMAIL` matches `SENDGRID_API_KEY`.

Run `python src/app.py` and go to the local URL. From there you will be able to preview the papers from today, as well as the generated digests.

## βœ… Roadmap

- [x] Support personalized paper recommendation using LLM.
- [x] Send emails for daily digest.
- [ ] Implement a ranking factor to prioritize content from specific authors.
- [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
- [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..


## πŸ’ Extending and Contributing

You may (and are encourage to) modify the code in this repository to suit your personal needs. If you think your modifications would be in any way useful to others, please submit a pull request.

These types of modifications include things like changes to the prompt, different language models, or additional ways for the digest is delivered to you.