Spaces:

linhkid91
/

ArxivDigest-extra

Sleeping

App Files Files Community

Richard Fan commited on May 18, 2023

Commit

ef10e9f

1 Parent(s): f7d455a

initial commit

Browse files

Files changed (15) hide show

.github/workflows/.daily_pipeline.yaml.swp +0 -0
.github/workflows/daily_pipeline.yaml +80 -0
README.md +105 -2
config.yaml +37 -0
readme_images/artifact.png +0 -0
readme_images/example_1.png +0 -0
readme_images/example_2.png +0 -0
readme_images/trigger.png +0 -0
src/action.py +142 -0
src/app.py +170 -0
src/download_new_papers.py +64 -0
src/relevancy.py +172 -0
src/relevancy_prompt.txt +7 -0
src/requirements.txt +7 -0
src/utils.py +148 -0

.github/workflows/.daily_pipeline.yaml.swp ADDED Viewed

Binary file (12.3 kB). View file

.github/workflows/daily_pipeline.yaml ADDED Viewed

	@@ -0,0 +1,80 @@

+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+name: Daily pipeline
+on:
+  workflow_dispatch: {}
+  schedule:
+    # * is a special character in YAML so you have to quote this string
+    # Feel free to change this cron schedule
+    # Currently its scheduled for 1:25 pm UTC, Sun-Thurs
+    - cron:  '25 13 * * 0-4'
+jobs:
+  build_and_test:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python 3.8
+      uses: actions/setup-python@v2
+      with:
+        python-version: 3.8
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r src/requirements.txt
+    - name: Generate Digest
+      run: |
+        python src/action.py
+      env:
+        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+        SENDGRID_API_KEY: ${{ secrets.SENDGRID_API_KEY }}
+        FROM_EMAIL: ${{ secrets.FROM_EMAIL }}
+        TO_EMAIL: ${{ secrets.TO_EMAIL }}
+    - name: Upload Artifact
+      uses: actions/upload-artifact@v3
+      with:
+        name: digest.html
+        path: digest.html
+    - name: check
+      id: check
+      env:
+        SENDGRID_API_KEY: ${{ secrets.SENDGRID_API_KEY }}
+        MAIL_USERNAME: ${{ secrets.MAIL_USERNAME }}
+        MAIL_PASSWORD: ${{ secrets.MAIL_PASSWORD }}
+        MAIL_CONNECTION: ${{ secrets.MAIL_CONNECTION }}
+      if: "${{ env.SENDGRID_API_KEY != '' && (env.MAIL_CONNECTION || env.MAIL_USERNAME != '' && env.MAIL_PASSWORD != '') }}"
+      run: echo "DEFINED=true" >> $GITHUB_OUTPUT
+    - name: Test step
+      env:
+        DEFINED: ${{ steps.check.outputs.DEFINED }}
+      run: echo "$DEFINED"
+    - name: Send mail
+      uses: dawidd6/action-send-mail@v3
+      env:
+        DEFINED: ${{ steps.check.outputs.DEFINED }}
+      if: ${{ env.DEFINED == 'true' }}
+      with:
+        # Specify connection via URL (replaces server_address, server_port, secure,
+        # username and password)
+        #
+        # Format:
+        #
+        #  * smtp://user:password@server:port
+        #  * smtp+starttls://user:password@server:port
+        connection_url: ${{secrets.MAIL_CONNECTION}}
+        # Required mail server address if not connection_url:
+        server_address: smtp.gmail.com
+        # Server port, default 25:
+        server_port: 465
+        username: ${{secrets.MAIL_USERNAME}}
+        password: ${{secrets.MAIL_PASSWORD}}
+        secure: true
+        subject: Personalized arXiv Digest
+        to: ${{ secrets.TO_EMAIL }}
+        from: "Personalized arxiv digest"
+        html_body: file://digest.html
+        ignore_cert: true
+        convert_markdown: true
+        priority: normal

README.md CHANGED Viewed

@@ -1,2 +1,105 @@
-# Arxiv-Digest
-Personalized Arxiv Digest using Large Language Models

+# Personalized-Arxiv-digest
+This repo aims to provide a better daily digest for newly published arxiv papers based on your own research interests and descriptions.
+## What this repo does
+Staying up to date on [arxiv](https://arxiv.org) papers can take a considerable amount of time, with on the order of hundreds of new papers each day to filter through. There is an [official daily digest service](https://info.arxiv.org/help/subscribe.html), however large subtopics like [cs.AI](https://arxiv.org/list/cs.AI/recent) still have 50-100 papers a day. Determining if these papers are relevant and important to you means reading through the title and abstract.
+This repository provides a way to have this daily digest sorted by relevance via large language models:
+* You modify the configuration file `config.yaml` with an arxiv topic, some set of subtopics, and a natural language statement about the type of papers you are interested in
+* The code pulls all the abstracts for papers in those subtopics and ranks how relevant they are to your interest on a scale of 1-10 using gpt-3.5-turbo.
+* The code then emits an HTML digest listing all the relevant papers, and optionally emails it to you using [SendGrid](https://sendgrid.com). You will need to have a SendGrid account with an API key for this functionality to work
+### Some examples:
+- Topic: cs.AI, cs.CL
+- Interest:
+  - Large language model pretraining and finetunings
+  - Multimodal machine learning
+  - Do not care about specific application, for example, information extraction, summarization, etc.
+  - Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.
+![example1](./readme_images/example_1.png)
+- Topic: q-fin
+- Interest: "making lots of money"
+![example2](./readme_images/example_2.png)
+## Usage
+### Running as a github action using SendGrid.
+The recommended way to get started using this repository is to:
+1. Fork the repository
+2. Modify `config.yaml` and merge the changes into your main branch. If you want a different schedule than Sunday through Thursday at 1:25PM UTC, then also modify the file `.github/workflows/daily_pipeline.yaml`
+3. Create or fetch your api key for [OpenAI](https://platform.openai.com/account/api-keys). Note: you will need an OpenAI account.
+4. Create or fetch your api key for [SendGrid](https://app.SendGrid.com/settings/api_keys). You will need a SendGrid account. The free tier will generally suffice.
+5. Set the following secrets:
+   - `OPENAI_API_KEY`
+   - `SENDGRID_API_KEY`
+   - `FROM_EMAIL` (only if you don't have it set in `config.yaml`)
+   - `TO_EMAIL` (only if you don't have it set in `config.yaml`)
+6. Manually trigger the action or wait until the scheduled action takes place.
+![artifact](./readme_images/trigger.png)
+### Running as a github action with SMTP credentials.
+An alternative way to get started using this repository is to:
+1. Fork the repository
+2. Modify `config.yaml` and merge the changes into your main branch. If you want a different schedule than Sunday through Thursday at 1:25PM UTC, then also modify the file `.github/workflows/daily_pipeline.yaml`
+3. Create or fetch your api key for [OpenAI](https://platform.openai.com/account/api-keys). Note: you will need an OpenAI account.
+4. Find your email provider's SMTP settings and set the secret `MAIL_CONNECTION` to that. It should be in the form `smtp://user:password@server:port` or `smtp+starttls://user:password@server:port`. Alternatively, if you are using Gmail, you can set `MAIL_USERNAME` and `MAIL_PASSWORD` instead. If you are (understandably) apprehensive about using your email authentication here, you can create something like an [application password](https://support.google.com/accounts/answer/185833) instead
+5. Set the following secrets:
+   - `OPENAI_API_KEY`
+   - `MAIL_CONNECTION` (see above)
+   - `MAIL_PASSWORD` (only if you don't have `MAIL_CONNECTION` set)
+   - `MAIL_USERNAME` (only if you don't have `MAIL_CONNECTION` set)
+   - `FROM_EMAIL` (only if you don't have it set in `config.yaml`)
+   - `TO_EMAIL` (only if you don't have it set in `config.yaml`)
+6. Manually trigger the action or wait until the scheduled action takes place.
+#### Running as a github action without emails
+If you do not wish to create a SendGrid account or use your email authentication, the action will also emit an artifact containing the HTML output. Simply do not create the SendGrid or SMTP secrets.
+You can access this digest as part of the github action artifact.
+![artifact](./readme_images/artifact.png)
+### Running from the command line
+If you do not wish to fork this repository, and would prefer to clone and run it locally instead:
+1. Install the requirements in `src/requirements.txt`
+2. Modify the configuration file `config.yaml`
+3. Create or fetch your api key for [OpenAI](https://platform.openai.com/account/api-keys). Note: you will need an OpenAI account.
+4. Create or fetch your api key for [SendGrid](https://app.SendGrid.com/settings/api_keys) (optional, if you want the script to email you)
+5. Set the following secrets:
+   - `OPENAI_API_KEY`
+   - `SENDGRID_API_KEY` (only if using SendGrid)
+   - `FROM_EMAIL` (only if using SendGrid and if you don't have them set in `config.yaml`)
+   - `TO_EMAIL` (only if using SendGrid and if you don't have them set in `config.yaml`)
+6. Run `python action.py`.
+7. If you are not using SendGrid, the html of the digest will be written to `digest.html`. You can then use your favorite webbrowser to view it.
+You may want to use something like crontab to schedule the digest.
+### Running with a user interface
+Install the requirements in `src/requirements.txt` as well as `gradio`. Set the evironment variables `OPENAI_API_KEY`, `FROM_EMAIL` and `SENDGRID_API_KEY`
+Run `python src/app.py` and go to the local URL. From there you will be able to preview the papers from today, as well as the generated digests.
+## Extending and Contributing
+You may (and are encourage to) modify the code in this repository to suit your personal needs. If you think your modifications would be in any way useful to others, please submit a pull request.
+These types of modifications include things like changes to the prompt, different language models, or additional ways for the digest is delivered to you.

config.yaml ADDED Viewed

	@@ -0,0 +1,37 @@

+# For physics topics, use the specific subtopics, e.g. "Astrophysics"
+topic: "Computer Science"
+# An empty list here will include all categories in a topic
+# Including more categories will result in more calls to the large language model
+categories: ["Artificial Intelligence", "Computation and Language"]
+# The email address that the digest will be sent from. must be the address matching
+# your sendgrid api key.
+# Leaving this empty will cause the script to use the
+# FROM_EMAIL environment variable instead
+from_email: ""
+# The email address you are going to send the digest to
+# Leaving this empty will cause the script to use the
+# TO_EMAIL environment variable instead
+to_email: ""
+# Relevance score threshold. abstracts that receive a score less than this from the large language model
+# will have their papers filtered out.
+#
+# Must be within 1-10
+threshold: 7
+# A natural language statement that the large language model will use to judge which papers are relevant
+#
+# For example:
+#     "I am interested in complexity theory papers that establish upper bounds"
+#     "gas chromatography, mass spectrometry"
+#     "making lots of money"
+#
+# This can be empty, which just return a full list of papers with no judgement or filtering,
+# in whatever order arXiv responds with.
+interest: |
+  1. Large language model pretraining and finetunings
+  2. Multimodal machine learning
+  3. Do not care about specific application, for example, information extraction, summarization, etc.
+  4. Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.

readme_images/artifact.png ADDED Viewed

readme_images/example_1.png ADDED Viewed

readme_images/example_2.png ADDED Viewed

readme_images/trigger.png ADDED Viewed

src/action.py ADDED Viewed

	@@ -0,0 +1,142 @@

+from sendgrid import SendGridAPIClient
+from sendgrid.helpers.mail import Mail, Email, To, Content
+from datetime import date
+import argparse
+import yaml
+import os
+from relevancy import generate_relevance_score, process_subject_fields
+from download_new_papers import get_papers
+# Hackathon quality code. Don't judge too harshly.
+# Feel free to submit pull requests to improve the code.
+topics = {
+    "Physics": "",
+    "Mathematics": "math",
+    "Computer Science": "cs",
+    "Quantitative Biology": "q-bio",
+    "Quantitative Finance": "q-fin",
+    "Statistics": "stat",
+    "Electrical Engineering and Systems Science": "eess",
+    "Economics": "econ"
+}
+physics_topics = {
+    "Astrophysics": "astro-ph",
+    "Condensed Matter": "cond-mat",
+    "General Relativity and Quantum Cosmology": "gr-qc",
+    "High Energy Physics - Experiment": "hep-ex",
+    "High Energy Physics - Lattice": "hep-lat",
+    "High Energy Physics - Phenomenology": "hep-ph",
+    "High Energy Physics - Theory": "hep-th",
+    "Mathematical Physics": "math-ph",
+    "Nonlinear Sciences": "nlin",
+    "Nuclear Experiment": "nucl-ex",
+    "Nuclear Theory": "nucl-th",
+    "Physics": "physics",
+    "Quantum Physics": "quant-ph"
+}
+# TODO: surely theres a better way
+category_map = {
+    "Astrophysics": ["Astrophysics of Galaxies", "Cosmology and Nongalactic Astrophysics", "Earth and Planetary Astrophysics", "High Energy Astrophysical Phenomena", "Instrumentation and Methods for Astrophysics", "Solar and Stellar Astrophysics"],
+    "Condensed Matter": ["Disordered Systems and Neural Networks", "Materials Science", "Mesoscale and Nanoscale Physics", "Other Condensed Matter", "Quantum Gases", "Soft Condensed Matter", "Statistical Mechanics", "Strongly Correlated Electrons", "Superconductivity"],
+    "General Relativity and Quantum Cosmology": ["None"],
+    "High Energy Physics - Experiment": ["None"],
+    "High Energy Physics - Lattice": ["None"],
+    "High Energy Physics - Phenomenology": ["None"],
+    "High Energy Physics - Theory": ["None"],
+    "Mathematical Physics": ["None"],
+    "Nonlinear Sciences": ["Adaptation and Self-Organizing Systems", "Cellular Automata and Lattice Gases", "Chaotic Dynamics", "Exactly Solvable and Integrable Systems", "Pattern Formation and Solitons"],
+    "Nuclear Experiment": ["None"],
+    "Nuclear Theory": ["None"],
+    "Physics": ["Accelerator Physics", "Applied Physics", "Atmospheric and Oceanic Physics", "Atomic and Molecular Clusters", "Atomic Physics", "Biological Physics", "Chemical Physics", "Classical Physics", "Computational Physics", "Data Analysis, Statistics and Probability", "Fluid Dynamics", "General Physics", "Geophysics", "History and Philosophy of Physics", "Instrumentation and Detectors", "Medical Physics", "Optics", "Physics and Society", "Physics Education", "Plasma Physics", "Popular Physics", "Space Physics"],
+    "Quantum Physics": ["None"],
+    "Mathematics": ["Algebraic Geometry", "Algebraic Topology", "Analysis of PDEs", "Category Theory", "Classical Analysis and ODEs", "Combinatorics", "Commutative Algebra", "Complex Variables", "Differential Geometry", "Dynamical Systems", "Functional Analysis", "General Mathematics", "General Topology", "Geometric Topology", "Group Theory", "History and Overview", "Information Theory", "K-Theory and Homology", "Logic", "Mathematical Physics", "Metric Geometry", "Number Theory", "Numerical Analysis", "Operator Algebras", "Optimization and Control", "Probability", "Quantum Algebra", "Representation Theory", "Rings and Algebras", "Spectral Theory", "Statistics Theory", "Symplectic Geometry"],
+    "Computer Science": ["Artificial Intelligence", "Computation and Language", "Computational Complexity", "Computational Engineering, Finance, and Science", "Computational Geometry", "Computer Science and Game Theory", "Computer Vision and Pattern Recognition", "Computers and Society", "Cryptography and Security", "Data Structures and Algorithms", "Databases", "Digital Libraries", "Discrete Mathematics", "Distributed, Parallel, and Cluster Computing", "Emerging Technologies", "Formal Languages and Automata Theory", "General Literature", "Graphics", "Hardware Architecture", "Human-Computer Interaction", "Information Retrieval", "Information Theory", "Logic in Computer Science", "Machine Learning", "Mathematical Software", "Multiagent Systems", "Multimedia", "Networking and Internet Architecture", "Neural and Evolutionary Computing", "Numerical Analysis", "Operating Systems", "Other Computer Science", "Performance", "Programming Languages", "Robotics", "Social and Information Networks", "Software Engineering", "Sound", "Symbolic Computation", "Systems and Control"],
+    "Quantitative Biology": ["Biomolecules", "Cell Behavior", "Genomics", "Molecular Networks", "Neurons and Cognition", "Other Quantitative Biology", "Populations and Evolution", "Quantitative Methods", "Subcellular Processes", "Tissues and Organs"],
+    "Quantitative Finance": ["Computational Finance", "Economics", "General Finance", "Mathematical Finance", "Portfolio Management", "Pricing of Securities", "Risk Management", "Statistical Finance", "Trading and Market Microstructure"],
+    "Statistics": ["Applications", "Computation", "Machine Learning", "Methodology", "Other Statistics", "Statistics Theory"],
+    "Electrical Engineering and Systems Science": ["Audio and Speech Processing", "Image and Video Processing", "Signal Processing", "Systems and Control"],
+    "Economics": ["Econometrics", "General Economics", "Theoretical Economics"]
+}
+def generate_body(topic, categories, interest, threshold):
+    if topic == "Physics":
+        raise RuntimeError("You must choose a physics subtopic.")
+    elif topic in physics_topics:
+        abbr = physics_topics[topic]
+    elif topic in topics:
+        abbr = topics[topic]
+    else:
+        raise RuntimeError(f"Invalid topic {topic}")
+    if categories:
+        for category in categories:
+            if category not in category_map[topic]:
+                raise RuntimeError(f"{category} is not a category of {topic}")
+        papers = get_papers(abbr)
+        papers = [
+            t for t in papers
+            if bool(set(process_subject_fields(t['subjects'])) & set(categories))]
+    else:
+        papers = get_papers(abbr)
+    if interest:
+        relevancy, hallucination = generate_relevance_score(
+            papers,
+            query={"interest": interest},
+            threshold_score=threshold,
+            num_paper_in_prompt=8)
+        body = "<br><br>".join(
+            [f'Title: <a href="{paper["main_page"]}">{paper["title"]}</a><br>Authors: {paper["authors"]}<br>Score: {paper["Relevancy score"]}<br>Reason: {paper["Reasons for match"]}'
+             for paper in relevancy])
+        if hallucination:
+            body = "Warning: the model hallucinated some papers. We have tried to remove them, but the scores may not be accurate.<br><br>" + body
+    else:
+        body = "<br><br>".join(
+            [f'Title: <a href="{paper["main_page"]}">{paper["title"]}</a><br>Authors: {paper["authors"]}'
+             for paper in papers])
+    return body
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config", help="yaml config file to use", default="config.yaml")
+    args = parser.parse_args()
+    with open(args.config, "r") as f:
+        config = yaml.safe_load(f)
+    if "OPENAI_API_KEY" not in os.environ:
+        raise RuntimeError("No openai api key found")
+    topic = config["topic"]
+    categories = config["categories"]
+    from_email = config.get("from_email") or os.environ.get("FROM_EMAIL")
+    to_email = config.get("to_email") or os.environ.get("TO_EMAIL")
+    threshold = config["threshold"]
+    interest = config["interest"]
+    with open("digest.html", "w") as f:
+        body = generate_body(topic, categories, interest, threshold)
+        f.write(body)
+        if os.environ.get('SENDGRID_API_KEY', None):
+            sg = SendGridAPIClient(api_key=os.environ.get('SENDGRID_API_KEY'))
+            from_email = Email(from_email)  # Change to your verified sender
+            to_email = To(to_email)
+            subject = date.today().strftime("Personalized arXiv Digest, %d %b %Y")
+            content = Content("text/html", body)
+            mail = Mail(from_email, to_email, subject, content)
+            mail_json = mail.get()
+            # Send an HTTP POST request to /mail/send
+            response = sg.client.mail.send.post(request_body=mail_json)
+            if response.status_code >= 200 and response.status_code <= 300:
+                print("Send test email: Success!")
+            else:
+                print("Send test email: Failure ({response.status_code}, {response.text})")
+        else:
+            print("No sendgrid api key found. Skipping email")

src/app.py ADDED Viewed

	@@ -0,0 +1,170 @@

+import gradio as gr
+from download_new_papers import get_papers
+from relevancy import generate_relevance_score, process_subject_fields
+from sendgrid.helpers.mail import Mail, Email, To, Content
+import sendgrid
+import os
+topics = {
+    "Physics": "",
+    "Mathematics": "math",
+    "Computer Science": "cs",
+    "Quantitative Biology": "q-bio",
+    "Quantitative Finance": "q-fin",
+    "Statistics": "stat",
+    "Electrical Engineering and Systems Science": "eess",
+    "Economics": "econ"
+}
+physics_topics = {
+    "Astrophysics": "astro-ph",
+    "Condensed Matter": "cond-mat",
+    "General Relativity and Quantum Cosmology": "gr-qc",
+    "High Energy Physics - Experiment": "hep-ex",
+    "High Energy Physics - Lattice": "hep-lat",
+    "High Energy Physics - Phenomenology": "hep-ph",
+    "High Energy Physics - Theory": "hep-th",
+    "Mathematical Physics": "math-ph",
+    "Nonlinear Sciences": "nlin",
+    "Nuclear Experiment": "nucl-ex",
+    "Nuclear Theory": "nucl-th",
+    "Physics": "physics",
+    "Quantum Physics": "quant-ph"
+}
+categories_map = {
+    "Astrophysics": ["Astrophysics of Galaxies", "Cosmology and Nongalactic Astrophysics", "Earth and Planetary Astrophysics", "High Energy Astrophysical Phenomena", "Instrumentation and Methods for Astrophysics", "Solar and Stellar Astrophysics"],
+    "Condensed Matter": ["Disordered Systems and Neural Networks", "Materials Science", "Mesoscale and Nanoscale Physics", "Other Condensed Matter", "Quantum Gases", "Soft Condensed Matter", "Statistical Mechanics", "Strongly Correlated Electrons", "Superconductivity"],
+    "General Relativity and Quantum Cosmology": ["None"],
+    "High Energy Physics - Experiment": ["None"],
+    "High Energy Physics - Lattice": ["None"],
+    "High Energy Physics - Phenomenology": ["None"],
+    "High Energy Physics - Theory": ["None"],
+    "Mathematical Physics": ["None"],
+    "Nonlinear Sciences": ["Adaptation and Self-Organizing Systems", "Cellular Automata and Lattice Gases", "Chaotic Dynamics", "Exactly Solvable and Integrable Systems", "Pattern Formation and Solitons"],
+    "Nuclear Experiment": ["None"],
+    "Nuclear Theory": ["None"],
+    "Physics": ["Accelerator Physics", "Applied Physics", "Atmospheric and Oceanic Physics", "Atomic and Molecular Clusters", "Atomic Physics", "Biological Physics", "Chemical Physics", "Classical Physics", "Computational Physics", "Data Analysis, Statistics and Probability", "Fluid Dynamics", "General Physics", "Geophysics", "History and Philosophy of Physics", "Instrumentation and Detectors", "Medical Physics", "Optics", "Physics and Society", "Physics Education", "Plasma Physics", "Popular Physics", "Space Physics"],
+    "Quantum Physics": ["None"],
+    "Mathematics": ["Algebraic Geometry", "Algebraic Topology", "Analysis of PDEs", "Category Theory", "Classical Analysis and ODEs", "Combinatorics", "Commutative Algebra", "Complex Variables", "Differential Geometry", "Dynamical Systems", "Functional Analysis", "General Mathematics", "General Topology", "Geometric Topology", "Group Theory", "History and Overview", "Information Theory", "K-Theory and Homology", "Logic", "Mathematical Physics", "Metric Geometry", "Number Theory", "Numerical Analysis", "Operator Algebras", "Optimization and Control", "Probability", "Quantum Algebra", "Representation Theory", "Rings and Algebras", "Spectral Theory", "Statistics Theory", "Symplectic Geometry"],
+    "Computer Science": ["Artificial Intelligence", "Computation and Language", "Computational Complexity", "Computational Engineering, Finance, and Science", "Computational Geometry", "Computer Science and Game Theory", "Computer Vision and Pattern Recognition", "Computers and Society", "Cryptography and Security", "Data Structures and Algorithms", "Databases", "Digital Libraries", "Discrete Mathematics", "Distributed, Parallel, and Cluster Computing", "Emerging Technologies", "Formal Languages and Automata Theory", "General Literature", "Graphics", "Hardware Architecture", "Human-Computer Interaction", "Information Retrieval", "Information Theory", "Logic in Computer Science", "Machine Learning", "Mathematical Software", "Multiagent Systems", "Multimedia", "Networking and Internet Architecture", "Neural and Evolutionary Computing", "Numerical Analysis", "Operating Systems", "Other Computer Science", "Performance", "Programming Languages", "Robotics", "Social and Information Networks", "Software Engineering", "Sound", "Symbolic Computation", "Systems and Control"],
+    "Quantitative Biology": ["Biomolecules", "Cell Behavior", "Genomics", "Molecular Networks", "Neurons and Cognition", "Other Quantitative Biology", "Populations and Evolution", "Quantitative Methods", "Subcellular Processes", "Tissues and Organs"],
+    "Quantitative Finance": ["Computational Finance", "Economics", "General Finance", "Mathematical Finance", "Portfolio Management", "Pricing of Securities", "Risk Management", "Statistical Finance", "Trading and Market Microstructure"],
+    "Statistics": ["Applications", "Computation", "Machine Learning", "Methodology", "Other Statistics", "Statistics Theory"],
+    "Electrical Engineering and Systems Science": ["Audio and Speech Processing", "Image and Video Processing", "Signal Processing", "Systems and Control"],
+    "Economics": ["Econometrics", "General Economics", "Theoretical Economics"]
+}
+def sample(email, topic, physics_topic, categories, interest):
+    if subject == "Physics":
+        if isinstance(physics_topic, list):
+            raise gr.Error("You must choose a physics topic.")
+        topic = physics_topic
+        abbr = physics_topics[topic]
+    else:
+        abbr = topics[topic]
+    if categories:
+        papers = get_papers(abbr)
+        papers = [
+            t for t in papers
+            if bool(set(process_subject_fields(t['subjects'])) & set(categories))][:4]
+    else:
+        papers = get_papers(abbr, limit=4)
+    if interest:
+        relevancy, _ = generate_relevance_score(
+            papers,
+            query={"interest": interest},
+            threshold_score=0,
+            num_paper_in_prompt=4)
+        return "\n\n".join([paper["summarized_text"] for paper in relevancy])
+    else:
+        return "\n\n".join(f"Title: {paper['title']}\nAuthors: {paper['authors']}" for paper in papers)
+def change_subsubject(subject, physics_subject):
+    if subject != "Physics":
+        return gr.Dropdown.update(choices=categories_map[subject], value=[], visible=True)
+    else:
+        print(physics_subject)
+        if physics_subject and not isinstance(physics_subject, list):
+            return gr.Dropdown.update(choices=categories_map[physics_subject], value=[], visible=True)
+        else:
+            return gr.Dropdown.update(choices=[], value=[], visible=False)
+def change_physics(subject):
+    if subject != "Physics":
+        return gr.Dropdown.update(visible=False, value=[])
+    else:
+        return gr.Dropdown.update(physics_topics, visible=True)
+def test(email, topic, physics_topic, categories, interest):
+    if topic == "Physics":
+        if isinstance(physics_topic, list):
+            raise gr.Error("You must choose a physics topic.")
+        topic = physics_topic
+        abbr = physics_topics[topic]
+    else:
+        abbr = topics[topic]
+    if categories:
+        papers = get_papers(abbr)
+        papers = [
+            t for t in papers
+            if bool(set(process_subject_fields(t['subjects'])) & set(categories))][:4]
+    else:
+        papers = get_papers(abbr, limit=4)
+    if interest:
+        relevancy, hallucination = generate_relevance_score(
+            papers,
+            query={"interest": interest},
+            threshold_score=7,
+            num_paper_in_prompt=8)
+        print(relevancy[0].keys())
+        body = "<br><br>".join([f'Title: <a href="{paper["main_page"]}">{paper["title"]}</a><br>Authors: {paper["authors"]}<br>Score: {paper["Relevancy score"]}<br>Reason: {paper["Reasons for match"]}' for paper in relevancy])
+        if hallucination:
+            body = "Warning: the model hallucinated some papers. We have tried to remove them, but the scores may not be accurate.<br><br>" + body
+    else:
+        body = "<br><br>".join([f'Title: <a href="{paper["main_page"]}">{paper["title"]}</a><br>Authors: {paper["authors"]}' for paper in papers])
+    sg = sendgrid.SendGridAPIClient(api_key=os.environ.get('SENDGRID_API_KEY'))
+    from_email = Email("[email protected]")  # Change to your verified sender
+    to_email = To(email)
+    subject = "arXiv digest"
+    content = Content("text/html", body)
+    mail = Mail(from_email, to_email, subject, content)
+    mail_json = mail.get()
+    # Send an HTTP POST request to /mail/send
+    response = sg.client.mail.send.post(request_body=mail_json)
+    if response.status_code >= 200 and response.status_code <= 300:
+        return "Send test email: Success!"
+    else:
+        return f"Send test email: Failure ({response.status_code})"
+with gr.Blocks() as demo:
+    with gr.Column():
+        email = gr.Textbox(label="Email address")
+        subject = gr.Radio(
+            list(topics.keys()), label="Topic to subscribe to"
+        )
+        physics_subject = gr.Dropdown(physics_topics, value=[], multiselect=False, label="Physics category", visible=False, info="")
+        subsubject = gr.Dropdown(
+                [], value=[], multiselect=True, label="Subtopic", info="", visible=False)
+        subject.change(fn=change_physics, inputs=[subject], outputs=physics_subject)
+        subject.change(fn=change_subsubject, inputs=[subject, physics_subject], outputs=subsubject)
+        physics_subject.change(fn=change_subsubject, inputs=[subject, physics_subject], outputs=subsubject)
+        interest = gr.Textbox(label="A natural language description of what you are interested in. Press enter to update.")
+    sample_output = gr.Textbox(label="Examples")
+    test_btn = gr.Button("Send email")
+    output = gr.Textbox(label="Test email status")
+    test_btn.click(fn=test, inputs=[email, subject, physics_subject, subsubject, interest], outputs=output)
+    subject.change(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output)
+    physics_subject.change(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output)
+    subsubject.change(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output)
+    interest.submit(fn=sample, inputs=[email, subject, physics_subject, subsubject, interest], outputs=sample_output)
+demo.launch()

src/download_new_papers.py ADDED Viewed

	@@ -0,0 +1,64 @@

+# encoding: utf-8
+import os
+import tqdm
+from bs4 import BeautifulSoup as bs
+import urllib.request
+import json
+import datetime
+import pytz
+def _download_new_papers(field_abbr):
+    NEW_SUB_URL = f'https://arxiv.org/list/{field_abbr}/new'  # https://arxiv.org/list/cs/new
+    page = urllib.request.urlopen(NEW_SUB_URL)
+    soup = bs(page)
+    content = soup.body.find("div", {'id': 'content'})
+    # find the first h3 element in content
+    h3 = content.find("h3").text   # e.g: New submissions for Wed, 10 May 23
+    date = h3.replace("New submissions for", "").strip()
+    dt_list = content.dl.find_all("dt")
+    dd_list = content.dl.find_all("dd")
+    arxiv_base = "https://arxiv.org/abs/"
+    assert len(dt_list) == len(dd_list)
+    new_paper_list = []
+    for i in tqdm.tqdm(range(len(dt_list))):
+        paper = {}
+        paper_number = dt_list[i].text.strip().split(" ")[2].split(":")[-1]
+        paper['main_page'] = arxiv_base + paper_number
+        paper['pdf'] = arxiv_base.replace('abs', 'pdf') + paper_number
+        paper['title'] = dd_list[i].find("div", {"class": "list-title mathjax"}).text.replace("Title: ", "").strip()
+        paper['authors'] = dd_list[i].find("div", {"class": "list-authors"}).text \
+                            .replace("Authors:\n", "").replace("\n", "").strip()
+        paper['subjects'] = dd_list[i].find("div", {"class": "list-subjects"}).text.replace("Subjects: ", "").strip()
+        paper['abstract'] = dd_list[i].find("p", {"class": "mathjax"}).text.replace("\n", " ").strip()
+        new_paper_list.append(paper)
+    #  check if ./data exist, if not, create it
+    if not os.path.exists("./data"):
+        os.makedirs("./data")
+    # save new_paper_list to a jsonl file, with each line as the element of a dictionary
+    date = datetime.date.fromtimestamp(datetime.datetime.now(tz=pytz.timezone("America/New_York")).timestamp())
+    date = date.strftime("%a, %d %b %y")
+    with open(f"./data/{field_abbr}_{date}.jsonl", "w") as f:
+        for paper in new_paper_list:
+            f.write(json.dumps(paper) + "\n")
+def get_papers(field_abbr, limit=None):
+    date = datetime.date.fromtimestamp(datetime.datetime.now(tz=pytz.timezone("America/New_York")).timestamp())
+    date = date.strftime("%a, %d %b %y")
+    if not os.path.exists(f"./data/{field_abbr}_{date}.jsonl"):
+        _download_new_papers(field_abbr)
+    results = []
+    with open(f"./data/{field_abbr}_{date}.jsonl", "r") as f:
+        for i, line in enumerate(f.readlines()):
+            if limit and i == limit:
+                return results
+            results.append(json.loads(line))
+    return results

src/relevancy.py ADDED Viewed

	@@ -0,0 +1,172 @@

+"""
+run:
+python -m relevancy run_all_day_paper \
+  --output_dir ./data \
+  --model_name="gpt-3.5-turbo" \
+"""
+import time
+import json
+import os
+import random
+import re
+import string
+from datetime import datetime
+import numpy as np
+import tqdm
+import utils
+def encode_prompt(query, prompt_papers):
+    """Encode multiple prompt instructions into a single string."""
+    prompt = open("src/relevancy_prompt.txt").read() + "\n"
+    prompt += query['interest']
+    for idx, task_dict in enumerate(prompt_papers):
+        (title, authors, abstract) = task_dict["title"], task_dict["authors"], task_dict["abstract"]
+        if not title:
+            raise
+        prompt += f"###\n"
+        prompt += f"{idx + 1}. Title: {title}\n"
+        prompt += f"{idx + 1}. Authors: {authors}\n"
+        prompt += f"{idx + 1}. Abstract: {abstract}\n"
+    prompt += f"\n Generate response:\n1."
+    print(prompt)
+    return prompt
+def post_process_chat_gpt_response(paper_data, response, threshold_score=8):
+    selected_data = []
+    if response is None:
+        return []
+    json_items = response['message']['content'].replace("\n\n", "\n").split("\n")
+    pattern = r"^\d+\. "
+    import pprint
+    try:
+        score_items = [json.loads(re.sub(pattern, "", line)) for line in json_items if "relevancy score" in line.lower()]
+    except Exception:
+        pprint.pprint([re.sub(pattern, "", line) for line in json_items if "relevancy score" in line.lower()])
+        raise RuntimeError("failed")
+    pprint.pprint(score_items)
+    scores = []
+    for item in score_items:
+        temp = item["Relevancy score"]
+        if "/" in temp:
+            scores.append(int(temp.split("/")[0]))
+        else:
+            scores.append(int(temp))
+    if len(score_items) != len(paper_data):
+        score_items = score_items[:len(paper_data)]
+        hallucination = True
+    else:
+        hallucination = False
+    for idx, inst in enumerate(score_items):
+        # if the decoding stops due to length, the last example is likely truncated so we discard it
+        if scores[idx] < threshold_score:
+            continue
+        output_str = "Title: " + paper_data[idx]["title"] + "\n"
+        output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
+        output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
+        for key, value in inst.items():
+            paper_data[idx][key] = value
+            output_str += key + ": " + value + "\n"
+        paper_data[idx]['summarized_text'] = output_str
+        selected_data.append(paper_data[idx])
+    return selected_data, hallucination
+def find_word_in_string(w, s):
+    return re.compile(r"\b({0})\b".format(w), flags=re.IGNORECASE).search(s)
+def process_subject_fields(subjects):
+    all_subjects = subjects.split(";")
+    all_subjects = [s.split(" (")[0] for s in all_subjects]
+    return all_subjects
+def generate_relevance_score(
+    all_papers,
+    query,
+    model_name="gpt-3.5-turbo",
+    threshold_score=8,
+    num_paper_in_prompt=4,
+    temperature=0.4,
+    top_p=1.0,
+    sorting=True
+):
+    ans_data = []
+    request_idx = 1
+    hallucination = False
+    for id in tqdm.tqdm(range(0, len(all_papers), num_paper_in_prompt)):
+        prompt_papers = all_papers[id:id+num_paper_in_prompt]
+        # only sampling from the seed tasks
+        prompt = encode_prompt(query, prompt_papers)
+        decoding_args = utils.OpenAIDecodingArguments(
+            temperature=temperature,
+            n=1,
+            max_tokens=1072,  # hard-code to maximize the length. the requests will be automatically adjusted
+            top_p=top_p,
+        )
+        request_start = time.time()
+        response = utils.openai_completion(
+            prompts=prompt,
+            model_name=model_name,
+            batch_size=1,
+            decoding_args=decoding_args,
+            logit_bias={"100257": -100},  # prevent the <|endoftext|> from being generated
+            # "100265":-100, "100276":-100 for <|im_end|> and <endofprompt> token
+        )
+        print ("response", response['message']['content'])
+        request_duration = time.time() - request_start
+        process_start = time.time()
+        batch_data, hallu = post_process_chat_gpt_response(prompt_papers, response, threshold_score=threshold_score)
+        hallucination = hallucination or hallu
+        ans_data.extend(batch_data)
+        print(f"Request {request_idx+1} took {request_duration:.2f}s")
+        print(f"Post-processing took {time.time() - process_start:.2f}s")
+    if sorting:
+        ans_data = sorted(ans_data, key=lambda x: x["Relevancy score"], reverse=True)
+    return ans_data, hallucination
+def run_all_day_paper(
+    query={"interest":"", "subjects":["Computation and Language", "Artificial Intelligence"]},
+    date=None,
+    data_dir="../data",
+    model_name="gpt-3.5-turbo",
+    threshold_score=8,
+    num_paper_in_prompt=8,
+    temperature=0.4,
+    top_p=1.0
+):
+    if date is None:
+        date = datetime.today().strftime('%a, %d %b %y')
+        # string format such as Wed, 10 May 23
+    print ("the date for the arxiv data is: ", date)
+    all_papers = [json.loads(l) for l in open(f"{data_dir}/{date}.jsonl", "r")]
+    print (f"We found {len(all_papers)}.")
+    all_papers_in_subjects = [
+        t for t in all_papers
+        if bool(set(process_subject_fields(t['subjects'])) & set(query['subjects']))
+    ]
+    print(f"After filtering subjects, we have {len(all_papers_in_subjects)} papers left.")
+    ans_data = generate_relevance_score(all_papers_in_subjects, query, model_name, threshold_score, num_paper_in_prompt, temperature, top_p)
+    utils.write_ans_to_file(ans_data, date, output_dir="../outputs")
+    return ans_data
+if __name__ == "__main__":
+    query = {"interest":"""
+    1. Large language model pretraining and finetunings
+    2. Multimodal machine learning
+    3. Do not care about specific application, for example, information extraction, summarization, etc.
+    4. Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.\n""",
+    "subjects":["Computation and Language"]}
+    ans_data = run_all_day_paper(query)

src/relevancy_prompt.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+You have been asked to read a list of a few arxiv papers, each with title, authors and abstract.
+Based on my specific research interests, elevancy score out of 10 for each paper, based on my specific research interest, with a higher score indicating greater relevance. A relevance score more than 7 will need person's attention for details.
+Additionally, please generate 1-2 sentence summary for each paper explaining why it's relevant to my research interests.
+Please keep the paper order the same as in the input list, with one json format per line. Example is:
+1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings"}
+My research interests are:

src/requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+beautifulsoup4==4.12.2
+tqdm==4.65.0
+pytz==2023.3
+numpy==1.24.2
+openai==0.27.4
+sendgrid==6.10.0
+pyyaml==6.00

src/utils.py ADDED Viewed

	@@ -0,0 +1,148 @@

+import dataclasses
+import logging
+import math
+import os
+import io
+import sys
+import time
+import json
+from typing import Optional, Sequence, Union
+import openai
+import tqdm
+from openai import openai_object
+import copy
+StrOrOpenAIObject = Union[str, openai_object.OpenAIObject]
+openai_org = os.getenv("OPENAI_ORG")
+if openai_org is not None:
+    openai.organization = openai_org
+    logging.warning(f"Switching to organization: {openai_org} for OAI API key.")
+@dataclasses.dataclass
+class OpenAIDecodingArguments(object):
+    max_tokens: int = 1800
+    temperature: float = 0.2
+    top_p: float = 1.0
+    n: int = 1
+    stream: bool = False
+    stop: Optional[Sequence[str]] = None
+    presence_penalty: float = 0.0
+    frequency_penalty: float = 0.0
+    # logprobs: Optional[int] = None
+def openai_completion(
+    prompts, #: Union[str, Sequence[str], Sequence[dict[str, str]], dict[str, str]],
+    decoding_args: OpenAIDecodingArguments,
+    model_name="text-davinci-003",
+    sleep_time=2,
+    batch_size=1,
+    max_instances=sys.maxsize,
+    max_batches=sys.maxsize,
+    return_text=False,
+    **decoding_kwargs,
+) -> Union[Union[StrOrOpenAIObject], Sequence[StrOrOpenAIObject], Sequence[Sequence[StrOrOpenAIObject]],]:
+    """Decode with OpenAI API.
+    Args:
+        prompts: A string or a list of strings to complete. If it is a chat model the strings should be formatted
+            as explained here: https://github.com/openai/openai-python/blob/main/chatml.md. If it is a chat model
+            it can also be a dictionary (or list thereof) as explained here:
+            https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb
+        decoding_args: Decoding arguments.
+        model_name: Model name. Can be either in the format of "org/model" or just "model".
+        sleep_time: Time to sleep once the rate-limit is hit.
+        batch_size: Number of prompts to send in a single request. Only for non chat model.
+        max_instances: Maximum number of prompts to decode.
+        max_batches: Maximum number of batches to decode. This argument will be deprecated in the future.
+        return_text: If True, return text instead of full completion object (which contains things like logprob).
+        decoding_kwargs: Additional decoding arguments. Pass in `best_of` and `logit_bias` if you need them.
+    Returns:
+        A completion or a list of completions.
+        Depending on return_text, return_openai_object, and decoding_args.n, the completion type can be one of
+            - a string (if return_text is True)
+            - an openai_object.OpenAIObject object (if return_text is False)
+            - a list of objects of the above types (if decoding_args.n > 1)
+    """
+    is_chat_model = "gpt-3.5" in model_name or "gpt-4" in model_name
+    is_single_prompt = isinstance(prompts, (str, dict))
+    if is_single_prompt:
+        prompts = [prompts]
+    if max_batches < sys.maxsize:
+        logging.warning(
+            "`max_batches` will be deprecated in the future, please use `max_instances` instead."
+            "Setting `max_instances` to `max_batches * batch_size` for now."
+        )
+        max_instances = max_batches * batch_size
+    prompts = prompts[:max_instances]
+    num_prompts = len(prompts)
+    prompt_batches = [
+        prompts[batch_id * batch_size : (batch_id + 1) * batch_size]
+        for batch_id in range(int(math.ceil(num_prompts / batch_size)))
+    ]
+    completions = []
+    for batch_id, prompt_batch in tqdm.tqdm(
+        enumerate(prompt_batches),
+        desc="prompt_batches",
+        total=len(prompt_batches),
+    ):
+        batch_decoding_args = copy.deepcopy(decoding_args)  # cloning the decoding_args
+        while True:
+            try:
+                shared_kwargs = dict(
+                    model=model_name,
+                    **batch_decoding_args.__dict__,
+                    **decoding_kwargs,
+                )
+                if is_chat_model:
+                    completion_batch = openai.ChatCompletion.create(
+                        messages=[
+                            {"role": "system", "content": "You are a helpful assistant."},
+                            {"role": "user", "content": prompt_batch[0]}
+                        ],
+                        **shared_kwargs
+                    )
+                else:
+                    completion_batch = openai.Completion.create(prompt=prompt_batch, **shared_kwargs)
+                choices = completion_batch.choices
+                for choice in choices:
+                    choice["total_tokens"] = completion_batch.usage.total_tokens
+                completions.extend(choices)
+                break
+            except openai.error.OpenAIError as e:
+                logging.warning(f"OpenAIError: {e}.")
+                if "Please reduce your prompt" in str(e):
+                    batch_decoding_args.max_tokens = int(batch_decoding_args.max_tokens * 0.8)
+                    logging.warning(f"Reducing target length to {batch_decoding_args.max_tokens}, Retrying...")
+                else:
+                    logging.warning("Hit request rate limit; retrying...")
+                    time.sleep(sleep_time)  # Annoying rate limit on requests.
+    if return_text:
+        completions = [completion.text for completion in completions]
+    if decoding_args.n > 1:
+        # make completions a nested list, where each entry is a consecutive decoding_args.n of original entries.
+        completions = [completions[i : i + decoding_args.n] for i in range(0, len(completions), decoding_args.n)]
+    if is_single_prompt:
+        # Return non-tuple if only 1 input and 1 generation.
+        (completions,) = completions
+    return completions
+def write_ans_to_file(ans_data, file_prefix, output_dir="./output"):
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+    filename = os.path.join(output_dir, file_prefix + ".txt")
+    with open(filename, "w") as f:
+        for ans in ans_data:
+            f.write(ans + "\n")