dbleek
Update README.md
1cfa840 unverified

A newer version of the Streamlit SDK is available: 1.41.1

Upgrade
metadata
title: CS-GY-6613 Project Milestone 3
colorFrom: blue
colorTo: red
sdk: streamlit
app_file: milestone-3.py
pinned: false

cs-gy-6613-project

Project for CS-GY-6613 Spring 2023

Milestone 4

Training

For documentation regarding the model training, please go to the Colab notebook here: https://github.com/dbleek/cs-gy-6613-project/blob/milestone-4/dmb443_csgy_6613_project_model_trainer.ipynb

Writing the App

Code: https://github.com/dbleek/cs-gy-6613-project/blob/main/milestone-3.py

First, I loaded the January 2016 HUPD data again and filtered out any applications from the validation dataset that were neither accepted nor rejected. I used applications only from the validation dataset in absence of a test set, since they were only used during the validation phase of training. I then randomly selected five accepted applications and five rejected applications to use as my app's sample data.

I then loaded the model and the distilBERT tokenizer as was done during training, except the model trained on the HUPD data was loaded instead of the base distilBERT model.

The patent numbers of the 10 sample applications are added as keys in a dictionary that map to each application's index in the dataset. This index is used with a helper function called load_data that selects the corresponding application from the dataset, and then populates the text inputs accordingly, whenever the selectbox is changed. These inputs include the application title, its decision, its abstract and its claims, although only the last two are entered into the model.

When the user presses the "Get Patentability Score" button, the abstract and claims are submitted as a form and ran through the tokenizer, just as they were during training. The tokens are then passed to the model. The models outputs the logits, which are then ran through a softmax to calculate the predicted probability of either label--0 for rejected, 1 for accepted--for the input.

Finally, the probability of the application being accepted is displayed on the page as the application's patentability score. The user will need to scroll down to see the message.

From the selected samples, the model correctly predicts 7 out of 10 applications. In other words, for 7 applications, the patentability score is .5 or above when the USPTO accepted the application, otherwise it is less than .5. This tracks with the .73 accuracy metric from training. The model seems to do a little better with correctly predicting accepted applications than rejected ones, which is understandable given the slight skew in the data I used to train it.

Landing Page and Demo Video

https://sites.google.com/nyu.edu/dmb443-cs-gy-6613-project

Milestone 3

USPTO Patentability Classifier: https://huggingface.co/spaces/dbleek/cs-gy-6613-project-final

Milestone 2

Sentiment Analysis App: https://huggingface.co/spaces/dbleek/cs-gy-6613-project

Milestone 1

For milestone 1, I used the quick start instructions from VS code to connect to a remote Ubuntu container:

https://code.visualstudio.com/docs/devcontainers/containers#_quick-start-open-an-existing-folder-in-a-container

Alt text