Spaces:

smtnkc
/

cov-snn-app

Sleeping

App Files Files Community

smtnkc commited on Dec 1, 2024

Commit

6d42f96

1 Parent(s): 59416a1

Using rank_by_scip

Browse files

Files changed (4) hide show

INSTRUCTIONS.md +4 -6
README.md +0 -105
app.py +6 -7
predict.py +4 -4

INSTRUCTIONS.md CHANGED Viewed

@@ -24,24 +24,21 @@ The output will be a dataframe with the following columns:
 | `log10(sc)`      | Log-scaled semantic change                                  |
 | `log10(sp)`      | Log-scaled sequence probability                             |
 | `log10(ip)`      | Log-scaled inverse perplexity                               |
-| `log10(gr)`      | Log-scaled grammaticality where `gr = (sp + ip) / 2`        |
 | `rank_by_sc`     | Rank by semantic change                                     |
 | `rank_by_sp`     | Rank by sequence probability                                |
 | `rank_by_ip`     | Rank by inverse perplexity                                  |
-| `rank_by_gr`     | Rank by grammaticality                                      |
 | `rank_by_scsp`   | Rank by semantic change + Rank by sequence probability      |
 | `rank_by_scip`   | Rank by semantic change + Rank by inverse perplexity        |
-| `rank_by_scgr`   | Rank by semantic change + Rank by grammaticality            |
-**Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scgr`.
 See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
 ### The Ranking Mechanism
-In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose a more robust metric for grammaticality by averaging sequence probability (`sp`) and inverse perplexity (`ip`).
-Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scgr`, with the top element possessing the smallest `rank_by_scgr` and indicating the sequence with the highest escape potential.
 ### Model Details
@@ -93,6 +90,7 @@ scanpy==1.9.3
 scikit-learn==1.2.2
 scipy==1.10.1
 plotly==5.24.1
 torch-optimizer==0.3.0
 torchmetrics==0.9.0
 torch==1.12.1+cu113

 | `log10(sc)`      | Log-scaled semantic change                                  |
 | `log10(sp)`      | Log-scaled sequence probability                             |
 | `log10(ip)`      | Log-scaled inverse perplexity                               |
 | `rank_by_sc`     | Rank by semantic change                                     |
 | `rank_by_sp`     | Rank by sequence probability                                |
 | `rank_by_ip`     | Rank by inverse perplexity                                  |
 | `rank_by_scsp`   | Rank by semantic change + Rank by sequence probability      |
 | `rank_by_scip`   | Rank by semantic change + Rank by inverse perplexity        |
+**Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`.
 See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
 ### The Ranking Mechanism
+In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality.
+Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential.
 ### Model Details
 scikit-learn==1.2.2
 scipy==1.10.1
 plotly==5.24.1
+huggingface-hub==0.25.2
 torch-optimizer==0.3.0
 torchmetrics==0.9.0
 torch==1.12.1+cu113

README.md CHANGED Viewed

@@ -10,108 +10,3 @@ pinned: false
 license: cc-by-nc-nd-4.0
 short_description: Predict viral escape potential of novel SARS-CoV-2 variants
 ---
-## Running Instructions
-1. Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns. See [example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 5 New Omicron (``EPI_ISL_177...``) and 5 Eris (``EPI_ISL_189...``) sequences.
-2. Application will compare the given sequences with the average Omicron embedding. This embedding has been generated using [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv).
-3. You will get a JSON response in the following format:
-```json
-{
-    "EPI_ISL_18905639": {
-        "sc": 410.788391,
-        "sp": 0.000113,
-        "ip": 9e-05,
-        "log10(sc)": 2.613618,
-        "log10(sp)": -3.946409,
-        "log10(ip)": -4.047602,
-        "rank_by_sc": 2,
-        "rank_by_sp": 3,
-        "rank_by_ip": 2,
-        "rank_by_scsp": 5,
-        "rank_by_scip": 4
-    },
-    ...
-}
-```
-where:
-* ``sc``: semantic change
-* ``sp``: sequence probability
-* ``ip``: inverse perplexity
-* ``rank_by_sc``: rank by semantic change (In descending order)
-* ``rank_by_sp``: rank by sequence probability (In descending order)
-* ``rank_by_ip``: rank by inverse perplexity (In descending order)
-* ``rank_by_scsp``: rank by semantic change + rank by sequence probability (Both descending order)
-* ``rank_by_scip``: rank by semantic change + rank by inverse perplexity (Both in descending order)
-## The Ranking Mechanism
-We measure grammaticality using either sequence probability (`sp`) or inverse perplexity (`ip`). Sequences that have a higher semantic change (`sc`) and higher grammaticality tend to have a higher escape potential. We rank sequences in descending order, meaning sequences with smaller rank values have a higher escape potential. In the JSON response, the sequences are sorted by ascending `rank_by_scsp`. Therefore, the top element has the smallest `rank_by_scsp`, representing the sequence with the highest escape potential.
-See [the response](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/response.json) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
-## Model Details
-This application uses the model checkpoint with the highest zero-shot test accuracy (91.5%).
-### Training Parameters:
-| Parameter | Value |
-|---------|-----|
-| Base Model | CoV-RoBERTa_2048 |
-| Loss function | Contrastive |
-| Max Seq Len | 1280 |
-| Positive Set | Omicron |
-| Negative Set | Delta |
-| Pooling | Max |
-| ReLU | 0.2 |
-| Dropout | 0.1 |
-| Learning Rate | 0.001 |
-| Batch size | 32 |
-| Margin | 2.0 |
-| Epochs | [0, 9] |
-### Training Results:
-| Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
-|---------|-----|-----|-----|-----|
-| 2 | 0.0236 | 99.7% | 1.0627 | 50.0% |
-| 4 | 0.2941 | 89.4% | 0.2286 | 91.5% |
-## Dependencies
-```bash
-conda create -n spaces python=3.8
-conda activate spaces
-pip install -r requirements.txt
-```
-**requirements.txt**
-```bash
-numpy==1.21.0
-pandas==2.0.2
-sentence-transformers==2.2.2
-transformers==4.30.2
-tokenizers==0.13.3
-scanpy==1.9.3
-scikit-learn==1.2.2
-scipy==1.10.1
-torch-optimizer==0.3.0
-torchmetrics==0.9.0
-torch==1.12.1+cu113
-torchvision==0.13.1+cu113
-torchaudio==0.12.1
---extra-index-url https://download.pytorch.org/whl/cu113
-```
-#### For More Information:
-[[email protected]](mailto:[email protected])

 license: cc-by-nc-nd-4.0
 short_description: Predict viral escape potential of novel SARS-CoV-2 variants
 ---

app.py CHANGED Viewed

@@ -53,7 +53,7 @@ def main():
             results_df = process_target_data(average_embedding, target_dataset)
         # Reverse the rank_sc_sp by subtracting it from the maximum rank value plus one
-        results_df['Escape Potential'] = results_df['rank_by_scgr'].max() + 1 - results_df['rank_by_scgr']
         # Create scatter plot with manual color assignment
         fig = px.scatter(
@@ -69,17 +69,17 @@ def main():
                 "log10(sp)": True,     # display log10(sp)
                 "log10(sc)": True,     # display log10(sc)
                 "log10(ip)": True,     # display log10(ip)
-                "log10(gr)": True,     # display log10(gr)
                 "sp": False,            # display actual sp
                 "sc": False,            # display actual sc
                 "ip": False,            # display actual ip
-                "gr": False,            # display actual gr
                 "rank_by_sc": True,    # display rank by sc
                 "rank_by_sp": True,    # display rank by sp
                 "rank_by_ip": True,    # display rank by ip
                 "rank_by_scsp": True,  # display rank by scsp
                 "rank_by_scip": True,  # display rank by scip
-                "rank_by_scgr": True,  # display rank by scgr
                 "Escape Potential": False
             },
         )
@@ -120,9 +120,8 @@ def main():
         # Display the results as a DataFrame
         st.dataframe(results_df[["accession_id", "log10(sc)", "log10(sp)", "log10(ip)",
-                                 "log10(gr)", "rank_by_sc", "rank_by_sp",
-                                 "rank_by_ip", "rank_by_gr", "rank_by_scsp", "rank_by_scip",
-                                 "rank_by_scgr"]], hide_index=True)
     # Display the README.md file
     st.markdown(readme_text)

             results_df = process_target_data(average_embedding, target_dataset)
         # Reverse the rank_sc_sp by subtracting it from the maximum rank value plus one
+        results_df['Escape Potential'] = results_df['rank_by_scip'].max() + 1 - results_df['rank_by_scip']
         # Create scatter plot with manual color assignment
         fig = px.scatter(
                 "log10(sp)": True,     # display log10(sp)
                 "log10(sc)": True,     # display log10(sc)
                 "log10(ip)": True,     # display log10(ip)
+                #"log10(gr)": True,     # display log10(gr)
                 "sp": False,            # display actual sp
                 "sc": False,            # display actual sc
                 "ip": False,            # display actual ip
+                #"gr": False,            # display actual gr
                 "rank_by_sc": True,    # display rank by sc
                 "rank_by_sp": True,    # display rank by sp
                 "rank_by_ip": True,    # display rank by ip
                 "rank_by_scsp": True,  # display rank by scsp
                 "rank_by_scip": True,  # display rank by scip
+                #"rank_by_scgr": True,  # display rank by scgr
                 "Escape Potential": False
             },
         )
         # Display the results as a DataFrame
         st.dataframe(results_df[["accession_id", "log10(sc)", "log10(sp)", "log10(ip)",
+                                 "rank_by_sc", "rank_by_sp", "rank_by_ip", "rank_by_scsp", "rank_by_scip"
+                                 ]], hide_index=True)
     # Display the README.md file
     st.markdown(readme_text)

predict.py CHANGED Viewed

@@ -195,7 +195,7 @@ def get_sc_sp_ip(average_embedding, target_dataset):
-def get_results_dict(target_dataset, sc_scores, sp_scores, ip_scores):
     # Create a DataFrame with the results
     results_df = target_dataset.copy()
     results_df["sc"] = sc_scores
@@ -239,7 +239,7 @@ def get_results_dict(target_dataset, sc_scores, sp_scores, ip_scores):
     return results_df
-def process_target_data(average_embedding, target_data):
-    sc_scores, sp_scores, ip_scores = get_sc_sp_ip(average_embedding, target_data)
-    results_df = get_results_dict(target_data, sc_scores, sp_scores, ip_scores)
     return results_df

+def get_results_df(target_dataset, sc_scores, sp_scores, ip_scores):
     # Create a DataFrame with the results
     results_df = target_dataset.copy()
     results_df["sc"] = sc_scores
     return results_df
+def process_target_data(average_embedding, target_dataset):
+    sc_scores, sp_scores, ip_scores = get_sc_sp_ip(average_embedding, target_dataset)
+    results_df = get_results_df(target_dataset, sc_scores, sp_scores, ip_scores)
     return results_df