Spaces:
Sleeping
Sleeping
smtnkc
commited on
Commit
·
6d42f96
1
Parent(s):
59416a1
Using rank_by_scip
Browse files- INSTRUCTIONS.md +4 -6
- README.md +0 -105
- app.py +6 -7
- predict.py +4 -4
INSTRUCTIONS.md
CHANGED
@@ -24,24 +24,21 @@ The output will be a dataframe with the following columns:
|
|
24 |
| `log10(sc)` | Log-scaled semantic change |
|
25 |
| `log10(sp)` | Log-scaled sequence probability |
|
26 |
| `log10(ip)` | Log-scaled inverse perplexity |
|
27 |
-
| `log10(gr)` | Log-scaled grammaticality where `gr = (sp + ip) / 2` |
|
28 |
| `rank_by_sc` | Rank by semantic change |
|
29 |
| `rank_by_sp` | Rank by sequence probability |
|
30 |
| `rank_by_ip` | Rank by inverse perplexity |
|
31 |
-
| `rank_by_gr` | Rank by grammaticality |
|
32 |
| `rank_by_scsp` | Rank by semantic change + Rank by sequence probability |
|
33 |
| `rank_by_scip` | Rank by semantic change + Rank by inverse perplexity |
|
34 |
-
| `rank_by_scgr` | Rank by semantic change + Rank by grammaticality |
|
35 |
|
36 |
-
**Note:** All ranks are in descending order, with the default sorting metric being `
|
37 |
|
38 |
See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
|
39 |
|
40 |
### The Ranking Mechanism
|
41 |
|
42 |
-
In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose a more robust metric for grammaticality
|
43 |
|
44 |
-
Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `
|
45 |
|
46 |
### Model Details
|
47 |
|
@@ -93,6 +90,7 @@ scanpy==1.9.3
|
|
93 |
scikit-learn==1.2.2
|
94 |
scipy==1.10.1
|
95 |
plotly==5.24.1
|
|
|
96 |
torch-optimizer==0.3.0
|
97 |
torchmetrics==0.9.0
|
98 |
torch==1.12.1+cu113
|
|
|
24 |
| `log10(sc)` | Log-scaled semantic change |
|
25 |
| `log10(sp)` | Log-scaled sequence probability |
|
26 |
| `log10(ip)` | Log-scaled inverse perplexity |
|
|
|
27 |
| `rank_by_sc` | Rank by semantic change |
|
28 |
| `rank_by_sp` | Rank by sequence probability |
|
29 |
| `rank_by_ip` | Rank by inverse perplexity |
|
|
|
30 |
| `rank_by_scsp` | Rank by semantic change + Rank by sequence probability |
|
31 |
| `rank_by_scip` | Rank by semantic change + Rank by inverse perplexity |
|
|
|
32 |
|
33 |
+
**Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`.
|
34 |
|
35 |
See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
|
36 |
|
37 |
### The Ranking Mechanism
|
38 |
|
39 |
+
In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality.
|
40 |
|
41 |
+
Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential.
|
42 |
|
43 |
### Model Details
|
44 |
|
|
|
90 |
scikit-learn==1.2.2
|
91 |
scipy==1.10.1
|
92 |
plotly==5.24.1
|
93 |
+
huggingface-hub==0.25.2
|
94 |
torch-optimizer==0.3.0
|
95 |
torchmetrics==0.9.0
|
96 |
torch==1.12.1+cu113
|
README.md
CHANGED
@@ -10,108 +10,3 @@ pinned: false
|
|
10 |
license: cc-by-nc-nd-4.0
|
11 |
short_description: Predict viral escape potential of novel SARS-CoV-2 variants
|
12 |
---
|
13 |
-
|
14 |
-
## Running Instructions
|
15 |
-
|
16 |
-
1. Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns. See [example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 5 New Omicron (``EPI_ISL_177...``) and 5 Eris (``EPI_ISL_189...``) sequences.
|
17 |
-
|
18 |
-
2. Application will compare the given sequences with the average Omicron embedding. This embedding has been generated using [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv).
|
19 |
-
|
20 |
-
3. You will get a JSON response in the following format:
|
21 |
-
|
22 |
-
```json
|
23 |
-
{
|
24 |
-
"EPI_ISL_18905639": {
|
25 |
-
"sc": 410.788391,
|
26 |
-
"sp": 0.000113,
|
27 |
-
"ip": 9e-05,
|
28 |
-
"log10(sc)": 2.613618,
|
29 |
-
"log10(sp)": -3.946409,
|
30 |
-
"log10(ip)": -4.047602,
|
31 |
-
"rank_by_sc": 2,
|
32 |
-
"rank_by_sp": 3,
|
33 |
-
"rank_by_ip": 2,
|
34 |
-
"rank_by_scsp": 5,
|
35 |
-
"rank_by_scip": 4
|
36 |
-
},
|
37 |
-
...
|
38 |
-
}
|
39 |
-
```
|
40 |
-
|
41 |
-
where:
|
42 |
-
|
43 |
-
* ``sc``: semantic change
|
44 |
-
* ``sp``: sequence probability
|
45 |
-
* ``ip``: inverse perplexity
|
46 |
-
* ``rank_by_sc``: rank by semantic change (In descending order)
|
47 |
-
* ``rank_by_sp``: rank by sequence probability (In descending order)
|
48 |
-
* ``rank_by_ip``: rank by inverse perplexity (In descending order)
|
49 |
-
* ``rank_by_scsp``: rank by semantic change + rank by sequence probability (Both descending order)
|
50 |
-
* ``rank_by_scip``: rank by semantic change + rank by inverse perplexity (Both in descending order)
|
51 |
-
|
52 |
-
|
53 |
-
## The Ranking Mechanism
|
54 |
-
|
55 |
-
We measure grammaticality using either sequence probability (`sp`) or inverse perplexity (`ip`). Sequences that have a higher semantic change (`sc`) and higher grammaticality tend to have a higher escape potential. We rank sequences in descending order, meaning sequences with smaller rank values have a higher escape potential. In the JSON response, the sequences are sorted by ascending `rank_by_scsp`. Therefore, the top element has the smallest `rank_by_scsp`, representing the sequence with the highest escape potential.
|
56 |
-
|
57 |
-
See [the response](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/response.json) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
|
58 |
-
|
59 |
-
|
60 |
-
## Model Details
|
61 |
-
|
62 |
-
This application uses the model checkpoint with the highest zero-shot test accuracy (91.5%).
|
63 |
-
|
64 |
-
### Training Parameters:
|
65 |
-
|
66 |
-
| Parameter | Value |
|
67 |
-
|---------|-----|
|
68 |
-
| Base Model | CoV-RoBERTa_2048 |
|
69 |
-
| Loss function | Contrastive |
|
70 |
-
| Max Seq Len | 1280 |
|
71 |
-
| Positive Set | Omicron |
|
72 |
-
| Negative Set | Delta |
|
73 |
-
| Pooling | Max |
|
74 |
-
| ReLU | 0.2 |
|
75 |
-
| Dropout | 0.1 |
|
76 |
-
| Learning Rate | 0.001 |
|
77 |
-
| Batch size | 32 |
|
78 |
-
| Margin | 2.0 |
|
79 |
-
| Epochs | [0, 9] |
|
80 |
-
|
81 |
-
|
82 |
-
### Training Results:
|
83 |
-
|
84 |
-
| Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
|
85 |
-
|---------|-----|-----|-----|-----|
|
86 |
-
| 2 | 0.0236 | 99.7% | 1.0627 | 50.0% |
|
87 |
-
| 4 | 0.2941 | 89.4% | 0.2286 | 91.5% |
|
88 |
-
|
89 |
-
|
90 |
-
## Dependencies
|
91 |
-
|
92 |
-
```bash
|
93 |
-
conda create -n spaces python=3.8
|
94 |
-
conda activate spaces
|
95 |
-
pip install -r requirements.txt
|
96 |
-
```
|
97 |
-
|
98 |
-
**requirements.txt**
|
99 |
-
```bash
|
100 |
-
numpy==1.21.0
|
101 |
-
pandas==2.0.2
|
102 |
-
sentence-transformers==2.2.2
|
103 |
-
transformers==4.30.2
|
104 |
-
tokenizers==0.13.3
|
105 |
-
scanpy==1.9.3
|
106 |
-
scikit-learn==1.2.2
|
107 |
-
scipy==1.10.1
|
108 |
-
torch-optimizer==0.3.0
|
109 |
-
torchmetrics==0.9.0
|
110 |
-
torch==1.12.1+cu113
|
111 |
-
torchvision==0.13.1+cu113
|
112 |
-
torchaudio==0.12.1
|
113 |
-
--extra-index-url https://download.pytorch.org/whl/cu113
|
114 |
-
```
|
115 |
-
|
116 |
-
#### For More Information:
|
117 |
-
[[email protected]](mailto:[email protected])
|
|
|
10 |
license: cc-by-nc-nd-4.0
|
11 |
short_description: Predict viral escape potential of novel SARS-CoV-2 variants
|
12 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
@@ -53,7 +53,7 @@ def main():
|
|
53 |
results_df = process_target_data(average_embedding, target_dataset)
|
54 |
|
55 |
# Reverse the rank_sc_sp by subtracting it from the maximum rank value plus one
|
56 |
-
results_df['Escape Potential'] = results_df['
|
57 |
|
58 |
# Create scatter plot with manual color assignment
|
59 |
fig = px.scatter(
|
@@ -69,17 +69,17 @@ def main():
|
|
69 |
"log10(sp)": True, # display log10(sp)
|
70 |
"log10(sc)": True, # display log10(sc)
|
71 |
"log10(ip)": True, # display log10(ip)
|
72 |
-
"log10(gr)": True, # display log10(gr)
|
73 |
"sp": False, # display actual sp
|
74 |
"sc": False, # display actual sc
|
75 |
"ip": False, # display actual ip
|
76 |
-
"gr": False, # display actual gr
|
77 |
"rank_by_sc": True, # display rank by sc
|
78 |
"rank_by_sp": True, # display rank by sp
|
79 |
"rank_by_ip": True, # display rank by ip
|
80 |
"rank_by_scsp": True, # display rank by scsp
|
81 |
"rank_by_scip": True, # display rank by scip
|
82 |
-
"rank_by_scgr": True, # display rank by scgr
|
83 |
"Escape Potential": False
|
84 |
},
|
85 |
)
|
@@ -120,9 +120,8 @@ def main():
|
|
120 |
|
121 |
# Display the results as a DataFrame
|
122 |
st.dataframe(results_df[["accession_id", "log10(sc)", "log10(sp)", "log10(ip)",
|
123 |
-
"
|
124 |
-
|
125 |
-
"rank_by_scgr"]], hide_index=True)
|
126 |
|
127 |
# Display the README.md file
|
128 |
st.markdown(readme_text)
|
|
|
53 |
results_df = process_target_data(average_embedding, target_dataset)
|
54 |
|
55 |
# Reverse the rank_sc_sp by subtracting it from the maximum rank value plus one
|
56 |
+
results_df['Escape Potential'] = results_df['rank_by_scip'].max() + 1 - results_df['rank_by_scip']
|
57 |
|
58 |
# Create scatter plot with manual color assignment
|
59 |
fig = px.scatter(
|
|
|
69 |
"log10(sp)": True, # display log10(sp)
|
70 |
"log10(sc)": True, # display log10(sc)
|
71 |
"log10(ip)": True, # display log10(ip)
|
72 |
+
#"log10(gr)": True, # display log10(gr)
|
73 |
"sp": False, # display actual sp
|
74 |
"sc": False, # display actual sc
|
75 |
"ip": False, # display actual ip
|
76 |
+
#"gr": False, # display actual gr
|
77 |
"rank_by_sc": True, # display rank by sc
|
78 |
"rank_by_sp": True, # display rank by sp
|
79 |
"rank_by_ip": True, # display rank by ip
|
80 |
"rank_by_scsp": True, # display rank by scsp
|
81 |
"rank_by_scip": True, # display rank by scip
|
82 |
+
#"rank_by_scgr": True, # display rank by scgr
|
83 |
"Escape Potential": False
|
84 |
},
|
85 |
)
|
|
|
120 |
|
121 |
# Display the results as a DataFrame
|
122 |
st.dataframe(results_df[["accession_id", "log10(sc)", "log10(sp)", "log10(ip)",
|
123 |
+
"rank_by_sc", "rank_by_sp", "rank_by_ip", "rank_by_scsp", "rank_by_scip"
|
124 |
+
]], hide_index=True)
|
|
|
125 |
|
126 |
# Display the README.md file
|
127 |
st.markdown(readme_text)
|
predict.py
CHANGED
@@ -195,7 +195,7 @@ def get_sc_sp_ip(average_embedding, target_dataset):
|
|
195 |
|
196 |
|
197 |
|
198 |
-
def
|
199 |
# Create a DataFrame with the results
|
200 |
results_df = target_dataset.copy()
|
201 |
results_df["sc"] = sc_scores
|
@@ -239,7 +239,7 @@ def get_results_dict(target_dataset, sc_scores, sp_scores, ip_scores):
|
|
239 |
return results_df
|
240 |
|
241 |
|
242 |
-
def process_target_data(average_embedding,
|
243 |
-
sc_scores, sp_scores, ip_scores = get_sc_sp_ip(average_embedding,
|
244 |
-
results_df =
|
245 |
return results_df
|
|
|
195 |
|
196 |
|
197 |
|
198 |
+
def get_results_df(target_dataset, sc_scores, sp_scores, ip_scores):
|
199 |
# Create a DataFrame with the results
|
200 |
results_df = target_dataset.copy()
|
201 |
results_df["sc"] = sc_scores
|
|
|
239 |
return results_df
|
240 |
|
241 |
|
242 |
+
def process_target_data(average_embedding, target_dataset):
|
243 |
+
sc_scores, sp_scores, ip_scores = get_sc_sp_ip(average_embedding, target_dataset)
|
244 |
+
results_df = get_results_df(target_dataset, sc_scores, sp_scores, ip_scores)
|
245 |
return results_df
|