smtnkc commited on
Commit
6d42f96
·
1 Parent(s): 59416a1

Using rank_by_scip

Browse files
Files changed (4) hide show
  1. INSTRUCTIONS.md +4 -6
  2. README.md +0 -105
  3. app.py +6 -7
  4. predict.py +4 -4
INSTRUCTIONS.md CHANGED
@@ -24,24 +24,21 @@ The output will be a dataframe with the following columns:
24
  | `log10(sc)` | Log-scaled semantic change |
25
  | `log10(sp)` | Log-scaled sequence probability |
26
  | `log10(ip)` | Log-scaled inverse perplexity |
27
- | `log10(gr)` | Log-scaled grammaticality where `gr = (sp + ip) / 2` |
28
  | `rank_by_sc` | Rank by semantic change |
29
  | `rank_by_sp` | Rank by sequence probability |
30
  | `rank_by_ip` | Rank by inverse perplexity |
31
- | `rank_by_gr` | Rank by grammaticality |
32
  | `rank_by_scsp` | Rank by semantic change + Rank by sequence probability |
33
  | `rank_by_scip` | Rank by semantic change + Rank by inverse perplexity |
34
- | `rank_by_scgr` | Rank by semantic change + Rank by grammaticality |
35
 
36
- **Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scgr`.
37
 
38
  See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
39
 
40
  ### The Ranking Mechanism
41
 
42
- In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose a more robust metric for grammaticality by averaging sequence probability (`sp`) and inverse perplexity (`ip`).
43
 
44
- Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scgr`, with the top element possessing the smallest `rank_by_scgr` and indicating the sequence with the highest escape potential.
45
 
46
  ### Model Details
47
 
@@ -93,6 +90,7 @@ scanpy==1.9.3
93
  scikit-learn==1.2.2
94
  scipy==1.10.1
95
  plotly==5.24.1
 
96
  torch-optimizer==0.3.0
97
  torchmetrics==0.9.0
98
  torch==1.12.1+cu113
 
24
  | `log10(sc)` | Log-scaled semantic change |
25
  | `log10(sp)` | Log-scaled sequence probability |
26
  | `log10(ip)` | Log-scaled inverse perplexity |
 
27
  | `rank_by_sc` | Rank by semantic change |
28
  | `rank_by_sp` | Rank by sequence probability |
29
  | `rank_by_ip` | Rank by inverse perplexity |
 
30
  | `rank_by_scsp` | Rank by semantic change + Rank by sequence probability |
31
  | `rank_by_scip` | Rank by semantic change + Rank by inverse perplexity |
 
32
 
33
+ **Note:** All ranks are in descending order, with the default sorting metric being `rank_by_scip`.
34
 
35
  See [the output](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/output.csv) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
36
 
37
  ### The Ranking Mechanism
38
 
39
+ In the original implementation of [Constrained Semantic Change Search](https://www.science.org/doi/10.1126/science.abd7331) (CSCS), grammaticality (`gr`) is determined by sequence probability (`sp`). We propose using inverse perplexity (`ip`) as a more robust metric for grammaticality.
40
 
41
+ Sequences with both high semantic change (`sc`) and high grammaticality (`gr`) are expected to have a greater escape potential. We rank the sequences in descending order, assigning smaller rank values to those with higher escape potential. Consequently, the output is sorted based on `rank_by_scip`, with the top element possessing the smallest `rank_by_scip` and indicating the sequence with the highest escape potential.
42
 
43
  ### Model Details
44
 
 
90
  scikit-learn==1.2.2
91
  scipy==1.10.1
92
  plotly==5.24.1
93
+ huggingface-hub==0.25.2
94
  torch-optimizer==0.3.0
95
  torchmetrics==0.9.0
96
  torch==1.12.1+cu113
README.md CHANGED
@@ -10,108 +10,3 @@ pinned: false
10
  license: cc-by-nc-nd-4.0
11
  short_description: Predict viral escape potential of novel SARS-CoV-2 variants
12
  ---
13
-
14
- ## Running Instructions
15
-
16
- 1. Upload a CSV file with the target sequences. CSV file must have ``accession_id`` and ``sequence`` columns. See [example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv) which includes 5 New Omicron (``EPI_ISL_177...``) and 5 Eris (``EPI_ISL_189...``) sequences.
17
-
18
- 2. Application will compare the given sequences with the average Omicron embedding. This embedding has been generated using [2000 Omicron sequences](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/omicron.csv).
19
-
20
- 3. You will get a JSON response in the following format:
21
-
22
- ```json
23
- {
24
- "EPI_ISL_18905639": {
25
- "sc": 410.788391,
26
- "sp": 0.000113,
27
- "ip": 9e-05,
28
- "log10(sc)": 2.613618,
29
- "log10(sp)": -3.946409,
30
- "log10(ip)": -4.047602,
31
- "rank_by_sc": 2,
32
- "rank_by_sp": 3,
33
- "rank_by_ip": 2,
34
- "rank_by_scsp": 5,
35
- "rank_by_scip": 4
36
- },
37
- ...
38
- }
39
- ```
40
-
41
- where:
42
-
43
- * ``sc``: semantic change
44
- * ``sp``: sequence probability
45
- * ``ip``: inverse perplexity
46
- * ``rank_by_sc``: rank by semantic change (In descending order)
47
- * ``rank_by_sp``: rank by sequence probability (In descending order)
48
- * ``rank_by_ip``: rank by inverse perplexity (In descending order)
49
- * ``rank_by_scsp``: rank by semantic change + rank by sequence probability (Both descending order)
50
- * ``rank_by_scip``: rank by semantic change + rank by inverse perplexity (Both in descending order)
51
-
52
-
53
- ## The Ranking Mechanism
54
-
55
- We measure grammaticality using either sequence probability (`sp`) or inverse perplexity (`ip`). Sequences that have a higher semantic change (`sc`) and higher grammaticality tend to have a higher escape potential. We rank sequences in descending order, meaning sequences with smaller rank values have a higher escape potential. In the JSON response, the sequences are sorted by ascending `rank_by_scsp`. Therefore, the top element has the smallest `rank_by_scsp`, representing the sequence with the highest escape potential.
56
-
57
- See [the response](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/response.json) for [the example target file](https://huggingface.co/spaces/smtnkc/cov-snn-app/resolve/main/target.csv).
58
-
59
-
60
- ## Model Details
61
-
62
- This application uses the model checkpoint with the highest zero-shot test accuracy (91.5%).
63
-
64
- ### Training Parameters:
65
-
66
- | Parameter | Value |
67
- |---------|-----|
68
- | Base Model | CoV-RoBERTa_2048 |
69
- | Loss function | Contrastive |
70
- | Max Seq Len | 1280 |
71
- | Positive Set | Omicron |
72
- | Negative Set | Delta |
73
- | Pooling | Max |
74
- | ReLU | 0.2 |
75
- | Dropout | 0.1 |
76
- | Learning Rate | 0.001 |
77
- | Batch size | 32 |
78
- | Margin | 2.0 |
79
- | Epochs | [0, 9] |
80
-
81
-
82
- ### Training Results:
83
-
84
- | Checkpoint | Test Loss | Test Acc | Zero-shot Loss | Zero-shot Acc |
85
- |---------|-----|-----|-----|-----|
86
- | 2 | 0.0236 | 99.7% | 1.0627 | 50.0% |
87
- | 4 | 0.2941 | 89.4% | 0.2286 | 91.5% |
88
-
89
-
90
- ## Dependencies
91
-
92
- ```bash
93
- conda create -n spaces python=3.8
94
- conda activate spaces
95
- pip install -r requirements.txt
96
- ```
97
-
98
- **requirements.txt**
99
- ```bash
100
- numpy==1.21.0
101
- pandas==2.0.2
102
- sentence-transformers==2.2.2
103
- transformers==4.30.2
104
- tokenizers==0.13.3
105
- scanpy==1.9.3
106
- scikit-learn==1.2.2
107
- scipy==1.10.1
108
- torch-optimizer==0.3.0
109
- torchmetrics==0.9.0
110
- torch==1.12.1+cu113
111
- torchvision==0.13.1+cu113
112
- torchaudio==0.12.1
113
- --extra-index-url https://download.pytorch.org/whl/cu113
114
- ```
115
-
116
- #### For More Information:
117
 
10
  license: cc-by-nc-nd-4.0
11
  short_description: Predict viral escape potential of novel SARS-CoV-2 variants
12
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -53,7 +53,7 @@ def main():
53
  results_df = process_target_data(average_embedding, target_dataset)
54
 
55
  # Reverse the rank_sc_sp by subtracting it from the maximum rank value plus one
56
- results_df['Escape Potential'] = results_df['rank_by_scgr'].max() + 1 - results_df['rank_by_scgr']
57
 
58
  # Create scatter plot with manual color assignment
59
  fig = px.scatter(
@@ -69,17 +69,17 @@ def main():
69
  "log10(sp)": True, # display log10(sp)
70
  "log10(sc)": True, # display log10(sc)
71
  "log10(ip)": True, # display log10(ip)
72
- "log10(gr)": True, # display log10(gr)
73
  "sp": False, # display actual sp
74
  "sc": False, # display actual sc
75
  "ip": False, # display actual ip
76
- "gr": False, # display actual gr
77
  "rank_by_sc": True, # display rank by sc
78
  "rank_by_sp": True, # display rank by sp
79
  "rank_by_ip": True, # display rank by ip
80
  "rank_by_scsp": True, # display rank by scsp
81
  "rank_by_scip": True, # display rank by scip
82
- "rank_by_scgr": True, # display rank by scgr
83
  "Escape Potential": False
84
  },
85
  )
@@ -120,9 +120,8 @@ def main():
120
 
121
  # Display the results as a DataFrame
122
  st.dataframe(results_df[["accession_id", "log10(sc)", "log10(sp)", "log10(ip)",
123
- "log10(gr)", "rank_by_sc", "rank_by_sp",
124
- "rank_by_ip", "rank_by_gr", "rank_by_scsp", "rank_by_scip",
125
- "rank_by_scgr"]], hide_index=True)
126
 
127
  # Display the README.md file
128
  st.markdown(readme_text)
 
53
  results_df = process_target_data(average_embedding, target_dataset)
54
 
55
  # Reverse the rank_sc_sp by subtracting it from the maximum rank value plus one
56
+ results_df['Escape Potential'] = results_df['rank_by_scip'].max() + 1 - results_df['rank_by_scip']
57
 
58
  # Create scatter plot with manual color assignment
59
  fig = px.scatter(
 
69
  "log10(sp)": True, # display log10(sp)
70
  "log10(sc)": True, # display log10(sc)
71
  "log10(ip)": True, # display log10(ip)
72
+ #"log10(gr)": True, # display log10(gr)
73
  "sp": False, # display actual sp
74
  "sc": False, # display actual sc
75
  "ip": False, # display actual ip
76
+ #"gr": False, # display actual gr
77
  "rank_by_sc": True, # display rank by sc
78
  "rank_by_sp": True, # display rank by sp
79
  "rank_by_ip": True, # display rank by ip
80
  "rank_by_scsp": True, # display rank by scsp
81
  "rank_by_scip": True, # display rank by scip
82
+ #"rank_by_scgr": True, # display rank by scgr
83
  "Escape Potential": False
84
  },
85
  )
 
120
 
121
  # Display the results as a DataFrame
122
  st.dataframe(results_df[["accession_id", "log10(sc)", "log10(sp)", "log10(ip)",
123
+ "rank_by_sc", "rank_by_sp", "rank_by_ip", "rank_by_scsp", "rank_by_scip"
124
+ ]], hide_index=True)
 
125
 
126
  # Display the README.md file
127
  st.markdown(readme_text)
predict.py CHANGED
@@ -195,7 +195,7 @@ def get_sc_sp_ip(average_embedding, target_dataset):
195
 
196
 
197
 
198
- def get_results_dict(target_dataset, sc_scores, sp_scores, ip_scores):
199
  # Create a DataFrame with the results
200
  results_df = target_dataset.copy()
201
  results_df["sc"] = sc_scores
@@ -239,7 +239,7 @@ def get_results_dict(target_dataset, sc_scores, sp_scores, ip_scores):
239
  return results_df
240
 
241
 
242
- def process_target_data(average_embedding, target_data):
243
- sc_scores, sp_scores, ip_scores = get_sc_sp_ip(average_embedding, target_data)
244
- results_df = get_results_dict(target_data, sc_scores, sp_scores, ip_scores)
245
  return results_df
 
195
 
196
 
197
 
198
+ def get_results_df(target_dataset, sc_scores, sp_scores, ip_scores):
199
  # Create a DataFrame with the results
200
  results_df = target_dataset.copy()
201
  results_df["sc"] = sc_scores
 
239
  return results_df
240
 
241
 
242
+ def process_target_data(average_embedding, target_dataset):
243
+ sc_scores, sp_scores, ip_scores = get_sc_sp_ip(average_embedding, target_dataset)
244
+ results_df = get_results_df(target_dataset, sc_scores, sp_scores, ip_scores)
245
  return results_df