ia-nechaev commited on
Commit
cdf1ffe
·
verified ·
1 Parent(s): c1c9fa4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -5,7 +5,7 @@ An updated version of Standard-Based Impact Classification (SBIC) method of CSR
5
  ## Contents
6
 
7
  1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
8
- 2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (available upon request)
9
  3. Claculated text embeddings of both datasets
10
  4. Script to predict the labels
11
 
@@ -35,7 +35,7 @@ Before running the script, make sure you have the following input files in the w
35
 
36
  1. **Data Files**:
37
  - labeled dataset: `labeled.csv`
38
- - dataset for prediction: `prediction.csv`
39
 
40
  2. **Precomputed Embeddings**:
41
  - labeled dataset: `embeddings_labeled.pkl`
@@ -57,7 +57,7 @@ The script follows these main steps:
57
  2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.
58
  3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.
59
  4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.
60
- 5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data.
61
 
62
  ## **Output File**
63
 
 
5
  ## Contents
6
 
7
  1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
8
+ 2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
9
  3. Claculated text embeddings of both datasets
10
  4. Script to predict the labels
11
 
 
35
 
36
  1. **Data Files**:
37
  - labeled dataset: `labeled.csv`
38
+ - dataset for prediction: `prediction_demo.csv`
39
 
40
  2. **Precomputed Embeddings**:
41
  - labeled dataset: `embeddings_labeled.pkl`
 
57
  2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.
58
  3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.
59
  4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.
60
+ 5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records.
61
 
62
  ## **Output File**
63