Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ An updated version of Standard-Based Impact Classification (SBIC) method of CSR
|
|
5 |
## Contents
|
6 |
|
7 |
1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
|
8 |
-
2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (available upon request)
|
9 |
3. Claculated text embeddings of both datasets
|
10 |
4. Script to predict the labels
|
11 |
|
@@ -35,7 +35,7 @@ Before running the script, make sure you have the following input files in the w
|
|
35 |
|
36 |
1. **Data Files**:
|
37 |
- labeled dataset: `labeled.csv`
|
38 |
-
- dataset for prediction: `
|
39 |
|
40 |
2. **Precomputed Embeddings**:
|
41 |
- labeled dataset: `embeddings_labeled.pkl`
|
@@ -57,7 +57,7 @@ The script follows these main steps:
|
|
57 |
2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.
|
58 |
3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.
|
59 |
4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.
|
60 |
-
5. **Save Results**: Exports `df_results_0_50k.csv` containing the
|
61 |
|
62 |
## **Output File**
|
63 |
|
|
|
5 |
## Contents
|
6 |
|
7 |
1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
|
8 |
+
2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
|
9 |
3. Claculated text embeddings of both datasets
|
10 |
4. Script to predict the labels
|
11 |
|
|
|
35 |
|
36 |
1. **Data Files**:
|
37 |
- labeled dataset: `labeled.csv`
|
38 |
+
- dataset for prediction: `prediction_demo.csv`
|
39 |
|
40 |
2. **Precomputed Embeddings**:
|
41 |
- labeled dataset: `embeddings_labeled.pkl`
|
|
|
57 |
2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.
|
58 |
3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.
|
59 |
4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.
|
60 |
+
5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records.
|
61 |
|
62 |
## **Output File**
|
63 |
|