File size: 2,511 Bytes
2b89af4
 
 
 
 
 
 
 
 
 
 
 
1fb8a93
 
fee97ef
1fb8a93
c74c413
1fb8a93
c74c413
cdf1ffe
c74c413
 
 
 
 
fa3b34f
0813e04
fa3b34f
0813e04
fa3b34f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b601ec3
 
cdf1ffe
fa3b34f
 
4d7180b
 
d9be3cd
fa3b34f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fc2ddf
fa3b34f
 
 
0813e04
fa3b34f
fee97ef
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: gpl-3.0
language:
- en
pipeline_tag: text-classification
tags:
- sbic
- csr
datasets:
- ia-nechaev/sbic-method2
---

# sbic-method2

An updated version of **Standard-Based Impact Classification (SBIC) method** of CSR report analysis in accordance with GRI framework

## Contents

1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
3. Claculated text embeddings of both datasets
4. Script to predict the labels


Instructions on how to run the code below.

---

# **Multilabel Classification Steps**  

This code performs report similarity search using **cosine similarity**, **K-Nearest Neighbor (KNN) algorithm**, and **Sigmoid activation function** to classify reports based on embeddings.  

## **Prerequisites**  

Ensure you have the following installed before running the script:  

- Python 3.8+  
- Required Python libraries (install using the command below)  

```bash
pip install numpy pandas torch sentence-transformers scikit-learn
```

## **Input Files**  

Before running the script, make sure you have the following input files in the working directory:  

1. **Data Files**:  
   - labeled dataset: `labeled.csv`  
   - dataset for prediction: `prediction_demo.csv`  

2. **Precomputed Embeddings**:  
   - labeled dataset: `embeddings_labeled.pkl`  
   - dataset for prediction: `embeddings_prediction.pkl`
      
## **Running the Script**  

Run the script using the following command:  

```bash
python script.py
```

## **Processing Steps**  

The script follows these main steps:  

1. **Load Data & Pretrained Embeddings**  
2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.  
3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.  
4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.  
5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records.  

## **Output File**  

The processed results will be saved in: `df_results_0_50k.csv`  

## **Execution Time**

Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.