File size: 2,511 Bytes
2b89af4 1fb8a93 fee97ef 1fb8a93 c74c413 1fb8a93 c74c413 cdf1ffe c74c413 fa3b34f 0813e04 fa3b34f 0813e04 fa3b34f b601ec3 cdf1ffe fa3b34f 4d7180b d9be3cd fa3b34f 0fc2ddf fa3b34f 0813e04 fa3b34f fee97ef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
license: gpl-3.0
language:
- en
pipeline_tag: text-classification
tags:
- sbic
- csr
datasets:
- ia-nechaev/sbic-method2
---
# sbic-method2
An updated version of **Standard-Based Impact Classification (SBIC) method** of CSR report analysis in accordance with GRI framework
## Contents
1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
3. Claculated text embeddings of both datasets
4. Script to predict the labels
Instructions on how to run the code below.
---
# **Multilabel Classification Steps**
This code performs report similarity search using **cosine similarity**, **K-Nearest Neighbor (KNN) algorithm**, and **Sigmoid activation function** to classify reports based on embeddings.
## **Prerequisites**
Ensure you have the following installed before running the script:
- Python 3.8+
- Required Python libraries (install using the command below)
```bash
pip install numpy pandas torch sentence-transformers scikit-learn
```
## **Input Files**
Before running the script, make sure you have the following input files in the working directory:
1. **Data Files**:
- labeled dataset: `labeled.csv`
- dataset for prediction: `prediction_demo.csv`
2. **Precomputed Embeddings**:
- labeled dataset: `embeddings_labeled.pkl`
- dataset for prediction: `embeddings_prediction.pkl`
## **Running the Script**
Run the script using the following command:
```bash
python script.py
```
## **Processing Steps**
The script follows these main steps:
1. **Load Data & Pretrained Embeddings**
2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.
3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.
4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.
5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records.
## **Output File**
The processed results will be saved in: `df_results_0_50k.csv`
## **Execution Time**
Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.
|