license: gpl-3.0
language:
- en
pipeline_tag: text-classification
tags:
- sbic
- csr
datasets:
- ia-nechaev/sbic-method2
sbic-method2
An updated version of Standard-Based Impact Classification (SBIC) method of CSR report analysis in accordance with GRI framework
Contents
- Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
- Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
- Claculated text embeddings of both datasets
- Script to predict the labels
Instructions on how to run the code below.
Multilabel Classification Steps
This code performs report similarity search using cosine similarity, K-Nearest Neighbor (KNN) algorithm, and Sigmoid activation function to classify reports based on embeddings.
Prerequisites
Ensure you have the following installed before running the script:
- Python 3.8+
- Required Python libraries (install using the command below)
pip install numpy pandas torch sentence-transformers scikit-learn
Input Files
Before running the script, make sure you have the following input files in the working directory:
Data Files:
- labeled dataset:
labeled.csv
- dataset for prediction:
prediction_demo.csv
- labeled dataset:
Precomputed Embeddings:
- labeled dataset:
embeddings_labeled.pkl
- dataset for prediction:
embeddings_prediction.pkl
- labeled dataset:
Running the Script
Run the script using the following command:
python script.py
Processing Steps
The script follows these main steps:
- Load Data & Pretrained Embeddings
- Perform Cosine Similarity Search: Finds the most relevant reports (sentences) using
semantic_search
fromsentence-transformers
. - Apply K-Nearest Neighbor (KNN) Algorithm: Selects top similar reports (sentences) and aggregates predictions.
- Use Sigmoid Activation for Classification: Applies a threshold to generate final classification outputs.
- Save Results: Exports
df_results_0_50k.csv
containing the processed data for the first 50k of records.
Output File
The processed results will be saved in: df_results_0_50k.csv
Execution Time
Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.