metadata

license: gpl-3.0
language:
  - en
pipeline_tag: text-classification
tags:
  - sbic
  - csr
datasets:
  - ia-nechaev/sbic-method2

sbic-method2

An updated version of Standard-Based Impact Classification (SBIC) method of CSR report analysis in accordance with GRI framework

Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
Claculated text embeddings of both datasets
Script to predict the labels

Instructions on how to run the code below.

Multilabel Classification Steps

This code performs report similarity search using cosine similarity, K-Nearest Neighbor (KNN) algorithm, and Sigmoid activation function to classify reports based on embeddings.

Prerequisites

Ensure you have the following installed before running the script:

Python 3.8+
Required Python libraries (install using the command below)

pip install numpy pandas torch sentence-transformers scikit-learn

Input Files

Before running the script, make sure you have the following input files in the working directory:

Data Files:
- labeled dataset: labeled.csv
- dataset for prediction: prediction_demo.csv
Precomputed Embeddings:
- labeled dataset: embeddings_labeled.pkl
- dataset for prediction: embeddings_prediction.pkl

Running the Script

Run the script using the following command:

python script.py

Processing Steps

The script follows these main steps:

Load Data & Pretrained Embeddings
Perform Cosine Similarity Search: Finds the most relevant reports (sentences) using semantic_search from sentence-transformers.
Apply K-Nearest Neighbor (KNN) Algorithm: Selects top similar reports (sentences) and aggregates predictions.
Use Sigmoid Activation for Classification: Applies a threshold to generate final classification outputs.
Save Results: Exports df_results_0_50k.csv containing the processed data for the first 50k of records.

Output File

The processed results will be saved in: df_results_0_50k.csv

Execution Time

Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.

ia-nechaev
/

sbic-method2