HCC TIIC Random Forest Model

Developed by: Yifu (Evan) Zuo

This is a Random Forest classifier for automatically classifying tumor-infiltrating immune cells in hepatocellular carcinoma tumor microenvironments in 40 categories based on expression data from 107 CD45+ genes.

How to use it

1. Download the model from Files

This is pretty straight forward. Head to the Files tab of this repository and download the model. The size of the RF model in pickle format is 2.1G.

2. Create a New Interactive Python Notebook

Open Jupyter Notebook or Google Colab, and create a new notebook file. This environment will allow you to interactively run Python commands and visualize outputs step-by-step.

3. Import Required Libraries

Start by importing the required libraries in your notebook. This includes:

import joblib
import pandas as pd
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

These libraries are needed to load the model, handle the data, and create visualizations.

4. Load the Downloaded Model

Use the following command to load the model into your notebook:

loaded_rf_model = joblib.load('path_to_downloaded_model.pkl')

Replace 'path_to_downloaded_model.pkl' with the actual file path of the downloaded model.

5. Load the Data in CSV Format

Load the Data in CSV Format: data = pd.read_csv('path_to_csv_file.csv')

• Each row should represent a cell.

• Each column should represent a gene.

• The required genes must be present in the data (Check Step 9 to see the full list).

Before loading the data in CSV format, make sure the UMI counts for each gene is normalized. The UMI counts should be scaled to 10,000 as standard practice. R and Seurat are recommended for the conversion to CSV.

7. Preprocess the Data for Model Compatibility

Prepare the data before feeding it to the model.

• Replace hyphens in column names with dots:

data.columns = data.columns.str.replace('-', '.')

• Drop irrelevant rows and columns:

# Rename columns based on the mapping dictionary
data.rename(columns=feature_mapping, inplace=True))

Ensure that the feature mapping is correctly defined in your code.

9. Select the Required Features for Prediction

Define the list of genes to be used by the model:

selected_features = ['CD3D', 'CD3E', 'CD3G', 'CCR7', 'LEF1', 'SELL', 'TCF7', 'S1PR1', 'ANXA1', 'ANXA2', 
                     'IL7R', 'CD74', 'TYROBP', 'CD4', 'HAVCR2', 'PDCD1', 'GZMB', 'ITGAE', 'CXCL13', 'FOXP3', 
                     'CTLA4', 'IL2RA', 'MKI67', 'STMN1', 'CMC1', 'CD8A', 'CD8B', 'CX3CR1', 'KLRG1', 'FCGR3A', 
                     'FGFBP2', 'GZMH', 'GZMK', 'CCL4', 'CCL5', 'NKG7', 'KLRD1', 'KLRF1', 'GNLY', 'IL32', 
                     'SLC4A10', 'KLRB1', 'ZBTB16', 'NCR3', 'NCAM1', 'CCL3', 'IFNG', 'CD69', 'HSPA1A', 
                     'XCL1', 'AREG', 'CD160', 'TIGIT', 'CXCR4', 'ZNF331', 'DNAJB1', 'HSPA1B', 'HSPA6', 
                     'TUBB', 'CST3', 'LYZ', 'CD14', 'VCAN', 'S100A9', 'RNASE2', 'S100A12', 'FCER1G', 'LST1', 
                     'AIF1', 'IFITM3', 'CD1C', 'FCER1A', 'CLEC10A', 'VEGFA', 'IRF4', 'RGS2', 'CLEC9A', 
                     'IRF8', 'IDO1', 'CLNK', 'XCR1', 'LAMP3', 'CD274', 'LTB', 'CCL19', 'CCL21', 'CD68', 
                     'THBS1', 'S100A8', 'CD163', 'SIGLEC1', 'C1QA', 'SLC40A1', 'GPNMB', 'APOE', 'SAT1', 
                     'HLA.DQB1', 'S100A4', 'HLA.DRA', 'HLA.DQA1', 'MARCO', 'CD79A', 'CPA3', 'KIT', 'CD19', 
                     'MS4A1', 'CD22']
X_test_data = data[selected_features]

10. Handle Missing Values in the Data

Replace missing values (NaN) with the mean of each column using SimpleImputer:

imputer = SimpleImputer(strategy='mean')
X_test_data = imputer.fit_transform(X_test_data)

11. Make Predictions with the Loaded Model

Use the model to make predictions:

predictions = loaded_rf_model.predict(X_test_data)

12. Add Predictions to the Data and Display the Updated Data

data['label'] = predictions
print(data.head())
plt.figure(figsize=(10, 4))
plt.title('Predicted Cell Type Distribution')
data['label'].value_counts().plot.bar(rot=0)
plt.show()