--- metrics: - accuracy - precision - recall - f1 pipeline_tag: tabular-classification tags: - medical - biology - code --- # HCC TIIC Random Forest Model **Developed by:** Yifu (Evan) Zuo This is a Random Forest classifier for automatically classifying tumor-infiltrating immune cells in hepatocellular carcinoma tumor microenvironments in 40 categories based on expression data from 107 CD45+ genes. ## How to use it #### 1. Download the model from Files This is pretty straight forward. Head to the Files tab of this repository and download the model. The size of the RF model in pickle format is 2.1G. #### 2. Create a New Interactive Python Notebook Open Jupyter Notebook or Google Colab, and create a new notebook file. This environment will allow you to interactively run Python commands and visualize outputs step-by-step. #### 3. Import Required Libraries Start by importing the required libraries in your notebook. This includes: ``` import joblib import pandas as pd from sklearn.impute import SimpleImputer import matplotlib.pyplot as plt ``` These libraries are needed to load the model, handle the data, and create visualizations. #### 4. Load the Downloaded Model Use the following command to load the model into your notebook: ``` loaded_rf_model = joblib.load('path_to_downloaded_model.pkl') ``` Replace `'path_to_downloaded_model.pkl'` with the actual file path of the downloaded model. #### 5. Load the Data in CSV Format Load the Data in CSV Format: `data = pd.read_csv('path_to_csv_file.csv')` • Each row should represent a cell. • Each column should represent a gene. • The required genes must be present in the data (Check Step 9 to see the full list). Before loading the data in CSV format, make sure the UMI counts for each gene is normalized. The UMI counts should be scaled to 10,000 as standard practice. R and Seurat are recommended for the conversion to CSV. #### 7. Preprocess the Data for Model Compatibility Prepare the data before feeding it to the model. • Replace hyphens in column names with dots: ``` data.columns = data.columns.str.replace('-', '.') ``` • Drop irrelevant rows and columns: ``` # Rename columns based on the mapping dictionary data.rename(columns=feature_mapping, inplace=True)) ``` Ensure that the feature mapping is correctly defined in your code. #### 9. Select the Required Features for Prediction Define the list of genes to be used by the model: ``` selected_features = ['CD3D', 'CD3E', 'CD3G', 'CCR7', 'LEF1', 'SELL', 'TCF7', 'S1PR1', 'ANXA1', 'ANXA2', 'IL7R', 'CD74', 'TYROBP', 'CD4', 'HAVCR2', 'PDCD1', 'GZMB', 'ITGAE', 'CXCL13', 'FOXP3', 'CTLA4', 'IL2RA', 'MKI67', 'STMN1', 'CMC1', 'CD8A', 'CD8B', 'CX3CR1', 'KLRG1', 'FCGR3A', 'FGFBP2', 'GZMH', 'GZMK', 'CCL4', 'CCL5', 'NKG7', 'KLRD1', 'KLRF1', 'GNLY', 'IL32', 'SLC4A10', 'KLRB1', 'ZBTB16', 'NCR3', 'NCAM1', 'CCL3', 'IFNG', 'CD69', 'HSPA1A', 'XCL1', 'AREG', 'CD160', 'TIGIT', 'CXCR4', 'ZNF331', 'DNAJB1', 'HSPA1B', 'HSPA6', 'TUBB', 'CST3', 'LYZ', 'CD14', 'VCAN', 'S100A9', 'RNASE2', 'S100A12', 'FCER1G', 'LST1', 'AIF1', 'IFITM3', 'CD1C', 'FCER1A', 'CLEC10A', 'VEGFA', 'IRF4', 'RGS2', 'CLEC9A', 'IRF8', 'IDO1', 'CLNK', 'XCR1', 'LAMP3', 'CD274', 'LTB', 'CCL19', 'CCL21', 'CD68', 'THBS1', 'S100A8', 'CD163', 'SIGLEC1', 'C1QA', 'SLC40A1', 'GPNMB', 'APOE', 'SAT1', 'HLA.DQB1', 'S100A4', 'HLA.DRA', 'HLA.DQA1', 'MARCO', 'CD79A', 'CPA3', 'KIT', 'CD19', 'MS4A1', 'CD22'] X_test_data = data[selected_features] ``` #### 10. Handle Missing Values in the Data Replace missing values (NaN) with the mean of each column using SimpleImputer: ``` imputer = SimpleImputer(strategy='mean') X_test_data = imputer.fit_transform(X_test_data) ``` #### 11. Make Predictions with the Loaded Model Use the model to make predictions: ``` predictions = loaded_rf_model.predict(X_test_data) ``` ##### 12. Add Predictions to the Data and Display the Updated Data ``` data['label'] = predictions print(data.head()) plt.figure(figsize=(10, 4)) plt.title('Predicted Cell Type Distribution') data['label'].value_counts().plot.bar(rot=0) plt.show() ```