Update README.md
Browse files
README.md
CHANGED
@@ -1 +1,105 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
metrics:
|
3 |
+
- accuracy
|
4 |
+
- precision
|
5 |
+
- recall
|
6 |
+
- f1
|
7 |
+
pipeline_tag: tabular-classification
|
8 |
+
tags:
|
9 |
+
- medical
|
10 |
+
- biology
|
11 |
+
- code
|
12 |
+
---
|
13 |
+
# HCC TIIC Random Forest Model
|
14 |
+
**Developed by:** Yifu (Evan) Zuo
|
15 |
+
|
16 |
+
This is a Random Forest classifier for automatically classifying tumor-infiltrating immune cells in hepatocellular carcinoma tumor microenvironments in 40 categories based on expression data from 107 CD45+ genes.
|
17 |
+
|
18 |
+
## How to use it
|
19 |
+
|
20 |
+
#### 1. Download the model from Files
|
21 |
+
This is pretty straight forward. Head to the Files tab of this repository and download the model. The size of the RF model in pickle format is 2.1G.
|
22 |
+
|
23 |
+
#### 2. Create a New Interactive Python Notebook
|
24 |
+
Open Jupyter Notebook or Google Colab, and create a new notebook file. This environment will allow you to interactively run Python commands and visualize outputs step-by-step.
|
25 |
+
|
26 |
+
#### 3. Import Required Libraries
|
27 |
+
Start by importing the required libraries in your notebook. This includes:
|
28 |
+
```
|
29 |
+
import joblib
|
30 |
+
import pandas as pd
|
31 |
+
from sklearn.impute import SimpleImputer
|
32 |
+
import matplotlib.pyplot as plt
|
33 |
+
```
|
34 |
+
|
35 |
+
These libraries are needed to load the model, handle the data, and create visualizations.
|
36 |
+
|
37 |
+
#### 4. Load the Downloaded Model
|
38 |
+
Use the following command to load the model into your notebook:
|
39 |
+
```
|
40 |
+
loaded_rf_model = joblib.load('path_to_downloaded_model.pkl')
|
41 |
+
```
|
42 |
+
Replace `'path_to_downloaded_model.pkl'` with the actual file path of the downloaded model.
|
43 |
+
#### 5. Load the Data in CSV Format
|
44 |
+
Load the Data in CSV Format:
|
45 |
+
`data = pd.read_csv('path_to_csv_file.csv')`
|
46 |
+
|
47 |
+
• Each row should represent a cell.
|
48 |
+
|
49 |
+
• Each column should represent a gene.
|
50 |
+
|
51 |
+
• The required genes must be present in the data (Check Step 9 to see the full list).
|
52 |
+
|
53 |
+
Before loading the data in CSV format, make sure the UMI counts for each gene is normalized. The UMI counts should be scaled to 10,000 as standard practice. R and Seurat are recommended for the conversion to CSV.
|
54 |
+
|
55 |
+
#### 7. Preprocess the Data for Model Compatibility
|
56 |
+
Prepare the data before feeding it to the model.
|
57 |
+
|
58 |
+
• Replace hyphens in column names with dots:
|
59 |
+
```
|
60 |
+
data.columns = data.columns.str.replace('-', '.')
|
61 |
+
```
|
62 |
+
• Drop irrelevant rows and columns:
|
63 |
+
```
|
64 |
+
# Rename columns based on the mapping dictionary
|
65 |
+
data.rename(columns=feature_mapping, inplace=True))
|
66 |
+
```
|
67 |
+
Ensure that the feature mapping is correctly defined in your code.
|
68 |
+
|
69 |
+
#### 9. Select the Required Features for Prediction
|
70 |
+
Define the list of genes to be used by the model:
|
71 |
+
```
|
72 |
+
selected_features = ['CD3D', 'CD3E', 'CD3G', 'CCR7', 'LEF1', 'SELL', 'TCF7', 'S1PR1', 'ANXA1', 'ANXA2',
|
73 |
+
'IL7R', 'CD74', 'TYROBP', 'CD4', 'HAVCR2', 'PDCD1', 'GZMB', 'ITGAE', 'CXCL13', 'FOXP3',
|
74 |
+
'CTLA4', 'IL2RA', 'MKI67', 'STMN1', 'CMC1', 'CD8A', 'CD8B', 'CX3CR1', 'KLRG1', 'FCGR3A',
|
75 |
+
'FGFBP2', 'GZMH', 'GZMK', 'CCL4', 'CCL5', 'NKG7', 'KLRD1', 'KLRF1', 'GNLY', 'IL32',
|
76 |
+
'SLC4A10', 'KLRB1', 'ZBTB16', 'NCR3', 'NCAM1', 'CCL3', 'IFNG', 'CD69', 'HSPA1A',
|
77 |
+
'XCL1', 'AREG', 'CD160', 'TIGIT', 'CXCR4', 'ZNF331', 'DNAJB1', 'HSPA1B', 'HSPA6',
|
78 |
+
'TUBB', 'CST3', 'LYZ', 'CD14', 'VCAN', 'S100A9', 'RNASE2', 'S100A12', 'FCER1G', 'LST1',
|
79 |
+
'AIF1', 'IFITM3', 'CD1C', 'FCER1A', 'CLEC10A', 'VEGFA', 'IRF4', 'RGS2', 'CLEC9A',
|
80 |
+
'IRF8', 'IDO1', 'CLNK', 'XCR1', 'LAMP3', 'CD274', 'LTB', 'CCL19', 'CCL21', 'CD68',
|
81 |
+
'THBS1', 'S100A8', 'CD163', 'SIGLEC1', 'C1QA', 'SLC40A1', 'GPNMB', 'APOE', 'SAT1',
|
82 |
+
'HLA.DQB1', 'S100A4', 'HLA.DRA', 'HLA.DQA1', 'MARCO', 'CD79A', 'CPA3', 'KIT', 'CD19',
|
83 |
+
'MS4A1', 'CD22']
|
84 |
+
X_test_data = data[selected_features]
|
85 |
+
```
|
86 |
+
#### 10. Handle Missing Values in the Data
|
87 |
+
Replace missing values (NaN) with the mean of each column using SimpleImputer:
|
88 |
+
```
|
89 |
+
imputer = SimpleImputer(strategy='mean')
|
90 |
+
X_test_data = imputer.fit_transform(X_test_data)
|
91 |
+
```
|
92 |
+
#### 11. Make Predictions with the Loaded Model
|
93 |
+
Use the model to make predictions:
|
94 |
+
```
|
95 |
+
predictions = loaded_rf_model.predict(X_test_data)
|
96 |
+
```
|
97 |
+
##### 12. Add Predictions to the Data and Display the Updated Data
|
98 |
+
```
|
99 |
+
data['label'] = predictions
|
100 |
+
print(data.head())
|
101 |
+
plt.figure(figsize=(10, 4))
|
102 |
+
plt.title('Predicted Cell Type Distribution')
|
103 |
+
data['label'].value_counts().plot.bar(rot=0)
|
104 |
+
plt.show()
|
105 |
+
```
|