MLstructureMining / README.md
Ekjaer's picture
Update README.md
750ff86 verified
|
raw
history blame
3.94 kB
---
tags:
- xgboost
- python
- tabular-classification
- chemistry
- pair-distribution-function
model_file: MLstructureMining_model.bin
metrics:
- accuracy
pipeline_tag: tabular-classification
license: apache-2.0
language:
- en
---
# Model description
MLStructureMining is a tree-based machine learning classifier designed to rapidly match X-ray pair distribution function (PDF) data to prototype
patterns from a large database of crystal structures, providing real-time structure characterization by screening vast quantities of data in seconds.
The code used to train the model can be found [HERE](https://github.com/EmilSkaaning/MLstructureMining-workflow), and the Python implementation can be found
[HERE](https://github.com/EmilSkaaning/MLstructureMining/tree/main) or the wheel file ´mlstructuremining-4.1.0-py3-none-any.whl´.
## Evaluation Results
MLstructureMining has been trained PDFs from on 10,833 crystal structures obtained from [Crystallography Open Database](https://www.crystallography.net/cod/) (COD).
The Pearson Correlation Coefficient (PCC) was used to cluster structures with similar PDF, resulting in a total of 6,062 labels. 100 PDFs were simluated per structures,
and the data were split into training, validation and testing set with a 80/10/10 ratio.
We evaluate this model based an accuracy and to test the robustness of MLstructureMining, we deploy zeroth-order optimization (ZOO)
from the [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox) (ART) library to perform adversarial attacks.
| Metric | Value |
|----------|---------|
| Accuracy | 91% |
| Top-3 Accuracy | 99% |
| ZOO Accuracy | 89% |
| ZOO Top-3 Accuracy | 97% |
# How to Get Started with the Model
Use the code below to get started with the model.
```python
import xgboost as xgb
import pandas as pd
def show_best(pred: np.ndarray,
best_list: np.ndarray,
df_stru_catalog: pd.DataFrame,
num_show: int) -> None:
"""
Display the best predictions based on the model output.
Parameters
----------
pred : np.ndarray
Predictions from the model.
best_list : np.ndarray
List of best predictions.
df_stru_catalog : pd.DataFrame
The structure catalog associated with the model.
num_show : int
Number of top predictions to show.
Returns
-------
None
"""
for count, idx in enumerate(reversed(best_list[-num_show:])):
print(f"\n{count}) Probability: {pred[idx]*100:3.1f}%")
compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])
print(f' COD-IDs: {df_stru_catalog.iloc[idx]["Label"].rsplit(".",1)[0]}, composition: {compo[0]}, space group: {sgs[0]}')
if not pd.isna(df_stru_catalog.at[idx, "Similar"]):
similar_files = extract_filenames(df_stru_catalog.at[idx, "Similar"])
compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])
for jdx in range(len(similar_files)):
print(f' COD-IDs: {similar_files[jdx]}, composition: {compo[jdx]}, space group: {sgs[jdx]}')
N_CPU = 8 # Number of CPUs used
NUM_SHOW = 5 # Show to X best predictions
# Load model
bst = xgb.Booster({'nthread': N_CPU})
bst.load_model("MLstructureMining_model.bin")
# Load your data
# data = pd.read_csv("your_data.csv")
# data_xgb = xgb.DMatrix(data)
# Load labels
labels = pd.read_csv("labels.csv", index_col=0)
# Do inference
pred = bst.predict(data_xgb)
# Show
best_list = np.argsort(pred)
show_best(pred[0], best_list[0], df_stru_catalog, NUM_SHOW)
```
# Model Card Authors
Emil T. S. Kjær
# Model Card Contact
[email protected]
# Citation
In review.
**BibTeX:**
```
[More Information Needed]
```