|
--- |
|
tags: |
|
- xgboost |
|
- python |
|
- tabular-classification |
|
- chemistry |
|
- pair-distribution-function |
|
model_file: MLstructureMining_model.bin |
|
metrics: |
|
- accuracy |
|
pipeline_tag: tabular-classification |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
# Model description |
|
MLStructureMining is a tree-based machine learning classifier designed to rapidly match X-ray pair distribution function (PDF) data to prototype |
|
patterns from a large database of crystal structures, providing real-time structure characterization by screening vast quantities of data in seconds. |
|
|
|
The code used to train the model can be found [HERE](https://github.com/EmilSkaaning/MLstructureMining-workflow), and the Python implementation can be found |
|
[HERE](https://github.com/EmilSkaaning/MLstructureMining/tree/main) or the wheel file ´mlstructuremining-4.1.0-py3-none-any.whl´. |
|
|
|
## Evaluation Results |
|
|
|
MLstructureMining has been trained PDFs from on 10,833 crystal structures obtained from [Crystallography Open Database](https://www.crystallography.net/cod/) (COD). |
|
The Pearson Correlation Coefficient (PCC) was used to cluster structures with similar PDF, resulting in a total of 6,062 labels. 100 PDFs were simluated per structures, |
|
and the data were split into training, validation and testing set with a 80/10/10 ratio. |
|
|
|
We evaluate this model based an accuracy and to test the robustness of MLstructureMining, we deploy zeroth-order optimization (ZOO) |
|
from the [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox) (ART) library to perform adversarial attacks. |
|
|
|
| Metric | Value | |
|
|----------|---------| |
|
| Accuracy | 91% | |
|
| Top-3 Accuracy | 99% | |
|
| ZOO Accuracy | 89% | |
|
| ZOO Top-3 Accuracy | 97% | |
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
import xgboost as xgb |
|
import pandas as pd |
|
|
|
def show_best(pred: np.ndarray, |
|
best_list: np.ndarray, |
|
df_stru_catalog: pd.DataFrame, |
|
num_show: int) -> None: |
|
""" |
|
Display the best predictions based on the model output. |
|
|
|
Parameters |
|
---------- |
|
pred : np.ndarray |
|
Predictions from the model. |
|
best_list : np.ndarray |
|
List of best predictions. |
|
df_stru_catalog : pd.DataFrame |
|
The structure catalog associated with the model. |
|
num_show : int |
|
Number of top predictions to show. |
|
|
|
Returns |
|
------- |
|
None |
|
""" |
|
for count, idx in enumerate(reversed(best_list[-num_show:])): |
|
print(f"\n{count}) Probability: {pred[idx]*100:3.1f}%") |
|
|
|
compo = clean_string(df_stru_catalog.iloc[idx]["composition"]) |
|
sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"]) |
|
|
|
print(f' COD-IDs: {df_stru_catalog.iloc[idx]["Label"].rsplit(".",1)[0]}, composition: {compo[0]}, space group: {sgs[0]}') |
|
if not pd.isna(df_stru_catalog.at[idx, "Similar"]): |
|
similar_files = extract_filenames(df_stru_catalog.at[idx, "Similar"]) |
|
compo = clean_string(df_stru_catalog.iloc[idx]["composition"]) |
|
sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"]) |
|
for jdx in range(len(similar_files)): |
|
print(f' COD-IDs: {similar_files[jdx]}, composition: {compo[jdx]}, space group: {sgs[jdx]}') |
|
|
|
|
|
N_CPU = 8 # Number of CPUs used |
|
NUM_SHOW = 5 # Show to X best predictions |
|
|
|
# Load model |
|
bst = xgb.Booster({'nthread': N_CPU}) |
|
bst.load_model("MLstructureMining_model.bin") |
|
|
|
# Load your data |
|
# data = pd.read_csv("your_data.csv") |
|
# data_xgb = xgb.DMatrix(data) |
|
|
|
# Load labels |
|
labels = pd.read_csv("labels.csv", index_col=0) |
|
|
|
# Do inference |
|
pred = bst.predict(data_xgb) |
|
|
|
# Show |
|
best_list = np.argsort(pred) |
|
show_best(pred[0], best_list[0], df_stru_catalog, NUM_SHOW) |
|
``` |
|
|
|
|
|
# Model Card Authors |
|
|
|
Emil T. S. Kjær |
|
|
|
# Model Card Contact |
|
|
|
[email protected] |
|
|
|
# Citation |
|
|
|
In review. |
|
|
|
**BibTeX:** |
|
``` |
|
@Article{D4DD00001C, |
|
author ="Kjær, Emil T. S. and Anker, Andy S. and Kirsch, Andrea and Lajer, Joakim and Aalling-Frederiksen, Olivia and Billinge, Simon J. L. and Jensen, Kirsten M. Ø.", |
|
title ="MLstructureMining: a machine learning tool for structure identification from X-ray pair distribution functions", |
|
journal ="Digital Discovery", |
|
year ="2024", |
|
pages ="-", |
|
publisher ="RSC", |
|
doi ="10.1039/D4DD00001C", |
|
url ="http://dx.doi.org/10.1039/D4DD00001C", |
|
abstract ="Synchrotron X-ray techniques are essential for studies of the intrinsic relationship between synthesis{,} structure{,} and properties of materials. Modern synchrotrons can produce up to 1 petabyte of data per day. Such amounts of data can speed up materials development{,} but also comes with a staggering growth in workload{,} as the data generated must be stored and analyzed. We present an approach for quickly identifying an atomic structure model from pair distribution function (PDF) data from (nano)crystalline materials. Our model{,} MLstructureMining{,} uses a tree-based machine learning (ML) classifier. MLstructureMining has been trained to classify chemical structures from a PDF and gives a top-3 accuracy of 99% on simulated PDFs not seen during training{,} with a total of 6062 possible classes. We also demonstrate that MLstructureMining can identify the chemical structure from experimental PDFs from nanoparticles of CoFe2O4 and CeO2{,} and we show how it can be used to treat an in situ PDF series collected during Bi2Fe4O9 formation. Additionally{,} we show how MLstructureMining can be used in combination with the well-known methods{,} principal component analysis (PCA) and non-negative matrix factorization (NMF) to analyze data from in situ experiments. MLstructureMining thus allows for real-time structure characterization by screening vast quantities of crystallographic information files in seconds."} |
|
``` |