File size: 5,819 Bytes
e995acb
cb0f6db
157e700
 
a544449
157e700
 
d94587a
157e700
 
 
 
 
 
a544449
 
 
5a496e5
 
a544449
5a496e5
 
a544449
 
 
5a496e5
 
 
 
157e700
 
9946b6d
a544449
 
2e98072
9946b6d
 
 
a544449
 
 
 
 
 
3ccf9e1
a544449
3ccf9e1
750ff86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ccf9e1
750ff86
3ccf9e1
 
 
 
 
 
 
 
 
9345112
 
 
3ccf9e1
 
750ff86
 
 
 
a544449
 
 
 
 
3ccf9e1
a544449
 
 
0365b1a
a544449
 
 
3ccf9e1
a544449
 
 
14d67f9
 
 
 
 
 
 
 
 
 
a544449
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
tags:
- xgboost
- python
- tabular-classification
- chemistry
- pair-distribution-function
model_file: MLstructureMining_model.bin
metrics:
- accuracy
pipeline_tag: tabular-classification
license: apache-2.0
language:
- en
---

# Model description
MLStructureMining is a tree-based machine learning classifier designed to rapidly match X-ray pair distribution function (PDF) data to prototype 
patterns from a large database of crystal structures, providing real-time structure characterization by screening vast quantities of data in seconds.

The code used to train the model can be found [HERE](https://github.com/EmilSkaaning/MLstructureMining-workflow), and the Python implementation can be found 
[HERE](https://github.com/EmilSkaaning/MLstructureMining/tree/main) or the wheel file ´mlstructuremining-4.1.0-py3-none-any.whl´.

## Evaluation Results

MLstructureMining has been trained PDFs from on 10,833 crystal structures obtained from [Crystallography Open Database](https://www.crystallography.net/cod/) (COD). 
The Pearson Correlation Coefficient (PCC) was used to cluster structures with similar PDF, resulting in a total of 6,062 labels. 100 PDFs were simluated per structures,
and the data were split into training, validation and testing set with a 80/10/10 ratio.

We evaluate this model based an accuracy and to test the robustness of MLstructureMining, we deploy zeroth-order optimization (ZOO) 
from the [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox) (ART) library to perform adversarial attacks.

| Metric   | Value   |
|----------|---------|
| Accuracy | 91% |
| Top-3 Accuracy | 99% |
| ZOO Accuracy | 89% |
| ZOO Top-3 Accuracy | 97% |

# How to Get Started with the Model

Use the code below to get started with the model.

```python
import xgboost as xgb
import pandas as pd

def show_best(pred: np.ndarray, 
              best_list: np.ndarray, 
              df_stru_catalog: pd.DataFrame, 
              num_show: int) -> None:
    """
    Display the best predictions based on the model output.

    Parameters
    ----------
    pred : np.ndarray
        Predictions from the model.
    best_list : np.ndarray
        List of best predictions.
    df_stru_catalog : pd.DataFrame
        The structure catalog associated with the model.
    num_show : int
        Number of top predictions to show.

    Returns
    -------
    None
    """
    for count, idx in enumerate(reversed(best_list[-num_show:])):
        print(f"\n{count}) Probability: {pred[idx]*100:3.1f}%")

        compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
        sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])

        print(f'    COD-IDs: {df_stru_catalog.iloc[idx]["Label"].rsplit(".",1)[0]}, composition: {compo[0]}, space group: {sgs[0]}')
        if not pd.isna(df_stru_catalog.at[idx, "Similar"]):
            similar_files = extract_filenames(df_stru_catalog.at[idx, "Similar"])
            compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
            sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])
            for jdx in range(len(similar_files)):
                print(f'    COD-IDs: {similar_files[jdx]}, composition: {compo[jdx]}, space group: {sgs[jdx]}')


N_CPU = 8  # Number of CPUs used
NUM_SHOW = 5  # Show to X best predictions 

# Load model
bst = xgb.Booster({'nthread': N_CPU})
bst.load_model("MLstructureMining_model.bin")

# Load your data
# data = pd.read_csv("your_data.csv")
# data_xgb = xgb.DMatrix(data)

# Load labels
labels = pd.read_csv("labels.csv", index_col=0)

# Do inference
pred = bst.predict(data_xgb)

# Show 
best_list = np.argsort(pred)
show_best(pred[0], best_list[0], df_stru_catalog, NUM_SHOW)
```


# Model Card Authors

Emil T. S. Kjær

# Model Card Contact

[email protected]

# Citation

In review.

**BibTeX:**
```
@Article{D4DD00001C,
author ="Kjær, Emil T. S. and Anker, Andy S. and Kirsch, Andrea and Lajer, Joakim and Aalling-Frederiksen, Olivia and Billinge, Simon J. L. and Jensen, Kirsten M. Ø.",
title  ="MLstructureMining: a machine learning tool for structure identification from X-ray pair distribution functions",
journal  ="Digital Discovery",
year  ="2024",
pages  ="-",
publisher  ="RSC",
doi  ="10.1039/D4DD00001C",
url  ="http://dx.doi.org/10.1039/D4DD00001C",
abstract  ="Synchrotron X-ray techniques are essential for studies of the intrinsic relationship between synthesis{,} structure{,} and properties of materials. Modern synchrotrons can produce up to 1 petabyte of data per day. Such amounts of data can speed up materials development{,} but also comes with a staggering growth in workload{,} as the data generated must be stored and analyzed. We present an approach for quickly identifying an atomic structure model from pair distribution function (PDF) data from (nano)crystalline materials. Our model{,} MLstructureMining{,} uses a tree-based machine learning (ML) classifier. MLstructureMining has been trained to classify chemical structures from a PDF and gives a top-3 accuracy of 99% on simulated PDFs not seen during training{,} with a total of 6062 possible classes. We also demonstrate that MLstructureMining can identify the chemical structure from experimental PDFs from nanoparticles of CoFe2O4 and CeO2{,} and we show how it can be used to treat an in situ PDF series collected during Bi2Fe4O9 formation. Additionally{,} we show how MLstructureMining can be used in combination with the well-known methods{,} principal component analysis (PCA) and non-negative matrix factorization (NMF) to analyze data from in situ experiments. MLstructureMining thus allows for real-time structure characterization by screening vast quantities of crystallographic information files in seconds."}
```