MLstructureMining / README.md
Ekjaer's picture
Update README.md
750ff86 verified
|
raw
history blame
3.94 kB
metadata
tags:
  - xgboost
  - python
  - tabular-classification
  - chemistry
  - pair-distribution-function
model_file: MLstructureMining_model.bin
metrics:
  - accuracy
pipeline_tag: tabular-classification
license: apache-2.0
language:
  - en

Model description

MLStructureMining is a tree-based machine learning classifier designed to rapidly match X-ray pair distribution function (PDF) data to prototype patterns from a large database of crystal structures, providing real-time structure characterization by screening vast quantities of data in seconds.

The code used to train the model can be found HERE, and the Python implementation can be found HERE or the wheel file ´mlstructuremining-4.1.0-py3-none-any.whl´.

Evaluation Results

MLstructureMining has been trained PDFs from on 10,833 crystal structures obtained from Crystallography Open Database (COD). The Pearson Correlation Coefficient (PCC) was used to cluster structures with similar PDF, resulting in a total of 6,062 labels. 100 PDFs were simluated per structures, and the data were split into training, validation and testing set with a 80/10/10 ratio.

We evaluate this model based an accuracy and to test the robustness of MLstructureMining, we deploy zeroth-order optimization (ZOO) from the Adversarial Robustness Toolbox (ART) library to perform adversarial attacks.

Metric Value
Accuracy 91%
Top-3 Accuracy 99%
ZOO Accuracy 89%
ZOO Top-3 Accuracy 97%

How to Get Started with the Model

Use the code below to get started with the model.

import xgboost as xgb
import pandas as pd

def show_best(pred: np.ndarray, 
              best_list: np.ndarray, 
              df_stru_catalog: pd.DataFrame, 
              num_show: int) -> None:
    """
    Display the best predictions based on the model output.

    Parameters
    ----------
    pred : np.ndarray
        Predictions from the model.
    best_list : np.ndarray
        List of best predictions.
    df_stru_catalog : pd.DataFrame
        The structure catalog associated with the model.
    num_show : int
        Number of top predictions to show.

    Returns
    -------
    None
    """
    for count, idx in enumerate(reversed(best_list[-num_show:])):
        print(f"\n{count}) Probability: {pred[idx]*100:3.1f}%")

        compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
        sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])

        print(f'    COD-IDs: {df_stru_catalog.iloc[idx]["Label"].rsplit(".",1)[0]}, composition: {compo[0]}, space group: {sgs[0]}')
        if not pd.isna(df_stru_catalog.at[idx, "Similar"]):
            similar_files = extract_filenames(df_stru_catalog.at[idx, "Similar"])
            compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
            sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])
            for jdx in range(len(similar_files)):
                print(f'    COD-IDs: {similar_files[jdx]}, composition: {compo[jdx]}, space group: {sgs[jdx]}')


N_CPU = 8  # Number of CPUs used
NUM_SHOW = 5  # Show to X best predictions 

# Load model
bst = xgb.Booster({'nthread': N_CPU})
bst.load_model("MLstructureMining_model.bin")

# Load your data
# data = pd.read_csv("your_data.csv")
# data_xgb = xgb.DMatrix(data)

# Load labels
labels = pd.read_csv("labels.csv", index_col=0)

# Do inference
pred = bst.predict(data_xgb)

# Show 
best_list = np.argsort(pred)
show_best(pred[0], best_list[0], df_stru_catalog, NUM_SHOW)

Model Card Authors

Emil T. S. Kjær

Model Card Contact

[email protected]

Citation

In review.

BibTeX:

[More Information Needed]