--- tags: - xgboost - python - tabular-classification - chemistry - pair-distribution-function model_file: MLstructureMining_model.bin metrics: - accuracy pipeline_tag: tabular-classification license: apache-2.0 language: - en --- # Model description MLStructureMining is a tree-based machine learning classifier designed to rapidly match X-ray pair distribution function (PDF) data to prototype patterns from a large database of crystal structures, providing real-time structure characterization by screening vast quantities of data in seconds. The code used to train the model can be found [HERE](https://github.com/EmilSkaaning/MLstructureMining-workflow), and the Python implementation can be found [HERE](https://github.com/EmilSkaaning/MLstructureMining/tree/main) or the wheel file ´mlstructuremining-4.1.0-py3-none-any.whl´. ## Evaluation Results MLstructureMining has been trained PDFs from on 10,833 crystal structures obtained from [Crystallography Open Database](https://www.crystallography.net/cod/) (COD). The Pearson Correlation Coefficient (PCC) was used to cluster structures with similar PDF, resulting in a total of 6,062 labels. 100 PDFs were simluated per structures, and the data were split into training, validation and testing set with a 80/10/10 ratio. We evaluate this model based an accuracy and to test the robustness of MLstructureMining, we deploy zeroth-order optimization (ZOO) from the [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox) (ART) library to perform adversarial attacks. | Metric | Value | |----------|---------| | Accuracy | 91% | | Top-3 Accuracy | 99% | | ZOO Accuracy | 89% | | ZOO Top-3 Accuracy | 97% | # How to Get Started with the Model Use the code below to get started with the model. ```python import xgboost as xgb import pandas as pd def show_best(pred: np.ndarray, best_list: np.ndarray, df_stru_catalog: pd.DataFrame, num_show: int) -> None: """ Display the best predictions based on the model output. Parameters ---------- pred : np.ndarray Predictions from the model. best_list : np.ndarray List of best predictions. df_stru_catalog : pd.DataFrame The structure catalog associated with the model. num_show : int Number of top predictions to show. Returns ------- None """ for count, idx in enumerate(reversed(best_list[-num_show:])): print(f"\n{count}) Probability: {pred[idx]*100:3.1f}%") compo = clean_string(df_stru_catalog.iloc[idx]["composition"]) sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"]) print(f' COD-IDs: {df_stru_catalog.iloc[idx]["Label"].rsplit(".",1)[0]}, composition: {compo[0]}, space group: {sgs[0]}') if not pd.isna(df_stru_catalog.at[idx, "Similar"]): similar_files = extract_filenames(df_stru_catalog.at[idx, "Similar"]) compo = clean_string(df_stru_catalog.iloc[idx]["composition"]) sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"]) for jdx in range(len(similar_files)): print(f' COD-IDs: {similar_files[jdx]}, composition: {compo[jdx]}, space group: {sgs[jdx]}') N_CPU = 8 # Number of CPUs used NUM_SHOW = 5 # Show to X best predictions # Load model bst = xgb.Booster({'nthread': N_CPU}) bst.load_model("MLstructureMining_model.bin") # Load your data # data = pd.read_csv("your_data.csv") # data_xgb = xgb.DMatrix(data) # Load labels labels = pd.read_csv("labels.csv", index_col=0) # Do inference pred = bst.predict(data_xgb) # Show best_list = np.argsort(pred) show_best(pred[0], best_list[0], df_stru_catalog, NUM_SHOW) ``` # Model Card Authors Emil T. S. Kjær # Model Card Contact emil.thyge.kjaer@gmail.com # Citation In review. **BibTeX:** ``` [More Information Needed] ```