Update README.md

750ff86 verified 11 months ago

3.94 kB

	---
	tags:
	- xgboost
	- python
	- tabular-classification
	- chemistry
	- pair-distribution-function
	model_file: MLstructureMining_model.bin
	metrics:
	- accuracy
	pipeline_tag: tabular-classification
	license: apache-2.0
	language:
	- en
	---

	# Model description
	MLStructureMining is a tree-based machine learning classifier designed to rapidly match X-ray pair distribution function (PDF) data to prototype
	patterns from a large database of crystal structures, providing real-time structure characterization by screening vast quantities of data in seconds.

	The code used to train the model can be found [HERE](https://github.com/EmilSkaaning/MLstructureMining-workflow), and the Python implementation can be found
	[HERE](https://github.com/EmilSkaaning/MLstructureMining/tree/main) or the wheel file ´mlstructuremining-4.1.0-py3-none-any.whl´.

	## Evaluation Results

	MLstructureMining has been trained PDFs from on 10,833 crystal structures obtained from [Crystallography Open Database](https://www.crystallography.net/cod/) (COD).
	The Pearson Correlation Coefficient (PCC) was used to cluster structures with similar PDF, resulting in a total of 6,062 labels. 100 PDFs were simluated per structures,
	and the data were split into training, validation and testing set with a 80/10/10 ratio.

	We evaluate this model based an accuracy and to test the robustness of MLstructureMining, we deploy zeroth-order optimization (ZOO)
	from the [Adversarial Robustness Toolbox](https://github.com/Trusted-AI/adversarial-robustness-toolbox) (ART) library to perform adversarial attacks.

	\| Metric \| Value \|
	\|----------\|---------\|
	\| Accuracy \| 91% \|
	\| Top-3 Accuracy \| 99% \|
	\| ZOO Accuracy \| 89% \|
	\| ZOO Top-3 Accuracy \| 97% \|

	# How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import xgboost as xgb
	import pandas as pd

	def show_best(pred: np.ndarray,
	best_list: np.ndarray,
	df_stru_catalog: pd.DataFrame,
	num_show: int) -> None:
	"""
	Display the best predictions based on the model output.

	Parameters
	----------
	pred : np.ndarray
	Predictions from the model.
	best_list : np.ndarray
	List of best predictions.
	df_stru_catalog : pd.DataFrame
	The structure catalog associated with the model.
	num_show : int
	Number of top predictions to show.

	Returns
	-------
	None
	"""
	for count, idx in enumerate(reversed(best_list[-num_show:])):
	print(f"\n{count}) Probability: {pred[idx]*100:3.1f}%")

	compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
	sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])

	print(f' COD-IDs: {df_stru_catalog.iloc[idx]["Label"].rsplit(".",1)[0]}, composition: {compo[0]}, space group: {sgs[0]}')
	if not pd.isna(df_stru_catalog.at[idx, "Similar"]):
	similar_files = extract_filenames(df_stru_catalog.at[idx, "Similar"])
	compo = clean_string(df_stru_catalog.iloc[idx]["composition"])
	sgs = clean_string(df_stru_catalog.iloc[idx]["space_group_symmetry"])
	for jdx in range(len(similar_files)):
	print(f' COD-IDs: {similar_files[jdx]}, composition: {compo[jdx]}, space group: {sgs[jdx]}')


	N_CPU = 8 # Number of CPUs used
	NUM_SHOW = 5 # Show to X best predictions

	# Load model
	bst = xgb.Booster({'nthread': N_CPU})
	bst.load_model("MLstructureMining_model.bin")

	# Load your data
	# data = pd.read_csv("your_data.csv")
	# data_xgb = xgb.DMatrix(data)

	# Load labels
	labels = pd.read_csv("labels.csv", index_col=0)

	# Do inference
	pred = bst.predict(data_xgb)

	# Show
	best_list = np.argsort(pred)
	show_best(pred[0], best_list[0], df_stru_catalog, NUM_SHOW)
	```


	# Model Card Authors

	Emil T. S. Kjær

	# Model Card Contact

	[email protected]

	# Citation

	In review.

	BibTeX:
	```
	[More Information Needed]
	```