--- base_model: mistralai/Mistral-7B-v0.3 license: agpl-3.0 datasets: - openfoodfacts/spellcheck-dataset - openfoodfacts/spellcheck-benchmark --- # Open Food Facts - Ingredients spellcheck model When a product is added to the database, all its details, such as allergens, additives, or nutritional values, are either wrote down by the contributor, or automatically extracted from the product pictures using OCR. However, it often happens the information extracted by OCR contains typos and errors due to bad quality pictures: low-definition, curved product, light reflection, etc... To solve this problem, we developed an **Ingredient Spellcheck** 🍊, a model capable of correcting typos in a list of ingredients following a defined guideline. The model, based on [Mistral-7B-v0.3], was fine-tuned on thousand of corrected lists of ingredients extracted from the database. ## Model Details ### Model Description The Open Food Facts Ingredients Spellcheck is a version of [Mistral-7B-v0.3](https://huggingface.co/mistralai/Mistral-7B-v0.3) fine-tuned on thousands of corrected list of ingredients extracted from the OFF database. The training dataset, with the evaluation benchmark are available in the Open Food Facts HF repository: * **Training dataset:** https://huggingface.co/datasets/openfoodfacts/spellcheck-dataset * **Evaluation benchmark:** https://huggingface.co/datasets/openfoodfacts/spellcheck-benchmark The project is currently in development. You can find it in the Open Food Facts Github repo. A demo of this model is also available in HF Spaces. - **Repository:** https://github.com/openfoodfacts/openfoodfacts-ai/tree/develop/spellcheck - **Demo:** https://huggingface.co/spaces/jeremyarancio/ingredients-spellcheck ## Uses This model takes a list of ingredients of a product as input and returns the correction. It follows a spellcheck guideline, which was used to build the training and evaluation datasets. You can find this guideline in the [Spellcheck project README](https://github.com/openfoodfacts/openfoodfacts-ai/tree/spellcheck/spellcheck). To respect the training process, the input list of ingredients needs to be embedded into the following prompt: ```python def prepare_instruction(text: str) -> str: """Prepare instruction prompt for fine-tuning and inference. Identical to instruction during training. Args: text (str): List of ingredients Returns: str: Instruction. """ instruction = ( "###Correct the list of ingredients:\n" + text + "\n\n###Correction:\n" ) return instruction ``` ## Training Details The model training informations are available in the [CometML Experiment Tracker](https://www.comet.com/jeremyarancio/spellcheck/e223b404168f4d4c8e633cbd0909b60d?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&viewId=vhfLDppdrZXnthxtP5Lnb3tep&xAxis=step), along the other experimentations. The model was trained on AWS Sagemaker using an ml.g5.2xlarge instance for 3 epochs. ## Evaluation The model is evaluated on the [benchmark](https://huggingface.co/datasets/openfoodfacts/spellcheck-benchmark) using a custom evaluation algorithm. In short, lists of ingredients are separated into 3 parts: *original*, *reference*, *prediction*. Using a [sequence alignement algorithm](https://en.wikipedia.org/wiki/Sequence_alignment) between respectively *original*-*reference* and *original*-**prediction*, we are able to tell which token were supposed to be corrected, and which one was actually corrected. This leads to a correction Precision and Recall. The complete explanation of the algorithm is available in the [Spellchech README](https://github.com/openfoodfacts/openfoodfacts-ai/tree/develop/spellcheck#-evaluation-metrics-and-algorithm). ### Metrics: * Correction precision: **0.67** * Correction recall: **0.62** * Localisation precision: **0.75** * Localisation recall: **0.69** ## Additional links: * Open Food Facts website: https://world.openfoodfacts.org/discover * Open Food Facts Github: https://github.com/openfoodfacts