--- license: creativeml-openrail-m datasets: - HuggingFaceTB/smoltalk language: - en base_model: - meta-llama/Llama-3.2-1B-Instruct pipeline_tag: text-generation library_name: OpenVINO tags: - Llama - SmolTalk - openvino --- The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries. ### Key Features: 1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains. 2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality. 3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving. ### Intended Applications: - **Conversational AI**: Engage users with dynamic and contextually aware dialogue. - **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently. - **Instruction Execution**: Follow user commands to generate precise and relevant responses. ### Technical Details: The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. #### Dataset description This is a synthetic dataset designed for supervised finetuning (SFT) of LLMs. It was used to build [SmolLM2-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) family of models and contains 1M samples. During the development of SmolLM2, we observed that models finetuned on public SFT datasets underperformed compared to other models with proprietary instruction datasets. To address this gap, we created new synthetic datasets that improve instruction following while covering diverse tasks including text editing, rewriting, summarization, and reasoning. Through a series of data ablations at 1.7B scale, we enhanced our SFT mix by incorporating public datasets to strengthen specific capabilities such as mathematics, coding, system prompt following and long-context understanding. All the new datasets were generated with [distilabel](https://github.com/argilla-io/distilabel) and you can find the generation code here https://github.com/huggingface/smollm/tree/main/distilabel_pipelines. #### Dataset composition The mix consists of: **New datasets** - *Smol-Magpie-Ultra*: the core component of our mix, consisting of 400K samples generated using the Magpie pipeline with /Llama-3.1-405B-Instruct. We also heavily curate and filter this dataset compared to the original Magpie-Pro pipeline. SmolLM models trained on this dataset alone outperform those trained on popular public datasets like OpenHermes and Magpie Pro across key benchmarks including IFEval and MT-Bench. - Smol-contraints: a 36K-sample dataset that trains models to follow specific constraints, such as generating responses with a fixed number of sentences or words, or incorporating specified words in the output. The dataset has been decontaminated against IFEval to prevent overlap. - Smol-rewrite: an 50k-sample collection focused on text rewriting tasks, such as adjusting tone to be more friendly or professional. Note that Smol-Magpie-Ultra also includes some rewriting, editing, and summarization examples. - Smol-summarize: an 100k-sample dataset specialized in email and news summarization. **Existing public datasets** To enhance capabilities in mathematics, coding, system prompts, and long-context understanding, we fine-tuned SmolLM2-1.7B on various public SFT datasets and included subsets of the best performing ones using tuned ratios. These include: - OpenHermes2.5: we added 100k samples from [OpenHermes2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5), since we found that it helps preserve and boost benchmarks such as MMLU and WinoGrande, and BBH. - MetaMathQA: we add this [dataset](https://huggingface.co/datasets/meta-math/MetaMathQA?) to improve the model on mathematics and reasoning, we include 50k random samples. - NuminaMath-CoT: we find that this [dataset](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) helps on mathematics, especially hard problems found in benchmarks such as MATH. - Self-Oss-Starcoder2-Instruct: we use this [dataset](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k) to improve coding capabilities. - SystemChats2.0: to make the model support a variety of system prompt formats we add 30k samples from the [SystemChat-2.0](https://huggingface.co/datasets/cognitivecomputations/SystemChat-2.0) dataset. Note that Smol-rewrite and and Smol-summarize datasets also include system prompts. - LongAlign: we find that finetuning the model on only short samples makes it loose long context abilities beyond 2048 tokens, so we add english samples (with less than 16k tokens) from the [LongAlign-10k](https://huggingface.co/datasets/THUDM/LongAlign-10k) dataset and train with a 8192 sequence. - Everyday-conversations: this [dataset](https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k) includes multi-turn everyday conversations such as greeting and was used in SmolLM v1 post-training. - APIGen-Function-Calling: we use 80k samples from [apigen-function-calling](https://huggingface.co/datasets/argilla/apigen-function-calling) which is a mix of [Synth-APIGen-v0.1](https://huggingface.co/datasets/argilla/Synth-APIGen-v0.1) and [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) datasets. - Explore-Instruct-Rewriting: 30k samples from this rewriting [dataset](https://huggingface.co/datasets/Wanfq/Explore_Instruct_Rewriting_32k). You can find the code for generating the new datasets with [distilabel](https://github.com/argilla-io/distilabel) here: https://github.com/huggingface/smollm. The ablation details will be included in an upcoming blog post. #### License All the new datasets (Smol-Magpie-Ultra, Smol-contraints, Smol-rewrite, Smol-summarize) are licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). For the existing public datasets, please refer to the original dataset for the license [Dataset composition](#dataset-composition) --- ## Prompt format ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> Cutting Knowledge Date: December 2023 Today Date: 26 July 2024 {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|> {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|> ``` ## This is the OpenVINO IR format of the model, quantized in int8 The model was created with the Optimum-Intel libray cli-command #### Dependencies required to create the model There is an open clash in dependencies versions between optiumum-intel and openvino-genai > ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO. So for the model conversion the only dependency you need is ``` pip install -U "openvino>=2024.3.0" "openvino-genai" pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu ``` The instructions are from the amazing [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)
vanilla pip install will create clashes among dependencies/versions
This command will install, among others: ``` tokenizers==0.20.3 torch==2.5.1+cpu transformers==4.46.3 nncf==2.14.0 numpy==2.1.3 onnx==1.16.1 openvino==2024.5.0 openvino-genai==2024.5.0.0 openvino-telemetry==2024.5.0 openvino-tokenizers==2024.5.0.0 optimum==1.23.3 optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315 ``` #### How to quantized the original model After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct)) ```bash optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct ``` this will start the process and produce the following messages, without any fatal error

#### Dependencies required to run the model with `openvino-genai` If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai ``` pip install openvino-genai==2024.5.0 ``` ## How to use the model with openvino-genai followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html) With changes because here we are using Chat Templates refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating) ```python # MAIN IMPORTS import warnings warnings.filterwarnings(action='ignore') import datetime from transformers import AutoTokenizer #for chat templating import openvino_genai as ov_genai import tiktoken import sys def countTokens(text): """ Use tiktoken to count the number of tokens text -> str input Return -> int number of tokens counted """ encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext)) numoftokens = len(encoding.encode(text)) return numoftokens # LOADING THE MODEL print('Loading the model', end='') model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct' pipe = ov_genai.LLMPipeline(model_dir, 'CPU') # PROMPT FORMATTING - we use tokenizer chat templating tokenizer = AutoTokenizer.from_pretrained(model_dir) print('✅ done') print('Ready for generation') print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...') counter = 1 while True: # Reset history ALWAys history = [] userinput = "" print("\033[1;30m") #dark grey print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:") print("\033[91;1m") #red lines = sys.stdin.readlines() for line in lines: userinput += line + "\n" if "quit!" in lines[0].lower(): print("\033[0mBYE BYE!") break history.append({"role": "user", "content": userinput}) tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False) # START PIPELINE setting eos_token_id = 151643 start = datetime.datetime.now() print("\033[92;1m") streamer = lambda x: print(x, end='', flush=True) output = pipe.generate(tokenized_chat, temperature=0.2, do_sample=True, max_new_tokens=500, repetition_penalty=1.178, streamer=streamer, eos_token_id = 128009) print('') delta = datetime.datetime.now() - start totalseconds = delta.total_seconds() totaltokens = countTokens(output) genspeed = totaltokens/totalseconds # PRINT THE STATISTICS print('---') print(f'Generated in {delta}') print(f'🧮 Total number of generated tokens: {totaltokens}') print(f'⏱️ Generation time: {totalseconds:.0f} seconds') print(f'📈 speed: {genspeed:.2f} t/s') ```