--- license: other license_name: falcon-llm-license license_link: https://falconllm.tii.ae/falcon3/falcon-3-terms-and-conditions.html base_model: - tiiuae/Falcon3-10B-Instruct-1.58bit --- # Quantized Falcon-3 10B Instruct 1.58bit I submit my quantization for 1.58-bit produced locally following the instructions for the BitNet library. I tested the results locally and am now sharing the quantized model here. Please note that the quantized model's performance is comparable or lower than smaller, dense models like Llama 3.2 1B. While I've included evaluation results for comparison, the model may not be comparable to its dense counterpart or similar-sized models. However, I've shared the quantized model to enable others to verify and explore its capabilities. In my Ryzen 7900 with 3060 12GB VRAM (not used according the output) I got this result: ```sh $ python run_inference.py -m models/Falcon3-10B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "how many letters 'a' do we have in the word 'Ananas'" build: 3954 (957b59d2) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Falcon3-10B-Instruct-1.58bit llama_model_loader: - kv 2: llama.block_count u32 = 40 llama_model_loader: - kv 3: llama.context_length u32 = 32768 llama_model_loader: - kv 4: llama.embedding_length u32 = 3072 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 23040 llama_model_loader: - kv 6: llama.attention.head_count u32 = 12 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000042.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 40 llama_model_loader: - kv 11: llama.vocab_size u32 = 131072 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 256 llama_model_loader: - kv 13: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = falcon3 llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,131072] = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,128810] = ["N E", "Ġ Ġ", "Ġ t", "Ġ a", "> >... llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 11 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2023 llama_model_loader: - kv 21: tokenizer.chat_template str = {% if tools %}{% for message in messa... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type f16: 2 tensors llama_model_loader: - type i2_s: 280 tensors # other stuff llm_load_vocab: special tokens cache size = 2024 llm_load_vocab: token to piece cache size = 0.8741 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 131072 llm_load_print_meta: n_merges = 128810 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_rot = 256 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 256 llm_load_print_meta: n_embd_head_v = 256 llm_load_print_meta: n_gqa = 3 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 23040 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000042.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = I2_S - 2 bpw ternary llm_load_print_meta: model params = 10.31 B llm_load_print_meta: model size = 3.71 GiB (3.09 BPW) llm_load_print_meta: general.name = Falcon3-10B-Instruct-1.58bit llm_load_print_meta: BOS token = 11 '<|endoftext|>' llm_load_print_meta: EOS token = 11 '<|endoftext|>' llm_load_print_meta: EOT token = 11 '<|endoftext|>' llm_load_print_meta: PAD token = 2023 '<|pad|>' llm_load_print_meta: LF token = 2150 'Ä' llm_load_print_meta: EOG token = 11 '<|endoftext|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.17 MiB llm_load_tensors: CPU buffer size = 3801.96 MiB .............................................................. llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32 llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 32 llama_new_context_with_model: n_ubatch = 32 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000042.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 320.00 MiB llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB llama_new_context_with_model: CPU output buffer size = 0.50 MiB llama_new_context_with_model: CPU compute buffer size = 16.38 MiB llama_new_context_with_model: graph nodes = 1286 llama_new_context_with_model: graph splits = 1 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 2 system_info: n_threads = 2 (n_threads_batch = 2) / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampler seed: 4226296889 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 0 how many letters 'a' do we have in the word 'Ananas' in Python? <|assistant|> In Python, you can use the built-in count() function in the str class to count the number of occurrences of a specific character in a string. Here is a code snippet that accomplishes this: ``python word = 'Ananas' count = word.count('a') print(count) `` When you run this code, it will output: `` 3 `` This means that in the word 'Ananas', there are 3 occurrences of the letter 'a'. [end of text] llama_perf_sampler_print: sampling time = 5.12 ms / 144 runs ( 0.04 ms per token, 28130.49 tokens per second) llama_perf_context_print: load time = 336.46 ms llama_perf_context_print: prompt eval time = 1775.89 ms / 18 tokens ( 98.66 ms per token, 10.14 tokens per second) llama_perf_context_print: eval time = 12286.29 ms / 125 runs ( 98.29 ms per token, 10.17 tokens per second) llama_perf_context_print: total time = 14075.18 ms / 143 tokens ``` 10.17 t/s in eval time and got the right answer. Lets try another one: ```sh python run_inference.py -m models/Falcon3-10B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "In our production code, we're building a utility function for generating totals with default values for missing parameters. However, the behavior is not as expected. Here's a simplified version of the code:\n\n```js\nexport const calculateTotal = (base, multiplier = 2, increment = 2) => {\n return base * multiplier + increment;\n};\n\nconst data = {\n baseAmount: 10,\n extraParams: [3, 5] // These should map to multiplier and increment\n};\n\nconst total = calculateTotal(data.baseAmount, ...data.extraParams);\nconsole.log(total); // Expecting 12, but the output seems wrong\n```\n\nCan you help figure out why this is happening and how to fix it?" # Same as above, just changed the prompt sampler seed: 2694760223 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 0 In our production code, we're building a utility function for generating totals with default values for missing parameters. However, the behavior is not as expected. Here's a simplified version of the code: Can you help figure out why this is happening and how to fix it? I'm using Visual Studio 2005 (2003.01). <|assistant|> This error is caused by the fact that you're trying to call a virtual method that does not exist in the base class. You're trying to call `GetCount` in the `IMyDataContext` class, but this method does not exist in the base class `IMyDataContext`. This is because `GetCount` is a virtual function in `IMyDataContext`, and it's meant to be overridden in the llama_perf_sampler_print: sampling time = 5.50 ms / 188 runs ( 0.03 ms per token, 34212.92 tokens per second) llama_perf_context_print: load time = 318.08 ms llama_perf_context_print: prompt eval time = 5935.47 ms / 60 tokens ( 98.92 ms per token, 10.11 tokens per second) llama_perf_context_print: eval time = 12688.14 ms / 127 runs ( 99.91 ms per token, 10.01 tokens per second) llama_perf_context_print: total time = 18638.24 ms / 187 tokens ``` Hallucinated and wrong asnwer. Last try: ```sh # Same as above, just changed the prompt sampler seed: 4052516490 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 0 Provide a summary with about two sentences for the following article: The efficient-market hypothesis (EMH) is a hypothesis in financial economics that states that asset prices reflect all available information. A direct implication is that it is impossible to “beat the market” consistently on a risk-adjusted basis since market prices should only react to new information. Because the EMH is formulated in terms of risk adjustment, it only makes testable predictions when coupled with a particular model of risk. As a result, research in financial economics since at least the 1990s has focused on market anomalies, that is, deviations from specific models of risk. The idea that financial market returns are difficult to predict goes back to Bachelier, Mandelbrot, and Samuelson, but is closely associated with Eugene Fama, in part due to his influential 1970 review of the theoretical and empirical research. The EMH provides the basic logic for modern risk-based theories of asset prices, and frameworks such as consumption-based asset pricing and intermediary asset pricing can be thought of as the combination of a model of risk with the EMH. Many decades of empirical research on return predictability has found mixed evidence. Research in the 1950s and 1960s often found a lack of predictability (e.g. Ball and Brown 1968; Fama, Fisher, Jensen, and Roll 1969), yet the 1980s-2000s saw an explosion of discovered return predictors (e.g. Rosenberg, Reid, and Lanstein 1985; Campbell and Shiller 1988; Jegadeesh and Titman 1993). Since the 2010s, studies have often found that return predictability has become more elusive, as predictability fails to work out-of-sample (Goyal and Welch 2008), or has been weakened by advances in trading technology and investor learning (Chordia, Subrahmanyam, and Tong 2014; McLean and Pontiff 2016; Martineau 2021). Summary: <|assistant|> The Efficient Market Hypothesis (EMH) is a financial economics theory that suggests that asset prices reflect all available information. Since the EMH is formulated in terms of risk adjustment, it only makes testable predictions when coupled with a particular model of risk. Today, research in financial economics since the 1990s has focused on market anomalies, that is, deviations from specific models of risk. [end of text] llama_perf_sampler_print: sampling time = 3.88 ms / 539 runs ( 0.01 ms per token, 139060.89 tokens per second) llama_perf_context_print: load time = 320.65 ms llama_perf_context_print: prompt eval time = 45040.32 ms / 452 tokens ( 99.65 ms per token, 10.04 tokens per second) llama_perf_context_print: eval time = 8845.43 ms / 86 runs ( 102.85 ms per token, 9.72 tokens per second) llama_perf_context_print: total time = 53896.92 ms / 538 tokens ``` Not bad. Kept the prompt this time, and gave a good summary. ## License This model is licensed under the TII Falcon License. See the [LICENSE](https://falconllm.tii.ae/falcon3/falcon-3-terms-and-conditions.html) for details. ## Quantization Details - **Quantization Type**: i2_s (int4 weight only with symmetric quantization) - **Bits**: 1.58 - **Group Size**: 128 - **Quantization Tool**: [BitNet](https://github.com/microsoft/BitNet) ## How to Use llama.cpp as today don't run the model, you can wait later for when they offer support. ### BitNet ```bash git clone https://github.com/microsoft/BitNet.git cd BitNet // They recommend conda but I tried venv and worked fine in debian 12 python3 -m venv venv source venv/bin/activate pip install -r requirements.txt // python setup_env.py --hf-repo your_hf_username/Falcon3-10B-Instruct-1.58bit -q i2_s // You can skip this one // move the model to the folder models, then python run_inference.py -m models/Falcon3-10B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "What is 1.58bit quantization in LLM and why its iteresting for gpu poor people?" -cnv # 1.58-bit quantization is a method in which floating-point numbers in a neural network are represented using fewer bits, specifically 1.58 bits. This technique reduces the number of bits needed to represent a number, which can lead to improved performance and lower memory usage. ``` ## Evaluation As provided [here](https://huggingface.co/tiiuae/Falcon3-10B-Instruct-1.58bit#benchmarks) and [here](https://huggingface.co/tiiuae/Falcon3-1B-Instruct#benchmarks). | Benchmark | Llama-3.2-1B | Qwen2.5-1.5B| Llama3-8B-1.58-100B-tokens | Falcon3-10B-Instruct-1.58bit | |---|---|---|---|---| | IFEval | **55.8** | 44.4 | 17.91 | 54.37 | | MUSR | 32.4 | **36.8** | 4.87 | 2.57 | | GPQA | 25.3 | **29.6** | 1.83 | 4.27 | | BBH | 30.3 | **38.5** | 5.36 | 6.59 | | MMLU-PRO | 11.3 | **21.3** | 2.78 | 6.62 | | MATH | **3.9** | 0.2 | 0.26 | 2.44 | ## Useful Links - View the [release blogpost](https://huggingface.co/blog/falcon3). ## Contributions Contributions are welcome! If you encounter any issues or have suggestions for improvement, please submit an issue or pull request.