Quantized Falcon-3 10B Instruct 1.58bit
I submit my quantization for 1.58-bit produced locally following the instructions for the BitNet library. I tested the results locally and am now sharing the quantized model here.
Please note that the quantized model's performance is comparable or lower than smaller, dense models like Llama 3.2 1B. While I've included evaluation results for comparison, the model may not be comparable to its dense counterpart or similar-sized models. However, I've shared the quantized model to enable others to verify and explore its capabilities.
In my Ryzen 7900 with 3060 12GB VRAM (not used according the output) I got this result:
$ python run_inference.py -m models/Falcon3-10B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "how many letters 'a' do we have in the word 'Ananas'"
build: 3954 (957b59d2) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Falcon3-10B-Instruct-1.58bit
llama_model_loader: - kv 2: llama.block_count u32 = 40
llama_model_loader: - kv 3: llama.context_length u32 = 32768
llama_model_loader: - kv 4: llama.embedding_length u32 = 3072
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 23040
llama_model_loader: - kv 6: llama.attention.head_count u32 = 12
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000042.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 40
llama_model_loader: - kv 11: llama.vocab_size u32 = 131072
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 256
llama_model_loader: - kv 13: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = falcon3
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,131072] = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,128810] = ["N E", "Ġ Ġ", "Ġ t", "Ġ a", "> >...
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 11
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2023
llama_model_loader: - kv 21: tokenizer.chat_template str = {% if tools %}{% for message in messa...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 2 tensors
llama_model_loader: - type i2_s: 280 tensors
# other stuff
llm_load_vocab: special tokens cache size = 2024
llm_load_vocab: token to piece cache size = 0.8741 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 131072
llm_load_print_meta: n_merges = 128810
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 3072
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_head = 12
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 256
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 3
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 23040
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000042.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = I2_S - 2 bpw ternary
llm_load_print_meta: model params = 10.31 B
llm_load_print_meta: model size = 3.71 GiB (3.09 BPW)
llm_load_print_meta: general.name = Falcon3-10B-Instruct-1.58bit
llm_load_print_meta: BOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOS token = 11 '<|endoftext|>'
llm_load_print_meta: EOT token = 11 '<|endoftext|>'
llm_load_print_meta: PAD token = 2023 '<|pad|>'
llm_load_print_meta: LF token = 2150 'Ä'
llm_load_print_meta: EOG token = 11 '<|endoftext|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.17 MiB
llm_load_tensors: CPU buffer size = 3801.96 MiB
..............................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000042.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 320.00 MiB
llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.50 MiB
llama_new_context_with_model: CPU compute buffer size = 16.38 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2
system_info: n_threads = 2 (n_threads_batch = 2) / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 4226296889
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 0
how many letters 'a' do we have in the word 'Ananas' in Python?
<|assistant|>
In Python, you can use the built-in count() function in the str class to count the number of occurrences of a specific character in a string.
Here is a code snippet that accomplishes this:
``python
word = 'Ananas'
count = word.count('a')
print(count)
``
When you run this code, it will output:
``
3
``
This means that in the word 'Ananas', there are 3 occurrences of the letter 'a'. [end of text]
llama_perf_sampler_print: sampling time = 5.12 ms / 144 runs ( 0.04 ms per token, 28130.49 tokens per second)
llama_perf_context_print: load time = 336.46 ms
llama_perf_context_print: prompt eval time = 1775.89 ms / 18 tokens ( 98.66 ms per token, 10.14 tokens per second)
llama_perf_context_print: eval time = 12286.29 ms / 125 runs ( 98.29 ms per token, 10.17 tokens per second)
llama_perf_context_print: total time = 14075.18 ms / 143 tokens
10.17 t/s in eval time and got the right answer.
Lets try another one:
python run_inference.py -m models/Falcon3-10B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "In our production code, we're building a utility function for generating totals with default values for missing parameters. However, the behavior is not as expected. Here's a simplified version of the code:\n\n```js\nexport const calculateTotal = (base, multiplier = 2, increment = 2) => {\n return base * multiplier + increment;\n};\n\nconst data = {\n baseAmount: 10,\n extraParams: [3, 5] // These should map to multiplier and increment\n};\n\nconst total = calculateTotal(data.baseAmount, ...data.extraParams);\nconsole.log(total); // Expecting 12, but the output seems wrong\n```\n\nCan you help figure out why this is happening and how to fix it?"
# Same as above, just changed the prompt
sampler seed: 2694760223
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 0
In our production code, we're building a utility function for generating totals with default values for missing parameters. However, the behavior is not as expected. Here's a simplified version of the code:
Can you help figure out why this is happening and how to fix it?
I'm using Visual Studio 2005 (2003.01).
<|assistant|>
This error is caused by the fact that you're trying to call a virtual method that does not exist in the base class.
You're trying to call `GetCount` in the `IMyDataContext` class, but this method does not exist in the base class `IMyDataContext`. This is because `GetCount` is a virtual function in `IMyDataContext`, and it's meant to be overridden in the
llama_perf_sampler_print: sampling time = 5.50 ms / 188 runs ( 0.03 ms per token, 34212.92 tokens per second)
llama_perf_context_print: load time = 318.08 ms
llama_perf_context_print: prompt eval time = 5935.47 ms / 60 tokens ( 98.92 ms per token, 10.11 tokens per second)
llama_perf_context_print: eval time = 12688.14 ms / 127 runs ( 99.91 ms per token, 10.01 tokens per second)
llama_perf_context_print: total time = 18638.24 ms / 187 tokens
Hallucinated and wrong asnwer.
Last try:
# Same as above, just changed the prompt
sampler seed: 4052516490
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 0
Provide a summary with about two sentences for the following article:
The efficient-market hypothesis (EMH) is a hypothesis in financial economics that states that asset prices reflect all available information. A direct implication is that it is impossible to “beat the market” consistently on a risk-adjusted basis since market prices should only react to new information. Because the EMH is formulated in terms of risk adjustment, it only makes testable predictions when coupled with a particular model of risk. As a result, research in financial economics since at least the 1990s has focused on market anomalies, that is, deviations from specific models of risk. The idea that financial market returns are difficult to predict goes back to Bachelier, Mandelbrot, and Samuelson, but is closely associated with Eugene Fama, in part due to his influential 1970 review of the theoretical and empirical research. The EMH provides the basic logic for modern risk-based theories of asset prices, and frameworks such as consumption-based asset pricing and intermediary asset pricing can be thought of as the combination of a model of risk with the EMH. Many decades of empirical research on return predictability has found mixed evidence. Research in the 1950s and 1960s often found a lack of predictability (e.g. Ball and Brown 1968; Fama, Fisher, Jensen, and Roll 1969), yet the 1980s-2000s saw an explosion of discovered return predictors (e.g. Rosenberg, Reid, and Lanstein 1985; Campbell and Shiller 1988; Jegadeesh and Titman 1993). Since the 2010s, studies have often found that return predictability has become more elusive, as predictability fails to work out-of-sample (Goyal and Welch 2008), or has been weakened by advances in trading technology and investor learning (Chordia, Subrahmanyam, and Tong 2014; McLean and Pontiff 2016; Martineau 2021).
Summary:
<|assistant|>
The Efficient Market Hypothesis (EMH) is a financial economics theory that suggests that asset prices reflect all available information. Since the EMH is formulated in terms of risk adjustment, it only makes testable predictions when coupled with a particular model of risk. Today, research in financial economics since the 1990s has focused on market anomalies, that is, deviations from specific models of risk. [end of text]
llama_perf_sampler_print: sampling time = 3.88 ms / 539 runs ( 0.01 ms per token, 139060.89 tokens per second)
llama_perf_context_print: load time = 320.65 ms
llama_perf_context_print: prompt eval time = 45040.32 ms / 452 tokens ( 99.65 ms per token, 10.04 tokens per second)
llama_perf_context_print: eval time = 8845.43 ms / 86 runs ( 102.85 ms per token, 9.72 tokens per second)
llama_perf_context_print: total time = 53896.92 ms / 538 tokens
Not bad. Kept the prompt this time, and gave a good summary.
License
This model is licensed under the TII Falcon License. See the LICENSE for details.
Quantization Details
- Quantization Type: i2_s (int4 weight only with symmetric quantization)
- Bits: 1.58
- Group Size: 128
- Quantization Tool: BitNet
How to Use
llama.cpp as today don't run the model, you can wait later for when they offer support.
BitNet
git clone https://github.com/microsoft/BitNet.git
cd BitNet
// They recommend conda but I tried venv and worked fine in debian 12
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
// python setup_env.py --hf-repo your_hf_username/Falcon3-10B-Instruct-1.58bit -q i2_s // You can skip this one
// move the model to the folder models, then
python run_inference.py -m models/Falcon3-10B-Instruct-1.58bit/ggml-model-i2_s.gguf -p "What is 1.58bit quantization in LLM and why its iteresting for gpu poor people?" -cnv
# 1.58-bit quantization is a method in which floating-point numbers in a neural network are represented using fewer bits, specifically 1.58 bits. This technique reduces the number of bits needed to represent a number, which can lead to improved performance and lower memory usage.
Evaluation
Benchmark | Llama-3.2-1B | Qwen2.5-1.5B | Llama3-8B-1.58-100B-tokens | Falcon3-10B-Instruct-1.58bit |
---|---|---|---|---|
IFEval | 55.8 | 44.4 | 17.91 | 54.37 |
MUSR | 32.4 | 36.8 | 4.87 | 2.57 |
GPQA | 25.3 | 29.6 | 1.83 | 4.27 |
BBH | 30.3 | 38.5 | 5.36 | 6.59 |
MMLU-PRO | 11.3 | 21.3 | 2.78 | 6.62 |
MATH | 3.9 | 0.2 | 0.26 | 2.44 |
Useful Links
- View the release blogpost.
Contributions
Contributions are welcome! If you encounter any issues or have suggestions for improvement, please submit an issue or pull request.
- Downloads last month
- 17
Model tree for israellaguan/Falcon3-10B-Instruct-1.58bit-i2_s
Base model
tiiuae/Falcon3-10B-Base