
Zillow
company
Verified
AI & ML interests
None defined yet.
zillow's activity

csabakecskemetiΒ
posted
an
update
5 days ago

csabakecskemetiΒ
posted
an
update
12 days ago
Post
1931
-UPDATED-
4bit inference is working! The blogpost is updated with code snippet and requirements.txt
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
-UPDATED-
I've played around with an MI100 and ROCm and collected my experience in a blogpost:
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
Unfortunately I've could not make inference or training work with model loaded in 8bit or use BnB, but did everything else and documented my findings.
4bit inference is working! The blogpost is updated with code snippet and requirements.txt
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
-UPDATED-
I've played around with an MI100 and ROCm and collected my experience in a blogpost:
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
Unfortunately I've could not make inference or training work with model loaded in 8bit or use BnB, but did everything else and documented my findings.

csabakecskemetiΒ
posted
an
update
17 days ago
Post
2750
Testing Training on AMD/ROCm the first time!
I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s.
For quantized inference it's a beast (MI50 was also surprisingly fast)
For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.
Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.
I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s.
For quantized inference it's a beast (MI50 was also surprisingly fast)
For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.
Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.

csabakecskemetiΒ
posted
an
update
27 days ago
Post
1621
I found if we apply the reasoning system prompt (that has been published on the
NousResearch/DeepHermes-3-Llama-3-8B-Preview model card) other models are also react to it and start mimicking reasoning. Some better some worse. I've seen internal monologue and self questioning.
Here's a blogpost about it:
http://devquasar.com/ai/reasoning-system-prompt/
Here's a blogpost about it:
http://devquasar.com/ai/reasoning-system-prompt/

csabakecskemetiΒ
posted
an
update
about 1 month ago
Post
1863
Check out my idea:
LLmaaS - Local LLM as a Service
With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the userβs device.
Demo, code, more detailed description.
https://devquasar.com/llmaas/
https://github.com/csabakecskemeti/LLmaaS
https://youtu.be/OOWGr8jcP5Q
Call for contributors
Join me a develop the LLmaaS proxy to make this a generic purpose tool to leverage local LLMs on web. Build in security measures.
I'm looking for help to make the proxy more generic support multiple local LLM services without any change on the HTML side.
Also looking for ideas how to make the HTML par more modular and easy to use.
LLmaaS - Local LLM as a Service
With LLmaaS, I propose leveraging locally running LLMs as a service, providing a standardized way for websites to access and utilize them for LLM-powered operations directly on the userβs device.
Demo, code, more detailed description.
https://devquasar.com/llmaas/
https://github.com/csabakecskemeti/LLmaaS
https://youtu.be/OOWGr8jcP5Q
Call for contributors
Join me a develop the LLmaaS proxy to make this a generic purpose tool to leverage local LLMs on web. Build in security measures.
I'm looking for help to make the proxy more generic support multiple local LLM services without any change on the HTML side.
Also looking for ideas how to make the HTML par more modular and easy to use.

csabakecskemetiΒ
posted
an
update
about 1 month ago
Post
2092
I've made an uncensored version of DeepSeek-R1-Distill-Llama-8B with merge. It's passing the "say f***" censor test.
Tested with lm-evaluation-harness on standard open llm leaderboard tests + hellaswag. Scores are improved in most. Details on the model card.
Model:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B
Quants:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B-GGUF
Tested with lm-evaluation-harness on standard open llm leaderboard tests + hellaswag. Scores are improved in most. Details on the model card.
Model:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B
Quants:
DevQuasar/DevQuasar-R1-Uncensored-Llama-8B-GGUF

csabakecskemetiΒ
posted
an
update
about 2 months ago
Post
2325
I've run the open llm leaderboard evaluations + hellaswag on
deepseek-ai/DeepSeek-R1-Distill-Llama-8B and compared to
meta-llama/Llama-3.1-8B-Instruct and at first glance R1 do not beat Llama overall.
If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results
Am I made some mistake, or (at least this distilled version) not as good/better than the competition?
I'll run the same on the Qwen 7B distilled version too.
If anyone wants to double check the results are posted here:
https://github.com/csabakecskemeti/lm_eval_results
Am I made some mistake, or (at least this distilled version) not as good/better than the competition?
I'll run the same on the Qwen 7B distilled version too.

csabakecskemetiΒ
posted
an
update
about 2 months ago
Post
487
NVIDIA's new AceInstruct and AceMath models quantized here:
DevQuasar/nvidia-aceinstruct-and-acemath-678d716f736603ddc8d7cbd4
(some still uploading please be patient)
DevQuasar/nvidia-aceinstruct-and-acemath-678d716f736603ddc8d7cbd4
(some still uploading please be patient)

csabakecskemetiΒ
posted
an
update
2 months ago
Post
607
Managed to run the Q2 quantized Deepseek V3 base locally
The quants are uploading (probably ~10-12hrs) here: DevQuasar/deepseek-ai.DeepSeek-V3-Base-GGUF
The quants are uploading (probably ~10-12hrs) here: DevQuasar/deepseek-ai.DeepSeek-V3-Base-GGUF

csabakecskemetiΒ
posted
an
update
2 months ago
Post
626
Just wondering why the number of parameters changed in the model attributes/Model size from 685B to 684B after converting
deepseek-ai/DeepSeek-V3-Base from FP8 to BF16:
DevQuasar/deepseek-ai.DeepSeek-V3-Base-bf16
and not just for me:
opensourcerelease/DeepSeek-V3-Base-bf16
??
DevQuasar/deepseek-ai.DeepSeek-V3-Base-bf16
and not just for me:
opensourcerelease/DeepSeek-V3-Base-bf16
??

csabakecskemetiΒ
posted
an
update
2 months ago
Post
1555
Happy New Year, Huggingface community!
In 2025, I'll continue my quantization (and some fine-tuning) efforts to support the open-source AI and Make knowledge free for everyone.
https://huggingface.co/DevQuasar
https://devquasar.com/
In 2025, I'll continue my quantization (and some fine-tuning) efforts to support the open-source AI and Make knowledge free for everyone.
https://huggingface.co/DevQuasar
https://devquasar.com/

csabakecskemetiΒ
posted
an
update
2 months ago
Post
2123
The
deepseek-ai/DeepSeek-V3-Base
model has featured today on CNBC tech news. The whale made a splash by using FP8 and shrink the cost of training significantly!
https://youtu.be/NJljq429cGk?si=kgk-ogPTMfJKsaA2
model has featured today on CNBC tech news. The whale made a splash by using FP8 and shrink the cost of training significantly!
https://youtu.be/NJljq429cGk?si=kgk-ogPTMfJKsaA2

csabakecskemetiΒ
posted
an
update
2 months ago
Post
1498
I've built a small utility to split safetensors file by file.
The issue/need came up when I've tried to convert the new Deepseek V3 model from FP8 to BF16.
The only Ada architecture GPU I have is an RTX 4080 and the 16GB vram was just wasn't enough for the conversion.
BTW: I'll upload the bf16 version here:
DevQuasar/deepseek-ai.DeepSeek-V3-Base-bf16
(it will take a while - days with my upload speed)
If anyone has access the resources to test it I'd appreciate a feedback if it's working or not.
The tool, is available from here:
https://github.com/csabakecskemeti/ai_utils/blob/main/safetensor_splitter.py
It's splitting every file to n pieces by the layers if possible, and create a new "model.safetensors.index.json" file.
I've tested it with Llama 3.1 8B and multiple split sizes, and validated by using inference pipeline.
use
Please note current version expects the model is already multiple file and have a "model.safetensors.index.json" layer-safetensor mapping file.
The issue/need came up when I've tried to convert the new Deepseek V3 model from FP8 to BF16.
The only Ada architecture GPU I have is an RTX 4080 and the 16GB vram was just wasn't enough for the conversion.
BTW: I'll upload the bf16 version here:
DevQuasar/deepseek-ai.DeepSeek-V3-Base-bf16
(it will take a while - days with my upload speed)
If anyone has access the resources to test it I'd appreciate a feedback if it's working or not.
The tool, is available from here:
https://github.com/csabakecskemeti/ai_utils/blob/main/safetensor_splitter.py
It's splitting every file to n pieces by the layers if possible, and create a new "model.safetensors.index.json" file.
I've tested it with Llama 3.1 8B and multiple split sizes, and validated by using inference pipeline.
use
--help
for usagePlease note current version expects the model is already multiple file and have a "model.safetensors.index.json" layer-safetensor mapping file.

csabakecskemetiΒ
posted
an
update
3 months ago
Post
1232
tiiuae Falcon3 10B Q8 playground:
https://huggingface.co/spaces/DevQuasar/Mi50
Also find my tiiuae Falcon3 Quant collection here:
https://huggingface.co/collections/DevQuasar/tiiuae-falcon3-676236626f3c57d1a19c6c1d
Enjoy!
https://huggingface.co/spaces/DevQuasar/Mi50
Also find my tiiuae Falcon3 Quant collection here:
https://huggingface.co/collections/DevQuasar/tiiuae-falcon3-676236626f3c57d1a19c6c1d
Enjoy!

csabakecskemetiΒ
posted
an
update
3 months ago
Post
4571
The AMD Instinct MI50 (~$110) is surprisingly fast for inference Quantized models.
This runs a Llama 3.1 8B Q8 with Llama.cpp
https://huggingface.co/spaces/DevQuasar/Mi50
A little blogpost about the HW
http://devquasar.com/uncategorized/amd-radeon-instinct-mi50-cheap-inference/
This runs a Llama 3.1 8B Q8 with Llama.cpp
https://huggingface.co/spaces/DevQuasar/Mi50
A little blogpost about the HW
http://devquasar.com/uncategorized/amd-radeon-instinct-mi50-cheap-inference/

csabakecskemetiΒ
posted
an
update
3 months ago
Post
1170
Fine Tuned a Llama3.2 3B on the MS Orca-Agents dataset for Analytical-Reasoning
r=16, Alpha=32
If you want to give it a try:
Model:
DevQuasar/analytical_reasoning_r16a32_unsloth-Llama-3.2-3B-Instruct-bnb-4bit
Adapter:
DevQuasar/analytical_reasoning_r16a32_unsloth-Llama-3.2-3B-Instruct-bnb-4bit_adapter
Quants:
DevQuasar/analytical_reasoning_r16a32_unsloth-Llama-3.2-3B-Instruct-bnb-4bit-GGUF
r=16, Alpha=32
If you want to give it a try:
Model:
DevQuasar/analytical_reasoning_r16a32_unsloth-Llama-3.2-3B-Instruct-bnb-4bit
Adapter:
DevQuasar/analytical_reasoning_r16a32_unsloth-Llama-3.2-3B-Instruct-bnb-4bit_adapter
Quants:
DevQuasar/analytical_reasoning_r16a32_unsloth-Llama-3.2-3B-Instruct-bnb-4bit-GGUF

csabakecskemetiΒ
posted
an
update
4 months ago
Post
1185
I have this small utility: no_more_typo
It is running in the background and able to call the LLM model to update the text on the clipboard. I think it would be ideal to fix typos and syntax.
I have just added the option to use custom prompt templates to perform different tasks.
Details, code and executable:
https://github.com/csabakecskemeti/no_more_typo
https://devquasar.com/no-more-typo/
It is running in the background and able to call the LLM model to update the text on the clipboard. I think it would be ideal to fix typos and syntax.
I have just added the option to use custom prompt templates to perform different tasks.
Details, code and executable:
https://github.com/csabakecskemeti/no_more_typo
https://devquasar.com/no-more-typo/

csabakecskemetiΒ
posted
an
update
4 months ago
Post
300
Repurposed my older AI workstation to a homelab server, it has received 2xV100 + 1xP40
I can reach huge 210k token context size with MegaBeam-Mistral-7B-512k-GGUF ~70+tok/s, or run Llama-3.1-Nemotron-70B-Instruct-HF-GGUF with 50k Context ~10tok/s (V100 only 40k ctx and 15tok/s).
Also able to Lora finetune with similar performace as an RTX3090.
It moved to the garage to no complaints for the noise from the family. Will move to a Rack soon :D
I can reach huge 210k token context size with MegaBeam-Mistral-7B-512k-GGUF ~70+tok/s, or run Llama-3.1-Nemotron-70B-Instruct-HF-GGUF with 50k Context ~10tok/s (V100 only 40k ctx and 15tok/s).
Also able to Lora finetune with similar performace as an RTX3090.
It moved to the garage to no complaints for the noise from the family. Will move to a Rack soon :D

csabakecskemetiΒ
posted
an
update
4 months ago
Post
1250
Some time ago, I built a predictive LLM router that routes chat requests between small and large LLM models based on prompt classification. It dynamically selects the most suitable model depending on the complexity of the user input, ensuring optimal performance while maintaining conversation context. I also fine-tuned a RoBERTa model to use with the package, but you can plug and play any classifier of your choice.
Project's homepage:
https://devquasar.com/llm-predictive-router/
Pypi:
https://pypi.org/project/llm-predictive-router/
Model:
DevQuasar/roberta-prompt_classifier-v0.1
Training data:
DevQuasar/llm_router_dataset-synth
Git:
https://github.com/csabakecskemeti/llm_predictive_router_package
Feel free to check it out, and/or contribute.
Project's homepage:
https://devquasar.com/llm-predictive-router/
Pypi:
https://pypi.org/project/llm-predictive-router/
Model:
DevQuasar/roberta-prompt_classifier-v0.1
Training data:
DevQuasar/llm_router_dataset-synth
Git:
https://github.com/csabakecskemeti/llm_predictive_router_package
Feel free to check it out, and/or contribute.

csabakecskemetiΒ
posted
an
update
4 months ago
Post
1512
I've built a small open utility pip package called LLM-Forwarder that allows you to inject context, such as adding a private RAG, into existing chat applications by forwarding the app through the LLM-Forwarder. In the forwarder server, you can configure custom code to re-process chat messages and alter the user prompt, for example, by adding extra context.
https://pypi.org/project/llm-forwarder/
More details
https://devquasar.com/llmforwarder/
https://pypi.org/project/llm-forwarder/
More details
https://devquasar.com/llmforwarder/