Tim Dolan
AI & ML interests
Articles
Organizations
macadeliccc's activity
@
decorator in ChatGPT. Once the function is selected, the model will either extract or improve your prompt (depending on how you ask).I have also included 2 notebooks that cover different ways to access Flux for your specific use case. The first method covers how to access flux via LitServe from Lightning AI. LitServe is a bare-bones inference engine with a focus on modularity rather than raw performance. LitServe supports text generation models as well as image generation, which is great for some use cases, but does not provide the caching mechanisms from a dedicated image generation solution.
Since dedicated caching mechanisms are so crucial to performance, I also included an example for how to integrate SwarmUI/ComfyUI to utilize a more dedicated infrastructure that may already be running as part of your tech stack. Resulting in a Llama-3.1 capable of utilizing specific ComfyUI JSON configs, and many different settings.
Lastly, I tested the response times for each over a small batch request to simulate a speed test.
It becomes clear quickly how efficient caching mechanisms can greatly reduce the generation time, even in a scenario where another model is called. An average 4.5 second response time is not bad at all when you consider that an 8B model is calling a 12B parameter model for a secondary generation.
Repo: https://github.com/tdolan21/tool-calling-playground
LitServe: https://github.com/Lightning-AI/LitServe
SwarmUI: https://github.com/mcmonkeyprojects/SwarmUI
This can be completed in a multistep process similar to cohere's platform. If you have tried the cohere playground with web scraping, this will feel very similar. In my experience, the Llama 3.1 version is much better due to the larger context window. Both tools are great, but the difference is the ollama + playwright version is completely controlled by you.
All you need to do is wrap your scraper in a function:
async def query_web_scraper(url: str) -> dict:
scraper = WebScraper(headless=False)
return await scraper.query_page_content(url)
and then make your request:
# First API call: Send the query and function description to the model
response = ollama.chat(
model=model,
messages=messages,
tools=[
{
'type': 'function',
'function': {
'name': 'query_web_scraper',
'description': 'Scrapes the content of a web page and returns the structured JSON object with titles, articles, and associated links.',
'parameters': {
'type': 'object',
'properties': {
'url': {
'type': 'string',
'description': 'The URL of the web page to scrape.',
},
},
'required': ['url'],
},
},
},
]
)
To learn more:
Github w/ Playground: https://github.com/tdolan21/tool-calling-playground/blob/main/notebooks/ollama-playwright-web-scraping.ipynb
Complete Guide: https://medium.com/@tdolan21/building-an-llm-powered-web-scraper-with-ollama-and-playwright-6274d5d938b5
Step 1: Pull docker images
docker pull apostacyh/vllm:lmcache-0.1.0
Step 2: Start vLLM + LMCache
model=mistralai/Mistral-7B-Instruct-v0.2 # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=0"' \
-v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
-p 8000:8000 \
--env "HF_TOKEN=<Your huggingface access token>" \
--ipc=host \
--network=host \
apostacyh/vllm:lmcache-0.1.0 \
--model $model --gpu-memory-utilization 0.6 --port 8000 \
--lmcache-config-file /lmcache/LMCache/examples/example-local.yaml
You can add another vLLM instance as long as its on a separate GPU by simply deploying another:
# The second vLLM instance listens at port 8001
model=mistralai/Mistral-7B-Instruct-v0.2 # Replace with your model name
sudo docker run --runtime nvidia --gpus '"device=1"' \
-v <Huggingface cache dir on your local machine>:/root/.cache/huggingface \
-p 8001:8001 \
--env "HF_TOKEN=<Your huggingface token>" \
--ipc=host \
--network=host \
apostacyh/vllm:lmcache-0.1.0 \
--model $model --gpu-memory-utilization 0.7 --port 8001 \
--lmcache-config-file /lmcache/LMCache/examples/example.yaml
This method supports local, remote or hybrid backends so whichever vLLM deployment method you are already using should work with the LMCache container (excluding BentoML).
LMCache: https://github.com/LMCache/LMCache/tree/dev
vLLM: https://github.com/vllm-project/vllm
In this colab, we simply apply a supervised finetune to phi-3 using the sharegpt format.
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = []
mapper = {"system": "system\n", "human": "\nuser\n", "gpt": "\nassistant\n"}
end_mapper = {"system": "", "human": "", "gpt": ""}
for convo in convos:
text = "".join(f"{mapper[(turn := x['from'])]} {x['value']}\n{end_mapper[turn]}" for x in convo)
texts.append(f"{text}{EOS_TOKEN}")
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
print(dataset['text'][8])
Opus Samantha consists of 1848 samples with the samantha personality. The dataset covers a wide variety of topics such as logical reasoning, mathematics, legal, and rp.
This notebook serves as a viable option to finetune Phi-3 until Unsloth supports phi-3, which should be very soon. When that happens check out AutoSloth for both SFT, DPO, and langfuse format RAG fine tuning on free tier colab hardware.
Resources:
Dataset: macadeliccc/opus_samantha
Colab: https://colab.research.google.com/drive/1e8LILflDQ2Me52hwS7uIfuJ9DxE2oQzM?usp=sharing
AutoSloth: https://colab.research.google.com/drive/1Zo0sVEb2lqdsUm9dy2PTzGySxdF9CNkc#scrollTo=bpimlPXVz-CZ
This game-changing technique allows for rapid quantization of models like Llama-2-70B in under 5 minutes, outperforming traditional methods by 50x in speed and offering high-quality compression without calibration data.
Mobius Labs innovative approach not only significantly reduces memory requirements but also enables the use of large models on consumer-grade GPUs, paving the way for more accessible and efficient machine learning research.
Mobius Labs' method utilizes a robust optimization formulation to determine the optimal quantization parameters, specifically targeting the minimization of errors between original and dequantized weights. This involves employing a loss function that promotes sparsity and utilizes a non-convex lp<1-norm, making the problem challenging yet solvable through a Half-Quadratic solver.
This solver simplifies the problem by introducing an extra variable and dividing the optimization into manageable sub-problems. Their implementation cleverly fixes the scale parameter to simplify calculations and focuses on optimizing the zero-point, utilizing closed-form solutions for each sub-problem to bypass the need for gradient calculations.
Check out the colab demo where you are able to quantize models (text generation and multimodal) for use with vLLM or Timm backend as well as transformers!
AutoHQQ: π https://colab.research.google.com/drive/1cG_5R_u9q53Uond7F0JEdliwvoeeaXVN?usp=sharing
Code: https://github.com/mobiusml/hqq
HQQ Blog post: https://mobiusml.github.io/hqq_blog/
Edit: Here is an example of how powerful HQQ can be: macadeliccc/Nous-Hermes-2-Mixtral-8x7B-DPO-HQQ
Citations:
@misc {badri2023hqq,
title = {Half-Quadratic Quantization of Large Machine Learning Models},
url = {https://mobiusml.github.io/hqq_blog/},
author = {Hicham Badri and Appu Shaji},
month = {November},
year = {2023}
}
thanks for letting me know. I have updated the post with the correct link https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing
With Bonito, you can generate synthetic datasets for a wide variety of supported tasks.
The Bonito model introduces a novel approach for conditional task generation, transforming unannotated text into task-specific training datasets to facilitate zero-shot adaptation of large language models on specialized data.
This methodology not only improves the adaptability of LLMs to new domains but also showcases the effectiveness of synthetic instruction tuning datasets in achieving substantial performance gains.
AutoBonitoπ: https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing
Original Repo: https://github.com/BatsResearch/bonito?tab=readme-ov-file
Paper: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation (2402.18334)
Thank you! π€
Thank you! I will try it again and let you know if there is any issues
Love the concept! I am not sure if you are just making requests to an API, but I gave it a simple paragraph and it ran for 3800 seconds with no response.
I was going to suggest more powerful hardware, but that may not be the answer.
Here it is on colab. I am cloning from a new branch that has greatly increased the generation time. This method will save you a lot of money, but will get merged into main branch soon. When that happens, update the git clone command and everything else will be the same. https://colab.research.google.com/drive/1IjDeNVBsRru2-hM_9aauD6gVI-VvJnpk?usp=sharing
If you are interested in a notebook, do let me know. I have one but the whole experiment will cost at minimum 250 dollars. I do not know how much it will cost if you use the new vllm method. It seems to offer quite a few improvements.
UCLA-AGI has proven that large language models, even weaker large language models, can improve themselves with data only produced by original model. The question they answer in their paper is:
"Can we empower a weak LLM to improve itself without acquiring additional human annotated data?"
They answer this question by the proposal and testing of a novel fine-tuning method they call Self-Play fIne-tuNing (SPIN). The process starts by applying a supervised fine-tune (SFT) to zephyr-7b using all 200k samples of HuggingfaceH4/ultrachat_200k to eliminate the need for a human annotator.
Once the model has completed SFT, the SPIN method suggests generating 50k samples of synthetic data pairs of 'chosen' and 'rejected' samples. The model will be fine tuned on those generations, and this process will repeat for another 3 iterations for a total 200k samples.
This experiment is unique because they propose that their method can yield upwards of 10% performance gains without using any additional human annotated data. The strategy was designed to improve the less strong language models, but with further experimentation could be a formidable strategy for improving language models.
If you would like to explore this strategy for yourself, here are some resources:
Colab: https://colab.research.google.com/drive/1IjDeNVBsRru2-hM_9aauD6gVI-VvJnpk?usp=sharing
Github: https://github.com/uclaml/SPIN
The product of the experiment: UCLA-AGI/zephyr-7b-sft-full-SPIN-iter3
Paper: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)
You are correct. Unsloth does utilize peft and trl, but as I understand it they are not necessarily cross-compatible.
Thank you I will test this feature and integrate into the notebook
To my knowledge that is not currently supported. I have provided the training arguments that unsloth documents, but I will do some more research and testing to provide a more comprehensive answer.
Here's the unsloth docs that show the supported training arguments https://github.com/unslothai/unsloth/tree/main?tab=readme-ov-file#-documentation
Unsloth is a framework for fine tuning language models boasting a 0% loss in accuracy while using no approximation methods. They offer a trainer for both supervised fine-tuning (SFT) and direct preference optimization (DPO) that can increase speed of fine-tuning by up to 5x.
This is achieved by adding LoRa adapters. This way they only need to train 1 to 10% of the total parameters. You can export a LoRa adapter or merge to 16-bit for a full finetune. The resulting model is prepared for use in vLLM for faster inference.
Additionally, Huggingface has integrated Unsloth into the documentation for DPO training and reported 18.6% performance gains on T4.
This sets a new standard for fine-tuning large language models. If you would like to explore this methodology for yourself I have provided a notebook "AutoSloth," where you can fine tune using either SFT or DPO and it will upload to HF with a prefilled Unsloth README π¦₯ and a Q8_0 quantization.
The SFT example is set up for free tier usage, but the DPO example is set up for an A100. The DPO example can be altered to work on T4 but I wanted to include more than one example.
Colab Stats during training:
+ Model: unsloth/mistral-7b-bnb-4bit
+ Dataset: yahma/alpaca-cleaned
+ Batch size: 2
+ Gradient steps: 4
+ System RAM: 8.5 / 51.0 GB
+ VRAM (T4): 13.6 / 15.0 GB
Resources:
π¦₯Unsloth: https://github.com/unslothai/unsloth
π¦₯AutoSloth: https://colab.research.google.com/drive/1Zo0sVEb2lqdsUm9dy2PTzGySxdF9CNkc?usp=sharing
π€HF-Unsloth-docs: https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth
π€HF-Unsloth Blog Post: https://huggingface.co/blog/unsloth-trl
I tried a 7b on my 4090 and it used ~16.4GB. T4 is 16GB so it very well may not work.
You could definitely do some experimenting with a smaller model and a T4 though
No a 7b model should work on V100 as well. If you want to test a larger model then itβs likely you will need the A100
I believe that this is related to the particular laser method. rmt_laser_snr and laser_snr_math are functional. I will look into the errors and see if its related to my notebook.
Layer-Selective Rank Reduction (LASER) is a denoising method that improves reasoning by the strategic removal of higher-order components from weight matrices in the multi-layer perceptron (MLP) layers without the need for additional parameters or training data. This process leverages singular value decomposition to identify and eliminate these components. This simple, yet effective, method has shown to improve question-answering performance by up to 27.4 percentage points.
LaserRMT implements this through a process by calculating signal to noise ratio (SNR) for each layer and selectively reducing the rank of these layers.The SNR method meticulously computes the SNR by leveraging singular value decomposition (SVD) to separate the signal (higher-order components) from the noise (lower-order components) within the weight matrices of the model's layers. The SNR calculation is what determines which layers would benefit from rank reduction without compromising the models integrity.
If a layer is identified that could benefit from rank reduction, then the layer will enter an incremental process where the weight matrices are reduced and reconstructed by retaining only the singular values that surpass the threshold. In the case of laserRMT, the threshold is calculated by Marchenko-Pastur Law.
@staticmethod
def marchenko_pastur_threshold(sigma, n, m):
beta = n / m if n < m else m / n
threshold = sigma * np.sqrt((1 + np.sqrt(beta))**2)
return thr
The two primary benefits of applying this method are reducing computational overhead of large language models and simultaneously improving output quality.
Credit to @ehartford @fernandofernandes @DavidGF for laserRMT
Resources:
βοΈ AutoLaser: https://colab.research.google.com/drive/11j0e-w6BfvqeFN1gUrpOqdW0vcKqfVqP?usp=sharing
laserRMT: https://github.com/cognitivecomputations/laserRMT
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2312.13558)
imatrix
quantization in place of quip#Quip-# is a quantization method proposed by [Cornell-RelaxML](https://github.com/Cornell-RelaxML) that claims tremendous performance gains using only 2-bit precision.
RelaxML proposes that quantizing a model from 16 bit to 2 bit precision they can utilize Llama-2-70B on a single 24GB GPU.
QuIP# aims to revolutionize model quantization through a blend of incoherence processing and advanced lattice codebooks. By switching to a Hadamard transform-based incoherence approach, QuIP# enhances GPU efficiency, making weight matrices more Gaussian-like and ideal for quantization with its improved lattice codebooks.
This new method has already seen some adoption by projects like llama.cpp. The use of the Quip-# methodology has been implemented in the form of imatrix calculations. The importance matrix is calculated from a dataset such as wiki.train.raw and will output the perplexity on the given dataset.
This interim step can improve the results of the quantized model. If you would like to explore this process for yourself:
llama.cpp - https://github.com/ggerganov/llama.cpp/
Quip# paper - https://cornell-relaxml.github.io/quip-sharp/
AutoQuip# colab - https://colab.research.google.com/drive/1rPDvcticCekw8VPNjDbh_UcivVBzgwEW?usp=sharing
Other impressive quantization projects to watch:
+ AQLM
https://github.com/Vahe1994/AQLM
https://arxiv.org/abs/2401.06118