[繁](./README_ZH.md) | [简](./README_SC.md) | EN
Originality:     Innovation:     Challenge:  
🛠️ Operation Principles | 📁 File Structure | 🖥️ Usage Instructions | 👀 Example Results
📣 Common Errors | 🙋🏻♂️ Frequently Asked Questions
# LIHKG Language Model (LiLM) Inspired by [Yi Lin](https://www.youtube.com/@lyi)'s [bilibot project](https://github.com/linyiLYi/bilibot/tree/main) and [video](https://www.youtube.com/watch?v=52clfKcM4M4&t=1s), this experimental project uses responses from users of the [LIHKG forum](https://lihkg.com) with a unique linguistic style for fine-tuning training, creating this Cantonese post response generation language model. After balancing computing costs and the [Chinese capability of base models](https://github.com/jeinlee1991/chinese-llm-benchmark), the open-source base model selected for this experimental project is [Qwen/Qwen1.5-32B-Chat](https://huggingface.co/Qwen/Qwen1.5-32B-Chat), which has 32 billion parameters. It utilizes the AI-specific framework [MLX](https://github.com/ml-explore/mlx) on the Apple Silicon platform and the [MLX-LM LoRA fine-tuning example](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md#fine-tune), leveraging the [LoRA algorithm](https://arxiv.org/abs/2106.09685) on the M3 Max 128GB and M2 Ultra 192GB to fine-tune the base model. The model has shown significant improvement in Cantonese language ability after fine-tuning, and its tone and style are deeply influenced by the group of [LIHKG](https://zh.wikipedia.org/zh-hk/LIHKG討論區) users. For more details, see [Example Results](#example-results). The fine-tuned model is available on Hugging Face: [alphrc/lilm](https://huggingface.co/alphrc/lilm/tree/main). To learn more about artificial intelligence and view more innovative and interesting projects in the future, please follow [alphrc](https://github.com/alphrc). ### GitHub - https://github.com/alphrc/lilm/tree/main ### Project Motivation - This project aims to demonstrate the language style imitation capabilities of large language models based on Cantonese spoken data and unique linguistic styles in a forum, primarily used for popular education, academic research, and technical demonstrations, hence the content will be more detailed ### Usage Limitations - The model training is based on public data, although efforts have been made to clean sensitive content, biases based on training content may still be included, and improper content should be avoided when used - The generated text reflects specific community culture, understand the relevant background before use - Conduct sufficient testing before actual application, avoid use in sensitive or controversial situations, and set up monitoring mechanisms to prevent generation of inappropriate content ### Remarks - All project codes are self-written, and the open-source community members are also encouraged to review the project, provide feedback and suggestions, and directly participate in the improvement of the project - The nature of this project is the use and practice of third-party training frameworks and models, with main challenges being system configuration, data fetching, data engineering, repeated trial and error, and long waits - The project has organized some configuration information and content in the `.env` file for users to adjust according to individual or organizational specific needs, ensuring flexibility and applicability, its format has been placed in `.env.template`, and the file name can be changed to `.env` for use ## Operation Principles ### Fine-tuning Large [pre-trained language model](https://www.kaggle.com/code/vad13irt/language-model-pre-training) possess basic and general human language response capabilities. By [fine-tuning](https://en.wikipedia.org/wiki/Fine-tuning_(deep_learning)) the model with specific textual data, it can learn further on this data, enhancing its ability to mimic aspects such as tone, style, information, and word usage. It is important to note that fine-tuning with specific data does not grant the model language abilities from scratch but deepens its understanding of local textual information and patterns based on its originally pre-trained capabilities. ### Dataset This project conducts large-scale public data scraping on the [LIHKG forum](https://lihkg.com) and processes the raw data to create a dataset for fine-tuning. To enhance data quality, the filtering criteria include: - The first response to the post is not by the author, ensuring the completeness of the information on which the response is based. - The response is positive, ensuring it aligns with the mainstream opinions of the forum. - The total number of reactions to the response is no less than 20 to reduce noise. - It is not a reply to another response. - It is not the author’s own response. - It does not contain any external links or embeds. - It does not contain sensitive words. - The total number of words plus system information does not exceed 2048. These responses, combined with the corresponding post’s title, content, and category, along with [system message](https://promptmetheus.com/resources/llm-knowledge-base/system-message), are converted into the [format](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md#data) required by the MLX-LM LoRA fine-tuning example, and randomly arranged to generate the total dataset. The total dataset is divided into a training set (80%), a validation set (10%), and a testing set (10%), where the testing set's posts have not appeared in the training or validation sets to validate [generalization](https://towardsdatascience.com/generalization-in-ai-systems-79c5b6347f2c) and prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting). The final version of the training set includes about 60,000 posts meeting the criteria, with 27,792 data items; the validation and test sets each contain 3,474 data items. ### Base Model The open-source base model [Qwen/Qwen1.5-32B-Chat](https://huggingface.co/Qwen/Qwen1.5-32B-Chat) has 32 billion parameters with a precision of BF16. When the MLX-LM module is run for the first time, if no model is detected in `~/.cache`, it automatically downloads the model from Hugging Face to `~/.cache/huggingface/hub/model--Qwen--Qwen1.5-32B-Chat`. Users do not need to manually pre-download. The model is about 65GB in size, divided into several blocks for downloading; if the download process is interrupted, the model will automatically gather the already downloaded blocks to continue the download next time, so there is no need to worry about having to start over. ### LoRA In traditional training and fine-tuning methods, it is necessary to adjust all parameters in some large matrices within the model simultaneously, which demands significant memory and computing power. In contrast, [LoRA (Low Rank Adaptation)](https://arxiv.org/abs/2106.09685) uses two smaller matrices to estimate changes in the model's large matrices, significantly reducing the number of parameters. This allows the model to be fine-tuned on devices with lower memory capacity, greatly reducing the training time. In practice, the original total parameter count of the model is 32.5B, and after applying LoRA to all 63 layers of attention in the base model, the learnable parameter count is reduced to 8.3M, only 0.026% of the original. Using MLX-LM LoRA to fine-tune the model does not alter the model's original parameters but generates adapters to be used in conjunction. During the fine-tuning process, MLX-LM automatically generates an `adapters/` folder in the current working directory and saves the adapter's checkpoints in `.safetensors` format, with each checkpoint about 33.6MB in size. These checkpoints can be used later for further fine-tuning. ### Gradient Checkpointing Gradient checkpointing is a technique used to save memory during the training of large neural networks. In the neural network training process, effective [backpropagation](https://brilliant.org/wiki/backpropagation/#:~:text=Backpropagation%2C%20short%20for%20%22backward%20propagation,to%20the%20neural%20network's%20weights.) typically requires the retention of intermediate layer outputs for gradient calculation. However, this consumes substantial memory, especially in deep networks. The gradient checkpointing method saves only certain key layer outputs during training. When gradient calculations are necessary, these saved key points are used to reconstruct the lost intermediate data. This approach ensures training efficacy while significantly reducing memory use. ### Model Fusion After fine-tuning is complete, MLX-LM can merge the adapter and the original model together, generating a complete model in the `model/lilm` folder in the current working directory, approximately 65GB in size. Afterwards, this model can be used directly through the path of this folder, without needing to use the original model and adapter together. ## File Structure - `src/` : Python code - `data.py` : Multithreaded proxy data fetching, formatting, and preliminary processing (require proxy to run) - `dataset.py` : Data processing, transformation, and filtering - `run.py` : LiLM model packaging and basic user interface - `data/` : Raw data obtained from data fetching, stored as `.csv` - `dataset/` : Processed training data, divided into `completion/` and `chat/` formats - `adapters/` : Stores adapters and configuration automatically generated by `mlx_lm.lora` - `adapters-llama3-70b/`: Adapters for Llama3-70B - `model/lilm` : Fusion model formed by merging the base model and adapter, generated by the following shell script - `demo/` : Example data, used by `run.py` ## Usage Instructions ### Hardware Requirements This project utilizes the proprietary MLX framework by Apple, hence it can only run on MacOS systems equipped with Apple Silicon Chips (M1 or higher). The local machine requires about 75GB of RAM for smooth inference and about 122GB of RAM for smooth fine-tuning. ### Environment Setup Run the following shell script to set up and configure the environment using [Anaconda](https://www.anaconda.com) and download all necessary dependencies according to `requirements.txt`. ```bash conda create -n lilm python=3.9 conda activate lilm pip install -r requirements.txt ``` ### Monitoring System Resource Usage (Optional) Use the `asitop` module to monitor computer resource usage in real-time through a graphical interface, such as CPU, GPU, and RAM, to ensure the program runs normally. ```bash sudo asitop ``` ### Inference Using the Base Model The model will automatically download the first time it is run, `--model` can be used for the full name of the model on Hugging Face or its path, ```bash mlx_lm.generate \ --model Qwen/Qwen1.5-32B-Chat \ --prompt "What is LIHKG?" ``` ### Fine-tuning After preparing the `train.jsonl` and `valid.jsonl` datasets in `dataset/chat`, start fine-tuning the model from scratch and generate the `adapters/` folder. ```bash mlx_lm.lora \ --model Qwen/Qwen1.5-32B-Chat \ --train \ --data dataset/chat \ --iters 600 \ --grad-checkpoint ``` ### Continue Fine-tuning Continue fine-tuning using an existing adapter, `--resume-adapter-file` must be a `.safetensors` file. ```bash mlx_lm.lora \ --model Qwen/Qwen1.5-32B-Chat \ --resume-adapter-file adapters/adapters.safetensors \ --train \ --data dataset/chat \ --iters 600 \ --grad-checkpoint ``` 🚨 Please note, you are likely to encounter [this error](#error-1). ### Inference with Adapter Perform generation using the base model combined with an adapter, where the adapter must be a `.safetensors` file. ```bash mlx_lm.generate \ --model Qwen/Qwen1.5-32B-Chat \ --adapter-path adapters/adapters.safetensors \ --prompt "What is LIHKG?" ``` ### Fusion of Base Model and Adapter The latest checkpoint `adapters.safetensors` in `adapters/` will be automatically selected for fusion, and the fused model will be placed in `model/lilm`. ```bash mlx_lm.fuse \ --model Qwen/Qwen1.5-32B-Chat \ --adapter-path adapters \ --save-path model/lilm ``` ### Inference Using the Fused Model Use the path of the fused model in `--model`. ```bash mlx_lm.generate \ --model model/lilm \ --prompt "What is LIHKG?" ``` ### Model Quantization (Optional) Use [quantization](https://blog.csdn.net/jinzhuojun/article/details/106955059) to reduce model parameter precision, compress model size, speed up inference, and reduce memory usage. The `--hf-path` is the same as before, can be the full name of the model on Hugging Face, or the model's path, and `--mlx-path` is the path where the compressed model is stored. However, testing shows that quantization significantly decreases model accuracy, and the quantized model cannot run using Hugging Face's Transformer. ```bash mlx_lm.convert \ --hf-path model/lilm \ --mlx-path model/lilm-4Bit \ -q ``` ### Running LiLM Use `src/run.py` to run the fused model, you can choose the `interactive` mode to enter a post link for response. ```bash python src/run.py ``` ## Example Results LiLM has shown significant improvement over the base model in Cantonese language ability, and its language style is also influenced by the LIHKG discussion forum. The following content is for example purposes only and may be offensive; sensitive words will be displayed as 'X'. ### Example 1 **Prompt ([Original Post](https://lihkg.com/thread/3699748/page/1)):** > 類別:「創意台」