--- license: llama2 language: - en base_model: - meta-llama/CodeLlama-13b-hf pipeline_tag: text-generation tags: - code - gguf - llama.cpp - llmstudio --- # Model Card for TestGen-Dart v0.2 (GGUF Version) This model card provides information about **TestGen-Dart v0.2 (GGUF Version)**, a fine-tuned version of Meta's Code Llama 13B model, optimized for generating unit tests in Dart for mobile applications. This GGUF-quantized model is designed to run efficiently with frameworks like **LLMStudio** and **llama.cpp**, enabling deployment on resource-constrained hardware while maintaining robust performance. --- ## Model Details ### Model Description **TestGen-Dart v0.2** is a fine-tuned version of Code Llama 13B, specifically adapted for generating unit test cases for Dart code. The GGUF quantization enables its use on lightweight, consumer-grade systems without significant performance loss. - **Developed by:** Jacob Hoffmann, Demian Frister (Karlsruhe Institute of Technology - KIT, AIFB-BIS) - **Funded by:** Helmholtz Association's Initiative and Networking Fund on the HAICORE@FZJ partition - **Shared by:** Jacob Hoffmann, Demian Frister - **Model type:** Fine-tuned Code Llama 13B for test generation in Dart - **Language(s):** English - **License:** LLaMA 2 Community License - **Finetuned from model:** Meta's Code Llama 13B ### Model Sources - **Repository:** [GitHub Repository](https://github.com/example/repo) (placeholder) - **Paper:** ["Generating Software Tests for Mobile Applications Using Fine-Tuned Large Language Models"](https://doi.org/10.1145/3644032.3644454) (published in AST '24) - **Demo:** Coming soon --- ## Uses ### Direct Use The model can be used in a zero-shot setting with **llama.cpp** or **LLMStudio** to generate unit tests in Dart. Provide the class code as input, and the model outputs structured unit tests using Dart's `test` package. ### Downstream Use This model is suitable for integration into developer tools, IDE extensions, or continuous integration pipelines to automate test generation for Dart-based applications. ### Out-of-Scope Use - Do not use this model for tasks unrelated to Dart test generation. - Avoid using this model to improve or train other LLMs not based on LLaMA or its derivatives, per the LLaMA 2 Community License. - Misuse for malicious purposes, such as generating incorrect or harmful test cases, is prohibited. --- ## Running the GGUF Model with llama.cpp To use this GGUF quantized model with llama.cpp: 1. Clone the llama.cpp repository and build the binaries: ```bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make ``` 2. Place the GGUF file in the models directory: ```bash mkdir -p models/testgen-dart-v0.2 mv /path/to/CodeLlama-13B-TestGen-Dart_v0.2.gguf models/testgen-dart-v0.2/ ``` 3. Run the model: ```bash ./main -m models/testgen-dart-v0.2/CodeLlama-13B-TestGen-Dart_v0.2.gguf --prompt "Generate unit tests in Dart for the following class:\nclass Calculator { int add(int a, int b) { return a + b; } }" ``` ## Training Details ### Training Data The fine-tuning dataset consists of **16,252 Dart code-test pairs** extracted from open-source GitHub repositories using Google BigQuery. The data was subjected to quality filtering and deduplication to ensure high relevance and consistency. ### Training Procedure - **Fine-tuning Approach:** Supervised Fine-Tuning (SFT) with QLoRA for memory efficiency. - **Hardware:** Training was conducted on a single NVIDIA A100 GPU. - **Optimization:** Flash Attention 2 was utilized for enhanced performance. - **Duration:** The training process ran for up to 32 hours. ### Training Hyperparameters - **Mixed Precision:** FP16 - **Optimizer:** AdamW - **Learning Rate:** 5e-5 - **Epochs:** 3 ### Environmental Impact - **Hardware Type:** NVIDIA A100 GPU - **Hours Used:** 32 hours - **Carbon Emitted:** 13.099 kgCO2eq --- ## Evaluation ### Testing Data, Factors & Metrics - **Testing Data:** A subset of **42 Dart files** from the training dataset, evaluated in a zero-shot setting. - **Factors:** Syntax correctness, functional correctness. - **Metrics:** pass@1, syntax error rate, functional correctness rate. ### Results - **Syntax Correctness:** +76% improvement compared to the base model. - **Functional Correctness:** +16.67% improvement compared to the base model. --- ## Citation If you use this model in your research, please cite: **BibTeX:** ```bibtex @inproceedings{hoffmann2024testgen, title={Generating Software Tests for Mobile Applications Using Fine-Tuned Large Language Models}, author={Hoffmann, Jacob and Frister, Demian}, booktitle={Proceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024)}, year={2024}, doi={10.1145/3644032.3644454} } ``` ## Model Card Contact - **Jacob Hoffmann**: [jacob.hoffmann@partner.kit.edu](mailto:jacob.hoffmann@partner.kit.edu) - **Demian Frister**: [demian.frister@kit.edu](mailto:demian.frister@kit.edu)