Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +200 -0
adapter_config.json +34 -0
adapter_model.safetensors +3 -0
added_tokens.json +25 -0
merges.txt +0 -0
special_tokens_map.json +31 -0
tokenizer.json +3 -0
tokenizer_config.json +215 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+## Credits and Acknowledgments
+TURBOPASTA is built upon the excellent work of [Fast Apply](https://github.com/kortix-ai/fast-apply) by Kortix AI. Our model leverages their dataset and builds on their pioneering approach to code merging and transformation. Key inspirations include:
+- Dataset structure and generation methodology
+- XML-based prompt engineering approach
+- Evaluation metrics and benchmarking approaches
+Special thanks to:
+- The Kortix AI team for open-sourcing Fast Apply
+- Their foundational work on high-speed code transformation models
+- The comprehensive dataset they've made available to the community
+While TURBOPASTA introduces its own innovations, the groundwork laid by Fast Apply was instrumental in making this project possible. We encourage users interested in code transformation models to also check out the original Fast Apply models:
+- [FastApply-7B-v1.0](https://huggingface.co/Kortix/FastApply-7B-v1.0)
+- [FastApply-1.5B-v1.0](https://huggingface.co/Kortix/FastApply-1.5B-v1.0)
+- [FastApply-dataset-v1.0](https://huggingface.co/datasets/Kortix/FastApply-dataset-v1.0)
+This project is licensed under Apache-2.0, consistent with Fast Apply's open-source ethos.
+---------------------------------------------------------------------------
+Based on a dataset inspired by https://www.kortix.ai/
+# TURBOPASTA LoRA Adapter for Qwen2.5-3B
+A LoRA adapter for unsloth/Qwen2.5-3B that merges code updates using chain-of-thought reasoning and maintains strict adherence to original code structure and formatting.
+## Technical Specifications
+### Base Model
+- Model: unsloth/Qwen2.5-3B
+- LoRA Rank: 64
+- Target Modules: v_proj, o_proj, down_proj, up_proj, q_proj, k_proj, gate_proj
+- Task: CAUSAL_LM
+- Dropout: 0
+- Alpha: 32
+### Input/Output Format
+Input XML structure:
+```xml
+<instruction>You are a coding assistant that helps merge code updates, ensuring every modification is fully integrated. Merge all changes from the snippet into the code. Preserve the code's structure, order, comments, and indentation exactly.</instruction>
+<fastapply>
+  <code>
+    {original_code}
+  </code>
+  <update>
+    {update_snippet}
+  </update>
+  <finalcode>
+    {merged_result}
+  </finalcode>
+</fastapply>
+```
+The model supports multiple `<fastapply>` blocks for few-shot context learning. Use your stop token as `</fastapply>`.
+## Deployment
+### VLLM Server Setup
+```bash
+export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
+export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
+vllm serve unsloth/qwen2.5-3b \
+  --gpu-memory-utilization=1 \
+  --port 6002 \
+  --served-model-name="turbopasta" \
+  --trust-remote-code \
+  --max-model-len 8192 \
+  --disable-log-requests \
+  --enable-lora \
+  --lora-modules lora=./dataset/output/turbopasta/lora_model \
+  --max-lora-rank 64
+```
+### Client Implementation
+```python
+import requests
+def merge_code(original_code: str, update_snippet: str, vllm_url: str = "http://localhost:6002/v1/completions") -> dict:
+    xml_content = (
+        '<instruction>You are a coding assistant that helps merge code updates, ensuring every modification is fully '
+        'integrated. Merge all changes from the snippet into the code. Preserve the code\'s structure, order, comments, '
+        'and indentation exactly.</instruction>\n'
+        '<fastapply>\n'
+        '  <code>\n'
+        f'{original_code}\n'
+        '  </code>\n'
+        '  <update>\n'
+        f'{update_snippet}\n'
+        '  </update>'
+    )
+    response = requests.post(
+        vllm_url,
+        json={
+            "prompt": xml_content,
+            "max_tokens": 6000,
+            "temperature": 0.1,
+            "model": "lora",
+            "stop": ["</fastapply>"]
+        },
+        timeout=30000
+    )
+    completion = response.json()["choices"][0]["text"]
+    # Parse XML tags
+    import re
+    def extract_tag(tag: str) -> str:
+        match = re.search(f'<{tag}>(.*?)</{tag}>', completion, re.DOTALL)
+        return match.group(1).strip() if match else ""
+    return {
+        "merged_code": extract_tag("finalcode")
+    }
+```
+### Batch Processing
+The model works with the included data processor for parallel processing of code updates:
+```python
+from request_processor import RequestProcessor
+processor = RequestProcessor(
+    input_file="updates.jsonl",
+    output_file="merged.jsonl",
+    num_threads=24
+)
+processor.process_file()
+```
+Input JSONL format:
+```json
+{
+    "id": "update_id",
+    "original_code": "...",
+    "update_snippet": "...",
+    "file_path": "path/to/file"
+}
+```
+Output JSONL format:
+```json
+{
+    "id": "update_id",
+    "original_code": "...",
+    "update_snippet": "...",
+    "merged_code": "...",
+    "file_path": "path/to/file",
+    "processed_at": "2024-10-24 02:52:33"
+}
+```
+## Implementation and Performance Considerations
+- Uses thread pooling for parallel processing
+- Atomic writes with file locking
+- Progress tracking with tqdm
+- Automatic error handling and logging
+- Configurable thread count for optimization
+- Temperature set to 0.1 for consistent merges
+## Error Handling
+Errors are captured in the output JSONL:
+```json
+{
+    "error": "error message",
+    "processed_at": "timestamp"
+}
+```
+Monitor errors in real-time:
+```bash
+tail -f merged.jsonl | grep error
+```
+## Model Training Details
+This model was trained using Force Multiplier's autotuning pipeline with the following key characteristics:
+- Base Model: unsloth/Qwen2.5-3B
+- Training Type: Few-shot learning with chain-of-thought reasoning
+- Special Focus: Code structure preservation and merge accuracy
+- LoRA Configuration: Optimized for code understanding and generation
+## Limitations
+- Maximum context length of 8192 tokens
+- Best suited for single-file code changes
+- May require multiple passes for complex refactoring
+- Not recommended for binary file merges

adapter_config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "unsloth/Qwen2.5-3B",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_dropout": 0,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 64,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "v_proj",
+    "o_proj",
+    "down_proj",
+    "up_proj",
+    "q_proj",
+    "k_proj",
+    "gate_proj"
+  ],
+  "task_type": "CAUSAL_LM",
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b69d737bfcf70429b6aa88aed71703f4beb07daf8a51a6088706c17b14f2afc3
+size 479005064

added_tokens.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|PAD_TOKEN|>": 151665,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|PAD_TOKEN|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fab42efe8d17406525a9154b728cf9e957629a8ed7ce997770efdd71128c6a1a
+size 11422086

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,215 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<|PAD_TOKEN|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "model_max_length": 32768,
+  "pad_token": "<|PAD_TOKEN|>",
+  "padding_side": "right",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff