|
--- |
|
library_name: transformers |
|
license: gemma |
|
metrics: |
|
- accuracy |
|
- perplexity |
|
base_model: |
|
- google/gemma-2-2b |
|
--- |
|
# Model Card for oopere/pruned40-gemma-2-2b |
|
<!-- Provide a quick summary of what the model is/does. --> |
|
This model is a pruned version of the Gemma-2b architecture, with a parameter reduction of 40% in the MLP Layers. |
|
The pruning process aims to enhance computational efficiency while maintaining acceptable performance across specific tasks. |
|
This model is not intended to be used directly, but rather to be fine-tuned for specific tasks where it can achieve equal or superior performance compared to fine-tuning the base model for the same task. |
|
|
|
## Model Details |
|
- **Model Type:** Pruned version of Gemma-2b using structured pruning |
|
- **Original Model:** google/gemma-2-2b |
|
- **Pruning Method:** Structured pruning of MLP layers using importance scores based on absolute maximum weights |
|
- **Size Reduction:** 11.36% (from 2.2B to 1.95B parameters) |
|
- **Architecture:** Same as original Gemma but with reduced MLP layer sizes |
|
- **Language(s):** Same as original model |
|
- **License:** Gemma |
|
- **Developed by:** [Pere Martra](https://huggingface.co/oopere) |
|
|
|
### Key Findings |
|
- Maintains moderate performance on binary classification tasks (BoolQ) |
|
- Significant but manageable degradation on reasoning tasks (ARC-Easy) |
|
- Substantial impact on long-range comprehension (LAMBADA) |
|
- Notable increase in perplexity (from 3.71 to 29.68 on LAMBADA-OpenAI) |
|
|
|
### Limitations |
|
- Considerable reduction in performance on complex language understanding tasks |
|
- Significant degradation in long-range dependency handling |
|
- May not be suitable for applications requiring high accuracy on language completion tasks |
|
- Best suited for simpler classification tasks |
|
|
|
### Implementation Details |
|
- **Pruning Notebook:** [Detailed implementation and methodology](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb) |
|
- **GitHub Repository:** [LLM Course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course) |
|
|
|
### Pruning Method |
|
- **Technique:** Structured pruning targeting MLP layers |
|
- **Pruning Ratio:** 40% of neurons removed from MLP layers |
|
- **Selection Criteria:** Importance scoring based on absolute maximum weights |
|
- **Architecture Specifics:** Maintained original architecture structure during pruning |
|
|
|
### Hardware Requirements |
|
#### Memory Requirements |
|
- **Base Model:** |
|
- Parameters: ~4.4 GB (FP16) |
|
- Total Runtime Memory: ~5.5 GB |
|
|
|
- **Pruned Model (40%):** |
|
- Parameters: ~3.9 GB (FP16) |
|
- Total Runtime Memory: ~4.9 GB |
|
|
|
- **Memory Reduction:** |
|
- Parameter Memory: 11.36% |
|
- Total Runtime Memory: ~10.9% |
|
|
|
#### Notes: |
|
- Memory requirements assume FP16 precision |
|
- Actual memory usage may vary depending on: |
|
- Batch size |
|
- Sequence length |
|
- Implementation details |
|
- Runtime environment |
|
|
|
#### Minimum Requirements |
|
- GPU Memory: 6GB for base model, 5GB for pruned model |
|
- CPU Memory: 16GB recommended for both models |
|
|
|
## Acknowledgments |
|
- Thanks to [Mariusz Kurman](https://huggingface.co/mkurman) for creating [llama-pruning](https://github.com/MedITSolutionsKurman/llama-pruning), a library that implements and extends this pruning methodology. |