--- datasets: - stanfordnlp/imdb language: - en --- # Model Card for SwarmFormer-Base SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention. ## Model Details ### Model Description SwarmFormer-Base consists of: - Token embedding layer with heavy dropout (0.4) - Multiple SwarmFormer layers - Mean pooling layer - Final classification layer - Comprehensive dropout throughout (0.3-0.4) - **Developed by**: Jordan Legg, Mikus Sturmanis, Takara.ai - **Funded by**: Takara.ai - **Shared by**: Takara.ai - **Model type**: Hierarchical transformer - **Language(s)**: English - **License**: Not specified - **Finetuned from model**: Trained from scratch ### Model Sources - **Repository**: https://github.com/takara-ai/SwarmFormer - **Paper**: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations" - **Demo**: Not available ## Uses ### Direct Use - Text classification - Sentiment analysis - Document processing ### Downstream Use - Feature extraction for NLP tasks - Transfer learning - Building block for larger systems ### Out-of-Scope Use - Text generation - Machine translation - Tasks requiring >768 tokens - Real-time processing without adequate hardware ## Bias, Risks, and Limitations - Fixed cluster size (4 tokens) - Maximum sequence length: 768 tokens - Potential information loss in clustering - Limited evaluation (English text classification only) ## Training Details ### Training Data - Dataset: IMDB Movie Review (50k samples) - Augmentation techniques: - Sentence-level shuffling - Controlled synonym replacement - Hierarchical sample creation ### Training Procedure #### Model Architecture Details 1. **Token Embedding Layer**: ```python - Embedding layer (vocab_size → d_model) - Dropout rate: 0.4 ``` 2. **Local Swarm Aggregator**: ```python - Input processing dropout: 0.3 - Local aggregation MLP: - Linear(d_model → d_model) - GELU activation - Dropout(0.3) - Linear(d_model → d_model) - Gate network: - Linear(2*d_model → d_model) - GELU activation - Linear(d_model → d_model) - Sigmoid activation - Output dropout: 0.3 ``` 3. **Clustering Mechanism**: - Groups tokens into fixed-size clusters (size=4) - Computes mean representation per cluster 4. **Global Cluster Attention**: ```python - Query/Key/Value projections: Linear(d_model → d_model) - Scaled dot-product attention - Attention dropout: 0.3 - Output dropout: 0.3 ``` 5. **Broadcast Updater**: ```python - Linear projection: d_model → d_model - Dropout: 0.1 - Gate network: - Linear(2*d_model → d_model) - GELU activation - Linear(d_model → d_model) - Sigmoid activation ``` #### Training Hyperparameters - Embedding dimension: 192 - Number of layers: 2 - Local update steps (T_local): 3 - Cluster size: 4 - Batch size: 48 - Learning rate: 4.74 × 10⁻⁴ - Weight decay: 0.0381 - Dropout rates: - Embedding: 0.4 - Local aggregation: 0.3 - Attention: 0.3 - Final: 0.4 ## Evaluation ### Testing Data, Factors & Metrics - IMDB test split (25k samples) - Full FP32 inference - Batch size: 256 ### Results - Accuracy: 89.03% - Precision: 87.22% - Recall: 91.46% - F1: 89.29% - Mean batch latency: 4.83ms - Peak memory: 9.13GB ## Technical Specifications ### Model Architecture and Objective Complete architecture flow: 1. Input → Token Embedding (with dropout) 2. For each layer: - Multiple iterations of Local Swarm Updates - Cluster Formation - Global Attention between clusters - Broadcast updates back to tokens 3. Mean pooling across sequence 4. Final dropout and classification ### Compute Infrastructure - GPU: NVIDIA RTX 2080 Ti or equivalent - VRAM: 10GB+ recommended - Framework: PyTorch ### Software Requirements ```python import torch import torch.nn as nn ``` ## Citation ```bibtex @article{legg2025swarmformer, title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations}, author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}}, journal={Takara.ai Research}, year={2025}, url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf} } ``` ## Model Card Authors Jordan Legg, Mikus Sturmanis, Takara.ai Research Team ## Model Card Contact research@takara.ai