Model Card for SwarmFormer-Base
SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.
Model Details
Model Description
SwarmFormer-Base consists of:
Token embedding layer with heavy dropout (0.4)
Multiple SwarmFormer layers
Mean pooling layer
Final classification layer
Comprehensive dropout throughout (0.3-0.4)
Developed by: Jordan Legg, Mikus Sturmanis, Takara.ai
Funded by: Takara.ai
Shared by: Takara.ai
Model type: Hierarchical transformer
Language(s): English
License: Not specified
Finetuned from model: Trained from scratch
Model Sources
- Repository: https://github.com/takara-ai/SwarmFormer
- Paper: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
- Demo: Not available
Uses
Direct Use
- Text classification
- Sentiment analysis
- Document processing
Downstream Use
- Feature extraction for NLP tasks
- Transfer learning
- Building block for larger systems
Out-of-Scope Use
- Text generation
- Machine translation
- Tasks requiring >768 tokens
- Real-time processing without adequate hardware
Bias, Risks, and Limitations
- Fixed cluster size (4 tokens)
- Maximum sequence length: 768 tokens
- Potential information loss in clustering
- Limited evaluation (English text classification only)
Training Details
Training Data
- Dataset: IMDB Movie Review (50k samples)
- Augmentation techniques:
- Sentence-level shuffling
- Controlled synonym replacement
- Hierarchical sample creation
Training Procedure
Model Architecture Details
Token Embedding Layer:
- Embedding layer (vocab_size β d_model) - Dropout rate: 0.4
Local Swarm Aggregator:
- Input processing dropout: 0.3 - Local aggregation MLP: - Linear(d_model β d_model) - GELU activation - Dropout(0.3) - Linear(d_model β d_model) - Gate network: - Linear(2*d_model β d_model) - GELU activation - Linear(d_model β d_model) - Sigmoid activation - Output dropout: 0.3
Clustering Mechanism:
- Groups tokens into fixed-size clusters (size=4)
- Computes mean representation per cluster
Global Cluster Attention:
- Query/Key/Value projections: Linear(d_model β d_model) - Scaled dot-product attention - Attention dropout: 0.3 - Output dropout: 0.3
Broadcast Updater:
- Linear projection: d_model β d_model - Dropout: 0.1 - Gate network: - Linear(2*d_model β d_model) - GELU activation - Linear(d_model β d_model) - Sigmoid activation
Training Hyperparameters
- Embedding dimension: 192
- Number of layers: 2
- Local update steps (T_local): 3
- Cluster size: 4
- Batch size: 48
- Learning rate: 4.74 Γ 10β»β΄
- Weight decay: 0.0381
- Dropout rates:
- Embedding: 0.4
- Local aggregation: 0.3
- Attention: 0.3
- Final: 0.4
Evaluation
Testing Data, Factors & Metrics
- IMDB test split (25k samples)
- Full FP32 inference
- Batch size: 256
Results
- Accuracy: 89.03%
- Precision: 87.22%
- Recall: 91.46%
- F1: 89.29%
- Mean batch latency: 4.83ms
- Peak memory: 9.13GB
Technical Specifications
Model Architecture and Objective
Complete architecture flow:
- Input β Token Embedding (with dropout)
- For each layer:
- Multiple iterations of Local Swarm Updates
- Cluster Formation
- Global Attention between clusters
- Broadcast updates back to tokens
- Mean pooling across sequence
- Final dropout and classification
Compute Infrastructure
- GPU: NVIDIA RTX 2080 Ti or equivalent
- VRAM: 10GB+ recommended
- Framework: PyTorch
Software Requirements
import torch
import torch.nn as nn
Citation
@article{legg2025swarmformer,
title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
journal={Takara.ai Research},
year={2025},
url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}
Model Card Authors
Jordan Legg, Mikus Sturmanis, Takara.ai Research Team
Model Card Contact
- Downloads last month
- 4