Model Card for SwarmFormer-Base

SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.

Model Details

Model Description

SwarmFormer-Base consists of:

Token embedding layer with heavy dropout (0.4)
Multiple SwarmFormer layers
Mean pooling layer
Final classification layer
Comprehensive dropout throughout (0.3-0.4)
Developed by: Jordan Legg, Mikus Sturmanis, Takara.ai
Funded by: Takara.ai
Shared by: Takara.ai
Model type: Hierarchical transformer
Language(s): English
License: Not specified
Finetuned from model: Trained from scratch

Model Sources

Repository: https://github.com/takara-ai/SwarmFormer
Paper: "SwarmFormer: Local-Global Hierarchical Attention via Swarmed Token Representations"
Demo: Not available

Uses

Direct Use

Text classification
Sentiment analysis
Document processing

Downstream Use

Feature extraction for NLP tasks
Transfer learning
Building block for larger systems

Out-of-Scope Use

Text generation
Machine translation
Tasks requiring >768 tokens
Real-time processing without adequate hardware

Bias, Risks, and Limitations

Fixed cluster size (4 tokens)
Maximum sequence length: 768 tokens
Potential information loss in clustering
Limited evaluation (English text classification only)

Training Details

Training Data

Dataset: IMDB Movie Review (50k samples)
Augmentation techniques:
- Sentence-level shuffling
- Controlled synonym replacement
- Hierarchical sample creation

Training Procedure

Model Architecture Details

Token Embedding Layer:

- Embedding layer (vocab_size → d_model)
- Dropout rate: 0.4

Local Swarm Aggregator:

- Input processing dropout: 0.3
- Local aggregation MLP:
  - Linear(d_model → d_model)
  - GELU activation
  - Dropout(0.3)
  - Linear(d_model → d_model)
- Gate network:
  - Linear(2*d_model → d_model)
  - GELU activation
  - Linear(d_model → d_model)
  - Sigmoid activation
- Output dropout: 0.3

Clustering Mechanism:
- Groups tokens into fixed-size clusters (size=4)
- Computes mean representation per cluster

Global Cluster Attention:

- Query/Key/Value projections: Linear(d_model → d_model)
- Scaled dot-product attention
- Attention dropout: 0.3
- Output dropout: 0.3

Broadcast Updater:

- Linear projection: d_model → d_model
- Dropout: 0.1
- Gate network:
  - Linear(2*d_model → d_model)
  - GELU activation
  - Linear(d_model → d_model)
  - Sigmoid activation

Training Hyperparameters

Embedding dimension: 192
Number of layers: 2
Local update steps (T_local): 3
Cluster size: 4
Batch size: 48
Learning rate: 4.74 × 10⁻⁴
Weight decay: 0.0381
Dropout rates:
- Embedding: 0.4
- Local aggregation: 0.3
- Attention: 0.3
- Final: 0.4

Evaluation

Testing Data, Factors & Metrics

IMDB test split (25k samples)
Full FP32 inference
Batch size: 256

Results

Accuracy: 89.03%
Precision: 87.22%
Recall: 91.46%
F1: 89.29%
Mean batch latency: 4.83ms
Peak memory: 9.13GB

Technical Specifications

Model Architecture and Objective

Complete architecture flow:

Input → Token Embedding (with dropout)
For each layer:
- Multiple iterations of Local Swarm Updates
- Cluster Formation
- Global Attention between clusters
- Broadcast updates back to tokens
Mean pooling across sequence
Final dropout and classification

Compute Infrastructure

GPU: NVIDIA RTX 2080 Ti or equivalent
VRAM: 10GB+ recommended
Framework: PyTorch

Software Requirements

import torch
import torch.nn as nn

Citation

@article{legg2025swarmformer,
  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
  journal={Takara.ai Research},
  year={2025},
  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}

Model Card Authors

Jordan Legg, Mikus Sturmanis, Takara.ai Research Team

Model Card Contact

[email protected]

takara-ai
/

SwarmFormer-Sentiment-Base