Model Card for SwarmFormer-Base

SwarmFormer-Base is a compact transformer variant that achieves competitive performance on text classification tasks through a hierarchical architecture combining local swarm-based updates with cluster-level global attention.

Model Details

Model Description

SwarmFormer-Base consists of:

  • Token embedding layer with heavy dropout (0.4)

  • Multiple SwarmFormer layers

  • Mean pooling layer

  • Final classification layer

  • Comprehensive dropout throughout (0.3-0.4)

  • Developed by: Jordan Legg, Mikus Sturmanis, Takara.ai

  • Funded by: Takara.ai

  • Shared by: Takara.ai

  • Model type: Hierarchical transformer

  • Language(s): English

  • License: Not specified

  • Finetuned from model: Trained from scratch

Model Sources

Uses

Direct Use

  • Text classification
  • Sentiment analysis
  • Document processing

Downstream Use

  • Feature extraction for NLP tasks
  • Transfer learning
  • Building block for larger systems

Out-of-Scope Use

  • Text generation
  • Machine translation
  • Tasks requiring >768 tokens
  • Real-time processing without adequate hardware

Bias, Risks, and Limitations

  • Fixed cluster size (4 tokens)
  • Maximum sequence length: 768 tokens
  • Potential information loss in clustering
  • Limited evaluation (English text classification only)

Training Details

Training Data

  • Dataset: IMDB Movie Review (50k samples)
  • Augmentation techniques:
    • Sentence-level shuffling
    • Controlled synonym replacement
    • Hierarchical sample creation

Training Procedure

Model Architecture Details

  1. Token Embedding Layer:

    - Embedding layer (vocab_size β†’ d_model)
    - Dropout rate: 0.4
    
  2. Local Swarm Aggregator:

    - Input processing dropout: 0.3
    - Local aggregation MLP:
      - Linear(d_model β†’ d_model)
      - GELU activation
      - Dropout(0.3)
      - Linear(d_model β†’ d_model)
    - Gate network:
      - Linear(2*d_model β†’ d_model)
      - GELU activation
      - Linear(d_model β†’ d_model)
      - Sigmoid activation
    - Output dropout: 0.3
    
  3. Clustering Mechanism:

    • Groups tokens into fixed-size clusters (size=4)
    • Computes mean representation per cluster
  4. Global Cluster Attention:

    - Query/Key/Value projections: Linear(d_model β†’ d_model)
    - Scaled dot-product attention
    - Attention dropout: 0.3
    - Output dropout: 0.3
    
  5. Broadcast Updater:

    - Linear projection: d_model β†’ d_model
    - Dropout: 0.1
    - Gate network:
      - Linear(2*d_model β†’ d_model)
      - GELU activation
      - Linear(d_model β†’ d_model)
      - Sigmoid activation
    

Training Hyperparameters

  • Embedding dimension: 192
  • Number of layers: 2
  • Local update steps (T_local): 3
  • Cluster size: 4
  • Batch size: 48
  • Learning rate: 4.74 Γ— 10⁻⁴
  • Weight decay: 0.0381
  • Dropout rates:
    • Embedding: 0.4
    • Local aggregation: 0.3
    • Attention: 0.3
    • Final: 0.4

Evaluation

Testing Data, Factors & Metrics

  • IMDB test split (25k samples)
  • Full FP32 inference
  • Batch size: 256

Results

  • Accuracy: 89.03%
  • Precision: 87.22%
  • Recall: 91.46%
  • F1: 89.29%
  • Mean batch latency: 4.83ms
  • Peak memory: 9.13GB

Technical Specifications

Model Architecture and Objective

Complete architecture flow:

  1. Input β†’ Token Embedding (with dropout)
  2. For each layer:
    • Multiple iterations of Local Swarm Updates
    • Cluster Formation
    • Global Attention between clusters
    • Broadcast updates back to tokens
  3. Mean pooling across sequence
  4. Final dropout and classification

Compute Infrastructure

  • GPU: NVIDIA RTX 2080 Ti or equivalent
  • VRAM: 10GB+ recommended
  • Framework: PyTorch

Software Requirements

import torch
import torch.nn as nn

Citation

@article{legg2025swarmformer,
  title={SwarmFormer: Local-Global Hierarchical Attention via Swarming Token Representations},
  author={Legg, Jordan and Sturmanis, Mikus and {Takara.ai}},
  journal={Takara.ai Research},
  year={2025},
  url={https://takara.ai/papers/SwarmFormer-Local-Global-Hierarchical-Attention-via-Swarming-Token-Representations.pdf}
}

Model Card Authors

Jordan Legg, Mikus Sturmanis, Takara.ai Research Team

Model Card Contact

[email protected]

Downloads last month
4
Safetensors
Model size
6.75M params
Tensor type
F32
Β·
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train takara-ai/SwarmFormer-Sentiment-Base

Collection including takara-ai/SwarmFormer-Sentiment-Base