Contra-Topic-bottleneck-t5-large: Linear Topic Extraction using Bottleneck T5

A lightweight approach to topic extraction leveraging the Bottleneck T5 autoencoder architecture with learned transformation matrices. This project provides three specialized transformation matrices for mapping content embeddings to topic embeddings across different domains.

Check out the blog

TL;DR: Transform content embeddings into topic embeddings using domain-specific 1024×1024 transformation matrices, trained on three distinct datasets. Built on top of the Bottleneck T5 architecture for efficient, training-free topic extraction.

Motivation

Large Language Models (LLMs) have become the go-to solution for many NLP tasks, including topic extraction and classification. However, they come with significant overhead:

  • High computational requirements
  • Large memory footprint
  • Considerable inference latency
  • Complex deployment needs
  • Limited to pre-specified classes

This project offers a lightweight alternative specifically for topic extraction by leveraging the semantic structure of the Bottleneck T5's latent space. Instead of training a new model or fine-tuning existing ones, we learn a simple linear transformation between content and topic embeddings, providing:

  • Fast inference (milliseconds)
  • Minimal memory footprint (single 1024×1024 matrix per domain)
  • Simple deployment (basic matrix multiplication)
  • No need for GPU at inference time
  • Generator in nature

Architecture Overview

Base Model

Transformation Layers

  • Three domain-specific transformation matrices (1024×1024 each)
  • Linear mapping from content to topic space
  • Learned using simple Mean Squared Error optimization
  • Total additional parameters: ~3M per domain

Datasets and Performance Metrics

1. ArXiv Abstracts Dataset (ankitagr01/dynamic_topic_modeling_arxiv_abstracts)

Scientific paper abstracts paired with their research topics, providing a test bed for academic content classification.

Performance Metrics:

  • Training MSE: 0.00225 (error on samples used to learn transformation)
  • Testing MSE: 0.00268 (error on held-out validation set)
  • Inter-topic MSE: 0.00620 (minimum distance between different topic embeddings)

Use Cases:

  • Automated paper categorization
  • Research trend analysis
  • Academic content recommendation

2. TopicSUM Dataset (knkarthick/topicsum)

241,171 dialogue samples with human-annotated topic labels, ideal for conversational content analysis.

Performance Metrics:

  • Training MSE: 0.00252
  • Testing MSE: 0.00255
  • Inter-topic MSE: 0.00737

Use Cases:

  • Meeting summarization
  • Customer service dialogue categorization
  • Chat log analysis

3. MSD Manual Topics (nuvocare/MSD_manual_topics_user_base)

Medical content from Merck's Manual, featuring both professional and patient-oriented content.

Performance Metrics:

  • Training MSE: 0.00174
  • Testing MSE: 0.00197
  • Inter-topic MSE: 0.00566

Use Cases:

  • Medical document classification
  • Healthcare content organization
  • Patient information routing

Understanding the Metrics

Computational Requirements

Resource Requirement Notes
Storage ~9MB per matrix 1024×1024 float32 values
Memory ~27MB total All three domain matrices
Inference Time ~10ms On CPU, per text sample
Training Hardware P100 GPU Free tier on Kaggle
Training Time ~4 hours total Mostly embedding generation
Base Model ~770M parameters Loaded only during embedding creation

Performance Metrics Explained

  1. Training MSE (Mean Squared Error)

    • Measures how well the transformation matrix maps content to topic embeddings
    • Calculated on the 80% training split
    • Lower values indicate better alignment between transformed content and actual topic embeddings
  2. Testing MSE

    • Same metric but on 20% held-out test set
    • Indicates generalization capability
    • Similar values between train/test suggest good generalization. Slightly higher than training MSE is expected and healthy
  3. Inter-topic MSE

    • Minimum squared distance between any pair of topic embeddings
    • Higher values indicate better topic separation
    • Critical for preventing topic confusion
    • Example: MSD's 0.00566 means medical topics maintain distinct representations

Comparative Analysis

  • MSD dataset shows best training performance (0.00174 MSE)
    • Likely due to well-structured medical vocabulary
    • Clear topic boundaries in medical domain
  • TopicSUM has highest inter-topic MSE (0.00737)
    • Reflects diverse nature of conversational topics
    • Important for distinguishing between varied dialogue contexts
  • ArXiv results balance between the two
    • Scientific content has natural overlap between fields
    • Still maintains good topic separation (0.00620 inter-topic MSE)

Implementation

Try it out here: (https://colab.research.google.com/drive/1_SuTiL3QS-PUYjSrugqqD5mQlMv8Hbfc?usp=sharing)

1. Base Model Wrapper

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path, 
            trust_remote_code=True
        ).to(device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)

2. Topic Mapper

Transformations Available:

  1. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_topicsum.pt
  2. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt
  3. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_msd.pt
url = 'https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt'
file_path = 'transformation_matrix.pt'
with open(file_path, 'wb') as f:
    f.write(requests.get(url).content)
transformation_matrix = torch.load(file_path, weights_only=False).float()
print(transformation_matrix.shape, type(transformation_matrix))

3. Final Conversion


autoencoder = BottleneckT5Autoencoder(model_path=model_path, device=device)
content_embedding = autoencoder.embed(content)
topic_embedding = content_embedding @ transformation_matrix
topic = = autoencoder.generate_from_latent(topic_embedding)
print(topic)

Limitations and Future Work

  1. Representation Quality

    • System inherits Bottleneck T5's encoding limitations
    • Performance depends on input text fitting model's training distribution
  2. Domain Specificity

    • Each matrix is domain-optimized
    • Cross-domain performance not guaranteed
    • Future work: Investigate domain adaptation techniques
  3. Fixed Dimensionality

    • Locked to Bottleneck T5's 1024D space
    • Potential future work: Dimension reduction studies
  4. Linear Transformation Limitations

    • Assumes linear relationship between content and topic spaces
    • Future work: Explore non-linear transformations

Memory and Computation Requirements

  • Transformation Matrix: 1024 × 1024 × 4 bytes ≈ 9MB per domain
  • Inference Time: ~10ms on CPU (matrix multiplication)
  • Total Model Size: ~27MB (all three domains)
  • Base Model: ~770M parameters (loaded only during embedding creation)

Acknowledgments

Special thanks to:

  • Linus Lee (@thesephist) for the Bottleneck T5 model
  • The T5 team at Google Research
  • Dataset providers:
    • @ankitagr01 for the ArXiv abstracts dataset
    • @knkarthick for the TopicSUM dataset
    • @nuvocare for the MSD Manual topics dataset
  • Kaggle for providing free P100 GPU resources

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Downloads last month
14
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for AmanPriyanshu/Contra-Topic-bottleneck-t5-large

Finetuned
(1)
this model

Datasets used to train AmanPriyanshu/Contra-Topic-bottleneck-t5-large