|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- answerdotai/ModernBERT-base |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
tags: |
|
- modernbert |
|
- m-bert |
|
--- |
|
# **MBERT Context Specifier** |
|
|
|
*MBERT Context Specifier* with 150M params is a text-based context labeler or classifier trained using the modernized bidirectional encoder-only Transformer model (BERT-style). This model is pre-trained on 2 trillion tokens of English and code data, with a native context length of up to 8,192 tokens. It incorporates the following features: |
|
|
|
1. **Rotary Positional Embeddings (RoPE):** Enables long-context support. |
|
2. **Local-Global Alternating Attention:** Enhances efficiency when processing long inputs. |
|
3. **Unpadding and Flash Attention:** Optimizes efficient inference. |
|
|
|
ModernBERT’s native long-context length makes it ideal for tasks requiring the processing of lengthy documents, such as retrieval, classification, and semantic search within large corpora. The model was trained on a vast dataset of text and code, making it suitable for a wide range of downstream tasks, including code retrieval and hybrid (text + code) semantic search. |
|
|
|
# **Run inference** |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# load model from huggingface.co/models using our repository id |
|
classifier = pipeline( |
|
task="text-classification", |
|
model="prithivMLmods/MBERT-Context-Specifier", |
|
device=0, |
|
) |
|
|
|
sample = "The global market for sustainable technologies has seen rapid growth over the past decade as businesses increasingly prioritize environmental sustainability." |
|
|
|
classifier(sample) |
|
``` |
|
# **Intended Use** |
|
|
|
The MBERT Context Specifier is designed for the following purposes: |
|
|
|
1. **Text and Code Classification:** |
|
- Assigning contextual labels to large text or code inputs. |
|
- Suitable for tasks requiring semantic understanding of both text and code. |
|
|
|
2. **Long-Document Processing:** |
|
- Ideal for tasks like document retrieval, summarization, and classification within lengthy documents (up to 8,192 tokens). |
|
|
|
3. **Semantic Search:** |
|
- Enables semantic understanding and hybrid (text + code) searches across large corpora. |
|
- Applicable in industries requiring domain-specific retrieval tasks (e.g., legal, healthcare, and finance). |
|
|
|
4. **Code Retrieval and Documentation:** |
|
- Retrieving relevant code snippets or understanding context in large repositories of codebases and technical documentation. |
|
|
|
5. **Language Understanding and Analysis:** |
|
- General-purpose tasks like question answering, summarization, and sentiment analysis over large text inputs. |
|
|
|
6. **Efficient Inference with Long Contexts:** |
|
- Optimized for scenarios requiring efficient processing of large inputs with minimal computational overhead, thanks to Flash Attention and RoPE. |
|
|
|
# **Limitations** |
|
|
|
1. **Domain-Specific Performance:** |
|
- While pre-trained on a large corpus of text and code, MBERT may require fine-tuning for niche or highly specialized domains to achieve optimal performance. |
|
|
|
2. **Tokenization Constraints:** |
|
- Inputs exceeding the 8,192-token limit will need truncation or intelligent preprocessing to avoid losing critical information. |
|
|
|
3. **Bias in Training Data:** |
|
- The pre-training data (text + code) may include biases from the source corpora, leading to biased classifications or retrievals in certain contexts. |
|
|
|
4. **Code-Specific Challenges:** |
|
- While MBERT supports code understanding, it may struggle with niche programming languages or highly domain-specific coding standards without fine-tuning. |
|
|
|
5. **Inference Costs on Resource-Constrained Devices:** |
|
- Processing long-context inputs can be computationally expensive, making MBERT less suitable for edge devices or environments with limited computational resources. |
|
|
|
6. **Multilingual Support:** |
|
- While optimized for English and code, MBERT may perform sub-optimally for other languages unless explicitly fine-tuned on multilingual datasets. |