Update README.md

bb1e4e2 verified 13 days ago

4.03 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- modernbert
	- m-bert
	---
	# MBERT Context Specifier

	MBERT Context Specifier with 150M params is a text-based context labeler or classifier trained using the modernized bidirectional encoder-only Transformer model (BERT-style). This model is pre-trained on 2 trillion tokens of English and code data, with a native context length of up to 8,192 tokens. It incorporates the following features:

	1. Rotary Positional Embeddings (RoPE): Enables long-context support.
	2. Local-Global Alternating Attention: Enhances efficiency when processing long inputs.
	3. Unpadding and Flash Attention: Optimizes efficient inference.

	ModernBERT’s native long-context length makes it ideal for tasks requiring the processing of lengthy documents, such as retrieval, classification, and semantic search within large corpora. The model was trained on a vast dataset of text and code, making it suitable for a wide range of downstream tasks, including code retrieval and hybrid (text + code) semantic search.

	# Run inference

	```python
	from transformers import pipeline

	# load model from huggingface.co/models using our repository id
	classifier = pipeline(
	task="text-classification",
	model="prithivMLmods/MBERT-Context-Specifier",
	device=0,
	)

	sample = "The global market for sustainable technologies has seen rapid growth over the past decade as businesses increasingly prioritize environmental sustainability."

	classifier(sample)
	```
	# Intended Use

	The MBERT Context Specifier is designed for the following purposes:

	1. Text and Code Classification:
	- Assigning contextual labels to large text or code inputs.
	- Suitable for tasks requiring semantic understanding of both text and code.

	2. Long-Document Processing:
	- Ideal for tasks like document retrieval, summarization, and classification within lengthy documents (up to 8,192 tokens).

	3. Semantic Search:
	- Enables semantic understanding and hybrid (text + code) searches across large corpora.
	- Applicable in industries requiring domain-specific retrieval tasks (e.g., legal, healthcare, and finance).

	4. Code Retrieval and Documentation:
	- Retrieving relevant code snippets or understanding context in large repositories of codebases and technical documentation.

	5. Language Understanding and Analysis:
	- General-purpose tasks like question answering, summarization, and sentiment analysis over large text inputs.

	6. Efficient Inference with Long Contexts:
	- Optimized for scenarios requiring efficient processing of large inputs with minimal computational overhead, thanks to Flash Attention and RoPE.

	# Limitations

	1. Domain-Specific Performance:
	- While pre-trained on a large corpus of text and code, MBERT may require fine-tuning for niche or highly specialized domains to achieve optimal performance.

	2. Tokenization Constraints:
	- Inputs exceeding the 8,192-token limit will need truncation or intelligent preprocessing to avoid losing critical information.

	3. Bias in Training Data:
	- The pre-training data (text + code) may include biases from the source corpora, leading to biased classifications or retrievals in certain contexts.

	4. Code-Specific Challenges:
	- While MBERT supports code understanding, it may struggle with niche programming languages or highly domain-specific coding standards without fine-tuning.

	5. Inference Costs on Resource-Constrained Devices:
	- Processing long-context inputs can be computationally expensive, making MBERT less suitable for edge devices or environments with limited computational resources.

	6. Multilingual Support:
	- While optimized for English and code, MBERT may perform sub-optimally for other languages unless explicitly fine-tuned on multilingual datasets.