File size: 3,888 Bytes

---
library_name: transformers
datasets:
- WebOrganizer/FormatAnnotations-Llama-3.1-8B
- WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
base_model:
- Alibaba-NLP/gte-base-en-v1.5
---
# WebOrganizer/FormatClassifier-NoURL

[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]

The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information).
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

#### All Domain Classifiers
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier)
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) *← you are here!*
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL)

## Usage

This classifier expects input in the following format:
```
{text}
```

Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier-NoURL",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """How to make a good sandwich? [Click here to read article]"""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)
```

You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
1. Academic Writing
2. Content Listing
3. Creative Writing
4. Customer Support
5. Comment Section
6. FAQ
7. Truncated
8. Knowledge Article
9. Legal Notices
10. Listicle
11. News Article
12. Nonfiction Writing
13. About (Org.)
14. News (Org.)
15. About (Pers.)
16. Personal Blog
17. Product Page
18. Q&A Forum
19. Spam / Ads
20. Structured Data
21. Documentation
22. Audio Transcript
23. Tutorial
24. User Review

The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).

#### Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
```python
AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier-NoURL",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)
```

## Citation
```bibtex
@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
```