File size: 3,888 Bytes
e1cad8a a0a405c e1cad8a a0a405c 7d7466f a0a405c ec26ebb 9193bf9 a0a405c 9193bf9 a0a405c ec26ebb 521747f a0a405c 74d5efb a0a405c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
library_name: transformers
datasets:
- WebOrganizer/FormatAnnotations-Llama-3.1-8B
- WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
base_model:
- Alibaba-NLP/gte-base-en-v1.5
---
# WebOrganizer/FormatClassifier-NoURL
[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information).
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
#### All Domain Classifiers
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier)
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) *← you are here!*
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL)
## Usage
This classifier expects input in the following format:
```
{text}
```
Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/FormatClassifier-NoURL",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """How to make a good sandwich? [Click here to read article]"""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)
```
You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):
1. Academic Writing
2. Content Listing
3. Creative Writing
4. Customer Support
5. Comment Section
6. FAQ
7. Truncated
8. Knowledge Article
9. Legal Notices
10. Listicle
11. News Article
12. Nonfiction Writing
13. About (Org.)
14. News (Org.)
15. About (Pers.)
16. Personal Blog
17. Product Page
18. Q&A Forum
19. Spam / Ads
20. Structured Data
21. Documentation
22. Audio Transcript
23. Tutorial
24. User Review
The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).
#### Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
```python
AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/FormatClassifier-NoURL",
trust_remote_code=True,
unpad_inputs=True,
use_memory_efficient_attention=True,
torch_dtype=torch.bfloat16
)
```
## Citation
```bibtex
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}
``` |