Text Classification
Transformers
Safetensors
new
custom_code
awettig's picture
Update README.md
74d5efb verified
metadata
library_name: transformers
datasets:
  - WebOrganizer/FormatAnnotations-Llama-3.1-8B
  - WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8
base_model:
  - Alibaba-NLP/gte-base-en-v1.5

WebOrganizer/FormatClassifier-NoURL

[Paper] [Website] [GitHub]

The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information). The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:

  1. WebOrganizer/FormatAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
  2. WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

All Domain Classifiers

Usage

This classifier expects input in the following format:

{text}

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
model = AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier-NoURL",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """How to make a good sandwich? [Click here to read article]"""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 6 ("Truncated" format, which covers incomplete content)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

  1. Academic Writing
  2. Content Listing
  3. Creative Writing
  4. Customer Support
  5. Comment Section
  6. FAQ
  7. Truncated
  8. Knowledge Article
  9. Legal Notices
  10. Listicle
  11. News Article
  12. Nonfiction Writing
  13. About (Org.)
  14. News (Org.)
  15. About (Pers.)
  16. Personal Blog
  17. Product Page
  18. Q&A Forum
  19. Spam / Ads
  20. Structured Data
  21. Documentation
  22. Audio Transcript
  23. Tutorial
  24. User Review

The full definitions of the categories can be found in the taxonomy config.

Efficient Inference

We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers (see more here) and loading the model like:

AutoModelForSequenceClassification.from_pretrained(
    "WebOrganizer/FormatClassifier-NoURL",
    trust_remote_code=True,
    unpad_inputs=True,
    use_memory_efficient_attention=True,
    torch_dtype=torch.bfloat16
)

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}