File size: 3,596 Bytes
3a87eca
d13a333
 
 
 
 
 
3a87eca
 
d13a333
3a87eca
d13a333
 
 
3a87eca
d13a333
 
 
3a87eca
52961e4
3a87eca
d13a333
3a87eca
d13a333
3a87eca
d13a333
3a87eca
d13a333
 
 
 
 
3a87eca
d5cfd8e
3a87eca
d5cfd8e
 
 
3a87eca
d5cfd8e
3a87eca
d13a333
3a87eca
d13a333
 
 
3a87eca
d13a333
3a87eca
d13a333
 
 
2358172
3a87eca
d13a333
 
 
3a87eca
d13a333
 
 
 
 
 
 
 
 
 
 
 
 
52961e4
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
pipeline_tag: text-generation
language:
- multilingual
inference: false
license: cc-by-nc-4.0
library_name: transformers
---

<br><br>

<p align="center">
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>

<p align="center">
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>

[Blog](https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown) | [Colab](https://colab.research.google.com/drive/1wXWyj5hOxEHY6WeHbOwEzYAC0WB1I5uA)

# Intro

Jina Reader-LM is a series of models that convert HTML content to Markdown content, which is useful for content conversion tasks. The model is trained on a curated collection of HTML content and its corresponding Markdown content.

# Models

| Name            |  Context Length   | Download                                                                                                                                          |
|-----------------|-------------------|-----------------------------------------------------------------------|
| reader-lm-0.5b  |  256K             | [🤗 Hugging Face](https://huggingface.co/jinaai/reader-lm-0.5b)       |
| reader-lm-1.5b  |  256K             | [🤗 Hugging Face](https://huggingface.co/jinaai/reader-lm-1.5b)       |
|                 |

# Get Started

## On Google Colab
The easiest way to experience reader-lm is by running [our Colab notebook](https://colab.research.google.com/drive/1wXWyj5hOxEHY6WeHbOwEzYAC0WB1I5uA), 
where we demonstrate how to use reader-lm-1.5b to convert the HackerNews website into markdown. The notebook is optimized to run smoothly on Google Colab’s free T4 GPU tier. You can also load reader-lm-0.5b or change the URL to any website and explore the output. Note that the input (i.e., the prompt) to the model is the raw HTML—no prefix instruction is required.

## Local

To use this model, you need to install `transformers`:

```bash
pip install transformers<=4.43.4
```

Then, you can use the model as follows:

```python
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "jinaai/reader-lm-1.5b"

device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# example html content
html_content = "<html><body><h1>Hello, world!</h1></body></html>"

messages = [{"role": "user", "content": html_content}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)

print(input_text)

inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

print(tokenizer.decode(outputs[0]))
```

## AWS Sagemaker & Azure Marketplace
[AWS 0.5b](https://aws.amazon.com/marketplace/pp/prodview-nli7b6dueo424?sr=0-1&ref_=beagle&applicationId=AWSMPContessa)
[AWS 1.5b](https://aws.amazon.com/marketplace/pp/prodview-ms27ixcwq3wjk?sr=0-2&ref_=beagle&applicationId=AWSMPContessa)
[Azure 0.5b](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.reader-lm-500m)
[Azure 1.5b](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.reader-lm-1500m)