Built with Axolotl

See axolotl config

axolotl version: 0.4.1

base_model: meta-llama/Meta-Llama-3-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

lora_fan_in_fan_out: false
data_seed: 49
seed: 49

datasets:
  - path: ft_data/alpaca_data.jsonl
    type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./qlora-alpaca-out
hub_model_id: pbevan11/llama-3-8b-ocr-correction

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:

wandb_project: ocr-ft
wandb_entity: sncds
wandb_name: test

gradient_accumulation_steps: 4
micro_batch_size: 2 # was 16
eval_batch_size: 2 # was 16
num_epochs: 3
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|end_of_text|>"

llama-3-8b-ocr-correction

This model is a qlora fine-tuned adapter for meta-llama/Meta-Llama-3-8B on the pbevan11/synthetic-ocr-correction-gpt4o dataset. It achieves the following results on the evaluation set:

  • Loss: 0.1778

Usage

First, download the model

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model_id='pbevan11/llama-3-8b-ocr-correction'
model = AutoPeftModelForCausalLM.from_pretrained(model_id).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

Then, construct the prompt template like so:

def prompt(instruction, inp):
    return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{inp}

### Response:
"""

def prompt_tok(instruction, inp, return_ids=False):
    _p = prompt(instruction, inp)
    input_ids = tokenizer(_p, return_tensors="pt", truncation=True).input_ids.cuda()
    out_ids = model.generate(input_ids=input_ids, max_new_tokens=5000, 
                          do_sample=False)
    ids = out_ids.detach().cpu().numpy()
    if return_ids: return out_ids
    
    full_output = tokenizer.batch_decode(ids, skip_special_tokens=True)[0]
    response_start = full_output.find("### Response:")
    if response_start != -1:
        return full_output[response_start + len("### Response:"):]
    else:
        return full_output[len(_p):]

Finally, you can get predictions like this:

# model inputs
instruction = "You are an assistant that takes a piece of text that has been corrupted during OCR digitisation, and produce a corrected version of the same text."
inp = "Do Not Kule Oi't hy.er-l'rieed AjijqIi: imac - Analyst (fteuiers) Hcuiers - A | ) | ilf, <;/) in |) nter |iic . conic! deeiilf. l.o sell n lower-|)rieofl wersinn oi its Macintosh cornutor to nttinct ronsnnu-rs already euami'red ot its iPod music jiayo-r untl annoyoil. by sccnrit.y problems ivitJi Willtlows PCs , Piper.iaffray analyst. (Jcne Muster <aid on Tlinrtiday."

# print prediction
out = prompt_tok(instruction, inp)
print(out.replace('\\', ' ').strip('\\n'))

This will give you a prediction that looks like this:

"Do Not Rule Out Lower-Priced Mac - Analyst (Reuters) Reuters - Apple Inc.  may be considering a lower-priced version of its Macintosh computer to attract consumers already enamored of its iPod music player and annoyed by security problems with Windows PCs, PiperJaffray analyst Gene Munster said on Thursday."

Alternatively, you can play with this model on Replicate: https://replicate.com/pbevan1/llama-3.1-8b-ocr-correction

Intended uses & limitations

Reconstructions should not be taken as the truth, the model is likely to make some things up to fill in the gaps, and so some things may not be perfectly histoically acurate.

This model was intended to be used to restore historical documents that have been imperfectly digitalised using OCR.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 49
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 10
  • num_epochs: 3

Training results

Training Loss Epoch Step Validation Loss
0.5646 0.0174 1 0.6286
0.3257 0.2609 15 0.2889
0.2285 0.5217 30 0.2171
0.1727 0.7826 45 0.1910
0.1497 1.0174 60 0.1792
0.1545 1.2783 75 0.1758
0.1317 1.5391 90 0.1738
0.1256 1.8 105 0.1699
0.0941 2.0348 120 0.1676
0.0723 2.2957 135 0.1783
0.07 2.5565 150 0.1779
0.073 2.8174 165 0.1778

Framework versions

  • PEFT 0.11.1
  • Transformers 4.42.3
  • Pytorch 2.1.2+cu118
  • Datasets 2.19.1
  • Tokenizers 0.19.1

Citation:

@misc {peter_j._bevan_2024,
    author       = { {Peter J. Bevan} },
    title        = { llama-3-8b-ocr-correction (Revision d4e6e75) },
    year         = 2024,
    url          = { https://huggingface.co/pbevan11/llama-3-8b-ocr-correction },
    doi          = { 10.57967/hf/2790 },
    publisher    = { Hugging Face }
}
Downloads last month
28
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Model tree for pbevan11/llama-3-8b-ocr-correction

Adapter
(529)
this model

Dataset used to train pbevan11/llama-3-8b-ocr-correction