Introduction

This model is quantized version of Universal-NER/UniNER-7B-all.

Quantization

The quantization was applied using LLM Compressor with 512 random examples from Universal-NER/Pile-NER-definition dataset.

The recipe for quantization:

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8),
    GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"]),
]

Inference

We added chat template for the tokenizer, thus it can be directly used with vLLM without any other preprocessing compered to original model.

Example:

import json

from vllm import LLM, SamplingParams

# Loading model
llm = LLM(model="daisd-ai/UniNER-W4A16")
sampling_params = SamplingParams(temperature=0, max_tokens=256)

# Define text and entities types
text = "Some long text with multiple entities"
entities_types = ["entity type 1", "entity type 2"]

# Applying tokenizer
prompts = []
for entity_type in entities_types:
    messages = [
        {
            "role": "user",
            "content": f"Text: {text}",
        },
        {"role": "assistant", "content": "I've read this text."},
        {"role": "user", "content":f"What describes {entity_type} in the text?"},
    ]
    prompt = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    prompts.append(prompt)

# Run inference
outputs = llm.generate(prompts, self.sampling_params)
outputs = [output.outputs[0].text for output in outputs]

# Results are returned is JSON format, parse it to python list
results = []
for lst in outputs:
    try:
        entities = list(set(json.loads(lst)))
    except Exception:
        entities = []

    results.append(entities)
Downloads last month
305
Safetensors
Model size
1.12B params
Tensor type
I64
F32
I32
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for daisd-ai/UniNER-W4A16

Quantized
(2)
this model