Safetensors
qwen2

Bonito(支持中文版本)

Bonito is an open-source model for conditional task generation: the task of converting unannotated text into task-specific training datasets for instruction tuning. This repo is a lightweight library for Bonito to easily create synthetic datasets built on top of the Hugging Face transformers and vllm libraries.

Bonito

This version supports the Chinese language

Because of the training data limitations, this version supports only the 3 task types

  • 🐠 1.question generation.
  • 🐡 2.multiple-choice question answering.
  • 🐟 3.question answering without choices.

Google Colab: Demo

Basic Usage

To generate synthetic instruction tuning dataset using Bonito, you can use the following code:
pip3 install bonito-llm

from pprint import pprint
from datasets import Dataset
from vllm import SamplingParams
from transformers import set_seed
from bonito import Bonito

unannotated_paragraph = """灌区以往的闸门控制系统在实际应用过程中普遍以人工操作为主,容易受到多种因素的影响,不可避免出现较多缺陷。如操作人员自身的综合能力、业务水平、工作态度等对工作质量和效率产生较大影响;工作人员实践操作中遇到极端气候、工作环境恶劣等问题,大大增加了工作难度,并存在较多安全隐患。"""
pprint(unannotated_paragraph)
bonito = Bonito("kitsdk/bonito-chinese-v1")
set_seed(2)

def convert_to_dataset(text):
  dataset = Dataset.from_list([{"input": text}])
  return dataset

sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)
synthetic_dataset = bonito.generate_tasks(
    convert_to_dataset(unannotated_paragraph),
    context_col="input",
    task_type="mcqa",
    sampling_params=sampling_params
)
pprint("----Generated Instructions----")
pprint(f'Input: {synthetic_dataset[0]["input"]}')
pprint(f'Output: {synthetic_dataset[0]["output"]}')
Downloads last month
22
Safetensors
Model size
3.09B params
Tensor type
FP16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kitsdk/bonito-chinese-v1

Base model

Qwen/Qwen2.5-3B
Finetuned
(96)
this model