1 7 15

Yiyang Nan

nanyy1025

AI & ML interests

None yet

Recent Activity

liked a dataset about 2 months ago

C4AI-Community/multilingual-reward-bench

liked a dataset about 2 months ago

CohereForAI/include-base-44

liked a dataset about 2 months ago

CohereForAI/Global-MMLU

View all activity

Organizations

nanyy1025's activity

liked 3 datasets about 2 months ago

liked 2 models 3 months ago

CohereForAI/aya-expanse-32b

Text Generation • Updated Dec 6, 2024 • 23.1k • 205

CohereForAI/aya-expanse-8b

Text Generation • Updated Dec 6, 2024 • 49.9k • 328

reacted to Taylor658's post with 🚀❤️🔥👀 4 months ago

Post

2521

Spent the weekend testing out some prompts with 🕵️‍♂️Mystery Bot🕵️‍♂️ on my mobile... exciting things are coming soon for the following languages:

🌐Arabic, Chinese, Czech, Dutch, English French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese!🌐

liked a model 6 months ago

BatsResearch/Llama-3.1-8B-bonito-v1

Text Generation • Updated Aug 13, 2024 • 209 • 5

upvoted a collection 6 months ago

Bonito

Collection

Models and datasets from the Bonito paper (https://arxiv.org/abs/2402.18334) • 8 items • Updated Oct 1, 2024 • 1

upvoted an article 7 months ago

Article

How NuminaMath Won the 1st AIMO Progress Prize

Jul 11, 2024

• 112

upvoted a paper 7 months ago

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

Paper • 2407.03321 • Published Jul 3, 2024 • 16

liked a Space 7 months ago

Instruction Synthesizer

🐠

Generate instruction-response pairs from text

upvoted a paper 7 months ago

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Paper • 2406.16235 • Published Jun 23, 2024 • 11

liked a Space 8 months ago

Bonito

💬

Generate task-specific instructions and responses from text

reacted to davanstrien's post with ❤️ 8 months ago

Post

2318

Several methods/models have recently been shared to generate synthetic data from minimal or no initial seeds, essentially creating data directly from raw text.

IMO, these approaches that rely on smaller models for synthetic data generation are quite valuable for scaling up synthetic data and democratizing access to creating domain-specific synthetic datasets.

I've compiled a collection of Gradio demos showcasing some of these methods here: davanstrien/synthetic-data-generation-demos-667573f248b97360ff3668a5

5 replies

liked a dataset 8 months ago

tals/vitaminc

Viewer • Updated Jul 1, 2022 • 489k • 254 • 8

reacted to macadeliccc's post with ❤️ 8 months ago

Post

Create synthetic instruction datasets using open source LLM's and bonito🐟!

With Bonito, you can generate synthetic datasets for a wide variety of supported tasks.

The Bonito model introduces a novel approach for conditional task generation, transforming unannotated text into task-specific training datasets to facilitate zero-shot adaptation of large language models on specialized data.

This methodology not only improves the adaptability of LLMs to new domains but also showcases the effectiveness of synthetic instruction tuning datasets in achieving substantial performance gains.

AutoBonito🐟: https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing
Original Repo: https://github.com/BatsResearch/bonito?tab=readme-ov-file
Paper: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation (2402.18334)

2 replies

upvoted a paper 9 months ago

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Paper • 2402.14086 • Published Feb 21, 2024 • 9