Speech Recognition Community Event Version 2

non-profit

Activity Feed

AI & ML interests

Multi-Lingual Speech Recognition

Recent Activity

vumichien authored a paper 12 days ago

Bridging the Data Provenance Gap Across Text, Speech and Video

anton-l authored a paper 18 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

g8a9 authored a paper about 1 month ago

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

View all activity

speech-recognition-community-v2's activity

anton-l

authored a paper 18 days ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published 20 days ago • 190

morenolq

authored a paper 21 days ago

FlanEC: Exploring Flan-T5 for Post-ASR Error Correction

Paper • 2501.12979 • Published Jan 22

FremyCompany

posted an update about 1 month ago

Post

577

🔀 Very cool demo of word-level alignment of paraphrased or cross-lingual sentences, from the new Fairly Multilingual ModernBERT embedding model:

Parallia/Fairly-Multilingual-ModernBERT-Token-Alignment

gagan3012

authored a paper 2 months ago

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Paper • 2412.13377 • Published Dec 17, 2024 • 2

anton-l

posted an update 2 months ago

Post

2497

Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
🛠️ carefully extracting math data from Common Crawl;
🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! 🚀
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2

versae

authored a paper 2 months ago

The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective

Paper • 2412.09460 • Published Dec 12, 2024 • 8

gagan3012

authored a paper 4 months ago

Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

Paper • 2411.01192 • Published Nov 2, 2024 • 3

w11wo

authored a paper 4 months ago

GenUP: Generative User Profilers as In-Context Learners for Next POI Recommender Systems

Paper • 2410.20643 • Published Oct 28, 2024

versae

authored a paper 6 months ago

Whispering in Norwegian: Navigating Orthographic and Dialectic Challenges

Paper • 2402.01917 • Published Feb 2, 2024

lgris

authored a paper 6 months ago

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Paper • 2107.11414 • Published Jul 23, 2021

nguyenvulebinh

authored a paper 7 months ago

Convoifilter: A case study of doing cocktail party speech recognition

Paper • 2308.11380 • Published Aug 22, 2023 • 1

gagan3012

authored 2 papers 7 months ago

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

Paper • 2407.18129 • Published Jul 25, 2024 • 12

Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition

Paper • 2407.13559 • Published Jul 18, 2024 • 14

morenolq

authored 2 papers 8 months ago

Exploiting Foundation Models and Speech Enhancement for Parkinson's Disease Detection from Speech in Real-World Operative Conditions

Paper • 2406.16128 • Published Jun 23, 2024

Speech Analysis of Language Varieties in Italy

Paper • 2406.15862 • Published Jun 22, 2024 • 2

anton-l

authored a paper 8 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 93

morenolq

authored a paper 8 months ago

Benchmarking Representations for Speech, Music, and Acoustic Events

Paper • 2405.00934 • Published May 2, 2024 • 1

flozi00

posted an update 8 months ago

Post

2218

🌟 Progress in the German FineWeb edu reproduction 🌟

We're delighted to share the launch of our new Data Quality Classification Model, designed specifically for evaluating educational content in German. This tool uses advanced machine learning techniques to assess texts across all educational levels, from primary school to university.

🔍 Inspired by Huggingface's fine web edu dataset, we've worked hard to refine data classification methods ensuring educators and learners access top-quality resources.
We're excited about the future as we continue improving our models and expanding our datasets.

Access the model here: pL-Community/GermanEduScorer-Qwen2-1.5b

🙏 A huge thank you to David and Daryoush from Vago Solutions; Björn and Jan from Ellamind / DiscoResearch for their expert insights throughout this project. Your support has been crucial.
This project was made possible by the support of PrimeLine AI.

2 replies

FremyCompany

posted an update 10 months ago

Post

2322

Today, April 26, is the Day of the Tatar Language! 🌟
To celebrate, we release our new language model, Tweety Tatar 🐣

https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1

The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.

We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:

https://huggingface.co/Tweeties/tweety-tatar-hydra-base-7b-2024-v1

More details in our upcoming paper 👀
François REMY, Pieter Delobelle, Alfiya Khabibullina

Татар теле көне белән!

3 replies

smangrul

posted an update 10 months ago

Post

3508

Unlocking the Power of locally running Llama-3 8B Model Agents with Chat-UI! 🔥🚀✨

I'm thrilled to share my hackathon-style side project:
1. Finetuning Llama-8B for function calling using PEFT QLoRA as the instruct Llama-3 model doesn't support this. The colab notebook for it is here: https://lnkd.in/ggJMzqh2. 🛠️
2. Finetuned model along with the 4-bit quants here: https://lnkd.in/gNpFKY6V ✨
3. Clone Hugging Face https://lnkd.in/gKBKuUBQ and make it compatible for function calling by building upon the PR https://lnkd.in/gnqFuAd4 for my model and local inferencing usecase using Ollama. This was a steep learning curve wherein I stayed awake the whole night to get it working. 💪🏽
4. Above, I used SerpAPI for web browsing and Mongo DB Atlas free tier for persistence of conversations and assistant configs. 🔎
5. More work is required to switch between using tools and responding directly wherein I see the model breaks. 🧐

How cool is this wherein we are approaching experience akin to ChatGPT while using local hosted agent model running on your laptop! 💻

1 reply

AI & ML interests

Recent Activity

Team members 199

speech-recognition-community-v2's activity