46 21 12

François Remy

FremyCompany

http://fremycompany.com

AI & ML interests

NLP; Clinical NLP; Medical NLP; EHR; Web development;

Recent Activity

new activity 7 days ago

Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-FR:ONNX + GGUF version?

liked a dataset 11 days ago

open-thoughts/OpenThoughts2-1M

new activity 23 days ago

FremyCompany/xls-r-2b-nl-v2_lm-5gram-os2_hunspell:Adding `safetensors` variant of this model

View all activity

Organizations

FremyCompany's activity

reacted to MikeDoes's post with 🔥 28 days ago

Post

2698

🚀 Ai4Privacy Team is excited to unveil PII-Masking-1M, our most significant release yet! 🎉

This publication series 📦 includes datasets 📊, models 🤖, and applications ⚙️ to advance PII masking with AI systems 🛡️

Starting on Monday with daily posts at 7 PM CET ⏰

reacted to clem's post with 🔥 about 1 month ago

Post

5921

Super happy to welcome Nvidia as our latest enterprise hub customer. They have almost 2,000 team members using Hugging Face, and close to 20,000 followers of their org. Can't wait to see what they'll open-source for all of us in the coming months!

Nvidia's org:

nvidia
Enterprise hub: https://huggingface.co/enterprise

posted an update 3 months ago

Post

665

🔀 Very cool demo of word-level alignment of paraphrased or cross-lingual sentences, from the new Fairly Multilingual ModernBERT embedding model:

Parallia/Fairly-Multilingual-ModernBERT-Token-Alignment

reacted to julien-c's post with 🤗 4 months ago

Post

10561

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

replied to their post 12 months ago

Not from scratch, as our technique perserves most of the model weights. But you have to continue pre-training to get most of the benefits, yes. You can read more about it in our preview paper.

We are in the process of releasing a library for replicating this easily, but are not ready to share this yet.

posted an update 12 months ago

Post

2411

Today, April 26, is the Day of the Tatar Language! 🌟
To celebrate, we release our new language model, Tweety Tatar 🐣

https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1

The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.

We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:

https://huggingface.co/Tweeties/tweety-tatar-hydra-base-7b-2024-v1

More details in our upcoming paper 👀
François REMY, Pieter Delobelle, Alfiya Khabibullina

Татар теле көне белән!

3 replies

reacted to chiphuyen's post with 👍 about 1 year ago

Post

It feels awkward having my first post sharing my stuff, but this is a weekend project that I really enjoyed working on. I'd love to meet more people interested in random ideas like this.

A hard part of building AI applications is choosing which model to use. What if we don’t have to? What if we can predict the best model for any prompt?

Predictive human preference aims to predict which model users might prefer for a specific query.

https://huyenchip.com/2024/02/28/predictive-human-preference.html

One use case is model routing. If we know in advance that for a prompt, users will prefer Claude Instant’s response over GPT-4, and Claude Instant is cheaper/faster than GPT-4, we can route this prompt to Claude Instant. Model routing has the potential to increase response quality while reducing costs and latency.

One pattern is that for simple prompts, weak models can do (nearly) as well as strong models. For more challenging prompts, however, users are more likely to prefer stronger models. Here’s a visualization of predicted human preference for an easy prompt (“hello, how are you?”) and a challenging prompt (“Explain why Planc length …”).

Preference predictors make it possible to create leaderboards unique to any prompt and domain.

3 replies

posted an update about 1 year ago

Post

🔥 What's that biomedical model that got 170,763 downloads last month on HuggingFace?! Well, the paper is finally published! #BioLORD

📰 Read our article in the Journal of the American Medical Informatics Association:
https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae029/7614965

📝TLDR: BioLORD-2023 is a series of semantic language models for the biomedical domain, capable of representing clinical concepts and sentences in a semantic space aligned with human preferences. Our new multilingual version supports 50+ languages and is further finetuned on 7 European languages. These models were trained contrastively and through distillations, using a corpus unifying in the same latent space the concept names of biomedical concepts and their descriptions. For concepts which didn't have a description written by humans in UMLS, we use information contained in the SnomedCT knowledge graph and the capabilities of ChatGPT to generate synthetic data and improve our results.

🤗 Access our models from the HuggingFace hub, including the new 2023-C and 2023-S variants:
FremyCompany/BioLORD-2023
FremyCompany/BioLORD-2023-M
FremyCompany/BioLORD-2023-S
FremyCompany/BioLORD-2023-C

1 reply

reacted to bwang0911's post with 👍 about 1 year ago

Post

We've been busy cooking up some interesting models at @jinaai , with a recent highlight being the release of our first batch of bilingual embedding models.

Internally labeled as X+EN, where X represents the target language and EN stays fixed, these models specialize in both monolingual tasks and cross-lingual retrieval tasks, crossing from X to EN.

You can find these models available on Huggingface:
1. German-English bilingual embedding: jinaai/jina-embeddings-v2-base-de
2. Chinese-English bilingual embedding: jinaai/jina-embeddings-v2-base-zh

We're also excited to announce that a Spanish bilingual embedding will be released in approximately two weeks.

Our evaluation across various MLM tasks has demonstrated that the Bilingual Backbone consistently outperforms state-of-the-art Multilingual Backbones like XLM-Roberta (given its focus on just two languages).

Despite being three times smaller than the leading multilingual models (e5-multilingual-large), our released bilingual embedding models have shown superior performance compared to e5-multilingual-large, excelling in both monolingual and cross-lingual search tasks.

Currently, we're putting the finishing touches on the technical report, which should be available on Arxiv by next week.

Looking ahead, the embedding team is gearing up for jina-embeddings-v3
with some initial groundwork already underway. Stay tuned for more updates!

1 reply

replied to BramVanroy's post about 1 year ago

(my other thought is that you should increase the KL divergence penalty, if your DPO models diverges too much from your initial model, but I think making the negative examples better is a stronger first step to take)

replied to BramVanroy's post about 1 year ago

Not an expert, but I think you should create your negative examples in a way where the first few tokens are not enough to differentiate between good and bad.

One easy way to do this would be to first sample the GPT-4 examples, then keep "n" tokens (sampled from 0 to the length of the answer) then generate the rest of the answer with the other (worse) model.

That way, the DPO model cannot just ignore every tokens after a few, because the branching can happen at any point.