François Remy

FremyCompany

AI & ML interests

NLP; Clinical NLP; Medical NLP; EHR; Web development;

Recent Activity

Organizations

AZ Delta R&D (RADar)'s profile picture Spaces-explorers's profile picture Speech Recognition Community Event Version 2's profile picture Tweeties in a Tweety World's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture Parallia's profile picture

FremyCompany's activity

posted an update about 22 hours ago
reacted to julien-c's post with 🤗 about 1 month ago
view post
Post
8226
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team
·
replied to their post 8 months ago
view reply

Not from scratch, as our technique perserves most of the model weights. But you have to continue pre-training to get most of the benefits, yes. You can read more about it in our preview paper.

We are in the process of releasing a library for replicating this easily, but are not ready to share this yet.

posted an update 9 months ago
view post
Post
2216
Today, April 26, is the Day of the Tatar Language! 🌟
To celebrate, we release our new language model, Tweety Tatar 🐣

https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1

The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.

We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:

https://huggingface.co/Tweeties/tweety-tatar-hydra-base-7b-2024-v1

More details in our upcoming paper 👀
François REMY, Pieter Delobelle, Alfiya Khabibullina

Татар теле көне белән!
·
reacted to chiphuyen's post with 👍 10 months ago
view post
Post
It feels awkward having my first post sharing my stuff, but this is a weekend project that I really enjoyed working on. I'd love to meet more people interested in random ideas like this.

A hard part of building AI applications is choosing which model to use. What if we don’t have to? What if we can predict the best model for any prompt?

Predictive human preference aims to predict which model users might prefer for a specific query.

https://huyenchip.com/2024/02/28/predictive-human-preference.html

One use case is model routing. If we know in advance that for a prompt, users will prefer Claude Instant’s response over GPT-4, and Claude Instant is cheaper/faster than GPT-4, we can route this prompt to Claude Instant. Model routing has the potential to increase response quality while reducing costs and latency.

One pattern is that for simple prompts, weak models can do (nearly) as well as strong models. For more challenging prompts, however, users are more likely to prefer stronger models. Here’s a visualization of predicted human preference for an easy prompt (“hello, how are you?”) and a challenging prompt (“Explain why Planc length …”).

Preference predictors make it possible to create leaderboards unique to any prompt and domain.
·
posted an update 11 months ago
view post
Post
🔥 What's that biomedical model that got 170,763 downloads last month on HuggingFace?! Well, the paper is finally published! #BioLORD

📰 Read our article in the Journal of the American Medical Informatics Association:
https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae029/7614965

📝TLDR: BioLORD-2023 is a series of semantic language models for the biomedical domain, capable of representing clinical concepts and sentences in a semantic space aligned with human preferences. Our new multilingual version supports 50+ languages and is further finetuned on 7 European languages. These models were trained contrastively and through distillations, using a corpus unifying in the same latent space the concept names of biomedical concepts and their descriptions. For concepts which didn't have a description written by humans in UMLS, we use information contained in the SnomedCT knowledge graph and the capabilities of ChatGPT to generate synthetic data and improve our results.

🤗 Access our models from the HuggingFace hub, including the new 2023-C and 2023-S variants:
FremyCompany/BioLORD-2023
FremyCompany/BioLORD-2023-M
FremyCompany/BioLORD-2023-S
FremyCompany/BioLORD-2023-C
  • 1 reply
·
reacted to bwang0911's post with 👍 11 months ago
view post
Post
We've been busy cooking up some interesting models at @jinaai , with a recent highlight being the release of our first batch of bilingual embedding models.

Internally labeled as X+EN, where X represents the target language and EN stays fixed, these models specialize in both monolingual tasks and cross-lingual retrieval tasks, crossing from X to EN.

You can find these models available on Huggingface:
1. German-English bilingual embedding: jinaai/jina-embeddings-v2-base-de
2. Chinese-English bilingual embedding: jinaai/jina-embeddings-v2-base-zh

We're also excited to announce that a Spanish bilingual embedding will be released in approximately two weeks.

Our evaluation across various MLM tasks has demonstrated that the Bilingual Backbone consistently outperforms state-of-the-art Multilingual Backbones like XLM-Roberta (given its focus on just two languages).

Despite being three times smaller than the leading multilingual models (e5-multilingual-large), our released bilingual embedding models have shown superior performance compared to e5-multilingual-large, excelling in both monolingual and cross-lingual search tasks.

Currently, we're putting the finishing touches on the technical report, which should be available on Arxiv by next week.

Looking ahead, the embedding team is gearing up for jina-embeddings-v3
with some initial groundwork already underway. Stay tuned for more updates!
  • 1 reply
·
replied to BramVanroy's post 12 months ago
view reply

(my other thought is that you should increase the KL divergence penalty, if your DPO models diverges too much from your initial model, but I think making the negative examples better is a stronger first step to take)

replied to BramVanroy's post 12 months ago
view reply

Not an expert, but I think you should create your negative examples in a way where the first few tokens are not enough to differentiate between good and bad.

One easy way to do this would be to first sample the GPT-4 examples, then keep "n" tokens (sampled from 0 to the length of the answer) then generate the rest of the answer with the other (worse) model.

That way, the DPO model cannot just ignore every tokens after a few, because the branching can happen at any point.