Parallia/Fairly-Multilingual-ModernBERT-Token-Alignment
François Remy
AI & ML interests
Recent Activity
Organizations
FremyCompany's activity
Parallia/Fairly-Multilingual-ModernBERT-Token-Alignment
TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
docs: https://huggingface.co/docs/hub/storage-limits
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥
cc: @reach-vb @pierric @victor and the HF team
Not from scratch, as our technique perserves most of the model weights. But you have to continue pre-training to get most of the benefits, yes. You can read more about it in our preview paper.
We are in the process of releasing a library for replicating this easily, but are not ready to share this yet.
To celebrate, we release our new language model, Tweety Tatar 🐣
https://huggingface.co/Tweeties/tweety-tatar-base-7b-2024-v1
The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.
We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:
https://huggingface.co/Tweeties/tweety-tatar-hydra-base-7b-2024-v1
More details in our upcoming paper 👀
François REMY, Pieter Delobelle, Alfiya Khabibullina
Татар теле көне белән!
A hard part of building AI applications is choosing which model to use. What if we don’t have to? What if we can predict the best model for any prompt?
Predictive human preference aims to predict which model users might prefer for a specific query.
https://huyenchip.com/2024/02/28/predictive-human-preference.html
One use case is model routing. If we know in advance that for a prompt, users will prefer Claude Instant’s response over GPT-4, and Claude Instant is cheaper/faster than GPT-4, we can route this prompt to Claude Instant. Model routing has the potential to increase response quality while reducing costs and latency.
One pattern is that for simple prompts, weak models can do (nearly) as well as strong models. For more challenging prompts, however, users are more likely to prefer stronger models. Here’s a visualization of predicted human preference for an easy prompt (“hello, how are you?”) and a challenging prompt (“Explain why Planc length …”).
Preference predictors make it possible to create leaderboards unique to any prompt and domain.
📰 Read our article in the Journal of the American Medical Informatics Association:
https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae029/7614965
📝
TLDR:
BioLORD-2023 is a series of semantic language models for the biomedical domain, capable of representing clinical concepts and sentences in a semantic space aligned with human preferences. Our new multilingual version supports 50+ languages and is further finetuned on 7 European languages. These models were trained contrastively and through distillations, using a corpus unifying in the same latent space the concept names of biomedical concepts and their descriptions. For concepts which didn't have a description written by humans in UMLS, we use information contained in the SnomedCT knowledge graph and the capabilities of ChatGPT to generate synthetic data and improve our results.🤗 Access our models from the HuggingFace hub, including the new 2023-C and 2023-S variants:
FremyCompany/BioLORD-2023
FremyCompany/BioLORD-2023-M
FremyCompany/BioLORD-2023-S
FremyCompany/BioLORD-2023-C
Internally labeled as
X+EN
, where X represents the target language and EN
stays fixed, these models specialize in both monolingual tasks and cross-lingual retrieval tasks, crossing from X to EN.You can find these models available on Huggingface:
1. German-English bilingual embedding: jinaai/jina-embeddings-v2-base-de
2. Chinese-English bilingual embedding: jinaai/jina-embeddings-v2-base-zh
We're also excited to announce that a Spanish bilingual embedding will be released in approximately two weeks.
Our evaluation across various MLM tasks has demonstrated that the Bilingual Backbone consistently outperforms state-of-the-art Multilingual Backbones like XLM-Roberta (given its focus on just two languages).
Despite being three times smaller than the leading multilingual models (e5-multilingual-large), our released bilingual embedding models have shown superior performance compared to e5-multilingual-large, excelling in both monolingual and cross-lingual search tasks.
Currently, we're putting the finishing touches on the technical report, which should be available on Arxiv by next week.
Looking ahead, the embedding team is gearing up for
jina-embeddings-v3
with some initial groundwork already underway. Stay tuned for more updates!
(my other thought is that you should increase the KL divergence penalty, if your DPO models diverges too much from your initial model, but I think making the negative examples better is a stronger first step to take)
Not an expert, but I think you should create your negative examples in a way where the first few tokens are not enough to differentiate between good and bad.
One easy way to do this would be to first sample the GPT-4 examples, then keep "n" tokens (sampled from 0 to the length of the answer) then generate the rest of the answer with the other (worse) model.
That way, the DPO model cannot just ignore every tokens after a few, because the branching can happen at any point.