ONNXConfig for all
non-profit
AI & ML interests
Make all hub models available for conversion to ONNX format.
Recent Activity
View all activity
OWG's activity
shiviย
authored
3
papers
about 1 month ago
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier
Paper
โข
2412.04261
โข
Published
โข
1
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Paper
โข
2411.19799
โข
Published
โข
11
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper
โข
2412.03304
โข
Published
โข
17
louisbrulenaudetย
posted
an
update
about 2 months ago
Post
1815
Iโve published a new dataset to simplify model merging ๐ค
This dataset facilitates the search for compatible architectures for model merging with @arcee_aiโs mergekit, streamlining the automation of high-performance merge searches ๐
Dataset : louisbrulenaudet/mergekit-configs
This dataset facilitates the search for compatible architectures for model merging with @arcee_aiโs mergekit, streamlining the automation of high-performance merge searches ๐
Dataset : louisbrulenaudet/mergekit-configs
louisbrulenaudetย
posted
an
update
3 months ago
Post
1204
Introducing Lemone-router, a series of classification models designed to produce an optimal multi-agent system for different branches of tax law.
Trained on a base of 49k lines comprising a set of synthetic questions generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation and authority documents, these models are based on an 8-category decomposition of the classification scheme derived from the Bulletin officiel des finances publiques - impรดts :
It achieves the following results on the evaluation set:
- Loss: 0.4734
- Accuracy: 0.9191
Link to the collection: louisbrulenaudet/lemone-router-671cce21d6410f3570514762
Trained on a base of 49k lines comprising a set of synthetic questions generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation and authority documents, these models are based on an 8-category decomposition of the classification scheme derived from the Bulletin officiel des finances publiques - impรดts :
label2id = {
"Bรฉnรฉfices professionnels": 0,
"Contrรดle et contentieux": 1,
"Dispositifs transversaux": 2,
"Fiscalitรฉ des entreprises": 3,
"Patrimoine et enregistrement": 4,
"Revenus particuliers": 5,
"Revenus patrimoniaux": 6,
"Taxes sur la consommation": 7
}
id2label = {
0: "Bรฉnรฉfices professionnels",
1: "Contrรดle et contentieux",
2: "Dispositifs transversaux",
3: "Fiscalitรฉ des entreprises",
4: "Patrimoine et enregistrement",
5: "Revenus particuliers",
6: "Revenus patrimoniaux",
7: "Taxes sur la consommation"
}
It achieves the following results on the evaluation set:
- Loss: 0.4734
- Accuracy: 0.9191
Link to the collection: louisbrulenaudet/lemone-router-671cce21d6410f3570514762
louisbrulenaudetย
posted
an
update
3 months ago
Post
3117
๐จ I have $3,500 in Azure credits, including access to an H100 (96 Go), expiring on November 12, 2024.
I wonโt be able to use it all myself, so Iโm reaching out to the @huggingface community: Are there any open-source projets with data ready for some compute power?
Letโs collaborate and make the most of it together ๐
I wonโt be able to use it all myself, so Iโm reaching out to the @huggingface community: Are there any open-source projets with data ready for some compute power?
Letโs collaborate and make the most of it together ๐
louisbrulenaudetย
posted
an
update
3 months ago
Post
2111
My biggest release of the year: a series of 7 specialized embedding models for information retrieval within tax documents, is now available for free on Hugging Face ๐ค
These new models aim to offer an open source alternative for in-domain semantic search from largeย text corpora and will improve RAG systems and context addition for large language models.
Trained on more than 43 million tax tokens derived from semi-synthetic and raw-synthetic data, enriched by various methods (in particular MSFT's evol-instruct by @intfloat ), and corrected by humans, this project is the fruit of hundreds of hours of work and is the culmination of a global effort to open up legal technologies that has only just begun.
A big thank you to Microsoft for Startups for giving me access to state-of-the-art infrastructure to train these models, and to @julien-c , @clem ๐ค, @thomwolf and the whole HF team for the inference endpoint API and the generous provision of Meta LLama-3.1-70B. Special thanks also to @tomaarsen for his invaluable advice on training embedding models and Loss functions โค๏ธ
Models are available on my personal HF page, into the Lemone-embed collection: louisbrulenaudet/lemone-embed-66fdc24000df732b395df29b
These new models aim to offer an open source alternative for in-domain semantic search from largeย text corpora and will improve RAG systems and context addition for large language models.
Trained on more than 43 million tax tokens derived from semi-synthetic and raw-synthetic data, enriched by various methods (in particular MSFT's evol-instruct by @intfloat ), and corrected by humans, this project is the fruit of hundreds of hours of work and is the culmination of a global effort to open up legal technologies that has only just begun.
A big thank you to Microsoft for Startups for giving me access to state-of-the-art infrastructure to train these models, and to @julien-c , @clem ๐ค, @thomwolf and the whole HF team for the inference endpoint API and the generous provision of Meta LLama-3.1-70B. Special thanks also to @tomaarsen for his invaluable advice on training embedding models and Loss functions โค๏ธ
Models are available on my personal HF page, into the Lemone-embed collection: louisbrulenaudet/lemone-embed-66fdc24000df732b395df29b
louisbrulenaudetย
posted
an
update
4 months ago
Post
2602
The Romulus model series has been released on Hugging Face, continually pre-trained on 34,864,949 tokens of French laws and intended to serve as a foundation for fine-tuning on labeled data ๐ค
The training code, dataset and model weights are open and available free on HF and the training was based on H100 provided by Microsoft for Startups using Unsloth AI by @danielhanchen and @shimmyshimmer ๐ฆฅ
Link to the base model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1
Link to the instruct model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct
Link to the dataset: louisbrulenaudet/Romulus-cpt-fr
Please note that these models have not been aligned for the production of usable texts as they stand, and will certainly need to be refined for the desired tasks in order to produce satisfactory results.
The training code, dataset and model weights are open and available free on HF and the training was based on H100 provided by Microsoft for Startups using Unsloth AI by @danielhanchen and @shimmyshimmer ๐ฆฅ
Link to the base model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1
Link to the instruct model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct
Link to the dataset: louisbrulenaudet/Romulus-cpt-fr
Please note that these models have not been aligned for the production of usable texts as they stand, and will certainly need to be refined for the desired tasks in order to produce satisfactory results.
louisbrulenaudetย
posted
an
update
4 months ago
Post
1576
An example of the application of LegalKit is the production of knowledge graphs, here is a demo Space ๐
With the update of the French legal code data model uploaded to ๐ค and the introduction of a column dedicated to HTML text, it's now easy to extract links between different articles and produce complex graphs with just a few lines of Python.
This simplified demo highlights the ease of implementation and creative potential, and enables the generation of complete data sets, although requiring a powerful graphics card for display. The framework used for the moment is D3.js, but perhaps other solutions are possible. I'd be delighted to hear your suggestions, and look forward to hearing from the community.
Link to the ๐ค Space: louisbrulenaudet/legalkit-knowledge-graph
With the update of the French legal code data model uploaded to ๐ค and the introduction of a column dedicated to HTML text, it's now easy to extract links between different articles and produce complex graphs with just a few lines of Python.
This simplified demo highlights the ease of implementation and creative potential, and enables the generation of complete data sets, although requiring a powerful graphics card for display. The framework used for the moment is D3.js, but perhaps other solutions are possible. I'd be delighted to hear your suggestions, and look forward to hearing from the community.
Link to the ๐ค Space: louisbrulenaudet/legalkit-knowledge-graph
Post
2612
Plugins in NiansuhAI
Plugin Names:
1. WebSearch: Searches the web using search engines.
2. Calculator: Evaluates mathematical expressions, extending the base Tool class.
3. WebBrowser: Extracts and summarizes information from web pages.
4. Wikipedia: Retrieves information from Wikipedia using its API.
5. Arxiv: Searches and fetches article information from Arxiv.
6. WolframAlphaTool: Provides answers on math, science, technology, culture, society, and everyday life.
These plugins currently support the GPT-4O-2024-08-06 model, which also supports image analysis.
Try it now: https://huggingface.co/spaces/NiansuhAI/chat
Similar to: https://hf.co/chat
Plugin Names:
1. WebSearch: Searches the web using search engines.
2. Calculator: Evaluates mathematical expressions, extending the base Tool class.
3. WebBrowser: Extracts and summarizes information from web pages.
4. Wikipedia: Retrieves information from Wikipedia using its API.
5. Arxiv: Searches and fetches article information from Arxiv.
6. WolframAlphaTool: Provides answers on math, science, technology, culture, society, and everyday life.
These plugins currently support the GPT-4O-2024-08-06 model, which also supports image analysis.
Try it now: https://huggingface.co/spaces/NiansuhAI/chat
Similar to: https://hf.co/chat
louisbrulenaudetย
posted
an
update
4 months ago
Post
2002
Understanding the json format response with HF's Serverless Inference API ๐ค
As it stands, there seems to be an inconsistency with the OpenAI documentation on the question of implementing the JSON response format using the InferenceClient completion API.
After investigating the InferenceClient source code, I share the official solution using a JSON Schema. This consolidates the structure of the response and simplifies parsing as part of an automated process for extracting metadata, information:
As a reminder, json mode is activated with the OpenAI client as follows:
One question remains unanswered, however, and will perhaps be answered by the community: it seems that an incompatibility persists for list of dictionaries generation, and currently, the production of simple dictionaries seems to be the only functional option.
As it stands, there seems to be an inconsistency with the OpenAI documentation on the question of implementing the JSON response format using the InferenceClient completion API.
After investigating the InferenceClient source code, I share the official solution using a JSON Schema. This consolidates the structure of the response and simplifies parsing as part of an automated process for extracting metadata, information:
from huggingface_hub import InferenceClient
client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct")
messages = [
{
"role": "user",
"content": "I saw a puppy a cat and a raccoon during my bike ride in the park. What did I saw and when?",
},
]
response_format = {
"type": "json",
"value": {
"properties": {
"location": {"type": "string"},
"activity": {"type": "string"},
"animals_seen": {"type": "integer", "minimum": 1, "maximum": 5},
"animals": {"type": "array", "items": {"type": "string"}},
},
"required": ["location", "activity", "animals_seen", "animals"],
},
}
response = client.chat_completion(
messages=messages,
response_format=response_format,
max_tokens=500,
)
print(response.choices[0].message.content)
As a reminder, json mode is activated with the OpenAI client as follows:
response = client.chat.completions.create(
model="gpt-3.5-turbo-0125",
messages=[...],
response_format={"type": "json_object"}
)
One question remains unanswered, however, and will perhaps be answered by the community: it seems that an incompatibility persists for list of dictionaries generation, and currently, the production of simple dictionaries seems to be the only functional option.
louisbrulenaudetย
posted
an
update
5 months ago
Post
2761
๐ RAGoon is now available on PyPI, GitHub, and as a Space on Hugging Face for batched embeddings generation ๐ค
RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.
At this stage, 5 major classes are available via RAGoon to facilitate:
- the production of chain embeddings for several models to simplify a continuous deployment process;
- production of LLM requests for web querying and content retrieval via the Google API;
- recursive chunking via tokens;
- data visualization and the function to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D graph;
- the creation of binary indexes for search with scalar (int8) rescoring.
Link to GitHub: https://github.com/louisbrulenaudet/ragoon
Link to the ๐ค Space: louisbrulenaudet/ragoon
RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.
At this stage, 5 major classes are available via RAGoon to facilitate:
- the production of chain embeddings for several models to simplify a continuous deployment process;
- production of LLM requests for web querying and content retrieval via the Google API;
- recursive chunking via tokens;
- data visualization and the function to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D graph;
- the creation of binary indexes for search with scalar (int8) rescoring.
Link to GitHub: https://github.com/louisbrulenaudet/ragoon
Link to the ๐ค Space: louisbrulenaudet/ragoon
louisbrulenaudetย
posted
an
update
6 months ago
Post
869
You can now find the OBIS - Ocean Biodiversity Information System, on Hugging Face with 128M rows, via the Datasets package stream ๐ค
The datasets are integrated, allowing seamless search and mapping by species name, higher taxonomic level, geographic area, depth, time, and environmental parameters. OBIS originates from the Census of Marine Life (2000-2010) and was adopted as a project under IOC-UNESCOโs International Oceanographic Data and Information (IODE) programme in 2009.
Collectively, they have provided over 45 million observations of nearly 120,000 marine species, ranging from bacteria to whales, from the surface to 10,900 meters depth, and from the tropics to the poles.
Link to the dataset: louisbrulenaudet/obis
The datasets are integrated, allowing seamless search and mapping by species name, higher taxonomic level, geographic area, depth, time, and environmental parameters. OBIS originates from the Census of Marine Life (2000-2010) and was adopted as a project under IOC-UNESCOโs International Oceanographic Data and Information (IODE) programme in 2009.
Collectively, they have provided over 45 million observations of nearly 120,000 marine species, ranging from bacteria to whales, from the surface to 10,900 meters depth, and from the tropics to the poles.
Link to the dataset: louisbrulenaudet/obis
Post
2820
Introducing Plugins in NiansuhAI (on July 20, 2024)
Plugin Names:
1. WebSearch: Tool for searching the web using search engines.
2. Calculator: Helps evaluate mathematical expressions; extends the base Tool class.
3. WebBrowser: Interacts with web pages to extract information or summarize content.
4. Wikipedia: Retrieves data from Wikipedia using its API.
5. Arxiv: Searches and fetches article information from Arxiv.
6. WolframAlphaTool: Answers questions on Math, Science, Technology, Culture, Society, and Everyday Life.
Similar to https://hf.co/chat
Plugin Names:
1. WebSearch: Tool for searching the web using search engines.
2. Calculator: Helps evaluate mathematical expressions; extends the base Tool class.
3. WebBrowser: Interacts with web pages to extract information or summarize content.
4. Wikipedia: Retrieves data from Wikipedia using its API.
5. Arxiv: Searches and fetches article information from Arxiv.
6. WolframAlphaTool: Answers questions on Math, Science, Technology, Culture, Society, and Everyday Life.
Similar to https://hf.co/chat
wannaphongย
authored
a
paper
6 months ago
Post
3490
Use GPT-4, GPT-4 Turbo Preview, GPT-3.5 Turbo, BingAI, and other models. The interface is similar to ChatGPT, with a speedy API endpoint.
https://huggingface.co/spaces/NiansuhAI/Copilot
https://huggingface.co/spaces/NiansuhAI/Copilot
louisbrulenaudetย
posted
an
update
6 months ago
Post
2120
Introducing the first two projects on the HFforLegal community: the 'Laws' dataset and the associated search tool based on
@nreimers
and
@tomaarsen
's Sentence Transformers library ๐ค
The objective of these two tools is to centralize in a single format a set of rules from different countries and legal systems in order to facilitate NLP in the field of comparative law, enabling more accurate and comprehensive legal analysis across different jurisdictions ๐
Link to the dataset : HFforLegal/laws
Link to the space: HFforLegal/laws-retrieval
We need your contributions to enrich this new knowledge base, and you will find in the 'Laws' dataset all the information you need to format your data and submit them to the appropriate split.
The objective of these two tools is to centralize in a single format a set of rules from different countries and legal systems in order to facilitate NLP in the field of comparative law, enabling more accurate and comprehensive legal analysis across different jurisdictions ๐
Link to the dataset : HFforLegal/laws
Link to the space: HFforLegal/laws-retrieval
We need your contributions to enrich this new knowledge base, and you will find in the 'Laws' dataset all the information you need to format your data and submit them to the appropriate split.
louisbrulenaudetย
posted
an
update
7 months ago
Post
2950
Announcing the creation of the "HF for Legal" organization, an open-source community dedicated to demystifying language models for legal professionals ๐ค
Whether you're a practicing attorney, a legal scholar, or a technologist interested in legal applications of AI, HF for Legal may be your hub for exploration, learning, and free innovation โ๏ธ
On the occasion of this launch, you'll be able to find several notebooks I've been developing over the last few months for TSDAE pre-training of embedding models, the generation of indexes for semantic search, based on the formidable work of @tomaarsen and @nreimers , adapted to the field of French law, or the addition of information retrieval tasks to the MTEB.
Join us in our mission to make AI more accessible and understandable for the legal world, ensuring that the power of language models can be harnessed effectively and ethically.
Link to the org: https://huggingface.co/HFforLegal
Special thanks to @clem for encouraging me to start this organization. Let's hope we can bring together all the enthusiasts who work in this field.
Let's code and share together! ๐๐
Whether you're a practicing attorney, a legal scholar, or a technologist interested in legal applications of AI, HF for Legal may be your hub for exploration, learning, and free innovation โ๏ธ
On the occasion of this launch, you'll be able to find several notebooks I've been developing over the last few months for TSDAE pre-training of embedding models, the generation of indexes for semantic search, based on the formidable work of @tomaarsen and @nreimers , adapted to the field of French law, or the addition of information retrieval tasks to the MTEB.
Join us in our mission to make AI more accessible and understandable for the legal world, ensuring that the power of language models can be harnessed effectively and ethically.
Link to the org: https://huggingface.co/HFforLegal
Special thanks to @clem for encouraging me to start this organization. Let's hope we can bring together all the enthusiasts who work in this field.
Let's code and share together! ๐๐
louisbrulenaudetย
posted
an
update
7 months ago
Post
3245
I am delighted to announce the publication of my LegalKit, a French labeled dataset built for legal ML training ๐ค
This dataset comprises multiple query-document pairs (+50k) curated for training sentence embedding models within the domain of French law.
The labeling process follows a systematic approach to ensure consistency and relevance:
- Initial Query Generation: Three instances of the LLaMA-3-70B model independently generate three different queries based on the same document.
- Selection of Optimal Query: A fourth instance of the LLaMA-3-70B model, using a dedicated selection prompt, evaluates the generated queries and selects the most suitable one.
- Final Label Assignment: The chosen query is used to label the document, aiming to ensure that the label accurately reflects the content and context of the original text.
Dataset: louisbrulenaudet/legalkit
Stay tuned for further updates and release information ๐ฅ
@clem , if we can create an "HF for Legal" organization, similar to what exists for journalists, I am available!
Note : My special thanks to @alvdansen for their illustration models โค๏ธ
This dataset comprises multiple query-document pairs (+50k) curated for training sentence embedding models within the domain of French law.
The labeling process follows a systematic approach to ensure consistency and relevance:
- Initial Query Generation: Three instances of the LLaMA-3-70B model independently generate three different queries based on the same document.
- Selection of Optimal Query: A fourth instance of the LLaMA-3-70B model, using a dedicated selection prompt, evaluates the generated queries and selects the most suitable one.
- Final Label Assignment: The chosen query is used to label the document, aiming to ensure that the label accurately reflects the content and context of the original text.
Dataset: louisbrulenaudet/legalkit
Stay tuned for further updates and release information ๐ฅ
@clem , if we can create an "HF for Legal" organization, similar to what exists for journalists, I am available!
Note : My special thanks to @alvdansen for their illustration models โค๏ธ