nicholasKluge commited on
Commit
088385f
·
verified ·
1 Parent(s): 0c0673c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -6
README.md CHANGED
@@ -1,36 +1,46 @@
1
  ---
2
  title: README
3
- emoji: 🦜
4
  colorFrom: gray
5
  colorTo: yellow
6
  sdk: static
7
  pinned: true
8
  license: apache-2.0
9
- short_description: Description of the Mula project.
10
  thumbnail: >-
11
  https://cdn-uploads.huggingface.co/production/uploads/62e1cc43926f4892a4ca2ff9/_1WxGqMpLN0RuX02Dq9Df.png
12
  ---
13
  <div align="center">
14
-
15
- # Tucano: Advancing Neural Text Generation for Portuguese
16
 
 
 
17
  </div>
18
 
19
  <p align="center">
20
  <img src="./logo.png" alt="An illustration of a Tucano bird showing vibrant colors like yellow, orange, blue, green, and black." height="400">
21
  </p>
22
 
23
- To stimulate the future of open development of neural text generation in Portuguese, we present both **[GigaVerbo](https://huggingface.co/datasets/TucanoBR/GigaVerbo)**, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens, and **[Tucano](https://huggingface.co/TucanoBR/Tucano-2b4)**, a series of decoder-transformers natively pre-trained in Portuguese. All byproducts of our study, including the source code used for training and evaluation, are openly released on [GitHub](https://github.com/Nkluge-correa/Tucano) and Hugging Face.
 
 
24
 
25
- Read our preprint in [arXiv](https://arxiv.org/abs/2411.07854).
 
 
 
 
26
 
27
  ## News
28
 
 
 
29
  - [29/11/2024] Tucano is mentioned on Deutsche Welle: "[Cientistas criam maior banco de dados em português para IA](https://www.dw.com/pt-br/pesquisadores-da-alemanha-criam-maior-banco-de-dados-p%C3%BAblico-em-portugu%C3%AAs-para-ia/a-70917082)".
30
  - [27/11/2024] Tucano video presentation at the C4AI (USP) [available on [YouTube](https://www.youtube.com/watch?v=BscOHn54ld8)].
31
  - [12/11/2024] "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)" is published as a preprint on ArXiv, with all models and datasets released on [Hugging Face](https://huggingface.co/TucanoBR).
 
32
  ## Community Contributions 🤝
33
 
 
34
  - Demo on how to [run inference on Tucano](https://colab.research.google..com/drive/1Qf2DsFOFDA7RKkamI-tH3OregtOlZ8Cz).
35
  - Demo on how to create a simple [Chat UI for Tucano](https://colab.research.google.com/drive/1fEW10CXksMfMv1veLr22OESwDs6e-W1b) using Gradio.
36
  - [Tucano OpenVINO](https://huggingface.co/cabelo/Tucano-2b4-Instruct-fp16-ov) is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology.
 
1
  ---
2
  title: README
3
+ emoji: 🌐
4
  colorFrom: gray
5
  colorTo: yellow
6
  sdk: static
7
  pinned: true
8
  license: apache-2.0
9
+ short_description: Efficient foundation models for low-resource languages.
10
  thumbnail: >-
11
  https://cdn-uploads.huggingface.co/production/uploads/62e1cc43926f4892a4ca2ff9/_1WxGqMpLN0RuX02Dq9Df.png
12
  ---
13
  <div align="center">
 
 
14
 
15
+ <h1>Polyglot</h1>
16
+
17
  </div>
18
 
19
  <p align="center">
20
  <img src="./logo.png" alt="An illustration of a Tucano bird showing vibrant colors like yellow, orange, blue, green, and black." height="400">
21
  </p>
22
 
23
+ In recent years, generative AI has seen remarkable advancements, with foundation models emerging as the cornerstone of much of the research and development in the field. However, the prevailing deep learning paradigm demands vast resources in terms of data and computation. This data-intensive approach has inadvertently deepened the divide between high-resource and low-resource languages. High-resource languages benefit from the bulk of development efforts and readily available resources, while low-resource languages face significant challenges in achieving comparable performance and autonomy.
24
+
25
+ To foster a more equitable, sustainable, and open ecosystem for AI research and development, we aim to create tools and resources to support the development of foundation models for low-resource languages. This includes developing models, datasets, and open-source code to empower underrepresented linguistic communities.
26
 
27
+ ## Recent Publications:
28
+
29
+ - **ViTucano: A Portuguese Vision Assitant** | [GitHub](https://github.com/Nkluge-correa/TinyLLaVA_Factory) | [Collection](https://huggingface.co/collections/TucanoBR/vitucano-v1-67804623a92cd2fabcafa0a3)
30
+ - **Tucano: Advancing Neural Text Generation for Portuguese** | [GitHub](https://github.com/Nkluge-correa/Tucano) | [Collection](https://huggingface.co/collections/TucanoBR/tucano-670565e8c5325fb7f2da4361) | [Paper](https://arxiv.org/abs/2411.07854)
31
+ - **TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese** | [GitHub](https://github.com/Nkluge-correa/TeenyTinyLlama) | [Collection](https://huggingface.co/collections/nicholasKluge/teenytinyllama-6582ea8129e72d1ea4d384f1) | [Paper](https://www.sciencedirect.com/science/article/pii/S2666827024000343)
32
 
33
  ## News
34
 
35
+ - [13/01/2025] We release ViTucano, a pair of vision assistants natively pretrained in Portuguese ([ViTucano-1b5-v1](https://huggingface.co/TucanoBR/ViTucano-1b5-v1), [ViTucano-2b8-v1](https://huggingface.co/TucanoBR/ViTucano-2b8-v1)).
36
+ - [13/01/2025] We release the datasets used to pretrain and fine-tune the ViTucano models: [ViTucano-Pretrain](https://huggingface.co/datasets/TucanoBR/ViTucano-Pretrain) and [ViTucano-SFT](https://huggingface.co/datasets/TucanoBR/ViTucano-SFT).
37
  - [29/11/2024] Tucano is mentioned on Deutsche Welle: "[Cientistas criam maior banco de dados em português para IA](https://www.dw.com/pt-br/pesquisadores-da-alemanha-criam-maior-banco-de-dados-p%C3%BAblico-em-portugu%C3%AAs-para-ia/a-70917082)".
38
  - [27/11/2024] Tucano video presentation at the C4AI (USP) [available on [YouTube](https://www.youtube.com/watch?v=BscOHn54ld8)].
39
  - [12/11/2024] "[Tucano: Advancing Neural Text Generation for Portuguese](https://arxiv.org/abs/2411.07854)" is published as a preprint on ArXiv, with all models and datasets released on [Hugging Face](https://huggingface.co/TucanoBR).
40
+
41
  ## Community Contributions 🤝
42
 
43
+ - Demo on how to [run inference on ViTucano](https://colab.research.google.com/drive/110_Gtjgu4pldRQP864_Y-rSm2VhyW7Li).
44
  - Demo on how to [run inference on Tucano](https://colab.research.google..com/drive/1Qf2DsFOFDA7RKkamI-tH3OregtOlZ8Cz).
45
  - Demo on how to create a simple [Chat UI for Tucano](https://colab.research.google.com/drive/1fEW10CXksMfMv1veLr22OESwDs6e-W1b) using Gradio.
46
  - [Tucano OpenVINO](https://huggingface.co/cabelo/Tucano-2b4-Instruct-fp16-ov) is a ported version of Tucano-2b4-Instruct optimized for Intel openVINO inference technology.