File size: 9,683 Bytes
3ae98d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
import gradio as gr
gr.Markdown("""
# Big Science Bloom is a 176B Parameter Large Language ML Model.
https://www.youtube.com/watch?v=wA8rjKueB3Q
https://www.youtube.com/watch?v=2MBJOuVq380&t=241s
# Big Science Papers and Code - Exciting AI Developments! 🤖💻🔬
https://paperswithcode.com/paper/bloom-a-176b-parameter-open-access
""")
api = gr.Interface.load("models/bigscience/bloom")
def complete_with_gpt(text):
# Use the last 50 characters of the text as context
# return text[:-50] + api(text[-50:])
# Use the last 100 characters of the text as context
return text[:-100] + api(text[-100:])
with gr.Blocks() as demo:
with gr.Row():
textbox = gr.Textbox(placeholder="Type here and press enter...", lines=14)
with gr.Column():
btn = gr.Button("Generate")
btn.click(complete_with_gpt, textbox, textbox)
with gr.Row():
gr.Markdown("""
# Example on how to prompt.
Create a pattern sequence of text. In this example I use language names then click generate to add each line after adding another heading for a language.
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
Japanese: 私はアランです。コンピューター科学者とプログラ
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
Chinese: 你好,我叫Aaron。我是一个计算机科学家和高级首席工程师。
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
Spanish: Hola, me llamo Aaron. Soy un cientifico de la computacion y un ingeniero principal
English: Hi my name is Aaron. I am a computer scientist and senior principal engineer.
Sanskrit: नमस्ते, मेरा नाम है Aaron. मैं एक कंप्यूटर वैज्ञानिक और वरिष्ठ प्रमुख इंजीनियर हूँ।
French: Bonjour, je m'appelle Aaron. Je suis un scientifique en informatique et un ingénieur senior.
## Language Models 🗣️
🏆 Bloom sets new record for most performant and efficient AI model in science! 🌸
### Comparison of Large Language Models
| Model Name | Model Size (in Parameters) |
| ----------------- | -------------------------- |
| BigScience-tr11-176B | 176 billion |
| GPT-3 | 175 billion |
| OpenAI's DALL-E 2.0 | 500 million |
| NVIDIA's Megatron | 8.3 billion |
| Transformer-XL | 250 million |
| XLNet | 210 million |
## ChatGPT Datasets 📚
- WebText
- Common Crawl
- BooksCorpus
- English Wikipedia
- Toronto Books Corpus
- OpenWebText
## ChatGPT Datasets - Details 📚
- **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
- [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.](https://paperswithcode.com/dataset/webtext)
- **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/common-crawl) by Brown et al.
- **BooksCorpus:** A dataset of over 11,000 books from a variety of genres.
- [Scalable Methods for 8 Billion Token Language Modeling](https://paperswithcode.com/dataset/bookcorpus) by Zhu et al.
- **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
- [Improving Language Understanding by Generative Pre-Training](https://huggingface.co/spaces/awacke1/WikipediaUltimateAISearch?logs=build) Space for Wikipedia Search
- **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
- [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond](https://paperswithcode.com/dataset/bookcorpus) by Schwenk and Douze.
- **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
- [Language Models are Few-Shot Learners](https://paperswithcode.com/dataset/openwebtext) by Brown et al.
## Big Science Model 🚀
- 📜 Papers:
1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper](https://arxiv.org/abs/2211.05100)
2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper](https://arxiv.org/abs/1909.08053)
3. 8-bit Optimizers via Block-wise Quantization [Paper](https://arxiv.org/abs/2110.02861)
4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper](https://arxiv.org/abs/2108.12409)
5. [Other papers related to Big Science](https://huggingface.co/models?other=doi:10.57967/hf/0003)
6. [217 other models optimized for use with Bloom](https://huggingface.co/models?other=bloom)
- 📚 Datasets:
**Datasets:**
1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.
- [Universal Dependencies official website.](https://universaldependencies.org/)
2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.
- [WMT14 website.](http://www.statmt.org/wmt14/)
3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet.
- [The Pile official website.](https://pile.eleuther.ai/)
4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.
- [HumanEval: An Evaluation Benchmark for Language Understanding](https://github.com/google-research-datasets/humaneval) by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.
5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.
- [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding](https://flores101.opennmt.net/) by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.
6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text.
- [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments](https://github.com/stanford-cogsci/crows-pairs) by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.
7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia.
- [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification](https://arxiv.org/abs/2105.08031) by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.
8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.
- [Multi-Task Evaluation Benchmark for Natural Language Inference](https://github.com/google-research-datasets/mteb) by Michał Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.
9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.
- [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context](https://github.com/nyu-dl/xp3) by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.
10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts.
- [A Large-Scale Corpus for Conversation Disentanglement](https://github.com/HLTCHKUST/DiaBLA) by Samuel Broscheit, António Branco, and André F. T. Martins.
- 📚 Dataset Papers with Code
1. [Universal Dependencies](https://paperswithcode.com/dataset/universal-dependencies)
2. [WMT 2014](https://paperswithcode.com/dataset/wmt-2014)
3. [The Pile](https://paperswithcode.com/dataset/the-pile)
4. [HumanEval](https://paperswithcode.com/dataset/humaneval)
5. [FLORES-101](https://paperswithcode.com/dataset/flores-101)
6. [CrowS-Pairs](https://paperswithcode.com/dataset/crows-pairs)
7. [WikiLingua](https://paperswithcode.com/dataset/wikilingua)
8. [MTEB](https://paperswithcode.com/dataset/mteb)
9. [xP3](https://paperswithcode.com/dataset/xp3)
10. [DiaBLa](https://paperswithcode.com/dataset/diabla)
# Deep RL ML Strategy 🧠
The AI strategies are:
- Language Model Preparation using Human Augmented with Supervised Fine Tuning 🤖
- Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁
- Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯
- Proximal Policy Optimization Fine Tuning 🤝
- Variations - Preference Model Pretraining 🤔
- Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution 📊
- Online Version Getting Feedback 💬
- OpenAI - InstructGPT - Humans generate LM Training Text 🔍
- DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜
- Reward Model Human Prefence Feedback 🏆
For more information on specific techniques and implementations, check out the following resources:
- OpenAI's paper on [GPT-3](https://arxiv.org/abs/2005.14165) which details their Language Model Preparation approach
- DeepMind's paper on [SAC](https://arxiv.org/abs/1801.01290) which describes the Advantage Actor Critic algorithm
- OpenAI's paper on [Reward Learning](https://arxiv.org/abs/1810.06580) which explains their approach to training Reward Models
- OpenAI's blog post on [GPT-3's fine-tuning process](https://openai.com/blog/fine-tuning-gpt-3/)
""")
demo.launch() |