import gradio as gr gr.Markdown(""" # Big Science Bloom is a 176B Parameter Large Language ML Model. # Big Science Papers and Code - Exciting AI Developments! 🤖💻🔬 """) api = gr.Interface.load("models/bigscience/bloom") def complete_with_gpt(text): # Use the last 50 characters of the text as context # return text[:-50] + api(text[-50:]) # Use the last 100 characters of the text as context return text[:-100] + api(text[-100:]) with gr.Blocks() as demo: with gr.Row(): textbox = gr.Textbox(placeholder="Type here and press enter...", lines=14) with gr.Column(): btn = gr.Button("Generate"), textbox, textbox) with gr.Row(): gr.Markdown(""" # Example on how to prompt. Create a pattern sequence of text. In this example I use language names then click generate to add each line after adding another heading for a language. English: Hi my name is Aaron. I am a computer scientist and senior principal engineer. Japanese: 私はアランです。コンピューター科学者とプログラ English: Hi my name is Aaron. I am a computer scientist and senior principal engineer. Chinese: 你好,我叫Aaron。我是一个计算机科学家和高级首席工程师。 English: Hi my name is Aaron. I am a computer scientist and senior principal engineer. Spanish: Hola, me llamo Aaron. Soy un cientifico de la computacion y un ingeniero principal English: Hi my name is Aaron. I am a computer scientist and senior principal engineer. Sanskrit: नमस्ते, मेरा नाम है Aaron. मैं एक कंप्यूटर वैज्ञानिक और वरिष्ठ प्रमुख इंजीनियर हूँ। French: Bonjour, je m'appelle Aaron. Je suis un scientifique en informatique et un ingénieur senior. ## Language Models 🗣️ 🏆 Bloom sets new record for most performant and efficient AI model in science! 🌸 ### Comparison of Large Language Models | Model Name | Model Size (in Parameters) | | ----------------- | -------------------------- | | BigScience-tr11-176B | 176 billion | | GPT-3 | 175 billion | | OpenAI's DALL-E 2.0 | 500 million | | NVIDIA's Megatron | 8.3 billion | | Transformer-XL | 250 million | | XLNet | 210 million | ## ChatGPT Datasets 📚 - WebText - Common Crawl - BooksCorpus - English Wikipedia - Toronto Books Corpus - OpenWebText ## ChatGPT Datasets - Details 📚 - **WebText:** A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2. - [WebText: A Large-Scale Unsupervised Text Corpus by Radford et al.]( - **Common Crawl:** A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3. - [Language Models are Few-Shot Learners]( by Brown et al. - **BooksCorpus:** A dataset of over 11,000 books from a variety of genres. - [Scalable Methods for 8 Billion Token Language Modeling]( by Zhu et al. - **English Wikipedia:** A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017. - [Improving Language Understanding by Generative Pre-Training]( Space for Wikipedia Search - **Toronto Books Corpus:** A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto. - [Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond]( by Schwenk and Douze. - **OpenWebText:** A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3. - [Language Models are Few-Shot Learners]( by Brown et al. ## Big Science Model 🚀 - 📜 Papers: 1. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model [Paper]( 2. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism [Paper]( 3. 8-bit Optimizers via Block-wise Quantization [Paper]( 4. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation [Paper]( 5. [Other papers related to Big Science]( 6. [217 other models optimized for use with Bloom]( - 📚 Datasets: **Datasets:** 1. - **Universal Dependencies:** A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing. - [Universal Dependencies official website.]( 2. - **WMT 2014:** The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages. - [WMT14 website.]( 3. - **The Pile:** An English language corpus of diverse text, sourced from various places on the internet. - [The Pile official website.]( 4. - **HumanEval:** A dataset of English sentences, annotated with human judgments on a range of linguistic qualities. - [HumanEval: An Evaluation Benchmark for Language Understanding]( by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes. 5. - **FLORES-101:** A dataset of parallel sentences in 101 languages, designed for multilingual machine translation. - [FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding]( by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra. 6. - **CrowS-Pairs:** A dataset of sentence pairs, designed for evaluating the plausibility of generated text. - [CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments]( by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong. 7. - **WikiLingua:** A dataset of parallel sentences in 75 languages, sourced from Wikipedia. - [WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification]( by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi. 8. - **MTEB:** A dataset of English sentences, annotated with their entailment relationships with respect to other sentences. - [Multi-Task Evaluation Benchmark for Natural Language Inference]( by Michał Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor. 9. - **xP3:** A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences. - [xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context]( by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge. 10. - **DiaBLa:** A dataset of English dialogue, annotated with dialogue acts. - [A Large-Scale Corpus for Conversation Disentanglement]( by Samuel Broscheit, António Branco, and André F. T. Martins. - 📚 Dataset Papers with Code 1. [Universal Dependencies]( 2. [WMT 2014]( 3. [The Pile]( 4. [HumanEval]( 5. [FLORES-101]( 6. [CrowS-Pairs]( 7. [WikiLingua]( 8. [MTEB]( 9. [xP3]( 10. [DiaBLa]( # Deep RL ML Strategy 🧠 The AI strategies are: - Language Model Preparation using Human Augmented with Supervised Fine Tuning 🤖 - Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank 🎁 - Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score 🎯 - Proximal Policy Optimization Fine Tuning 🤝 - Variations - Preference Model Pretraining 🤔 - Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution 📊 - Online Version Getting Feedback 💬 - OpenAI - InstructGPT - Humans generate LM Training Text 🔍 - DeepMind - Advantage Actor Critic Sparrow, GopherCite 🦜 - Reward Model Human Prefence Feedback 🏆 For more information on specific techniques and implementations, check out the following resources: - OpenAI's paper on [GPT-3]( which details their Language Model Preparation approach - DeepMind's paper on [SAC]( which describes the Advantage Actor Critic algorithm - OpenAI's paper on [Reward Learning]( which explains their approach to training Reward Models - OpenAI's blog post on [GPT-3's fine-tuning process]( """) demo.launch()