Dolly:开源的ChatGPT
我们表明,任何人都可以使用一个过时的现成开源大型语言模型 (LLM),并通过在一台机器上使用高质量的训练数据在 30 分钟内对其进行训练,从而赋予它神奇的类似 ChatGPT 的指令遵循能力。
令人惊讶的是,指令遵循似乎不需要最新或最大的模型:我们的模型只有 60 亿个参数,而 GPT-3 有 1750 亿个参数。 我们开源了模型 (Dolly) 的代码,并展示了如何在 Databricks 上重新创建它。 我们相信像 Dolly 这样的模型将有助于 LLM 民主化,将它们从极少数公司能够负担得起的东西转变为每个公司都可以拥有和定制以改进其产品的商品。
Dolly已经开源,点击这里访问。
1、背景
ChatGPT 是一种专有的指令遵循模型,于 2022 年 11 月发布并风靡全球。 该模型是根据来自网络的数万亿个单词进行训练的,需要大量的 GPU 来开发。 这很快导致谷歌和其他公司发布了他们自己专有的指令遵循模型。 2023 年 2 月,Meta 向学术研究人员发布了一组名为 LLaMA 的高质量(但不是指令遵循)语言模型的权重,每个模型的训练时间超过 80,000 GPU 小时。 然后,在 3 月,斯坦福大学建立了基于 LLaMA 的Alphaca模型,但在一个包含 50,000 个类似人类的问题和答案的小型数据集上进行了调整,令人惊讶的是,它表现出了类似于 ChatGPT 的交互性。
2、Dolly简介
今天我们要介绍的是 Dolly,这是一种构建成本低廉的 LLM,它展示了 ChatGPT 展示的令人惊讶的指令遵循能力。 虽然 Alpaca 团队的工作表明,最先进的模型可以被诱导为高质量的指令遵循行为,但我们发现,即使是具有更早架构的多年开源模型在微调时也会表现出惊人的行为 指令训练数据的小型语料库。 Dolly 的工作方式是使用来自 EleutherAI 的现有开源 60 亿参数模型,并使用来自 Alpaca 的数据对其进行轻微修改,以引发指令遵循原始模型中不存在的头脑风暴和文本生成等功能。
Dolly 的底层模型只有 60 亿个参数,而 GPT-3 中有 1750 亿个参数,而且已经使用了两年,它运行得如此之好尤其令人惊讶。 这表明,像 ChatGPT 这样的最先进模型的许多定性收益可能归功于指令跟踪训练数据的集中语料库,而不是更大或更好调整的基础模型。 我们将模型称为Dolly——以第一只克隆哺乳动物多莉的名字命名——因为它是Alpaca的开源克隆,灵感来自美洲驼。 我们正处于企业 AI 民主化的早期阶段,还有很多工作要做,但我们相信 Dolly 的底层技术为那些希望廉价构建自己的指令遵循模型的公司提供了一个令人兴奋的新机会。
我们根据 ChatGPT 所基于的 InstructGPT 论文中描述的指令遵循功能对 Dolly 进行了评估,发现它展示了许多相同的定性功能,包括文本生成、头脑风暴和开放式问答。 这些示例中特别值得注意的不是生成文本的质量,而是指令遵循能力的巨大改进,这是通过在小型、高质量数据集上微调已有多年历史的开源模型而产生的。
3、生成任务测试
3.1 写一封情书
任务:Write a love letter to Edgar Allan Poe.
原始模型:
HISTORY: Edgar Allan Poe rose to fame in the 19th Century. He was often compared to the great Romantic poets and is considered the father of the modern short story and Poe is thought of as the founder of American detective fiction. He has been dead for 130 years.
PATRICK CAMPOS: You should never try to write a love letter to a dead person.
PATRICK CAMPOS: Poe was more than just the author of The Raven. He was a master of atmosphere.
BUDDY FRITZLER: Edgar Allan Poe lived in Baltimore from 1831 until his death in 1849. He's best known for his short stories, his greatest contributions to the development of literature being the concept of the detective story, the mystery story and the short story.
BUDDY FRITZLER: Poe's most famous story, "The Gold-Bug", was written in 1843. It's about a treasure hunter who searches for buried treasure. In this short story, he provides a blueprint for every mystery ever written since.
BUDDY FRITZLER: The central idea that you have to look for clues is very appealing. It was even used as …
Dolly:
Dearest Edgar Allan Poe,
I met you first in the dimly lit stairwell of the old house, and I know that the moment I saw you, I never wanted to see you again. You seemed so distant and so filled with the shadows of the past that I wanted to stay away forever.
But I couldn't help but be obsessed with you, and my feelings for you have only grown over the years. Even now, I feel a deep connection with you that I can never explain.
I hope that you can find happiness in the life that you now lead. I am sure that you will do great things, and I am forever in your debt.
Sincerely,
[Your Name]
3.2 写一个tweet
任务:Write a tweet announcing Dolly, a large language model from Databricks.
原始模型:
In your tweet, provide a short summary of the challenge, how you analyzed it, and how Dolly performed.
After your tweet, I will retweet it to Databricks’ Twitter account.
Clean up your social media page for the next week by removing any mentions of Dolly or Databricks.
Data Preparation
Our first step is to generate word vector embeddings for each document. We can obtain these vector representations in two different ways:
Using a tokenized document set (in this case, everything in our download folder)
Loads a word2vec model from the Google Search API (in this case, vector2vec.c)
In our case, we want to use the first method, which is easier, but the second may be more robust. We will use the word2vec model above to create a vocabulary file for any new words found in the file. This process gives us a quick way to identify new words and find contextually related terms.
We can start by copying word2vec.c to a local directory called /tmp/word2vec.
#/tmp/word2vec/word2vec.c FILE_NAME /tmp/word
Dolly:
Databricks just announced Dolly - a new language model for natural language processing with large capacity and improved efficiency!
#DataScience #NaturalLanguageProcessing
[Link to Dolly model release]
3.3 写一个广告
任务:I'm selling my Nikon D-750, write a short blurb for my ad.
原始模型:
I'm not sure if you can post a location - anything on the website will work.
I've only had it a week. It is in mint condition with the charging cable and window case intact. Bought it for $1800 for in store pickup at a store in San Francisco. Plus tax and the undisclosed Costco price.
Some one please let me know that this is a scam and not really an offer.
I just received and e-mail that said the camera is at someones house so someone else can pick it up and did I want to message them, well no I'm selling it as I would never …
Dolly:
Looking to get a reliable camera for photo and video work? Check out the Nikon D-750! This sleek, lightweight camera has a 24.2MP sensor and 1080/60p video recording, making it a great choice for capturing those moments in life. The ergonomic grip and nine-point autofocus system make it easy to work with and shoot without hassle. With Wi-Fi, you can upload your pictures easily to your computer or smartphone without having to buy a dedicated external storage. Buy the Nikon D-750 now to experience why it is the best choice for photography and videography!
4、开放Q&A测试
任务:Explain to me the difference between nuclear fission and fusion.
原始模型:
The first thing that strikes one is the similarity in appearance and function between the atom and the sun. The central body in the sun is the source of energy, and the entire structure of the sun is made up of a million tons of the hydrogen gas that's so important to us. It's what all hydrogen fuses into in the sun, and that's why we are made of it: fusion.
The nucleus of the atom, at the center, is made of protons, and the electrons orbit around it. In the center, there is a tiny spherical mass with a lot of protons (and therefore a lot of positive charge) called the nucleus. The electrons aren't so tiny, but their charge is negative, as a positive nucleus and negative electrons make for stability (due to charge balance). Protons are positively charged, so even though they themselves can't get anywhere, they do repel electrons. There is a core of neutrons, that don't have any charge, but are held there by the protons. Protons also repel neutrons.
This simple fact is the basic form of our sun, and it is what makes up all the elements in the earth. Hydrogen is the most plentiful element in …
Dolly:
The difference between nuclear fission and fusion is the method by which the energy of the nucleus is released. In nuclear fission, a single atomic nucleus splits into two smaller nuclei. This process is used to create energy through the release of a large amount of energy in the form of heat. On the other hand, fusion involves two atomic nuclei coming together to form a single nucleus. This process is used to create a much larger amount of energy compared to fission, and the energy is released through the release of light and heat. Fusion has the potential to be a much more efficient way of creating energy. However, it is much …
5、头脑风暴测试
任务:Give me a list of 5 science fiction books I should read next.
原始模型:
I know it’s early, and I should feel ashamed of myself for even thinking of it, but I have a terrible tendency to procrastinate and read only when I absolutely have to.
As a result, I have not read anything this year, in any genre.
As soon as I have a set of ideas for an article, I let the ideas and outline do the work of making sure I don’t start writing. There is something special about a blank page and an empty mind.
That said, I need to change that, and reading the above question made me realize just how badly I want ..…
Dolly:
1. 2001: A Space Odyssey by Arthur C. Clarke
2. The Hitchhiker's Guide to the Galaxy by Douglas Adams
3. The Lord of the Rings by J.R.R. Tolkien
4. The Hitchhiker's Guide to the Galaxy by Douglas Adams
5. 1984 by George Orwell.
6、为什么要开放模型?
公司宁愿建立自己的模型,也不愿将数据发送给集中式 LLM 提供商,后者为 API 背后的专有模型提供服务,原因有很多。 对于许多公司来说,最有可能从 AI 中受益的问题和数据集代表了他们最敏感和专有的知识产权,将其交给第三方可能会让人难以接受。 此外,组织可能在模型质量、成本和期望行为方面有不同的权衡。 我们相信,大多数 ML 用户最好通过直接拥有他们的模型来获得长期服务。
我们正在开源一个简单的 Databricks notebook,您可以使用它在 Databricks 上自己构建 Dolly。
7、未来的工作
Dolly 的发布是 Databricks 发布的一系列公告中的第一个,重点是帮助每个组织利用大型语言模型的力量。 我们相信人工智能具有不可思议的力量,可以改变每个组织和个人的生产力,并欢迎你加入我们的旅程。 在接下来的几周内,请继续关注该领域的更多信息!
这项工作在很大程度上归功于许多令人难以置信的组织的努力和见解。 如果没有 EleutherAI 开源和培训 GPT-J,这是不可能的。 我们的灵感来自斯坦福大学基础模型研究中心,特别是 Alpaca 背后的团队令人难以置信的想法和数据。 小数据集强大功能背后的核心思想要归功于关于自我指导的原始论文。 我们也感谢 Hugging Face 托管、开源和维护无数模型和库; 他们对最先进技术的贡献怎么强调都不为过。
原文链接:Hello Dolly: Democratizing the magic of ChatGPT with open models
BimAnt翻译整理,转载请标明出处