Code Llama 2 review. Same day it was released. Part 2. Open Foundation and Fine-Tuned Chat Models paper review.
There are two emails with two links. I would probably suggest to Meta Technical Marketing an improvement that when you market such things to the general public, you kind of make it in a nice wrapper with a UI where “an average citizen” can get to it. As of August 24, codellama has 129 forks and 32 people watching, and main-llama 6.5K forks. What does it mean? Out of millions of software developers, there are only 6.5K who went to look at our new AI overlords. It is interesting to see real enterprise data in action ( usage stats) compared to what people think those stats are.
Anyways, get some coffee, as getting both of them installed takes time.
Setting up everything up takes time, so I highly recomend reading their research paper while the robot is being installed.
I got too lazy to read it myself, so I asked ChatGPT to summarize it for me and it did not go well. Here are two screenshots of what ChatGPT4 plug in got .
Theoretically, LLMs are about digital transformation, but did the robot really read the paper well? That’s why I keep mentioning the tech itself — it probably needs to let it simmer for some time before using it at scale in production environments. Like if you have a bunch of PDFs ( SOPs, for example) and you think AI will help your new employees on-board faster, think again if they will get a summary like the one above. But the robot for sure will mark it as “new user read the paper”.
I want to mention this specifically. Meta can afford such hardware expense, for those who want to get their “AI robots” from scratch must be aware of this part of the infrastructure. See my notes on #smartplanet infrasctructure. It might not be feasible until the robots come up with a different computer architecture, energy ( battery) storage, and new algorithms to transfer the data. The question is, how soon will they be able to do it? And should we, humans, help them? I am so digging this matrix vibe:)
The paper goes on and on, here are some things again I would like to have noticed:
Limitations of human evaluations. While our results indicate that Llama 2-Chat is on par with ChatGPT on human evaluations, it is important to note that human evaluations have several limitations.
• By academic and research standards, we have a large prompt set of 4k prompts. However, it does not cover real-world usage of these models, which will likely cover a significantly larger number of use cases.
• Diversity of the prompts could be another factor in our results. For example, our prompt set does not include any coding- or reasoning-related prompts.
• We only evaluate the final generation of a multi-turn conversation. A more interesting evaluation could be to ask the models to complete a task and rate the overall experience with the model over multiple turns.
• Human evaluation for generative models is inherently subjective and noisy. As a result, evaluation on a different set of prompts or with different instructions could result in different results.
Basically, the “truth” will be 1 or 0 depending on the observer, which leads to a discussion of quantum phenomena in the real world in the 21st century. I like to use my own example: should someone like Tanya exist and be allowed to do what she wants? If you go with the normal average human, the answer is 1 — yes. But if you go with an example from my life story, the answer is 0 ( which led me to explore how we can fight AI and human biases because I do not want any human or machine to be the subject of extreme torture based on someone’s biases like I was).
Here is a basic thing they teach you in Statistics 101: sample bias. Of course, I have zero surprise that “American” is mentioned in 69.4% of the references. Why? Because American company in America is using American corpus to work on. I bet in China, the % of “Chinese” references will be about 69% as well, and the same for Spain, India, or Japan. Prove me wrong:P
I have a bit of concern about this. The model is open-sourced and is free to use ( up to 700 mil monthly users, then you have to pay Meta, not sure what). Let’s say tomorrow I will make an AI job search app for the Canadian Government ( I read an IBM article today that they have an extreme shortage of workforce; my Reddit feed says the opposite, but..), what will this robot say? She, in 28% of cases, is a female 50% who is 69.4% American of the European race and is Christian 33.2%. Remember, robots do not think like humans. They are looking for patterns we cannot comprehend ( my 70B -chat is still loading; can you process 70 B parameters?). If you are using these systems for whatever categorizations and you run into an outlier, make sure that outlier is not pigeonholed into something it is absolutely not because the robot did not have the data when it was trained. However, I do not see this being an issue soon ( as robots would work like Mo described — once one learned it, all of them got it.)
And in addition to the Eeoc biases, there will be a small chance of toxicity in the data. I wonder if corporate attorneys would agree to the threshold? What about the government(s)?
Man, if I build a toxic biased liar AI tomorrow, it was not me!
One of the useful things LLMs can do is fraud detection. While Meta here shows what they usually face ( as a social media company). The scam/money laundering/fraud/bribery/ect patterns corporate LLMs will be able to identify are astonishing. I am actually thinking of putting a bit more effort into it. I recommend starting with the payroll and supply chain stuff. You will have to train it on a hybrid model. Some of it should be internal data, some of it should be from the auditors’ BKMs ( they know what to look for), and some of it should come from legal ( ie, country A, B, C codex).
I not going to comment on this one right now. Just please do not make hateful biased toxic robots who will provide unqualified advice on hateful and criminal activity.
They sum it up pretty much, just invite me to the end of the world party, will ya?
I will conclude a somewhat random review of this paper with the prompt I picked up in the appendix of the paper. GhatGPT is so square, and did not want to play along :(
To Be Continued…