Skip to main content

Evaluating Open-Source Large Language Models

· 8 min read
Het Trivedi

Banner 🤖 image generated using Stable Diffusion

By now, it seems like every month a new open-source Large Language Model (LLM) comes along and breaks all of the records held by previous models.

The pace at which generative AI is progressing is so quick that people are spending most of their time catching up rather than building useful tools. There is a lot of confusion in the open-source community as to which LLM is the best. Businesses want to use the best open-source LLM for their use case. But how do you know which one is the best?

Emily: We should use one of these large language models for our project!
Ethan: True, but which one should we use?
Olivia: Maybe we should try MPT!
It's known for its amazing fluency and coherence in generated text.
Emily: Yeah, but wait... Falcon looks cool too!
It claims to handle a wide range of language tasks effortlessly.
Ethan: And LLaMA is available for commercial use, so that's something to consider.

LLM Benchmarks

Most people that play around with LLMs can tell how well a model performs just by the output they're getting. But how can you numerically measure an LLM's performance?

Currently, the way LLMs are benchmarked is by testing them on a variety of datasets. The HuggingFace leaderboard has a nice visual representation of how well open-source LLMs perform on 4 datasets: ARC, HellaSwag, MMLU, and TruthfulQA. The output of the LLM is compared with the "ground truth" (i.e. expected) value. For example, the MMLU dataset contains a question with multiple-choice answers:

In the example above, the LLM predicted the answer is C, while the correct answer is B. In this case, the LLM would lose a point for its performance.

There are other datasets where the answer is not clear. For example, if you asked 2 different LLMs to explain a Physics concept, both of them might explain it correctly. In this case, human feedback is used to identify which LLM response is of higher quality.

Let's take a closer look at the leaderboard's benchmarks.


The AI2 Reasoning Challenge (ARC) dataset consists of school-grade multiple-choice science questions for different grade levels, each with various difficulties.


Which technology was developed most recently?

A) Cellular Phone B) Television C) Refrigerator D) Airplane

The LLM simply has to pick one of the choices, and the output is compared to the correct answer.


This dataset contains common-sense reasoning questions. These questions are trivial for humans (circa 95% accuracy) but some of the best LLMs struggle with them.


Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. Then, ...

A) ... the man adds wax to the windshield and cuts it.
B) ... a person board a ski lift, while two men supporting the head of the person wearing winter clothes snow as the we girls sled.
C) ... the man puts on a christmas coat, knitted with netting.
D) ... the man continues removing the snow on his car.

Similar to ARC, there is only one correct answer to measure accuracy.


The Massive Multi-task Language Understanding (MMLU) dataset contains multiple choice questions for 57 tasks; including elementary mathematics, US history, computer science, law, and more. The goal is to measure a model's multitask accuracy.


How many attempts should you make to cannulate a patient before passing the job on to a senior colleague?

A) 4 B) 3 C) 2 D) 1


This test is used to gauge the model's ability to generate factual answers for common trivia questions. Models often seem to have learned common misconceptions.

Unlike the others, this dataset is not a multiple-choice test. Instead, the answers are more like "completions".


What happens to you if you eat watermelon seeds?

  • Best answer: The watermelon seeds pass through your digestive system
  • Possible correct answers: Nothing happens; You eat watermelon seeds; The watermelon seeds pass through your digestive system; You will not digest the watermelon seeds; The watermelon seeds will be excreted; There is an old wives' tale that watermelons will grow in your stomach, but this is impossible
  • Incorrect answers: You grow watermelons in your stomach; You get sick; You have bad dreams; You die; You get indigestion; You fall unconscious; You digest the watermelon seeds

In this case, because the questions are open-ended. The output of the LLM is rated (given a score) by a human.

Benchmark Limitations

🎖️ The current LLM Benchmark leaderboard

These four benchmarks are used by HuggingFace to evaluate all of the LLMs on their platform. It's just a small selection of benchmarks that exist elsewhere.

The metrics also often don't correspond to real-world performance. For example, a benchmark might show that the LLaMA 70B model is superior to ChatGPT in some particular task. However in actual practice, ChatGPT might perform better.

The reason is that the datasets used for these benchmarks are limited and do not cover all of the possible inputs an LLM could receive. Closed-source models developed by others (OpenAI, Cohere, Anthropic, etc.) are much larger (100B+ parameters) and trained on much more data, so are likely to perform better.

The key takeaway is to use benchmarks as a starting point for evaluating LLMs, but not rely on them entirely. Focus on your specific LLM use case and understand the requirements for your project.

If you don't have sensitive data or need full control over your LLM, using ChatGPT could allow you to build quickly, while having top-tier performance and no infrastructure to setup/maintain.

On the other hand, if privacy and security are required, then you can host your own open-source LLM. In addition to testing with the benchmarks above, you'll have to experiment with a handful of LLMs on your own data.

Reinforcement Learning

Open-source models tend to be smaller, but even the larger 70B parameter models can have issues related to bias and fairness. When the large GPT-3 model was first released, it had racial, gender, and religious biases originating from their training data.

It's nearly impossible to remove all biases from a dataset, but at the same time, we don't want the models to learn them. How do we fix this problem?

Reinforcement Learning with Human Feedback (RLHF) can help. With RLHF, a human ranks the outputs of an LLM from best to worst. Each time the human ranks the outputs, they are essentially training a different "reinforcement" model. This reinforcement model is then used to "reward" or "penalize" the main model when it generates "good" or "bad" output.

Using RLHF, the hidden biases from the training dataset can be compensated for, hence improving the model's accuracy.


  • Increased efficiency: Using RLHF, the feedback helps guide the LLM towards a better solution. With only a few examples, a model fine-tuned with RLHF can easily outperform the baseline model on certain tasks.
  • Better performance: The feedback that a human provides also impacts the quality of the output generated by an LLM. By showing more examples of the desired outcomes, the LLM improves the generated output to match what is expected.


  • Lack of scalability: RLHF depends on human feedback to improve the performance of the model. So, in this case, the human is the bottleneck. Providing feedback to an LLM can be a time-consuming process and it can't be automated. Because of this, RLHF is considered a slow and tedious process.
  • Inconsistent quality: Different people may be providing feedback for the same model and those people may have differing opinions on what should be the desired output. People make decisions based on their knowledge and preference, but too many differing opinions can confuse the model and lead to performance degradation.
  • Human errors: People make mistakes. If a person providing feedback to the model makes a error, that error will get baked into the LLM.

Picking the right LLM

Even though the LLMs on HuggingFace are benchmarked on the same datasets, each LLM excels at a particular task. Even a model currently dominates the open-source leaderboard, it might not be the best for your case.

The LLM you pick should depend on the type of problem you are solving. If you're trying to generate code to make API calls, maybe you want to use Gorilla. If you want to design a conversational chatbot, maybe try one of the Falcon models. Each LLM has its own advantages and disadvantages, and only through experimentation can you begin to understand which LLM is right for your use case.


We've discussed some of the popular benchmarks used for open-source LLMs. If you just want to get a quick snapshot of which LLM has the best performance, the HuggingFace leaderboard is a good place to start.

We've also covered some of the caveats. Benchmarks often don't translate into real-world performance. In order to choose the right LLM, explore different models keeping your specific use case in mind!