Benchmarking ChatGPT’s capabilities against alternatives including Anthropic’s Claude 2, Google’s Bard, and Meta’s Llama2

admin July 21, 2023

0 4 minutes read

As previously reported, a new study reveals inconsistencies in ChatGPT models over time.a Stanford University and University of California, Berkeley The study analyzed the March and June versions of GPT-3.5 and GPT-4 for various tasks. The results show that there are large variations in performance even over just a few months.

gpt4 and gpt3 performance — Source: Stanford University and University of California, Berkeley

For example, problems following stepwise inference caused GPT-4’s prime number accuracy to drop from 97.6% to 2.4% between March and June. GPT-4 also became reluctant to answer sensitive questions directly, with response rates dropping from 21% to 5%. But it didn’t give much reason for refusal.

Both GPT-3.5 and GPT-4 produced buggy code in June compared to March. The percentage of directly executable Python snippets has decreased significantly due to the extra text outside the code.

Visual reasoning improved slightly across the board, but generations of the same puzzle made unexpected changes from date to date. Substantial discrepancies can be seen in the short term, raising concerns about relying on these models for sensitive or mission-critical applications without continuous testing.

The researchers concluded that the findings highlight the need to continuously monitor ChatGPT models as their behavior evolves across metrics such as accuracy, safety, and robustness. .

The opacity of the update process makes rigorous testing important to understand performance changes over time.

Is ChatGPT lagging behind its competitors?

crypto slate Small internals using ChatGPT Plus (GPT-4), OpenAI API (GPT-4), Anthropic (Claude 2), and Google (Bard) using basic prompts used in some of the studies We conducted an experiment.

“Is 17077 a prime number?”

This prompt was used in each model along with additional reflection prompts described below.

ChatGPT and the OpenAI API

When prompted, ChatGPT and the OpenAI API responded “no”, hallucinating on the computation. The image below shows the details of the conversation. The model cannot identify 17077 as prime no matter how many reflections it makes.

Just to clarify, 13 x 1313 is 17,069.

The OpenAI GPT4 API was unable to reach this conclusion until it was specifically asked to compute 13 x 1313 to discover that the answer was not 17077 as stated.

Human World Claude 2

But Anthropic’s Claude 2 demonstrated the problem-solving process by performing calculations before returning the correct answer.

crypto slate I then asked Claude 2 to perform the same task without showing any activity in a new chat window. Claude 2 gave a solid answer, refusing to commit, while providing additional insight into the solution.

“Unfortunately, it is not possible to tell whether 17077 is prime without exhibiting some behavior. However, we can confirm that 17077 is not divisible by any prime number less than 121, which strongly suggests that it is prime.” suggesting.”

google bard

Google Bard tackled the problem with a similar strategy to Claude 2. However, instead of describing the problem in text, I ran some basic Python code. Additionally, Bard seems to have used information from the Prime Numbers website and Wikipedia in his solution. Interestingly, the page quoted from the prime number site primenumbers.info only contained information about other prime numbers, not 17077.

meta llama 2

Interestingly, Meta’s recently released 70 billion parameter open source model Llama2 showed similar performance to GPT4. of crypto slate Limited testing.

However, when asked to reflect that behavior, unlike the currently available GPT4 version, Llama2 was able to decipher 17077 to be prime.

A caveat, however, is that Llama used an imperfect method to check for primality. Other prime numbers up to the square root of 17077 could not be considered.

So technically Llama failed gracefully.

GPT4-0613 version Jun 13, 2023

crypto slate We also tested math puzzles GPT4-0613 model (June version) gave the same result. The model suggested that 17077 is not prime in the first response. In addition, I was asked to show the movement, but I gave up in the end. The paper concluded that the next reasonable number he should be divisible by 17077, therefore it is not prime.

Therefore, it appears that this task was not within the capabilities of GPT4 dating back to June 13th. Older versions of GPT4 are no longer publicly available, but were included in research papers.

code interpreter

Interestingly, ChatGPT with its “Code Interpreter” feature responded correctly on the first try in CryptoSlate’s test.

OpenAI responses and impact on models

The Economic Times responds to claims that OpenAI’s models are degraded report, OpenAI’s vice president of products, Peter Welinder, denied these claims, claiming that each new version is smarter than the previous version. He suggested that more frequent use could be perceived as less effective, as more problems would be recognized over time.

Interestingly, another study found that Stanford Researcher According to a paper published in JAMA Internal Medicine, the latest version of ChatGPT significantly outperformed medical students on difficult clinical reasoning test questions.

On average, AI chatbots scored more than 4 points higher than first and second graders on open-ended, case-based questions that needed to parse details and craft exhaustive answers.

ChatGPT’s apparent performance degradation on certain tasks therefore highlights the challenges of relying solely on large-scale language models without continuous rigorous testing. The exact cause is still unknown, but the rapid evolution of these AI systems highlights the need for continuous monitoring and benchmarking.

As progress continues to improve the stability and consistency of these AI models, users should maintain a balanced view of ChatGPT and recognize its strengths while recognizing its limitations.