A recent study discovered that the popular chatbot ChatGPT had some ups and downs in its performance. The study, done by Stanford University, looked at how well ChatGPT handled different tasks over a few months; These tasks included solving math problems, answering sensitive questions, generating software code, and visual reasoning.
The results were surprising. They found that ChatGPT’s abilities were not consistent. For instance, they looked at two versions of the technology: GPT-3.5 and GPT-4. When it came to solving math problems, GPT-4 started off strong in March, correctly identifying prime numbers 97.6% of the time — But just three months later, its accuracy dropped to a mere 2.4%. GPT-3.5 showed improvement, going from 7.4% accuracy to 86.8% in the same task.
Similar fluctuations occurred in tasks like writing code and visual reasoning. James Zou, a Stanford computer science professor involved in the study, was surprised by the significant changes in ChatGPT’s performance.
“When we are tuning a large language model to improve its performance on certain tasks, that can actually have a lot of unintended consequences, which might actually hurt this model’s performance on other tasks […]. There’s all sorts of interesting interdependencies in how the model answers things which can lead to some of the worsening behaviors that we observed.”
The shifts in performance are not so much about the chatbot’s accuracy in specific tasks but rather the unintended consequences of fine-tuning the model. Tweaking one part of the model to improve one task can negatively affect other tasks due to complex interconnections within the model.
Unfortunately, because ChatGPT operates like a black box, researchers and the public can’t see how it works. This lack of transparency became more evident when OpenAI decided not to make its code open source. Zou emphasizes the importance of acknowledging these performance shifts and keeping an eye on how the models perform over time.
Not only did ChatGPT’s answers become less accurate, but it also stopped explaining its reasoning. This is akin to asking a student to show their work in solving a math problem step by step. It helps researchers understand how the AI arrives at its answers — However, ChatGPT started to skip this step, making it harder to study its reasoning process.
In the case of sensitive questions, both GPT-4 and GPT-3.5 initially refused to engage, stating that the questions were based on discriminatory ideas. But by June, ChatGPT simply declined to answer, providing less insight into its decision-making process.
To wrap it up, ChatGPT’s performance can be unpredictable, and understanding its inner workings remains a challenge but the study’s main message is the need to monitor and address these performance shifts in large language models.