Researchers at Apple uncovered significant weaknesses in Large Language Models from OpenAI, Meta and other AI developers. Researchers also raised questions about the LLMs logical reasoning capabilities.

Findings of the study revealed that minor changes in phrasing of a question could create major discrepancies in how a model performs potentially compromising reliability in scenarios where consistency is required in logical reasoning.

The study included testing over 20 models like OpenAI’s o1 and GPT-4o, Google’s Gemma 2, and Meta’s Llama 3. It highlighted that the performance of every model decreased when variables were changed.

“Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark,” the researchers said.

GSM-Symbolic benchmark was developed by researchers to overcome the shortcomings of the earlier used GSM8K benchmark used to measure the reasoning skills of models. The new model was required since the models could potentially know the answers to the older benchmark test since they were trained on it, reducing its accuracy.

The findings suggest that further research is required to develop AI models capable of formal reasoning and moving beyond pattern recognition to achieve a more robust and generalised problem-solving skills, researchers noted.

Published - October 17, 2024 01:47 pm IST