Amazon Surpasses OpenAI's ChatGPT with Multimodal-CoT Generative AI
Recent advancements in AI technology have allowed large language models (LLMs) to perform well in complex reasoning tasks through chain-of-thought (CoT) prompting.
OpenAI's ChatGPT has been making waves in the AI community for the past two months, with discussions about its potential impact on various fields including business and education. However, tech giants Google and Baidu have since entered the chatbot scene, showcasing their own generative AI technologies. Now, Amazon has entered the race with a new language model that outperforms OpenAI's GPT-3.5 on the ScienceQA benchmark by 16 percentage points, even surpassing human performance.
Recent advancements in AI technology have allowed large language models (LLMs) to perform well in complex reasoning tasks through chain-of-thought (CoT) prompting. However, the current research in CoT focuses solely on the language modality, often using a multimodal-CoT paradigm to find CoT reasoning in multiple inputs such as language and vision.
Most existing methods of multimodal-CoT combine multiple inputs into a single modality before asking LLMs to perform CoT. But this can lead to information loss and produce hallucinatory reasoning patterns. To overcome these limitations, Amazon researchers have developed Multimodal-CoT, which combines visual features in a separate training framework. The framework divides the reasoning process into two parts: finding a reason and determining the answer. By incorporating vision in both stages, the model is able to provide more convincing arguments and draw more accurate conclusions.
The inference and reasoning-generating stages of the Multimodal-answer CoT use the same model architecture but differ in their inputs and outputs. In the rationale generation stage, the model is fed data from both the visual and language domains, and the rationale is then added to the language input in the answer inference step. The textual representation of the language is made through a Transformer encoder and combined with the visual representation, which is then fed into the Transformer decoder.
In conclusion, Amazon's Multimodal-CoT demonstrates state-of-the-art performance on the ScienceQA benchmark, outperforming GPT-3.5 accuracy and even surpassing human performance.