Beyond Text and Images: The Era of Multimodal AI

Imagine a world in which technology can not only understand our words but also analyze our pictures, interpret our movies, and even anticipate what we want. As 2024 progresses, generative AI is poised to become a valuable tool for regular consumers beyond the tech-savvy industry insiders. A growing number of people are likely to experiment with a wide range of AI models. State-of-the-art AI models like GPT-4 and Gemini, are at the forefront of this technological revolution. Unlike their predecessors, which were limited to processing text, these advanced models, known as multimodal AI, have the capability to understand and generate not only text but also images, and potentially even videos.

Unimodal vs Multimodal

In general, there are two generative AI models: unimodal and multimodal. Unimodal models get instructions from the same modality as the created content modality, but multimodal models take cross-modal inputs and produce outputs from multiple modalities. It is a sort of artificial intelligence that does not limit itself to one type of data, such as images or text. Instead, it combines several sorts of data - such as pictures, text, audio, code, and videos - to better understand the situation. It's similar to how we utilize our eyes to see, ears to hear, and brains to process what's happening around us. A multimodal AI, for example, may look at an image while reading a description, allowing it to grasp what's in the picture far better than if it only had the picture.

The state of multimodal AI

One of the most substantial updates is Google's Gemini AI model, which was trained on several data formats before being fine-tuned using new multimodal data. Gemini beats older models in a variety of areas:

Source: https://blog.google/technology/ai/google-gemini-ai/#performance

There are three different versions: Ultra, Pro, and Nano, each catering to a different set of requirements. The Ultra version outperforms human experts in massive multitask language understanding (MMLU) and outperforms 30 of 32 academic benchmarks. Google's chatbot Bard is powered by Gemini Pro and the Nano version is included within the Pixel 8 Pro phone, improving capabilities like Summarize and Smart Reply.

OpenAI's GPT-4 with vision, which was previously provided exclusively to a small number of users, has now been made broadly available. While it shows promise, there are still limitations and challenges. For example, it sometimes struggles with recognizing structural connections in visuals and can make mistakes while doing tasks such as copying mathematical formulae or counting things in images.

Apple's Ferret: A New Contender

This takes us to Apple's most recent achievement, Ferret. It shines in areas where GPT-4 falls short: In benchmark testing, Ferret outperformed GPT-4, particularly when it came to effectively detecting and summarizing small details in pictures.

Ferret's launch signals an important milestone in Apple's road toward complex AI applications, with the potential to change sectors such as computer vision, particularly in AR/VR experiences.

It will be intriguing to see how consumers adjust to these new multimodal AI systems in the future. The gains in accuracy and usefulness that these systems attain in the future will be equally fascinating. As we enter this new era of artificial intelligence, the possibilities are as boundless as they are thrilling.


Disclaimers:

This is not an offering. This is not financial advice. Always do your own research.

Our discussion may include predictions, estimates or other information that might be considered forward-looking. While these forward-looking statements represent our current judgment on what the future holds, they are subject to risks and uncertainties that could cause actual results to differ materially. You are cautioned not to place undue reliance on these forward-looking statements, which reflect our opinions only as of the date of this presentation. Please keep in mind that we are not obligating ourselves to revise or publicly release the results of any revision to these forward-looking statements in light of new information or future events.

Previous
Previous

Movement in VR (And Why Disney Probably Won’t Solve It)

Next
Next

What Pulls Gamers Back in?