GPT-4 came out today and set a new milestone in Large Language Model (LLM) performance across the board. It's more creative, reliable, and can understand more nuanced instructions, but GPT-4 is especially notable for two things:
- Multimodality. It can understand different types of inputs. In this case, text and images.
- Greater reasoning ability.
Leveraging multimodality, GPT-4 achieves human-level performance on various professional and academic benchmarks.
What Multimodality Enables
GPT-4 can take a combination of text and image inputs and generate text outputs. It can understand the content in images, diagrams, illustrations, and other visual types, and infer things about them.
Let’s look at an example. Below is a prompt involving an image and some text asking GPT-4 what's “unusual” about the image, and GPT-4 responds with the salient point.
You can ask it to explain a meme step by step. It even has achieved human-like performance on various academic tests involving visual inputs (diagrams, equations, etc.).
Multimodal Search Has Arrived
While GPT-4 is opening a new world of possibilities in generative AI, multimodal LLMs also open new possibilities in search. In the same way that GPT-4 can see images and understand complex text, search systems can now read text and images to prepare the perfect results.
At Kailua Labs, we’re building multimodal search to produce astonishing results by matching the way you think and speak. Here's a video of our public demo working on stock images:
As you can see below, a seemingly simple search like "black and white photo of street buskers" can have very different results. (Guess which one uses Multimodal search):
In a world that is moving increasingly to non-text media, the possibilities are endless.
We’re reinventing core search experiences for media companies, e-commerce, marketplaces, and much more.
Want to see what Multimodal can do for you? Get in touch.