Summarize, Recommend, and Organize Content through Multimodal Search

April 28, 2023

Team Objective

Search is having its multimodal moment, and that’s going to change how we find and discover products, information, and entertainment. But multimodal search goes beyond just finding things. It can also change how we generate and interact with content.

At the heart of multimodal search systems like ours is a content understanding engine: a system that can ingest media of any kind and automatically “know” what’s in it without any tagging. That includes objects, landmarks, logos, colors, and even qualities such as sentiment (do the people in the image look happy?), actions (mountain biker doing a backflip), and fine details (the particular pattern on a wallpaper).

With an engine that understands content in all forms, we can unlock all sorts of new workflows and capabilities. Here are three examples.

Automatically summarize visual and audio content
Imagine a sports marketing team with a folder full of basketball footage. They want to put together a highlight video. Normally, this would entail a person going through the videos, snipping out highlights, then stringing them together. With a multimodal content understanding engine, this entire process can be automated since the system can identify:

Actions such as dunks and 3-pointers.
Crowd and team reactions (visual and audio) as cues for surprise and shock.

The content system could ingest raw footage and automatically identify highlight candidates.

One could extend the same workflow to conferences, presentations, or meetings.

Enhance content recommendation and feeds
Recommendation algorithms combine user behavior and content information to help people discover the next great video or song. Today, algorithms get that content information from transcripts, descriptions, and whatever text is available. This means creators still need to do a lot of manual work to get their content recommended (e.g., adding the right keywords and descriptions). If something is missed, then an opportunity to connect with a potential audience is lost.

With a content understanding engine, that whole process gets streamlined. Imagine a recommendation system that automatically identifies key attributes and qualities in a video and recommends it to others consuming similar content. It can be obvious stuff (e.g. motorcycles) or non-obvious (e.g. exotic solo adventures).

This helps creators be more efficient and focused on content rather than SEO, and helps platforms connect viewers to high-quality information. The same idea could be extended to virtualized shopping assistants, customer support, and education.

Enrich metadata and organize content
A system which can search media based on its contents can also organize media for exploration. In the past, to make media accessible, users would need to tag it which was a time-consuming chore. Even then, subsequent users could browse media based only on whatever tags were available.

With a multimodal search system, collections can be generated on the fly and saved. Whether you want to organize your education library by topics, or your wallpaper collection by themes, creating collections just entails entering relevant terms to generate them.

This makes new marketing tactics possible from generating collections rapidly in response to trends, to creating a ton of pre-rendered pages to rank on public search engines. To get a feel for it, check out our demo on stock images.

Search as a content understanding engine
Search is richer and more expansive than most of us think. All it takes is creativity and the right technology. At Objective, we’ve built a turnkey system that understands your media and helps you with search, discovery, recommendations, enrichment, and building next-generation multimodal apps on top of those capabilities.

See Objective Search for yourself: Schedule a demo.