Visual SEO used to end in a fairly mechanical task list: reduce file weight, write a decent alt attribute, export to WebP, check that LCP had not fallen through the floor. That list still matters, but it no longer explains what happens when someone points a camera at a shop window, a label, a facade or a spare part and asks Google what is in front of them.
Multimodal search changes the unit of optimization. The page is no longer only a text document. It becomes a scene: objects, relationships, brands, materials, locations, prices, availability, reviews, licenses and editorial context. In April 2025, Google announced that AI Mode could accept a photo or uploaded image, understand the whole scene with Gemini, identify objects through Lens and issue multiple searches with query fan-out. At Google I/O a month later, the company said Lens had more than 1.5 billion monthly users searching what they see.
The uncomfortable conclusion: a beautiful but mute image is not enough. For multimodal search, every important image needs to say what it is, where it fits and why it can be trusted.
What changes when Google can see the scene
AI Mode and Lens do not treat an image as decoration for the article. Google describes a flow where the system understands objects, materials, colors, shapes, arrangement and the relationship between elements. It then runs several searches about the whole image and about specific objects inside it. One photo of a bookshelf can become searches about books, editions, recommendations, shops and follow-up questions.
Robby Stein, VP of Product at Google Search, explains in the AI Mode announcement that the experience combines Lens with a custom version of Gemini to answer complex questions about what the user sees. The important phrase is not “AI”; it is “what the user sees”. In SEO, that forces us to move from abstractions to visible evidence.
Consider a dental clinic in Barcelona. A reception photo with alt="dental clinic" is better than nothing, but it does not disambiguate much. An image with the team visible, real signage, neighborhood cues, nearby copy mentioning “dental clinic in Gracia”, consistent LocalBusiness schema and crawlable photos of the room, treatment area and facade creates a stronger visual entity. If someone uses Lens in front of the location or asks about nearby services, Google has more pieces to connect.
The contrarian point: you do not need more photos everywhere. You need fewer generic images and more images that answer a visual question. A stock photo of a smiling person at a laptop does not build entity strength; an original photo of your product, workshop, menu, showroom or before-and-after can. The camera does not ask for “digital solutions”. It asks what this is, where to buy it, how to use it, whether it is open and whether it matches the need.
Visual entities: products, places, people and attributes
In entity SEO we usually talk about the Knowledge Graph, sameAs, brands and authors. In visual SEO, the logic becomes more physical: which entities appear inside the image. A dress is not only “dress”; it may be a green satin midi dress with a specific neckline, brand, size, price, availability and styling options. A restaurant is not only “restaurant”; it may be terrace, interior, menu, facade, dishes, neighborhood and opening hours.
Google Search Central says SEO best practices remain relevant for AI Overviews and AI Mode, and recommends supporting textual content with high-quality images and videos where useful. It also reminds site owners that structured data should match visible text. That second point prevents a familiar temptation: using schema to claim what the image does not prove.
For products, Google’s Product structured data documentation is direct: product data can appear in rich results, Google Images and Google Lens, with information such as price, availability, ratings, shipping and returns. If you run ecommerce, the main photo, Merchant Center feed, Product schema and page copy need to tell the same story. When they do not, the system has to choose between conflicting signals.
For local, LocalBusiness structured data recommends crawlable, indexable images that represent the marked-up content and use formats supported by Google Images. It also recommends multiple high-resolution images in 16:9, 4:3 and 1:1 ratios. That is not just an aesthetic preference; it gives Google several surfaces for local results, maps, images, panels and visual answers.
The useful analogy here comes from inventory, not photography. Each important image is a warehouse item. If the box has no label, location, content description or condition notes, someone has to open it and guess. Multimodal search can look inside the box, yes, but it still needs the inventory system to confirm what it sees.
Surrounding text, captions and accessibility
Google’s Image SEO best practices say it uses alt text together with computer vision algorithms and page content to understand an image’s subject. That “together with” is the part many teams miss. Alt text does not live alone. The previous paragraph, caption, H2, product copy and structured data help define what the image means on that specific page.
Alt text should describe the image’s function in context. For a product image, “shoe” is not enough. Better: “white running shoe with wide sole for overpronation”. For a local page: “facade of the Ighenatt Dental clinic on Carrer de Sants, Barcelona”. For a technical guide: “Search Console screenshot with Image search type filter applied”. Natural, concrete, no keyword stuffing.
Accessibility is not decorative. W3C WAI reminds us that images need text alternatives describing the information or function they represent, so people with different disabilities can use them. The interesting SEO angle is that the same discipline improves machine interpretation: if you can explain why the image matters to someone who cannot see it, you are also clarifying its semantic role.
Captions have one advantage alt text does not: users see them. In ecommerce and local SEO, a good caption can solve questions that computer vision cannot confirm: “2026 model in navy blue, available in the Barcelona store”, “covered terrace accessible from the main entrance”, “result after 90 days of treatment, with patient consent”. If that sounds operational, good. Visual search rewards the operational.
ImageObject, licenses and schema that matches the page
ImageObject is not magic. Schema.org defines it as a type for describing image objects with properties such as content, URL, caption, creator, copyright, license and usageInfo. Google says image metadata can be provided through structured data or IPTC photo metadata. If you use both and they conflict, Google prioritizes the structured data.
This matters for publishers, image libraries, portfolios, photographers, ecommerce brands with original photography, real estate, tourism, architecture, art, training and any business where image usage rights matter. The “licensable” badge is not everyone’s goal, but traceability is. An image with creator, copyright and license page reduces doubt around origin and reuse.
For articles, Article schema can declare images associated with editorial content. For products, Product schema connects the image to price, availability, ratings, shipping and returns. For local, LocalBusiness schema can include images of a restaurant, clinic, office, shop or branch. The practical rule: do not mark an image as proof of something the page does not visibly show.
A reasonable commercial page setup has four layers. First, HTML img with src, width, height, alt and sensible loading. Second, nearby text or caption explaining context. Third, page-type schema: Product, LocalBusiness, Article, FAQPage or HowTo. Fourth, ImageObject or IPTC if you need credit, license, author or acquisition page. If you already have a structured data strategy for AI, this is the natural visual extension. For the bridge between schema and GEO, it also connects to the guide on Schema.org as an SEO-GEO bridge.
Airbnb shows the commercial side. Its professional photography program reports, using 2024-2025 data, 21% more host earnings and 19% more bookings for listings with professional photography versus comparable listings without it. That is not pure SEO, but it proves a truth Lens and AI Mode amplify: a better documented visual surface changes discovery and decision behavior.
Practical checklist for local, product and business pages
Use this checklist for images that can generate revenue, leads or physical visits. You do not need it for every icon or decorative asset.
- Choose original, specific images: real product, real facade, real team, real room, real installation, real packaging. Avoid stock on key surfaces.
- Name files with entity and attribute:
white-running-shoe-overpronation-brand.jpg,dental-clinic-gracia-facade.jpg,vegan-restaurant-menu-barcelona.jpg. - Ensure crawlability and indexing: accessible image URLs, not blocked by robots.txt, no login, included in a sitemap or discoverable from HTML.
- Set dimensions:
widthandheightto prevent CLS; useful ratios for rich results and cards include 16:9, 4:3 and 1:1 when the page type needs them. - Write contextual alt text: describe the image as part of the page, not as a keyword cloud.
- Add captions where they help decisions: material, color, model, location, access, date, condition, availability or legal status.
- Repeat critical text in HTML: prices, model names, opening hours, ingredients, certifications, addresses and steps should not depend only on OCR.
- Connect visible schema: Product for product pages, LocalBusiness for locations, Article for editorial, HowTo for processes, FAQPage for real questions.
- Include license data where relevant:
license,acquireLicensePage, creator, copyright or IPTC, especially for publishers and owned photography. - Check Merchant Center and Business Profile: product and location photos should match feeds, GBP, hours, availability and policies.
- Measure by surface: Google Images, web results, Lens where you have clues, Merchant Center, GBP insights, conversions and calls.
On a product page, prioritize main image, variants, details, scale and usage context. On a local page, prioritize facade, entrance, interior, team, accessibility and surroundings. On a B2B or service page, prioritize process screenshots, real deliverables, before-and-after comparisons and diagrams explained in text.
Performance: modern formats, yes, but not as religion
There is a reason this site already has articles on WebP images and SEO performance and WebP/AVIF image optimization: file weight still matters. A visually perfect image that delays LCP loses business before Lens can do anything. Use AVIF or WebP when your pipeline supports it, optimized JPEG when you need simple compatibility and SVG only for graphics that are truly vector-based.
But in 2026 the “WebP vs AVIF” debate should not swallow the strategy. Format is hygiene. The real progress is serving the right image, at the right size, at the right time, with enough context for humans and machines to know what it represents.
Four practical rules are enough to start. Do not lazy-load the LCP image. Use srcset and sizes so you do not send a desktop photo to a mobile screen. Keep width and height. Compress without destroying details a visual search may need: texture, label, shape, color, pattern, model number. I have seen teams crush a product image so hard that the logo becomes a smear. Fast, yes. Also useless.
Compression should respect image function. For a facade photo, the sign needs to be readable. For a chart, axes and legends matter. For a product, materials and finishes should survive. Performance and visual citability are not enemies; they break when you optimize only one metric.
Measurement: how to know visual SEO is progressing
Search Console does not yet give you a clean “clicks from Lens to this image” report. Google says AI Overviews and AI Mode are counted within overall Search performance under the Web search type. That means visual measurement requires triangulation, not a single dashboard.
Start with Search Console. Separate Image and Web search types. Look for pages with many image impressions and low CTR. Review queries that include color, material, model, location, “near me”, “how to use”, “photo of”, “price”, “dimensions” or product names. Cross those patterns with image changes, captions and schema. If a local page gains image impressions after adding facade and interior photos, the visual signal is probably improving.
Then look at business outcomes. In ecommerce, Merchant Center and GA4 can show whether pages with new images improve clicks, add-to-cart actions or assisted conversions. In local, Google Business Profile can show calls, direction requests, photo interactions and location queries. In editorial content, watch whether articles with explained diagrams, captions and Article schema earn more image clicks or better visual snippets.
Frequently asked questions about multimodal search and visual SEO
Does visual SEO for AI replace classic image SEO?
No. It expands it. Crawling, formats, file weight, dimensions, alt text and image sitemaps still matter, but multimodal search adds an entity layer: which product, place, material, brand or action appears in the image and how it is confirmed by visible text, captions, schema, feeds and business data.
Is ImageObject required to appear in Google Lens or AI Mode?
No. Google says there is no special schema required for AI Overviews or AI Mode. Still, ImageObject, Product, LocalBusiness, Article and license metadata help reduce ambiguity when they faithfully represent the visible content of the page.
Which images should I prioritize first?
Start with images that already influence business: main product photos, location photos, local service images, comparisons, process screenshots, menus, rooms, facilities and any image a user might photograph with Lens to ask what it is, how much it costs, where it is or how it works.
How do you measure whether visual search is working?
Measure Google Images impressions and clicks in Search Console, Product or LocalBusiness rich result appearances, clicks from pages with refreshed images, assisted conversions in GA4, Merchant Center performance and queries mentioning visual attributes such as color, material, shape, model, location or style.
Your task this week: export your 20 highest revenue or lead pages, mark which ones depend on an image to persuade, and review only those. Check whether the photo is original, whether the alt text describes the entity, whether nearby copy explains context, whether schema matches what is visible and whether Search Console separates Image results. It is a small audit. That is exactly why it gets done.
Share this article
If you found this content useful, share it with your colleagues.
Frequently Asked Questions
¿Con qué frecuencia publican contenido nuevo?
Publicamos artículos nuevos semanalmente, enfocados en las últimas tendencias de SEO técnico, casos de estudio reales y mejores prácticas. Suscríbete a nuestro newsletter para no perderte ninguna actualización.
¿Los consejos son aplicables a cualquier tipo de sitio web?
Nuestros consejos se adaptan a diferentes tipos de sitios: ecommerce, blogs, sitios corporativos y aplicaciones web. Siempre indicamos cuándo una técnica es específica para cierto tipo de sitio o requerimientos técnicos.
¿Puedo implementar estas técnicas yo mismo?
Muchas técnicas básicas puedes implementarlas tú mismo siguiendo nuestras guías paso a paso. Para optimizaciones avanzadas o auditorías completas, recomendamos consultar con especialistas en SEO técnico como nuestro equipo.
¿Ofrecen servicios de consultoría personalizada?
Sí, ofrecemos servicios de consultoría SEO técnica personalizada, auditorías completas y optimización integral. Contáctanos para discutir las necesidades específicas de tu proyecto y cómo podemos ayudarte.