Web Devs, Meet the AI Apps You’ll Build Next

A Google Deep Mind research scientist says that the AI interface will change as AI apps go multimodal and engage with our surroundings.

May 13th, 2025 3:00pm by Loraine Lawson

Featued image for: Web Devs, Meet the AI Apps You’ll Build Next

Photo by Alex Schuper via Unsplash

AI isn’t going to be limited to changing how we use the internet and create code — it will also change how applications function, enabling them to interact with the physical world in new ways. Research scientist Stefania Druga of Google Deep Mind showed developers what this might look like, including demos of four applications, at last week’s Infobip Shift Miami and Infobip CX Unlocked Americas conferences.

Multimodal AI receives inputs from sensors, cameras, robotic arms and other interactive technologies, she told the audience.

“We can have AIs that can perceive the world the same way we perceive the world,” said Druga, who holds a PhD in AI literacies and applications. “Once we have video input, audio input and images, AI applications are going to have a much richer context and understanding of the tasks for the different environments that we are using them in.”

Multimodal applications also move our interaction with AI beyond mere text, enabling real-time interactions and a speech interface for AI, she added. They also make it possible to have better grounding, which is the process of connecting an AI model’s abstract knowledge to specific, real-world information and context.

“The real-time aspect is very important,” she said. “Let’s say I need to replace [a] tire on my car. I want to be able to have an API that can see the task in real time and give me feedback.”

Developer Lessons in AI From Cognimates

Previously, Druga was part of MIT’s Scratch team. Scratch is a visual programming language used by children worldwide.

She designed a Scratch-based platform called Cognimates, which allows children to learn about and build with AI. Cognimates is free, open source and currently in early preview.

“If you feel intimidated or overwhelmed or feel like things are moving too fast and too scary, I want you to be inspired and encouraged that even the youngest members of our society are actually learning about this technology and building with it,” she told the audience.

There are more than 18 extensions, which are like libraries, available on the Cognimates platform. It can incorporate a sentiment analysis program, a voice assistant similar to Alexa or Siri, and smart lights.

One girl trained the AI to play hide and seek. The app connected to a camera, which could rotate and view the whole room.

“She wanted to be able to run around the room and hide, and the robot would scan and — if the number of people it could see was larger than zero — say, ‘I see you,’” she said. ”I love this example, because just in seven blocks, she was able to create this rich interaction.”

Another game provided a valuable lesson in unintentional bias in AI and how to fix it. Two students created a rock, paper and scissors app, using images of their own hands to train the model. They soon found that their friends couldn’t play because they had different skin tones. So the intrepid coders retrained the model with a broader range of hand models.

A screenshot of the Scratch platform Cognimates, which is used to develop AI applications. This shows a picture of a hand used to train the models for a game of rock, paper, scissors.

Screenshot from Stefania Druga’s Infobip conference presentation.

“It’s a different way of talking about AI ethics in a way that is not paralyzing,” she said. “How do we fix it? How do we input more diverse data into a training sense?”

Druga also studied how children perceived AI after six weeks of using the platform. Before the children began using the platform, they measured their understanding of AI by asking if AI is smarter than the children. All of them said yes. But after six weeks of using the tool, their answers shifted to maybe, sometimes and no.

“They understood that there are people who are creating these data sets, and they understood when these data sets are useful, when they’re not, and what sort of tasks we can delegate to AI and what sort of tasks we cannot delegate to AI,” she said.

Meanwhile, we adults are still struggling with that.

A Natural Language Interface With Google Home Gemini Extension

One project she’s worked on for Google over the past year is using Gemini and smart devices to control home items, such as the heat, A/C or blinds. The Google Home Gemini extension, available in public preview, installs on your phone and uses a voice interface.

The idea behind this multimodal app is to test fuzzy queries that use natural language. Instead of asking it to turn on the A/C, it responds to a comment such as, “It’s very hot in here,” to turn on the air.

She wasn’t sure how it would respond when she hit the live demo, telling the assistant that she wants to practice yoga upstairs. It responded by turning on the air conditioning. In another scenario, it closed the blinds and turned on spot lighting.

A Multimodal AI Chemistry Assistant

There’s a lot of buzz around the idea of an AI co-scientist, Druga said. For instance, Google has an AI Co-Scientist that uses AI to assist in scientific research by helping professional scientists. It can identify novel research directions and generate new hypotheses, in addition to helping with iterative testing and refinement.

But it’s still very text-based, Druga noted. ChemBuddy actually brings the AI into the lab work by having it observe experiments in real life.

Right now, it’s designed for educational settings, but it’s not hard to see how it could evolve for professional labs. The system can connect to cameras, microscopes, sensors or even a robotic arm. It also has a web speech API to support audio. It uses Imagen for image generation and can create visualizations of the experiment’s reactions.

The architecture for ChemBuddy, an AI assistant for chemistry students.

Screenshot via Stefania Druga’s slides.

From the chat, she can ask questions. Chembuddy records everything a student does and creates documentation from the sensor results in real time.

It can also detect a misconception. If the student is asking questions that reveal he or she doesn’t understand the difference between endothermic and exothermic reactions, it’s going to keep track of that and help the student learn.

MathMind and Evaluating Models

MathMind is an app that identifies math misconceptions by analyzing student work via webcam and providing targeted exercises and feedback. It targets 55 algebra misconceptions.

“There [are] so many ways in which we can get math wrong, but it’s very useful to be able to understand what concept is clear and what concept is not clear,” she said, adding that teachers could use the tool to see where students need reteaching. It can also generate custom exercises with visualizations and create reports for teachers, students and parents.

For MathMind, she did something that will be critical for developing multimodal AI apps: She created it with Gemini 2.5 API but tested it on a variety of models, including open source options, to evaluate how well the models work within the context of the application. She evaluated it along parameters besides precision, including whether the models gave age-appropriate answers, had the right tone and are coherent and clear.

“I highly recommend when you think about evaluating multimodal AI systems or AI systems in general, think beyond precision,” she said. “Have all these other dimensions that are focused more on the user experience.”

Loraine Lawson is a veteran technology reporter who has covered technology issues from data integration to security for 25 years. Before joining The New Stack, she served as the editor of the banking technology site Bank Automation News. She has...