A Deep Dive Into Rabbit AI’s “Teach Mode”

12 min readJan 11, 2024

The Rabbit R1, the first dedicated LLM agent device, was announced yesterday at CES.

One feature demonstrated was the ability to “teach” the Rabbit AI to take actions on your behalf.

Today I’m going to explore ideas on how they might be implementing this feature, and then next week, if people are interested, I’ll explore how to add similar functionality into ChatGPT via a custom GPT.

Note: This is my own conjecture and not the actual approach taken by Rabbit. It’s an exploration of how to approach the problem, sprinkled with my limited understanding of their implementation. It also glosses over the significant engineering challenges required to make this work at scale.

The Rabbit Demos

Before getting started, let’s review a couple of the demos that Rabbit released.

Both demos rely on what Rabbit calls their Large Action Model (LAM), which they claim is a foundation model for translating user intents into actions.¹

Generating an Image Via Midjourney

During the CES announcement, we saw the AI taught to generate an image via Midjourney using an experimental “teach mode”.

To do this, the user must:

Visit the Rabbit “teach mode” page
Enter the URL for the web application
(https://discord.com)
Press Start Session
Perform the actions required to accomplish the task
(generate an image via Midjourney)
Press Stop when the task is finished
Annotate the recorded tasks Rabbit to generalize the task
(described, but not shown during the demo)

The start page for teaching Rabbit new tasks

After the user has completed these steps, the recorded task requires processing by the Rabbit servers before it is ready to use in the Rabbit OS.

Once ready, the user can generate images via Midjourney using any prompt simply by asking their Rabbit device.

Booking a Room via AirBnB

Another demo we have available lies on the Research page of the Rabbit website.

In this demo, we see a user booking a room via AirBnB with a side-by-side view showing us what the Large Action Model (LAM) is observing and executing. This gives us a behind-the-scenes view of the LAM.

During the video, we can see the LAM taking the same actions as the user, but highlighting the user interface (UI) elements it has detected the user interacting with.

The Rabbit LAM appears to use hierarchical UI element detection, grouping individual HTML controls into higher-level conceptual controls.

Below we can see that when the user presses the + icon to increase the number of adults, the LAM identifies not only the plus button itself in red, but highlights in green the higher-level “Number of Adults” control that includes the corresponding minus button.

On the left, the user records their actions; on the right, the LAM extracts those into high-level instructions

We also see overlayed at the bottom of the LAM image, the high-level instruction the LAM detected through the user’s interaction.

The combined instructions from the user’s recording session form the basis of the web automation script that the LAM will use to execute the action later.

In the second half of the video, we see the LLM execute this web automation script.

During the execution, we see the same hierarchical detection of controls used when recording the script, then the LLM using these controls to perform the instructions.

How Rabbit Might Implement “Teach Mode”

If you scroll further down the Research page, the Rabbit team discuss the difficulties in getting LLMs to understand applications using raw HTML.

Likewise, they claim that “teach mode” also works for mobile and desktop applications, which don’t use HTML.

That implies they are using a multimodal model to detect and interact with UI controls.

For instance, if I give ChatGPT a screenshot of the AirBnB web site and ask it where I would search for a city to rent a room in, it immediately tells me how to find the UI control:

While in this simple example, ChatGPT described the control in a block of text, it wouldn’t be hard to use either prompt engineering or fine-tuning to return a structured response that can be transformed into an instruction.

Are They Using a Hierarchical Approach?

On their Research page, Rabbit refers to the paper HeaP: Hierarchical Policies for Web Actions Using LLMs, which describes how to integrate high-level instructions like “Enter a city” with the low-level commands required for a web automation engine to execute that instruction.

Rabbit appears to have innovated on this approach by using a multimodal LLM that can “look” at the graphical elements on web page directly instead of attempting to infer them from the HTML—though read the next section where I question whether this is true.

Using a hierarchical approach would allow them to create lists of instructions using far fewer tokens than using the raw HTML, and extend the approach into desktop and mobile applications which don’t use HTML, but whose controls are visually observable.

Are They Using a Multimodal Approach?

For HTML applications, they could create a multimodal LLM that detects UI controls both through images and HTML snippets.

Since this occurs in the context of a user recording their actions, they have access to user interaction information. This is helpful.

Instead of asking an LLM to look at an entire page, they can focus the attention of the LLM on the specific element the user just interacted with, e.g., by highlighting that element.

Likewise, rather than passing the entire HTML in the DOM of the web page, they can pass the HTML for the specific element and its surrounding context only, This solves a problem they highlighted that the HTML used by most modern web applications is too large for smaller context windows.

The net result would be improved UI control detection and operation.

In fact, it might be possible to get acceptable results merely using context-based HTML without any image recognition at all; though I’m not sure how this would translate to desktop or mobile applications.

Are They Using Multiple Models?

We can break the problem of detecting and executing tasks within software applications into separate stages, each of which could, in theory, use its own customized model:

UI Control Detection
A method to detect UI controls. This itself might be hierarchical, as noted above where a plus button could be identified as part of a broader “Number of Adults” input control.
Instruction Abstraction
A method to create a high-level instruction from a set of UI actions. For instance, clicking the plus button on “Number of Adults” three times could be abstracted into the instruction, “Set number of adults to [3]”.
Task Reasoning
A method of sequencing instructions together and tying them back to the task goal, especially when the UI varies slightly each time, or some information is missing from the user. Might also be used to group instructions into sub-tasks.

Not all of these models would need to be LLMs—for instance, input control detection from an HTML snippet might be done via an algorithm.

Are They Using Open Source Web Automation Software?

Once they have a script, they need to execute it when the user makes a request that requires it.

Side Note: From what I’ve read, it appears they are using ChatGPT as the “master” LLM that coordinates calling out to these scripts based on the user’s request—which is essentially the concept of plugins within GPT.

Popular projects include Playwright, Puppeteer and Cypress, all of which have packages that allow you to record your actions in a browser to generate scripts that can later be replayed.

At first glance, extending the script recorders with LLMs to handle UI detection and instruction abstraction appears to be an engineering problem rather than an innovation problem.

A separate LLM (the Large Action Model) could then be trained to generate scripts for one of these extended web automation tools, allowing Rabbit to scale task running via cloud-based virtual machines.

Such tools could also save the session cookies or local storage data required for authentication, allowing them to run the scripts as the same user later on without requiring the credentials (until the session expires, of course).

However, none of this appears to be the case. According to their Research page:

“To assist with the new model, we designed the technical stack from the ground up, from the data collection platform to a new network architecture that utilizes both transformer-style attention and graph-based message passing, combined with program synthesizers that are demonstration and example-guided.”

I’m still curious what results you could get from modifying existing open source projects, especially if you do pre-processing of the image or HTML data to remove noise before prompting the LLMs.

How Do They Handle Missing Parameters, Variations & Errors?

Frankly, I don’t know. They mention a series of different approaches, and then talk about how they have created a hybrid system that uses both symbolic algorithms and neural networks. That’s beyond my current knowledge of AI.

That said, some things I might explore when building such a system include:

Sub-Goals
Use an LLM to determine sub-goals for the task and group the instructions into sub-tasks that support each sub-goal. For example, when booking an AirBnB, I need to a) set my filters, b) browse the results, c) choose an option, and d) book that option. Each of these can then be broken down further into sub-tasks.
Sequences
Using a combination of algorithms and the LLM, determine which sub-goals have dependencies on each other. For instance, I need to set my filter before I choose an option, but the order I apply my filters may be irrelevant.
Parameter Analysis
Identify which input parameters are required and when, so the LLM can go back to the user for more information. In the Rabbit demo, they didn’t show any conversational capabilities—they provided all the information upfront. In my explorations with custom GPTs, I’ve found it helpful to teach them to ask for missing required parameters before calling the plugin.
Assertions
Use the assertions capabilities of web automation software to assert that actions were successful, and fine-tune and/or train the action LLM which corrective action to take based on certain errors, including what to report back to the user.
Completion Conditions
Similar to assertions, being able to extract and automatically define completion conditions for specific sub-goals can be useful to allow an LLM to know when to move to the next sub-goal.

My gut tells me this is a mix of engineering and invention. The Rabbit team definitely felt strongly enough that they needed an innovative new approach to this problem, and built their system from the ground up to handle these things. We’ll see in the coming months when it reaches the hands of consumers how well they did.

Don’t Believe the Hype

Rabbit has done an excellent job of marketing the launch of their project. If you read the dozens of articles coming out about the product, they mostly repeat the claims from the presentation. In this section, I want to challenge some of those claims.

The “No Software” Claim

During the demo, they claimed that teaching Rabbit required no software.

For web applications, they technically could proxy all interactions through their servers, so they had access to the HTML being sent back and forth, without breaking the security model of the browser. That would also give them access to the session cookies, allowing them to later authenticate as the user without having their credentials—a claim they made.

I don’t see how that would work for desktop and mobile applications, however. They didn’t demo teaching Rabbit how to work with these applications, so it’s unclear how this would work.

I suspect they are stretching the truth. Teaching using a browser extension would be easier and more secure than proxy requests. And automated recording or execution of desktop or mobile applications would require the user giving explicit permissions to installed software.²

The alternative explanation that WIRED gave is that you simply point the camera at your desktop screen and train that way. While Figure did just demonstrate video-based training of a robot, during the Rabbit training the user had a Stop button overlayed on the web page they were training, which wouldn’t be possible if you were purely video recording your screen.

Bottom line: something doesn’t add up, but we’ll find out once the device ships at the end of March.

The “No Apps” Claim

Lots of journalists ran with their “no apps” claim. And while technically true, I find it disingenuous.

First off, they are using apps, they just aren’t installed in the OS. They are web applications that the AI needs to visit to execute tasks.

And from everything they’ve demoed, this is NOT a general purpose AI that can automatically figured out what to do. You have to essentially train it step-by-step. What we appear to have here are glorified automation scripts, integrated into an LLM using a plugin architecture.

Yes, “plugins” aren’t technically “apps”, but they’re pretty close. And the idea that you can “just ask it to do anything” without doing any prep work—which is the implication they are making with the no apps claim—is clearly false.

Secondly, AI assistants like Siri and Google Assistant already have automation scripts—they’re called shortcuts.

Yes, they aren’t as capable or teachable as the functionality demoed by Rabbit, but that’s a far cry from not existing at all.

Bottom line: while technically true, Rabbit is still using an app-like architecture requiring pre-defined plugins. Their ultimate goal may be a general purpose AI that can operate any app without training, but they’re not there yet.

The “No Credentials Stored” Claim

They claim to not store 3rd-party credentials or save your username or password. However, they must store some credential to be able to access the web application at a later date.

It appears from the demo that they might use OAuth. In which case, they have an access token they need to save to access the application later on.

Many web applications don’t support OAuth, however, so how do they handle saving the authentication information for those? If they proxy requests, they could store the session token sent back from the server. But again, this is storing a form of a credential.

Either way, they are storing some credential that allows them to access the service later using an automated script. How securely that credential is stored is critical, even if it’s an access or session token that can be more easily expired than a username & password.

Bottom line: they appear to be using a narrow definition of what a “credential” is to gloss over the fact that they are storing authentication information that could be hacked or misused.

Final Thoughts

Overall, I’m impressed by Rabbit. They’ve combined a bunch of existing technologies together in a well-designed package, and appear to have innovated on a few key areas of web automation.

And the latency they showed with ChatGPT is impressive. Getting rid of latency would greatly reduce my editing time for my AI Meets Productivity podcast. Though I wonder what the real-world latency will be at scale.

My impression is that this was rushed to market for CES, and will still have many rough edges when it ships at the end of March.

I also don’t foresee this becoming a mainstream product. Few people will want to carry multiple devices with them and nothing about the hardware looks special. iPhone 15 Pro users can already replace Siri with ChatGPT for one-button access, and both the plugin capabilities of GPT and the GPT store will quickly outpace the innovation shown by Rabbit today.

Still, it’s interesting to see how quickly all of this is developing and how often unexpected developments occur. I totally didn’t see a dedicated AI device like Rabbit being announced at CES.

Clap if you found this article interesting, and comment below if you want me to dive into how to add similar “teaching” functionality into ChatGPT.

¹ It’s unclear whether Rabbit’s Large Language Model is a true generalized foundational model or a task-specific model, or how “large” it truly is. They don’t appear to have raised enough capital to train a foundational LLM. For the purposes of this article, though, I’m going to use their terminology.

² Technically, the user could give the browser application permission for recording the scripts, but I don’t see how this would work to execute those scripts in a desktop OS.

Postscript: My latest contract just ended and I am exploring options within AI. I’m particularly interested in exploring new user interfaces (UI/UX) for LLMs—I already have a list of potential new applications—and/or systemic AI alignment, aka how to prevent bad things from happening when systems of “aligned” AI agents are working together and/or integrated into critical infrastructure. If you have opportunities you want to discuss in either, get in touch.