We got some great feedback from the community, with the most positive response being about our vision-first approach used in our browser agent. However, many wanted to use the underlying agent outside the testing domain. So today, we're releasing our fully featured AI browser automation framework.
You can use it to automate tasks on the web, integrate between apps without APIs, extract data, test your web apps, or as a building block for your own browser agents.
Traditionally, browser automation could only be done via the DOM, even though that’s not how humans use browsers. Most browser agents are still stuck in this paradigm. With a vision-first approach, we avoid relying on flaky DOM navigation and perform better on complex interactions found in a broad variety of sites, for example:
- Drag and drop interactions
- Data visualizations, charts, and tables
- Legacy apps with nested iframes
- Canvas and webGL-heavy sites (like design tools or photo editing)
- Remote desktops streamed into the browser
To interact accurately with the browser, we use visually grounded models to execute precise actions based on pixel coordinates. The model used by Magnitude must be smart enough to plan out actions but also able to execute them. Not many models are both smart *and* visually grounded. We highly recommend Claude Sonnet 4 for the best performance, but if you prefer open source, we also support Qwen-2.5-VL 72B.
Most browser agents never make it to production. This is because of (1) the flaky DOM navigation mentioned above, but (2) the lack of control most browser agents offer. The dominant paradigm is you give the agent a high-level task + tools and hope for the best. This quickly falls apart for production automations that need to be reliable and specific. With Magnitude, you have fine-grained control over the agent with our `act()` and `extract()` syntax, and can mix it with your own code as needed. You also have full control of the prompts at both the action and agent level.
```ts
// Magnitude can handle high-level tasks
await agent.act('Create an issue', {
// Optionally pass data that the agent will use where appropriate
data: {
title: 'Use Magnitude',
description: 'Run "npx create-magnitude-app" and follow the instructions',
},
});// It can also handle low-level actions
await agent.act('Drag "Use Magnitude" to the top of the in progress column');
// Intelligently extract data based on the DOM content matching a provided zod schema
const tasks = await agent.extract(
'List in progress issues',
z.array(z.object({
title: z.string(),
description: z.string(),
// Agent can extract existing data or new insights
difficulty: z.number().describe('Rate the difficulty between 1-5')
})),
);```
We have a setup script that makes it trivial to get started with an example, just run "npx create-magnitude-app". We’d love to hear what you think!
One thing I'm wondering is if there's anyone doing this at scale? The issue I see is that with complex workflows which take several dozen steps and have complex control flow, the probability of reaching the end falls off pretty hard, because if each step has a .95 chance of completing successfully, after not very many steps you have a pretty small overall probability of success. These use cases are high value because writing a traditional scraper is a huge pain, but we just don't seem to be there yet.
The other side of the coin is simple workflows, but those tend to be the workflows where writing a scraper is pretty trivial. This did work, and I told it to search for a product at a local store, but the program cost $1.05 to run. So doing it at any scale quickly becomes a little bit silly.
So I guess my question is: who is having luck using these tools, and what are you using them for?
One route I had some success with is writing a DSL for scraping and then having the llm generate that code, then interpreting it and editing it when it gets stuck. But then there's the "getting stuck detection" part which is hard etc etc.
We currently are optimizing for reliability and quality, which is why we suggest Claude - but it can get expensive in some cases. Using Qwen 2.5-VL-72B will be significantly cheaper, though may not be always reliable.
Most of our usage right now is for running test cases, and people seem to often prefer qwen for that use case - since typically test cases are clearer how to execute.
Something that is top of mind for is is figuring out a good way to "cache" workflows that get taken. This way you can repeat automations either with no LLM or with a smaller/cheap LLM. This will would enable deterministic, repeatable flows, that are also very affordable and fast. So even if each step on the first run is only 95% reliable - if it gets through it, it could repeat it with 100% reliability.
We recently gave the model access to a file system so that it never forgets what it's supposed to do - we already have ton of users very happy with recent reliability updates!
We also have a beta workflow-use, which is basically what's mentioned in the comments here to "cache" a workflow: https://github.com/browser-use/workflow-use
Let us know what you guys think - we are shipping hard and fast!
browser-use is still strongly coupled to the DOM for interaction because of the set-of-marks approach it uses (for context - those little rainbow boxes you see around the elements). This means it’s very difficult to get it to reliably do interactions outside of straightforward click/type like drag and drop, interacting with canvas, etc.
Since we interact based purely on what we see on the screen using pixel coordinates, those sort of interactions are a lot more natural to us and perform much more reliably. If you don't believe me, I encourage you to try to get both Magnitude and browser-use to drag and drop cards on a Kanban board :)
Regardless, best of luck!
I've been working on a Chrome extension with a side panel. Think about it like the side panel copilot in VSCode, Cursor, or Windsurf. Currently it is automating workflows but those are hard coded. I've started working on a more generalized automation using langchain. Looking at your code is helpful because I can in only a few hundred lines of code recreate a huge portion Playwright's capabilities in a Chrome extension side panel so I should be able to port it to the Chrome extension. That is, I'm creating a tools like mouse click, type, mouse move, open tab, navigate, wait for element, ect..
Looking at your code, I'm thinking about pulling anything that isn't coupled to node while mapping all the Playwright capabilities to the equivalent in a Chrome extension. It's busy work.
If I do that why would I prefer using .baml over the equivalent langchain? What's the differnce? Am I'm comparing apples to oranges? I'm not worried about using langgraph because I should be able to get most of the functionality with xstate v5 [0] plus serialized portable JSON state graphs so I can store custom graphs on a remote server that can be queried by API.
That is my question. I don't see langchain in the dependencies which is cool, but why .baml? Also, what am I'm missing going down this thought path?
[0] https://chatgpt.com/share/685dfc60-106c-8004-bbd0-1ba3a33aba...
To answer your question - BAML is as DSL that helps to define prompts, organize context, and to get better performance on structured output from the LLM. In theory you should be able to map over similar logic to other clients.
You can do most anything you can do in Playwright, navigate, open new tabs, scroll with the added benefit of keeping the human in the loop. Conceptually they are exactly the same, I can go into that more if you want. Most of the limitations are security features. However, for automated workflows, the security features should be heeded for good reason. For example, chatgpt console require isTrusted to be true rejecting synthetic events so it is impossible to automate the chatgpt console without workarounds which they will likely close. That is the biggest limitation. On the other hand, there are 3 billion Chrome users and they can download the extension with a single click. Bypassing the security features like requiring a human interaction button press or mouse click to go fullscreen, play sound, or transfer money on a bank website shouldn't be alowed. If the use case requires that, use Playwright or a BrowserWindow in an electron application. A Chrome extension with a side panel can collect every element using stacking context that is visible to limit the amount of data processed by a LLM, it can capture all the inner text of a page, it can read every single fetch and XMLHttpRequests which is a very good way to get data without loading tons of markup, it can make fetch and XMLHttpRequests in the MAIN world so they automatically contain all the cookies, it can use huggingface/transformers.js to transcribe audio, video to text with openai whisper or perform ocr image to text on webgpu, if available.
I can systematically analyze, poke, and prod thousands of websites running with playwright in the cloud to discover all the capabilities and automatically create workflows with xstate v5 which are sent to the Chrome extension in JSON. For example, I can automatically navigate to a website, find all the inputs, try several ways to inject text, use image to text to test if the text is added to the field to add to the list of capabilities. So if a user is on the page, I can automate the workflow or notify the user they need to take a step.
I think the best idea is to have curated workflows and curated data embeddings to target focused industries. It can automate navigating the browser to MLS and zillow.com, collected information, inject it into google sheets office 365 excel, export it, navigate to email, write information, attach the file to the email, and send it all with the human in the loop. Moreover, if it does 95% of the work, I don't think humans will mind pressing a button or taking an action when prompted. The question is will people prefer this instead of fully automated running somewhere in the cloud? How do you feel about using a code assistant? Do you like being in the loop?
This is all experimental. The gif has a good example of a side panel automating stock option trading. I'm going to try and inject your code to see if I can start to develop systematic generalized automation with it. [0] [1]
[0] https://github.com/adam-s/doomberg-terminal
[1] https://github.com/adam-s/doomberg-terminal/tree/main/docs/m...
You can also use cheaper models depending on your needs, for example Qwen 2.5 VL 72B is pretty affordable and works pretty well for most situations.
Best of both worlds. The playwright is more of a cache than a test
It seems like a hybrid approach would scale better and be significantly cheaper.