Here's a quick demo showing how you can use it to scrape leads from an auto dealer directory. What's cool is that it scrapes non-uniform pages, which is quite hard to do with "traditional" scrapers: https://youtu.be/wPbyPSFsqzA
A little background: I've written lots and lots of scrapers over the last 10+ years. They're fun to write when they work, but the internet has changed in ways that make them harder to write. One change has been the increasing complexity of web pages due to SPAs and obfuscated CSS/HTML.
I started experimenting with using ChatGPT to parse pages, and it's surprisingly effective. It can take the raw text and/or HTML of a page, and answer most scraping requests. And in addition to traditional scraping thigns like pulling out prices, it can extract subjective data, like summarizing the tone of an article.
As an example, I used FetchFox to scrape Hacker News comment threads. I asked it for the number of comments, and also for a summary of the topic and tone of the articles. Here are the results: https://fetchfoxai.com/s/cSXpBs3qBG . You can see the prompt I used for this scrape here: https://imgur.com/uBQRIYv
Right now, the tool does a "two step" scrape. It starts with an initial page, (like LinkedIn) and looks for specific types of links on that page, (like links to software engineer profiles). It does this using an LLM, which receives a list of links from the page, and looks for the relevant ones.
Then, it queues up each link for an individual scrape. It directs Chrome to visit the pages, get the text/HTML, and then analyze it using an LLM.
There are options for how fast/slow to do the scrape. Some sites (like HN) are friendly, and you can scrape them very fast. For example here's me scraping Amazon with 50 tabs: https://x.com/ortutay/status/1824344168350822434 . Other sites (like LinkedIn) have strong anti-scraping measures, so it's better to use the "1 foreground tab" option. This is slower, but it gives better results on those sites.
The extension is 100% free forever if you use your OpenAI API key. It's also free "for now" with our backend server, but if that gets overloaded or too expensive we'll have to introduce a paid plan.
Last thing, you can check out the code at https://github.com/fetchfox/fetchfox . Contributions welcome :)
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
related:
- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926
- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182
Imo, users should be allowed to use automation tools to access websites and collect data. Most of these sites thrive off of user generated content anyways, for example Reddit is built on UGC. Why shouldn't people be able to scrape it?
https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin...
These are entirely different things. The upshot of the proceedings is that while the courts ruled there wasn't sufficient for an injunction to stop the scraping, it was nonetheless still injurious to the plaintiff and had breached their User Agreement -- thus allowing LinkedIn to compel hiQ towards a settlement.
From Wikipedia:
The 9th Circuit ruled that hiQ had the right to do web scraping.[1][2][3] However, the Supreme Court, based on its Van Buren v. United States decision,[4] vacated the decision and remanded the case for further review in June 2021. In a second ruling in April 2022 the Ninth Circuit affirmed its decision.[5][6] In November 2022 the U.S. District Court for the Northern District of California ruled that hiQ had breached LinkedIn's User Agreement and a settlement agreement was reached between the two parties.[7]
Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options
More specifically, FetchFox is targeting a specific niche of scraping. It focuses on small scale scraping, like dozens or a few hundred pages. This is partly because, as a Chrome extension, it can only scrape what the user's internet connection can support. You can't scrape thousands or millions of pages on a residential connection.
But a separate reason is, I think that LLM's open up a new market and use case for scraping. FetchFox lets anyone scrape without coding knowledge. Imagine you're doing a research project, and want data from 100 websites. FetchFox makes that easy, whereas with traditional scraping you would have needed coding knowledge to scrape those sites.
As an example, I used FetchFox to research political bias in the media. I was able to get data from hundreds of articles without writing a line of code: https://ortutay.substack.com/p/analyzing-media-bias-with-ai . I think this tool could be used by many non-technical people in the same way.
e.g. Mr John Smith is a journalist, find his ten most recent articles via locating his personal website, news sites and social media.
so wondering if your tool will be obsolete in a years time?
Try it out and see if you like it. Curious if you think it’s better or worse than ChatGPT for scraping dozens of pages.
Personally I am looking into options in this area, are you planning to offer a cloud based version of this at some point/could you tell which existing ones are good if not?
I do want to offer a cloud version. If it’s something you’d be interested in, please email and maybe you’d be a good early user for it. You’d get 1:1 attention as one of our early users. Email is on the site and in my profile.
It is not easy to evaluate scrapers unless you have had to deal with lots of poorly written websites in your life. Just using it on a few highly structured well-maintained sites can be impressive but if you are using it to acquire data from many websites or large websites things get hairy fast.
Most scrapers today are some combination of extracting xpaths and reducing them to the loosest common form, parsing semantic (or easy to identify, like links) or highly structured content that has discoverable patterns, and LLMs.
The actual best way to scrape a site is to determine if they are populating the data you want with API calls and replicate those. People are usually more reluctant to completely change back-end code but will make subtle breaking changes (to your scraper) to front-end code all the time. For example small structural or naming changes. This has become more problematic since people have been moving to more SSR and semi-SSR injection. There can also be a problem with discovering all the pages on a site if it doesn't have poorly designed or implemented paging or search.
Some of the worst sites to scrape are large WP sites that have obviously been through a few developers. If you really want to test a scraper find some of those and they will put it to the test.
Cloudflare is another issue. Not necessarily an issue with this plugin, but because so many sites use it, you typically have to spin up multiple automated headless browsers using residential proxies for any type of large-scale scraping.
Some things that LLM does shine at related to scraping is interpreting freeform addresses, custom tables (meaning no TR,TD, but just divs and CSS to make it look like a table, and lists that are also just styled divs. Often there are no tags, attributes, keywords, or generalized xpath that will help you depending on how the developer put it together.
Surprisingly there is a pretty old library from Microsoft of all places called Prose if you can still find it (they keep updating but using the same name for different things and trying to inject AI) that is really good at pattern matching and prediction and is small, fast, and free and generally great at building a generalized scraper. Only drawback is I believe the only one I could find was .NET at the time.
For one thing, many (most? all?) large sites ban Amazon IP's from accessing their websites. This is not a problem for FetchFox.
Also, with FetchFox, you can scrape a logged in session without exposing any sensitive information. Your login tokens/passwords are never exposed to any 3rd party proxy like they would be with cloud scraping. And if you use your own OpenAI API key, the extension developer (me) never sees any of the activity in your scraping. OpenAI does see it, however.
> And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?
FetchFox :).
But besides that, the gold standard for scraping is proxied mobile IP requests. There are services that let you make requests which appear to come from a mobile IP address. These are very hard for big sites to block, because mobile providers aggregate many customer requests together.
The downside is mainly cost. Also, the providers in this space can be semi-sketchy, depending on how they get the proxy bandwidth. Some employ spyware, or embed proxies into mobile games without user knowledge/consent. Beware what you're getting into.
1. Pay users to install a browser extension that scrapes social media content they browse. Or ask them in exchange for a service, e.g. "remember everything I browse and make it searchable", etc.
2. Ship the data you scrape to your servers.
3. Sell training data to companies at a discount.
This gets past the new rate limiters and blocks that Reddit and others have installed.
What if we gave people some service, say a "browser toolbar", and in exchange we sell their browsing data to third parties?
You just reinvented spyware from first principles. This is basically BonziBuddy.
How does this work? Does it rely on GPT to extract the data or does it actually generate a bunch of selectors? If it's the former, then the results aren't reliable since it can just hallucinate whole results or even just parts.
I haven't put together a good test framework yet, but qualitatively, the results are surprisingly good, and hallucinations are fairly low. The prompt tells GPT to say (not available) if needed, which helps.
I'm going to try the "generate selectors" approach as well. If you'd like to learn more or discuss just reach out via email (marcell.ortutay@gmail.com) or discord (https://discord.gg/mM54bwdu59 @ortutay)
> "By scraping raw text with AI, FetchFox lets you circumvent anti-scraping measures on sites like LinkedIn and Facebook. Even the the complicated HTML structures are possible to parse with FetchFox."
The trickier part is “everything else” to make the extension work.
The benefit of this approach is it's very simple and easy, but the downside is it sends a lot of unnecessary tokens to the LLM. That drives up the cost, slows things down, and hurts accuracy.
I'm working on a few improvements now to improve this.
1. Remove all <style> and <svg > tags. These rarely add value, and can dramatically increase token counts.
2. For the “crawl” step, I exclusively pull out <a> tags and only look at those. The “extract” step looks at full HTML
3. For now, it only looks at the first 50k text characters, and the first 120k HTML characters. This is to stay within token limits.
The last part will be what I focus on improving in the next version.
They keep throwing it in my url bar. I refuse to click (big warning it sends to google's servers)
So far, I've done some un-scientific testing to compare text vs. HTML. Text is a lot more effective on a per-token basis, and therefore lower cost. However, some data is only available in HTML.
That sounds like a scold, but it's meant as an observation.
Now I will embed some implied scolding in what's to follow, but feel free to ignore that part; I wouldn't expect you to care.
But if you lack even a shred of human decency or morals, perhaps there's one more reason you might consider for spreading out your requests across sites and time, instead of absolutely pushing the abuse to the hilt, and that is that if you change what you are doing, and take a slower, more gentle approach, and I'm appealing to your selfishness here because clearly that is the only viable way into your head, then you will be less likely to cause countermeasures, and more likely to succeed.
I don’t think any sites are going to get hammered though, even at the fastest rates. The limiting factor is often LLM token rates.