Completely agree with this sentiment.
I just spent the last couple of months developing a chrome extension, but recently also did an unrleated web scraping project where I looked into all the common tools like beautiful soup, selenium, playwright, pupeteer, etc, etc.
All of these tools were needlessly complicated and I was having a ton of trouble with sites that required authentication. I then realized it would be way easier to write some javascript and paste it in my browser to do the scraping. Worked like a charm!
I feel that when it has been set up, it's very straightforward to use.
Maybe in contrast to other solutions you posted? Not sure about that though; having only brief experiences with both, Playwright seems like an improved Cypress to me.
Tampermoney also works around CORs issues with relative ease.
Why?
https://github.com/Tampermonkey/tampermonkey
> This repository contains the source of the Tampermonkey extension up to version 2.9. All newer versions are distributed under a proprietary license.
Rough aspects:
a) It requires a _lot_ of browser permissions to install the extension, and I figured the audience who might be interested in their own search index would likely be put off by intrusive perms.
b) Loading the search index from localstorage on browser startup took 10-15s with a moderate number of sites; not great. Maybe would be a fit for pouchdb or something else that makes IndexedDB tolerable. (Or wasm sqllite, if it's mature enough.)
c) A lot of sites didn't like being scraped (even with rate limiting and back-off), and I ended up being served an annoying number of captchas in my regular everyday browsing.
d) Some walled garden sites seem completely unscrapable (even in the browser) - e.g. Linkedin.
Any examples besides Linkedin? Tell me what sites you're trying to target and I'll have a look to see what can be done with them. It takes some pretty evil Javascript obfuscation to block me and only one site has been able to do that. I doubt that the sites you're hitting are anywhere near that evil, lol. I would appreciate it if you have a good example that I could use in a future article.
IIRC I ended up building an iframe based scraper for sites that didn't yield any content with just a fetch - and I think built a fallback mechanism so that if fetch didn't work, I'd queue it up in the iframe scraper. The problem with that is that there are various heavily used security headers that prohibit loading in an iframe. (And the reason for iframe vs just loading in a tab and injecting my extension's script is that I wanted it to be able to run "in the background" without being super distracting for the user - the tab changing favicon every second or two was pretty annoying.)
This is my project for extracting my (your) webshop order & item data https://gitlab.com/Kagee/webshop-order-scraper
I went the browser extension route and used grease monkey to inject custom JavaScript. I patched the window.fetch and because it was a react page it did most of the work for me providing me with a slightly convolute JSON doc everytime I scrolled. Getting the data extracted was only a question of getting a flask API with correct CORS settings running.
Thanks for posting using a local proxy for even more control could be helpful in the future.
Seems like an omission in the spec.
But firefox extension expose API to inspect the response stream.
> One of the issues is what is called CORS (Cross-Origin Resource Sharing) which is a set of protocols which may forbid or allow access to a web resource by Javascript. There are two possible workarounds: a browser extension or a proxy server. The first choice is fairly limited since some security restrictions still apply.
I'm doing this for a browser extension that crawls a website from page to page checking for SEO/speed/security problems (https://www.checkbot.io/). It's been flexible enough, and it's nice not to have to maintain and scale servers for the web crawling. https://browserflow.app/ is another extension I know of that does scraping within the browser I think, and other automation.
It is worrying what this means for the future for web crawlers in general though if most sites end up being gated to all bots that aren't from major search engines.
My approach is a step or two more automated (optionally using a userscript and a backend) and runs in the console on the site under automation rather than cross-origin, as shown in OP.
In addition to being simple for one-off scripts and avoiding the learning curve of a Selenium, Playwright or Puppeteer, scraping in-browser avoids a good deal of potential bot detection issues, and is useful for constant polling a site to wait for something to happen (for example, a specific message or article to appear).
You can still use a backend and write to file, trigger an email or SMS, etc. Just have your userscript make requests to a server you're running.
I've posted here about scraping for example HN with JavaScript. It's certainly not a new idea.
> Why do you need a proxy or to worry about CORS?
Not sure about OP, but you might want to point to a proxy depending on the site/content you are scraping and your location. For example, if you are in Canada but you want to scrape in USD, you might need to use a proxy located in the US to get US prices. > Why not just point your browser to rumble.com and start from there?
Some endpoints use simple web application firewall rules that will block IPs. In this case, a rotating proxy can help evade the blocks (and prevent your legitimate traffic from being blocked). Some domains use more sophisticated WAFs like Imperva and will do browser fingerprinting so you'll need even more advanced techniques to scrape successfully.Source: work at a startup that does a lot of scraping and these are issues we've run into. Our entire office network is blocked from some sites due to some early testing without a proxy.
I kind of don't want to use DOMParser because it's browser-only... my web-scrapers have to evolve every few years as the underlying web pages change, so I really want CI tests, so it's easiest to have something that works in node.
As for fingerprinting, you can just use a different computer. Most people probably have a bunch of old computers lying around, right? If not, computers are cheap.
https://github.com/acheong08/ChatGPT-API-agent
Worked pretty well but browsers took up too much memory per tab so automating thousands of accounts (what i wanted) was infeasible
My guess would be that some companies are doing it (I worked at a major tech company that is/was), just not publicizing this fact as crawling/scraping is such a gray legal area.
Um... [0]
If you can elaborate, I would very much appreciate it. I'm always interested in doing better.
Why use Puppeteer etc. when you don't have to? What is the argument for using these additional tools versus not using them?
Of course it's always up to the site owner, but most people want people to read what they share.
very funny, both jokes
You could also try zooming in. My apps don't expand to full width because of the video box but you can zoom.
With this setup, many sites work, but a few... a few have a top ad banner, a side banner and a footer of 'cookie acceptance'... then add in a 'subscribe to our email' and a google login prompt.... (Game Wiki's.. I game in smaller windows too -- what good is a multi tasking computer if you don't use it?)
And Beautiful Soup should be BeautifulSoup. Who makes the rules?
Margins would also be nice on the left and right.
Beautiful Soup is two words. Just look at their website.
I dislike black-on-white and don't understand gray-on-black which seems to be popular now due to gamma settings being cranked up to 11 or something. I try to use some color as an in-between but that may take some time to "perfect".
Since browsers allow users to configure a default font and background color then one possible "happy in-between" would be to set no background color, and set no font color, thereby allowing each user agent (i.e., browser) to display the site with that user's default background and font colors.
In that case, each viewer should get their preferred colors, all without you doing anything.
If you don't need cross-browser and Chrome is all you need, then something like a simple Chrome extension and/or Chrome DevTools Protocol cuts out a lot of middle-man baggage and at least you will be wrangling the browser behavior directly, without any extra idiosyncrasies of middle layers.
Maybe somebody will make a web browser with all of the security locks disabled. Sort of like the Russian commander in "Hunting for Red October" who disabled his missiles' security features in order to more effectively target the American sub but then got blown up by his own missile.