Ask HN: Best web scraping toolset for 2024?

robk · on Feb 21, 2024

Three real paths I'd evaluate

0) Tools: Puppeteer and Playwright are the cleanest way I've found to get proper JS page rendering and behavior control. Node is well suited for this but some prefer python.. Since you sound concerned with blocks I'm guessing plain HTML scrapers like BeautifulSoup or Cheerio would be insufficient but they're more robust in terms of sheer volume and overhead of course.

1) if you have money the fingerprint avoiding systems like GoLogin are exceptionally good at avoiding detection. But they're not cheap so you would need to have a reasonable budget to use them well. I've had extremely high success with GoLogin myself and if budget wasn't a concern I'd just default to that.

2) less expensively you can use headless Chrome (Browserless.io has an excellent docker image for this) and then proxy using 4g/5g proxies. You usually pay by kb so you'd want to be savvy about blocking images etc to manage costs but with a good proxy and decent tuning this also takes you pretty far except some of the more onerous services like Cloudflare and Datadome. There are also v good captcha solving services that seamlessly integrate. I've had very good results with simple proxying this way and well crafted settings in Browserless like making sure stealth plug-ins used and user agent is properly done etc.

3) most inexpensive you can simply use Playwright on Browserless (Chrome Headless) and a captcha service with stealth plug-ins and running from a decent quality IP. I'm careful to check things like viewport size, user agent, ghost cursor etc.

Data center IPs usually raise flags and even the server should be selected to not have vpn ports or http ports open as those are also antibot signals among others.

--

I do #3 at a fairly decent volume (<1m page views per day) across a few dozen machines (each with a half dozen IP addresses and separate VLANs for each container) to scrape from some private sites and so long as I stay under a sane rate limit I've had years of success on many sites that are fairly strict about blocking browsers.

In all cases I parse the key parts of the page and dump it into mongo for async processing of the data and to allow fixes when sites change. You need to keep an eye on your ETL pipeline and alert when something breaks - I expect once a quarter I have to fix a selector change or something trivial as sites change.

This is also a good substack evaluating the various paid options for the toughest sites. https://substack.thewebscraping.club

dhruvkar · on Feb 21, 2024

Thanks for the info -- this is pretty useful. Typically I've success sniffing android app interactions if the website seems too hard to scrape.

This gives me another avenue to explore!

tomcam · on Feb 23, 2024

i’m not criticizing you. I’m not telling you I know better. Quite the opposite, I want to learn: why use Mongo in this case over Postgres? At that magnitude it seems like it would be a more robust database.

redwood · on Feb 23, 2024

Postgres has neither built in high availability auto-failover, nor built-in horizontal scalability -- not quite sure what you mean?

tomcam · on Feb 23, 2024

That's the answer I didn't know I was looking for. Thanks

naiv · on Feb 21, 2024

Thank you for the blog link, it has some very well written in depth content.

dhruvkar · on Feb 20, 2024

Usually JS based websites operate with some internal API under the hood.

Inspect those, and then directly hit those with something like Python or Go.

I prefer Python (requests, lxml, BeautifulSoup).

Mimic the headers that your browser is using.

Have you tried this already and run into issues?

mortallywounded · on Feb 20, 2024

I've done this sort of thing and it does work well. The caveat is you usually need a cookie/token after logging in, so you still need to fake that part as well as manage the session state.

It does work well, though some sites certainly detect the behavior after a while.

dhruvkar · on Feb 21, 2024

Right..

I'm in an antiquated space (logistics and wholesale) so most websites aren't hard to scrape and even mimic the auth.

The other avenue I've had success with is sniffing android app traffic to reverse engineer the API calls.

nicbou · on Feb 22, 2024

I have used Playwright to write tests for my website. It was so easy and the code is so readable. I can wholeheartedly recommend it if you need to control a browser from your code. I'm now using it to write a crawler for a dozen websites, because anything else would be too tedious.

tuktuktuk · on Feb 21, 2024

I think puppeteer is the way to go!