Web Scraping news recap - August 2023
A special post to thank all of you who decided to subscribe to The Web Scraping Club.
This month the news recap will be different and hope this won’t bother you too much, but the biggest news in August was that The Web Scraping Club turned 1 year old!
So I’ll use this post to share with you the past, present, and future of The Web Scraping Club, ask for some feedback, and reward those who’ll contribute to sharing this newsletter.
Why The Web Scraping Club is born?
I’ve been working in web scraping for 10 years and actually, I’m at Databoutique.com, a marketplace for web scraped data (
).The industry and techniques heavily changed during this period but one thing that doesn’t change is the lack of valuable content, all in the same place.
There’s plenty of marketing material for web scraping services, with some hidden gems around, but most of it lacks practical solutions. There are also some hidden blogs written by great hackers, but difficult to find and it becomes difficult to follow them.
But scraping hundreds of websites as a main job, I needed (and still need) access to these practical notions in a simple way, and eventually I went deeper using linked sources. That’s why in these years I took notes on web scraping and then, last year, I started sharing them and writing new ones directly online, since I feel that many professionals could benefit from them.
The content you find here now and in the future can be divided into two areas:
Educational, with posts where I code something, tutorials, and interview people in different areas who share their knowledge with us
Solutions, with posts where I try new techniques or tools to make sh*t done.
In my opinion, a good web scraping professional should understand what’s happening (as an example, what are the main reasons why he’s getting blocked), but also know what are the right tools and techniques to tackle these issues in a proper and timely manner.
And I want to highlight the word timely for a reason: if you have customers expecting you to deliver web data and a website is getting blocked by an anti-bot, most of the time you don’t have weeks or months to spend in reverse engineering the whole anti-bot solution. You need to pick one of the best tools you have in your belt, and it’s your duty to be constantly up-to-date if a new one is available.
For the same reasons, you won’t find on these pages posts like “My new antibot-X reverse engineering”, since I don’t have the time and probably the skills to write it. And for sure, after writing it, it will be unuseful in a few weeks due to the antibot update.
But what you can expect from this blog is to be constantly updated on the most effective tools and techniques, both free and commercial, to bypass antibots, to give to you (and to myself) the larger toolbelt available, and being a more informed web scraping professional.
The community around The Web Scraping Club
First of all, thank you for being so many. I could not imagine that, after only one year, I’d be speaking weekly to more than 1600 subscribers.
But I think there is still a huge number of people who could be interested in these topics but never heard of The Web Scraping Club, because of the poor SEO of the publication.
So, if you know someone interested in these topics who’s not subscribed yet, please share with him The Web Scraping Club newsletter.
As a thank you to the people who will contribute to the newsletter’s growth, I’ve revisited the referral program. If you bring in 3 new free subscribers, you’ll get one month of paid content. But if you bring in 10, you get one year of paid subscription and four 1-on-1 web calls of half an hour each to spend during the year.
The more we are, the more cumulated experience is in here, and the more we learn from each other.
For this reason, I strongly encourage you to enter our communities and share with others your toolbelt:
My profile on Twitter and the newly created The Web Scraping Community, always on Twitter.
Being a person who struggles to feed his Instagram feed, any help in managing and moderating these communities is welcome.
Paid subscriptions and sponsors
Being a co-founder of a startup, I have my day-job that, most of the time goes well beyond the classic 9-to-5. While it provides me with much of the material I use in the articles, some work is still needed to put it in a readable form and write a story on top of it. I really love to write this Substack, but I’m still using the most valuable currency we have, our time. This needs to be rewarded somehow.
While there will always be valuable content for free, I have mixed feelings about the way of monetizing the work I put in this newsletter, given that it’s the first time I face these themes as a “creator”.
Basing the whole reward system on sponsors and keeping all the content free for everyone would be the easiest choice and probably also the most profitable one. But if the whole newsletter is sponsored, I’m afraid that its content would be perceived as no more impartial. None of the past sponsors I had on TWSC forced me to change a single line of a post but I understand that this objection has its fundaments. It’s difficult to eventually write some critiques on someone who’s contributing to the blog with its money.
On the other hand, every post I publish should add some value to every reader, both free or paid, so it’s difficult to draw a line inside an article where to place a paywall.
Summer holidays made me think about a hybrid model and we’ll see how it will pay during the next 12 months.
The weekend articles will be always free, like the Hands On series, where we test commercial tools.
The Lab articles, where there are more advanced techniques with code, will be still paywalled, trying to give some value to free readers also. The code repository will be accessible only to paid readers.
I’m transitioning from traditional sponsorship with banners on top of the articles to collaborations with companies, trying to add value to the blog. It’s more complicated than placing some advertising here and there but could be a win-win-win situation for the readers, the club, and the companies.
And don’t forget that if you want to support The Web Scraping Club, getting at the same time some discounts on the most famous services around, there’s a page dedicated to all the discount codes. If you use them by clicking on the link on that page, also The Web Scraping Club will get a small fee.
Future content
Here’s the part of the article where I mostly need your feedback. As you have noticed, we have several recurring formats here and I’d like to know more which one is your preferred and the less liked.
I have my feelings about the results but don’t want to write them here to influence the poll results.
About the content I can say I’ll try some new things in the next months, from longer-form content to, maybe, different mediums.
As mentioned before, Substack has a really bad SEO optimization so I’m replicating the content also on a WordPress Blog, thewebscraping.club. When it is 100% ready, maybe there will be some perks and different content over there, but at the moment I don’t have anything in mind for that.
I think I’ve talked enough about myself and The Web Scraping Club for today, please answer the polls because I really appreciate your feedback and, if you want to leave a more detailed one, please write below here in the comment section or write me in private at pier@thewebscraping.club.