This post is sponsored by Bright Data, award-winning proxy networks, powerful web scrapers, and ready-to-use datasets for download. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.
In this case, for all The Web Scraping Club Readers, using this link you will have an automatic free top-up. It means you get a free credit of $50 upon depositing $50 in your Bright Data account.
Welcome to our monthly interview, this time it’s the turn of Aviv Besinsky, the Director of Proxy products at Bright Data
Hi Aviv, thanks for joining us at The Web Scraping Club, I’m really happy to have you here.
First of all, tell us a bit about yourself and your company Bright Data. The proxy industry is quite crowded and it seems to me it’s getting even more full of players, with some specific characteristics. In what Bright Data is differentiated from the other players?
Firstly, we offer solutions for anyone who needs publicly available web data, regardless of their technical abilities. Anyone, no matter their experience, technical ability, or source of data, can source their public data using our products.
Secondly, our automated solutions are generic and built to handle and support any target site. Unlike other solutions on the market that support specific target sites or verticals, our customers know they can count on us to support any public data source, always.
Thirdly, as a market leader, we take our responsibility seriously and constantly work to put in place industry-first standards that enforce ethical usage on multiple levels. We ensure full transparency for all parties involved. We have a dedicated compliance team that onboards new customers, verifies each respective use case, and closely monitors different aspects of their usage. We also monitor the health of target sites and technically ensure that we never overload or affect their site health.
These aspects, combined with our unique work culture, allow us to always be at the forefront of our industry. Typically, we see a 2-3 year lag from the time we add a new ability or product until our competitors catch up. Ultimately, our commitment to providing quality first, compliant, transparent, and inclusive public data solutions sets us apart from the rest.
Bright Data also operates the Bright Initiative, a global program that focuses on providing over 700 NGOs, NPOs, academic institutions and public bodies with pro-bono access to leading data technology and expertise to drive positive change on a global scale.
One of the most known products is the Web Unlocker, which I’ve tested in this article. Can you go deeper in detail on how it works?
The Web Unlocker is a powerful tool designed to simplify the process of gaining full access to public data. Its main goal is to automate all necessary actions required to bypass any obstacles that may be preventing access to this data. As anyone who deals with collecting large amounts of public data knows, it's common to encounter situations where the data is blocked, and this is where the Web Unlocker comes in handy.
The tool is capable of using the right proxy network, handling fingerprints and retries, captcha solving, and much more. It leverages various machine learning practices to dynamically and automatically use its set of tools.
The Web Unlocker provides its customers with peace of mind by ensuring that they always have access to the publicly available web data they need, regardless of any changes made to their target sites. The Web Unlocker is a powerful and reliable solution for anyone who needs unrestricted access to public data.
What are the main challenges of building such performing proxy infrastructure?
Building a high-performance proxy infrastructure comes with a unique set of challenges. Here are the two main challenges that we continually work to address:
To ensure that our customers receive the best possible experience, we must stay up to date with the latest industry developments and provide new features and products that deliver value. This requires us to keep our finger on the pulse of the industry, continually learn about new developments and implement them in our products and services.
The public web data landscape is constantly changing, and collecting public data presents new challenges every day. Our job is to ensure that we continually improve and adapt our different products and infrastructure to keep pace with these changes. This ensures that our customers can always rely on our products to enable and deliver the data they need, regardless of any new challenges that may arise.
Recently you’ve launched the Scraping Browser, a browser we can integrate into Playwright or Puppeteer and includes various anti-bot techniques. Can you go again deeper, please?
The Scraping Browser is a recently launched product that offers a full solution for customers who need to collect public data that requires interacting with a web page using a browser or needs JavaScript rendering. Traditionally, customers would need to build or use a third-party tool in order to connect with a proxy network to ensure that the browser is using the IP address of a real person. However, for some sites, this isn't enough, and customers would get blocked, requiring them to build their own unblocking solution.
The Scraping Browser solves this problem by offering a complete solution that includes a browser, a proxy, and unblocking capabilities. It is activated using Puppeteer (Node.js) or Playwright (Python), which provides a high-level API to control Chromium or Chrome over CDP. Customers can use these commands with their crawler or scraping script like a template. With a few lines of code, customers can connect to the Scraping Browser API from Puppeteer or Playwright, and the product takes care of activating the actions on the browser, which are connected to the proxy and unlocking. The Scraping Browser also manages scale, ensuring that many requests are handled in parallel.
The benefits of using the Scraping Browser include saving resources due to built-in unblocking abilities and easy integration, as it supports common libraries like Puppeteer or Playwright. It works at any scale, both horizontally (unlimited parallel sessions) and vertically (any load or session length). Customers no longer need to set up and maintain the browser layer, connect the browsers to one or multiple proxy providers, or build and maintain unlocking abilities. This requires investment of R&D time and resources to build, maintain and stay up to date. With the Scraping Browser, customers only need to write the logic for how to control the browsers while all the rest is handled by the product.
Do you think that AI will be more helpful in such products? Do you think we’ll have soon a product that, given the desired output data field, can handle anti-bot, pagination, and the actual scraping of a website?
I think AI is already very helpful, and we will keep and increase the usage of AI solutions to improve the performance of our products.
The ability to get any public data needed just by a free text prompt, including unlocking the data, scraping and parsing it, is already here, and some features of this nature were already added to some of our products. I expect this trend to continue with more and more abilities added.
The end goal is to give access to any needed data at minimal effort, and the growing AI abilities fully support that.
How do you get the IPs you provide in the proxy infrastructure? Do you have some ethical sourcing standards you follow?
Bright Data sources its peers (Residential and Mobile IPs) through the Bright SDK, which is integrated into applications as a form of app monetization. The app users are presented with an option to opt-in to the Bright Data Network and become a peer by sharing their device IP in exchange for an ad-free or free application. Bright Data ensures that all peers sharing resources with the Residential network have personally opted-in and can opt-out at any time. The company maintains a compliant system based on deterrence, prevention, and enforcement to ensure that the network remains safe and ethical. The infrastructure ensures that the traffic is only routed under strict conditions in a manner that will not affect the device's operation.
Our usual last question: Is any fun fact related to the early days of Bright Data or yours?
A fun fact related to the early days of Bright Data is that despite having a $0 marketing budget, the company was still able to gain traction and grow through the sheer quality of its products. This not only serves as a valuable learning experience but also highlights the importance of creating a great product that meets the needs of the customers. Nowadays, Bright Data has an amazing marketing department that helps promote the company's offerings, but the early days are a testament to the power of word-of-mouth and creating a quality product that people want to use.
Thanks Aviv for sharing your ideas with us, if anyone wants to know more about the Web Unlocker, you can read the full review in this post.