Interview #6: Aleksandras Šulženko - Oxylabs
On proxies, Ai and web scraping used for a better world
This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.
In this case, for all The Web Scraping Club Readers, using the discount code WSC25 you can save 25% OFF for your residential proxies buying.
The Web Scraping Club is a free weekly newsletter about web scraping. Once every two weeks, I publish The Lab, paid content with deep dives on more technical aspects and solutions to common issues, with also code on a GitHub repository. You can access to the following articles using a 7 days trial and then subscribe if you find them useful.
Welcome to our monthly interview, this time it’s the turn of Aleksandras Šulženko, the Scraper APIs Product Owner at Oxylabs
Hi Aleksandras, thanks for joining us at The Web Scraping Club, I’m really happy to have you here.
First of all, tell us a bit about yourself and your company Oxylabs.
I’ll start with the company I work at. Oxylabs is a web intelligence collection solution and premium proxy service provider. We offer both out-of-the-box data acquisition solutions such as the Web Unblocker and Scraper APIs and proxies of various types. While we do focus on providing businesses with web intelligence, we do have quite a variety of use cases for our products.
I initially began my journey in Oxylabs as an account manager, overseeing the needs of many of our important partners. Throughout my experience in that role, I realized the importance of data to many of these advanced businesses and decided to shift my career toward managing solutions that would be able to fill those needs.
Now I am the Product Owner of our Scraper API solutions that allow companies of all sizes to extract publicly available data in real-time from nearly any publicly available source. These solutions are the backbone of our business, as they have enabled us to continue growing and become one of the most respected data acquisition solution providers in the industry.
What makes Oxylabs different from the other proxy providers?
Well, today Oxylabs is more than a proxy provider, which definitely sets it apart from the smaller competitors. Many of the larger organizations, however, have also moved towards providing data acquisition solutions or data itself. Growth toward such positioning can be done in a multitude of ways, and I think, Oxylabs has a stark difference in the two areas.
We have committed ourselves to pursue the path of high business standards and ethics. Oxylabs was the first provider to clearly outline our residential proxy acquisition practices to the industry at large, which allowed us to showcase that all IPs in our infrastructure are acquired ethically. Additionally, we have since become a founding member of the Ethical Web Data Collection Initiative (EWDCI), which is an association of the leading proxy and data acquisition solution providers, pushing for more regulation and the creation of industry-wide standards.
Moreover, Oxylabs heavily invests in R&D initiatives, namely the integration of artificial intelligence and machine learning into our solutions. We have created an advisory board composed of industry and academia experts with extensive experience in creating and managing artificial intelligence.
These efforts have led us to create solutions that use advanced technologies such as our Adaptive Parser and Web Unblocker. These are industry-first in their own right, allowing our customers to achieve better data acquisition results without requiring them to invest any additional resources.
I’ve hosted several interviews with people in the proxy industry and from what I understood, IP quality and ASN differentiation are key to the success of a proxy company. Do you have some other ingredients to add to this recipe?
These two criteria are definitely important in the current state of the proxy industry. I’d question, however, what previous interviewees meant by IP quality. There can be a whole host of definitions for it, so I’d like to unpack it a bit.
IP quality can be, on the one hand, understood as infrastructure reliability. It may be a less pressing issue for smaller providers, especially if your target market is SMBs or individual entrepreneurs. Generally, these businesses might be more resilient to short outages or inaccessibility. Corporations and enterprises, on the other hand, will lose enormous amounts of money even with short outages, so as the target market expands upwards, infrastructure reliability becomes ever more important.
On the other hand, IP quality can be understood as pool width, defined as the number of locations with proxies available, and depth, defined as the number of IPs in general. Global coverage is definitely important, up to a point. Most usage will come from the USA and major countries in Europe and Asia. Covering Antarctica, for example, might not be as useful for a business.
I believe that depth, for depth’s sake, isn’t as important as some may make it out to be. It should be directly correlated with the scale of the operations, as maintaining proxies is expensive, so having an arbitrary number might just be an added cost rather than a benefit.
Finally, I think we’re moving towards a future where proxies will become highly specialized. I remember when there were, basically, residential and datacenter proxies, most of which were dedicated. Now there are mobile proxies, ISP proxies, shared and dedicated proxies of all types, etc. So, they’re becoming more specialized, and I think having a wide variety on offer might be a recipe for success.
I’ve seen you’ve recently launched your Web Unblocker solution, which uses AI to bypass blockers. Can you please give us more details on how it works? Is it effective on every anti-bot solution?
Our Web Unblocker uses several novel technologies that enable it to gather data much more effectively. Some of these exist in our Scraper API solutions, however, Web Unblocker uses all the advancements we’ve made in recent years to provide results.
There are three major innovations in Web Unblocker. First is a patented solution that uses machine learning models and a central proxy server to evaluate user requests. Their combination lets us ensure that for each user request, the best possible proxy is selected out of our IP pool. Decision-making is based on latency, potential success rate, geolocation, and numerous other factors.
Second, dynamic fingerprinting automates the selection of request features such as HTTP headers and numerous others. These selections can greatly influence the lifetime and success rate of an IP address, so with our solution, customers do not have to worry about picking the correct combination each time.
Finally, we have implemented a machine learning-based evaluation of responses. Since errors, improper responses, or lower-quality scrapes can still happen, the model evaluates whether the response acquired matches what was expected. If not, the request is repeated with different parameters.
Do you think that AI will replace web scrapers in the future? I don’t have this sensation, AI can solve the problem to find some values inside HTML code but I can’t imagine, and maybe I am wrong, the best strategy to tackle anti-bot software, that’s the most challenging and time-consuming part. As in other industries, I see more likely a hybrid future where AI facilitates human tasks.
I don’t see AI replacing web scrapers at all, as I think this is a bit of an apples-and-oranges comparison. What can be replaced by AI are several of the more repetitive parts of web data gathering, such as parsing. We have already partly done that by introducing our Adaptive Parser, which is integrated into several of our solutions.
Another part where I am hopeful for AI is the discovery and task generation part of scraping. There are uncertainties now when defining what data is needed. For example, a business or academic researcher might want to find out the price of a particular product across Europe and compare it to the same product in the Middle East. It’s quite a difficult task as you might have to figure out the potential data sources (e-commerce marketplaces), find the products of interest within them, fetch the source HTML content, and only then extract the pricing data from the HTML.
AI, as such, could be used to accept prompts in a fashion similar to what we’ve seen with ChatGPT. So, instead of the rather complicated task of collecting URLs, matching products, sending requests to particular pages, and extracting the data, users could type in a prompt and get what they need.
So, I think you’re correct that there will be a hybrid future where AI makes data acquisition easier for users. On the other hand, I think we’re at the start of machine learning wars, as I have alluded to in one of the articles I’ve written. There’s already a tug-of-war between scrapers and data sources, and AI is only going to accelerate the process.
I know some companies in the proxy industry use SDKs for mobile apps and reward developers or users of these apps to get the right to use the IP of the phone where the app is installed. Is this how you also gather mobile IPs?
Oxylabs has not maintained any SDKs for quite some time now as we now source IPs ethically from reputable and trusted partners such as Honeygain.
SDKs used to be a part of the proxy industry for the longest time, however, we saw many ethical flaws within it. Some providers would sneakily add these SDKs to free applications and put the statement that the application will be using the user’s traffic for proxy purposes. Such statements were added to the Terms and Conditions and other legal documents. I’m sure you know how many people read, from start to finish, all the legal documents before installing an application.
We saw this as a major flaw in the approach, so we took great efforts to change it. Oxylabs has an extensive residential proxy acquisition handbook where we outline our practices. In short, we created our tier system to showcase how residential proxies should be acquired, with the best practice being that users are consenting, informed, and receive financial rewards for the usage of their traffic. Our residential proxy service is implemented according to the practices we’ve outlined, and we hope that other providers will follow in our footsteps.
On the Oxylabs website, there’s a huge section about learning, I really appreciate the effort to help people that are approaching the rabbit hole of web scraping allowing them to learn from useful content. There are podcasts, webinars, blog posts, and videos from your tech experts. How big is the team dedicated to all this content creation? It seems like a huge work!
Our content production team is very thankful for your question. They have been working extremely hard to educate people around the world about web scraping and proxies, so they’re elated to be noticed.
It is quite an extensive collaborative effort. Many teams, such as Event Marketing, Product Marketing, Content Marketing, and even Public Relations add to our content library regularly. In total, there are more than 20 professionals working on videos and written content, so I do believe that’s one of the larger content production arms in the industry.
Our goal has always been to bring web scraping out of the shadows and allow more people to engage with the practice. I’m glad these efforts are recognized and help people around the world see the benefits of scraping.
I’ve read some time ago about an AI solution created by Oxylabs to detect content related to child trafficking. It’s a great way to use technology for good. Can you tell us more about this solution?
Yes, that project was such a major success that it became the catalyst for our Project 4beta, which is a pro bono initiative, offering our solutions for free to those who want to use scraping or proxies for the greater good.
A few years ago, before the initiative began, we noticed that the Lithuanian government was holding a contest under the banner of GovTechLab wherein they sought the private sector’s help in solving pressing issues. That year, the Communications Regulatory Authority of the Republic of Lithuania (CRA) sought help in detecting illegal content, namely sexual and child abuse imagery, across the Lithuanian IP address space.
We dedicated a small team of our professionals to developing an AI-driven solution that would scrape the IP space and detect such imagery. In just ten weeks, our team developed and deployed the solution, which sends out alerts about any potentially illegal content to CRA. Specialists then evaluate and, if needed, forward it to the relevant authorities.
During the first two months of use, the tool achieved these results:
19 websites were identified as violators of national or EU laws;
8 police reports were filed;
11 complaints to the Inspector of Journalist ethics were registered;
2 pre-trial investigations started.
Before our solution, everything had to be done manually, and the CRA relied upon regular internet users to send them reports. Now, combining both regular reports and our automated solution, the CRA can find, investigate, and pursue justice better than before.
Our usual last question: Is any fun fact related to the early days of Oxylabs or yours?
I think few people know that our data acquisition solutions actually began as a passion project of a single person who sought to help our customers with their issues. The customer in question had trouble harnessing the data they needed using our proxies, so our employee developed a basic web scraping tool that did the job correctly.
Soon enough, everyone, from developers to high-level management, noticed the potential of such an application and created an entire project that would bring the solution out of the basic stage into something that could be used by our customers. Now that single project has developed into several distinct Scraper APIs and the Web Unblocker.
The Lab - premium content with real-world cases
THE LAB #8: Using Bezier curves for human-like mouse movements
THE LAB #6: Changing Ciphers in Scrapy to avoid bans by TLS Fingerprinting
THE LAB #4: Scrapyd - how to manage and schedule a fleet of scrapers
THE LAB #2: scraping data from a website with Datadome and xsrf tokens