Interview #1: Neha Setia - Zyte
Let's talk about web scraping, open source, Zyte and Extract Summit
Welcome to the first of our series of interviews, we'll break the ice with Neha Setia (@nehasetianagpal), developer advocate at Zyte, where she conducts workshops and enablement sessions for system integrators and clients at events and conferences. She also spends time identifying AI and ML use-cases for various industry domains by talking to system integrators, clients, and partners.
First of all, thank you Neha for taking your time and answering these questions.
Let’s start with a brief introduction for the few that don’t know Zyte. The company started in 2010 as a web scraping services company, how did it evolve to nowadays?
N: Zyte was originally founded as Scrapinghub back in 2010. At the time there were no credible data extraction solutions available. Founders Shane Evans and Pablo Hoffman decided to build their own data extraction software to provide customers with a simple way to access open web data.
We put open web data and customers at the heart of what we do.
Our desire of going beyond simply the web scraping procedure by continuously innovating and optimizing the process of data extraction has enabled us to serve 2,000 companies and over 1 million developers worldwide. That is our secret to now having the largest team in the industry with over 100 dedicated developers and extraction experts.
What services are you offering to your customers?
N: Our services are:
Scrapy - An open-source python framework built specifically for web data extraction.
Smart Proxy Manager - Enables scalable web scraping by routing your requests
through a pool of IP addresses. It introduces delays and discards IPs wherever
necessary to maximize the success rate.
Automatic Extraction - To instantly access web data with our patented AI-powered automated extraction API. Get quality data back quickly and in a
structured format.
Scrapy Cloud - Zyte Scrapy Cloud removes the need to set up and monitor
servers and provides a nice UI to manage spiders and review scraped items,
logs, and stats
Something exciting is coming soon, which will make the entire data extraction process smoother for the developers. Stay tuned for updates to be announced at Extract Summit.
What kind of data do your customers ask for more? E-commerce data? Social media?
N: The most popular is Product Data from eCommerce sites / Marketplaces. That said, we get a wide variety of requests.
For Product data insights and analytics particularly, we receive requests for over 3,000,000,000 Products per month. Customers use web-extracted product data for endless use cases like price intelligence, market analysis, competitor intelligence, vendor management, compliance, and many more.
Customers rely on us to get data from any e-commerce website by sourcing product data from marketplace websites. We provide custom solutions that deliver the exact data required.
Did you see any change after Covid-19 when some industries were forced to go online?
N: There’s no doubt Covid-19 had an impact on businesses of all industries and sizes. With the majority having to either strengthen their online presence or start to invest in digital transformation services in order to keep their business afloat.Many companies felt the need to adapt and develop an e-commerce strategy.
As a result of this, we saw a substantial increase in the demand for eCommerce Data requests.
During these years the web changed a lot and still does. What are the new trends and changes in data extraction you’re seeing? Do you see any pattern in anti-bot software?
N: When it comes to data extraction, anti-bot software and detection methods are directly related. The entire process of creating and detecting bots is like an endless game of cat and mouse. One side invents new strategies to gain an advantage, while the other side comes up with counter strategies.
In 2009 Google bought the Recaptcha research project as many of these had very basic functions.
Earlier websites were not as dynamic as they are now. For starters, they were not interactive, nor javascript enabled. Earlier Captchas were simple and now they have become much more complex too.
What makes it different from today, is digital fingerprinting. This is made up of Browser Fingerprinting & TCP/IP Fingerprinting process, mouse movements, speakers, and plugins. All of that information makes up digital fingerprinting, which is why antibots have become more complex. Therefore, we need much more complex solutions to tackle them.
All in all, that is what has made one the biggest differences in how the web and antibot software have changed these days.
Is artificial intelligence coming into this field? Are we ready for production usage of ML in web scraping, at least for some niches?
N: Artificial intelligence is not coming, it’s already here.
It can be applied on the Proxy front: You can train your Machine Learning Model to pick the suitable proxy for the website you are trying
Or also in the extraction front: you can train your machine model or neural network to classify, and identify the different elements of a web page.
We have Automatic Extraction API, which knows exactly how to classify different parts of the scraped data and eliminate unnecessary parts efficiently. For example, many e-commerce websites have similar layouts to display the product image andother details. A Machine learning model (or data parsing algorithm) can be trained to identify the approximate location of a product’s image and other details.
Product Extraction supports pages that contain a single product. Many fields are extracted, such as product name, brand, price, availability, and SKU. This supports use cases such as price monitoring, product intelligence, product analytics, and many others.
Related page types are Product List Extraction which supports pages with multiple products or Review Extraction which supports reviews on single product pages.
Web scraping is in a legal gray area, do you advise your customers also on this area? What are the general rules you follow to not incur legal troubles?
N: We hear a lot that web scraping is a legal grey area, but the truth scraping itself isn’t illegal. It’s the manner in which you scrape, what you scrape, and how you use the data scraped that might fall into the grey area.
Zyte has a legal and compliance team who have developed web scraping best practices on the various grey areas. Some of the main points that we highlight in our best practices are not to overburden the target sites, follow copyright laws, and ensure compliance with GDPR and otherdata protection laws, and to abide by website terms and conditions that you must explicitly accept.
For further information on ethics and compliance in web data extraction, be sure to tune into our Chief Legal Officer’s (Sanaea Daruwalla) talk at Extract Summit on September 29.
From what I read here Zyte was born in 2010 and lies its foundation on successful open source projects like Scrapy, now with more than 44k stars on GitHub. With so many pull requests and interactions with the community, what does it mean to manage such a successful project?
N: It can be quite an effort to maintain a large project, that’s why developers in Zyte have 5h per week that they can spend maintaining open source projects and I believe we have (or wanted to have?) dedicated people to maintain our critical open source projects.
Open source doesn't make you successful as a company, I think, unless the product is really fantastic and it manages to gather a large community. Generally speaking, companies want visibility and open source projects can become popular pretty fast. But from popularity to success, is a long way, in my opinion.
Until some years ago, companies were afraid of open source because it was seen as poorly defendable software while the mentality seems to be changed now. In your experience, how much do open source projects help a company be successful?
N: Open source projects have helped companies of all sizes be successful. It provides the possibility for anybody to use, study, modify, and distribute your project for any purpose. Many large organizations have open source projects. Yes, even Apple quite recently!! So I guess it has become sort of a trend by now.
Though there are some benefits of having open source as a company, at the end of the day, doing open source is hard to manage and does not guarantee success. Open source is powerful because it lowers the barriers to adoption and collaboration, allowing collaborators to work on and improve projects quickly.
Although you cannot expect open source to just magically change your company. A great open source project requires having a good team and commitment from management. So in a nutshell, working with open source does not guarantee success and it is all about how your company approaches it.
On the 29th of September, there will be the web extract summit 2022: what we should expect from it?
N: We expect to have over 200 attendees who will see real-world implementations of the Zyte platform and derive insights from experts and colleagues.
Additionally, Extract Summit will provide a showcase of Zyte’s newest technology release, providing the world with a new standard for reliability and ease of use in open web data extraction.
It will be a good space to meet industry thought leaders and connect with them in person or even virtually.
One talk I will be looking forward to is “Data mining from a bomb shelter in Ukraine”, by Alexander Lebedev. Also can’t wait to hear from Sanaea Daruwalla on “Ethics and Compliance in web data extraction”.
The event will focus on the following themes:
Scaling your web scraping - Learn practical how-tos with technical talks that help you overcome challenges around quality, scalability, and accessibility.
Ethical web data extraction - Learn how to scrape web data respectfully. Hear all about web scraping best practices and ethical use cases.
Innovation in web scraping - Web scraping is the tool that gives companies a
competitive edge. Find out the innovative ways in which businesses use web scraped data
The future of the web scraping industry - AI, machine learning, and what does the future hold for data extraction?
The following link has more details on the full Extract Summit speaker agenda.
Thank you Neha for your time and see you in London then!