How to Scrape Open-Source Datasets Ethically
How to collect open data responsibly, without breaking rules or burning bridges
When you need to scrape data from the web, “open data” and “open-source datasets” sound like a green light. No paywall, no login, no restrictions: just data sitting there, ready to be collected. It is a reasonable assumption, right?
Well, not so fast.
Open data does not automatically mean free to use, free to redistribute, or free from privacy obligations. And scraping it without thinking through the implications can land you in legal trouble, get your IP banned from a public infrastructure that was never designed to handle aggressive crawlers, or cause you to expose people’s personal information.
In this article, we will go through a complete picture of the “open data” world: what the problem actually is, how to approach it correctly, and how to implement responsible open data scrapers in Python.
Let’s dive into it!
Before proceeding, let me thank NetNut, the platinum partner of the month. Their set of solutions cover all your needs for scraping.
What “Open Data” Actually Means Legally, Ethically, and Practically
“Open” is one of the most overloaded words in the data world. Depending on the license, the jurisdiction, and the type of data involved, the same publicly accessible dataset can be freely redistributable, commercially restricted, privacy-sensitive, or legally off-limits entirely.
So, before anything else, let’s establish what you are actually dealing with.
What “Open-Source Dataset” Actually Means (and What It Doesn’t)
Where a dataset sits on the licensing spectrum determines everything: whether you can redistribute it, whether you can use it commercially, and whether collecting it at all exposes you to liability. Here is how the spectrum breaks down:
CC0 (Creative Commons Zero): Essentially, it is a public domain dedication. The author waives all rights. You can scrape it, redistribute it, use it commercially, and modify it.
CC-BY (Creative Commons Attribution): It requires you to credit the original source. This means you must clearly state where the data came from, who created it, and link back to the original when you publish or redistribute it. This is the most permissive license after CC0, and it is generally easy to comply with.
CC-BY-SA (Share-Alike): This carries the same attribution requirement as CC-BY, but adds a condition: any derivative work you publish must carry the same license. In practice, this means you cannot fold a CC-BY-SA dataset into a proprietary product and lock it down.
CC-BY-NC (Non-Commercial): It also requires attribution, but restricts commercial use entirely. You can use the data for research, journalism, or personal projects, but the moment money is involved, you need a separate agreement with the data owner.
ODbL (Open Database License), used by OpenStreetMap: It requires both attribution and share-alike, specifically for databases. It is worth noting that ODbL distinguishes between the database itself and the contents. Basically, you can use individual facts freely, but any public use of the database as a whole must comply with the license terms.
And then there is the grey zone, which is where most scraping engineers actually operate: data that is publicly accessible but carries no explicit license. Common cases are government portals, academic repositories, open court records, and municipal datasets. This is a huge portion of what people call “open data”. And here is the thing that matters for scraping professionals: no license does not mean free to use. In most jurisdictions, the absence of a license means the default copyright law applies. Which means the creator reserves all rights.
So before you write a single line of scraper code, the first question is not “Can I access this?” but “Under what terms am I allowed to use what I access?”
For your scraping needs, having a reliable proxy provider like Decodo on your side improves the chances of success.
Where the Ethical (and Legal) Risks Hide
Once you have cleared the license question, there are still several risk areas that are easy to overlook:
License violations: This is the most obvious one. If a dataset requires attribution and you redistribute it without crediting the source, you are in breach. If it has a non-commercial clause and you use it in a commercial product, it’s the same story. These are the kind of things that generate cease-and-desist letters.
PII embedded in “open” datasets: This is a subtler and arguably more dangerous problem than license violation. Consider open court records: they are public by design, but they contain names, addresses, and sometimes sensitive personal details. Census microdata, even when anonymized at the aggregate level, can contain individual-level records. For example, the GitHub commit history is public, but it contains email addresses, which is personal data. So, the fact that data was made public by someone else does not strip it of its privacy implications when you collect, aggregate, and store it.
Jurisdictional complexity: A dataset hosted on a European government portal carries GDPR obligations even if you are scraping it from the United States. The GDPR applies based on where the data subjects are located, not where the scraper is running. If you are collecting data about EU residents, you are in GDPR territory regardless of your own geography.
The aggregation problem: This is probably one of the most underappreciated risks in the scraping industry. Individually, a dataset of names, a dataset of addresses, and a dataset of employment records might each be harmless and openly licensed. But combine them, and you have created a detailed profile of real people. This is something that privacy regulations were specifically designed to prevent.
The Infrastructure Problem: Open Data Portals Are Not Built for Scrapers
Many scraping engineers come to open data with habits built on commercial targets. That experience can be misleading, because the infrastructure behind open data portals is completely different.
When you scrape a large e-commerce website or a major social media platform, you are hitting servers that are engineered to handle millions of requests per day, backed by CDNs, load balancers, and dedicated anti-bot teams. In other words, they can take a (hard) hit.
On the other hand, a municipal open data portal, a university’s research repository, or a small NGO’s dataset hosting is an entirely different story. This means that a scraper that would barely register as noise on Amazon’s servers could genuinely degrade performance for a public data portal serving thousands of researchers.
This is why scraping open data portals aggressively is arguably more unethical than doing the same to a commercial target. You are not fighting a corporation’s anti-bot system. You are potentially taking down a public resource that other people depend on.
A Four-Step Framework for Scraping Open Datasets Without Breaking Rules or Infrastructure
Every risk outlined above has a straightforward mitigation, but only if you apply it at the right point in your workflow. The mistake most scraping engineers make is treating these as afterthoughts: checking the license after the scraper is already built, thinking about PII after the data is already stored. Let’s discuss a framework that inverts this.
License-First Workflow: Read Before You Scrape
The fix for the license problem is simple in principle, even if it requires discipline in practice: make license verification the first step of your workflow.
Most well-maintained open data portals provide license information in one of these three places: a LICENSE file in the dataset’s root directory, a metadata field in the dataset’s API response, or the dataset’s documentation page. Here is a quick reference for what the licenses described above mean for your use case:
When there is no license, the safe default is not to scrape and redistribute without seeking explicit permission from the dataset owner. A short email asking for clarification is a sign of professionalism.
Prefer APIs and Bulk Downloads Over Scraping
This is a rule that experienced scraping engineers sometimes forget because they are so used to reaching for their scraper toolkit: always check for an official API or bulk download endpoint before writing a scraper.
Most serious open data portals expose REST APIs or provide direct bulk download links. Using these is better in every dimension: it is faster, more reliable, more respectful of the server, and often gives you cleaner, structured data than you would get from parsing HTML.
Your workflow should be:
Check the portal’s documentation for an API.
Check for a
Sitemapor structured data endpoint (as discussed in our article on robots.txt and its implications).Check for bulk download links (CSV, JSON, Parquet).
Only fall back to HTML scraping if none of the above exist.
Scraping should be your last resort, not your first instinct.
Responsible Scraping Behavior for Open Infrastructure
When scraping is genuinely the only option, the rules of polite scraping apply. But in the case of open data portals, you should apply a higher standard than you would on a commercial target.
As covered in “best practices for ethical web scraping”, respecting rate limits, introducing delays between requests, and using a descriptive User-Agent are baseline requirements. But for open data portals, you should go further because of their weaker infrastructure. Below are additional rules you should consider:
Respect Crawl-delay strictly: Even if major crawlers ignore it, on underfunded infrastructure, that directive is a good signal about server capacity.
Cache responses locally: If you need to re-run your scraper for testing or debugging, you should not be hitting the server again. Cache what you have already fetched.
Scrape during off-peak hours: For public portals serving researchers and government users, off-peak typically means nights and weekends in the portal’s local timezone.
Scrape only what you need: This sounds obvious, but it’s easy to over-collect data “just in case”. However, for open portals, remember that every unnecessary request is a cost imposed on a public resource that stays online on an underfunded infrastructure.
Handling PII in Open Datasets
PII stands for Personally Identifiable Information. This refers to any data that can be used, alone or in combination with other data, to identify a specific individual. Think names, email addresses, phone numbers, but also subtler things like IP addresses or device IDs.
The reality is that most well-maintained open data portals go through a review process before publication, and raw PII in open datasets is not as common as you might think. The most common cases where PII can slip through are quite specific: older government datasets published before modern privacy review processes, improperly anonymized academic research deposits, or crowdsourced datasets where contributors included personal details voluntarily.
In such specific cases, the real risk for most scraping engineers is at the aggregation level. A dataset of names, a dataset of ZIP codes, and a dataset of employment records might each be perfectly clean and openly licensed in isolation. But combine them, and you have built a detailed profile of real individuals. This is something that privacy regulations like the GDPR and CPRA were specifically designed to prevent. And once you collect, store, and process that combined data, you become responsible for it, regardless of where it originally came from.
The key principle remains the usual one: identify and handle PII at collection time. Here is a schema you can use to audit the fields that are likely to contain PII:
Direct identifiers: names, email addresses, phone numbers, national ID numbers, passport numbers, and social security numbers. These are the clearest cases as they point to a specific individual on their own, without needing to be combined with anything else. If you see these fields in a dataset, there is no ambiguity: you are dealing with PII.
Quasi-identifiers: dates of birth, ZIP codes, job titles, gender, ethnicity, and salary ranges. None of these identify a person on their own, but they become dangerous in combination. A classic example is aggregating just three fields—say date of birth, gender, and ZIP code: this is enough to uniquely identify a great portion of the population in a country.
Sensitive categories under GDPR: health and medical data, political opinions, religious or philosophical beliefs, biometric data, genetic data, sexual orientation, and trade union membership. This is a legally distinct class that carries stricter obligations regardless of context. In other words, you cannot process this data based on legitimate interest alone. You need explicit consent or another specific legal basis, and the bar is significantly higher than for ordinary PII.
For each PII field, decide upfront: do you need it? If not, drop it at collection time. If you do need it, apply pseudonymization (replacing the identifier with a reversible token) or anonymization (irreversible removal or generalization) before storage.
Python Implementation: Putting the Full Responsible Scraping Pipeline Into Code
Principles are only useful if they translate into implementation. Below are two concrete components you can adapt for your own pipelines:
Checking a dataset’s license before downloading anything, using CKAN’s metadata API, with a practical fallback strategy for portals that don’t run CKAN.
Running PII detection at collection time, using field-level schema classification, with an honest discussion of where that approach has limits.
Note that the examples below omit an API-first fetch pattern and a polite scraper skeleton, even though they are covered in the framework section above. This is because those are problems with well-known, straightforward solutions that every scraping engineer should be aware of. The idea of the following sections is to provide you with lesser-known solutions, to help you get ideas to apply to your pipelines.
Checking a Dataset’s License Programmatically
Many open data portals are built on CKAN, an open-source data management system used by governments and enterprises. CKAN exposes a REST API that includes license metadata, which makes programmatic license checking straightforward.
Here is how to query a CKAN-based portal and extract license information before proceeding:
import requests
def check_dataset_license(portal_base_url: str, dataset_id: str) -> dict:
"""
Queries a CKAN portal API to retrieve license information
for a given dataset before any scraping begins.
"""
api_url = f"{portal_base_url}/api/3/action/package_show"
params = {"id": dataset_id}
response = requests.get(api_url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
result = data.get("result", {})
license_info = {
"dataset_name": result.get("title", "Unknown"),
"license_id": result.get("license_id", "Not specified"),
"license_title": result.get("license_title", "Not specified"),
"license_url": result.get("license_url", "Not specified"),
}
return license_info
# Example: querying the UK government's open data portal
portal = "<https://data.gov.uk>"
dataset = "road-accidents-safety-data"
license_info = check_dataset_license(portal, dataset)
print(f"Dataset: {license_info['dataset_name']}")
print(f"License: {license_info['license_title']}")
print(f"License ID: {license_info['license_id']}")
print(f"License URL: {license_info['license_url']}")Which outputs the following:
Dataset: Road Safety Data
License: UK Open Government Licence (OGL)
License ID: uk-ogl
License URL: <https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/>With this information in hand, you can make an informed decision before a single byte of dataset content is downloaded. Specifically, you can directly check the government license page. The image below partially shows the license page:
But what if the portal you need to scrape doesn’t run CKAN? Not all open data portals do… Socrata (used by many US city and state governments), DKAN, and custom-built portals each have different or no metadata APIs. In those cases, your fallback options are the following:
Check for a LICENSE or METADATA file in the dataset’s root directory or bulk download package. Many portals include one.
Look for a <link rel=”license”> tag in the dataset’s HTML page, which some portals emit as structured metadata.
Check the portal’s documentation or “About” page, where license terms are often stated globally for all datasets.
If none of the above yield a clear answer, treat the license as unknown and do not redistribute without seeking explicit written permission from the dataset owner. A short email asking for clarification is a professional move.
PII Detection at Scrape Time
In this case, the approach depends heavily on what you actually know about the data you need to scrape. Two situations you will encounter in practice, each calling for a different strategy:
You know the schema: If you are retrieving structured data, field-level detection is the right approach. You know which fields are likely to carry PII, so you can target them directly. This is faster, more precise, and produces far fewer false positives than running a general NER model over free text.
You have no schema: For unstructured data, NER-based detection is a reasonable starting point, but go in with realistic expectations. A common solution is using spaCy’s en_core_web_sm, which is a small model trained on news text, so don’t expect it to do miracles for you. Another approach, which can give way better results, is using LLMs to give a structure to unstructured text.
For the structured case, here is a field-level PII detection pipeline:
import re
import hashlib
from dataclasses import dataclass, field
from typing import Any
# Fields that are unambiguously PII on their own
DIRECT_IDENTIFIER_FIELDS = {
"name", "full_name", "first_name", "last_name",
"email", "email_address",
"phone", "phone_number", "mobile",
"ssn", "national_id", "passport_number",
"ip_address", "device_id"
}
# Fields that are not PII alone but dangerous in combination
QUASI_IDENTIFIER_FIELDS = {
"date_of_birth", "dob", "birth_date",
"zip_code", "postcode", "zip",
"gender", "sex",
"job_title", "occupation",
"salary", "income",
"ethnicity", "race"
}
# Regex patterns for validating suspected PII values at the content level
EMAIL_PATTERN = re.compile(r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
PHONE_PATTERN = re.compile(r"\\b(\\+?\\d[\\d\\s\\-().]{7,}\\d)\\b")
@dataclass
class FieldAudit:
field_name: str
classification: str # "direct", "quasi", or "clean"
original_value: Any
processed_value: Any # pseudonymized, generalized, or original
action_taken: str # "pseudonymized", "generalized", "dropped", "kept"
def pseudonymize(value: Any) -> str:
"""
Replaces a PII value with a consistent, reversible token.
Using a hash means the same value always produces the same token,
which preserves referential integrity across records (e.g., you can
still count unique users without knowing who they are).
In production, use an HMAC with a secret key instead of plain SHA-256.
"""
return hashlib.sha256(str(value).encode()).hexdigest()[:16]
def generalize_date(value: str) -> str:
"""
Reduces a full date of birth to a birth year only.
A simple but effective generalization for quasi-identifiers.
"""
# Handles common formats: YYYY-MM-DD, DD/MM/YYYY, MM/DD/YYYY
match = re.search(r"\\b(19|20)\\d{2}\\b", str(value))
return match.group(0) if match else "UNKNOWN_YEAR"
def audit_record(record: dict) -> tuple[dict, list[FieldAudit]]:
"""
Processes a single structured record field by field.
Returns a cleaned record and a full audit trail of what was done to each field.
Strategy:
- Direct identifiers: pseudonymize (preserve referential integrity)
- Quasi-identifiers: generalize where possible, pseudonymize otherwise
- Everything else: pass through unchanged
"""
clean_record = {}
audit_trail = []
for field_name, value in record.items():
normalized = field_name.lower().strip()
if normalized in DIRECT_IDENTIFIER_FIELDS:
processed = pseudonymize(value)
audit_trail.append(FieldAudit(
field_name=field_name,
classification="direct",
original_value=value,
processed_value=processed,
action_taken="pseudonymized"
))
clean_record[field_name] = processed
elif normalized in QUASI_IDENTIFIER_FIELDS:
# Apply field-specific generalization where we can
if normalized in {"date_of_birth", "dob", "birth_date"}:
processed = generalize_date(value)
action = "generalized"
else:
# For other quasi-identifiers, pseudonymize as a safe default
processed = pseudonymize(value)
action = "pseudonymized"
audit_trail.append(FieldAudit(
field_name=field_name,
classification="quasi",
original_value=value,
processed_value=processed,
action_taken=action
))
clean_record[field_name] = processed
else:
# Field is not in either PII list — pass through, but still
# run a regex check on string values as a safety net
if isinstance(value, str):
if EMAIL_PATTERN.search(value) or PHONE_PATTERN.search(value):
# Unexpected PII in a non-PII field: flag it and pseudonymize
processed = pseudonymize(value)
audit_trail.append(FieldAudit(
field_name=field_name,
classification="direct",
original_value=value,
processed_value=processed,
action_taken="pseudonymized (unexpected PII in non-PII field)"
))
clean_record[field_name] = processed
continue
audit_trail.append(FieldAudit(
field_name=field_name,
classification="clean",
original_value=value,
processed_value=value,
action_taken="kept"
))
clean_record[field_name] = value
return clean_record, audit_trail
def process_records(records: list[dict]) -> list[dict]:
"""
Runs field-level PII detection and handling across a list of records.
Prints an audit summary for any record where PII was found.
"""
clean_records = []
for i, record in enumerate(records):
clean_record, audit_trail = audit_record(record)
pii_fields = [a for a in audit_trail if a.classification != "clean"]
if pii_fields:
print(f"Record {i}: PII detected and handled in {len(pii_fields)} field(s):")
for audit in pii_fields:
print(f" [{audit.classification.upper()}] {audit.field_name} "
f"→ {audit.action_taken}")
clean_records.append(clean_record)
return clean_records
# Example: a batch of records from a scraped open dataset
records = [
{
"record_id": "A001",
"name": "Jane Doe",
"date_of_birth": "1985-03-22",
"zip_code": "SW1A 1AA",
"incident_type": "Road accident",
"severity": "Slight"
},
{
"record_id": "A002",
"name": "John Smith",
"date_of_birth": "1973-11-04",
"zip_code": "EC1A 1BB",
"incident_type": "Road accident",
"severity": "Serious",
# An email that slipped into a free-text notes field
"notes": "Witness contact: witness@example.com"
}
]
clean = process_records(records)The output is the following
Record 0: PII detected and handled in 3 field(s):
[DIRECT] name → pseudonymized
[QUASI] date_of_birth → generalized
[QUASI] zip_code → pseudonymized
Record 1: PII detected and handled in 4 field(s):
[DIRECT] name → pseudonymized
[QUASI] date_of_birth → generalized
[QUASI] zip_code → pseudonymized
[DIRECT] notes → pseudonymized (unexpected PII in non-PII field)A few things worth calling out in this implementation:
Pseudonymization preserves referential integrity: Because the same input always produces the same hash token, you can still count unique individuals, join records, or track entities across datasets, without storing the raw PII. In production, replace the plain SHA-256 with an HMAC keyed on a secret, so tokens cannot be reversed by someone who also has access to the hashing algorithm.
The regex safety net on non-PII fields: This catches the common real-world case where PII slips into a free-text or notes field that your schema classification didn’t anticipate. It is not foolproof, but it catches the obvious cases.
The audit trail is intentional: Every field-level decision is logged. If you are ever asked to demonstrate that your collection process handled PII responsibly, you have a record of exactly what was done to each field in each record.
Conclusion
Open data is a shared resource, and how you interact with it says something about you as a professional. In this article, you learned what “open” means in the context of data scraping and how you should treat it if you want to be an ethical scraper.
So, let us know: Did we miss something? What’s your approach to handling open datasets in your scraping projects? Let’s discuss in the comments.





