How to Parse JSON with Python: A Beginner-Friendly Guide
Tips and tricks for handling JSON in your scraping operations
Are you searching for the best way to work with JSON in Python? You’ve come to the right place!
JSON (JavaScript Object Notation) is a lightweight and widely used data-interchange format that every Python developer (and, in general, every person handling data) must master. Usually, in our web scraping projects, we find JSON data in the responses to API calls and in the HTML of our target websites. In fact, many frameworks used for web development use JSON to pass dynamic data from the backend to the front end.
In this mini guide, we’ll see how to parse JSON with Python using some real-world examples.
What is JSON?
JSON is a text-based format used to store and exchange data. It’s lightweight, easy to read, and works seamlessly with Python. Here’s a sample JSON structure:
{
"name": "Alice",
"age": 30,
"skills": ["Python", "Data Analysis"],
"isEmployed": true
}
This format resembles Python dictionaries, making it intuitive to use. JSON is also widely used in web APIs, configuration files, and data storage because it is:
Lightweight and fast for data exchange.
Readable and easy to understand.
Compatible with Python’s json library, which I find the cleanest way to handle them.
Why Extract JSON Instead of HTML in Web Scraping?
When working on web scraping projects, extracting data from JSON is often more efficient and reliable than parsing raw HTML. I can summarize the advantages of using it in the following points:
Clean and Structured Data: JSON provides data in a structured format that is easy to parse and process. Unlike HTML, which often requires complex parsing to extract useful information, JSON data can be directly accessed using Python dictionaries.
Reduced Noise: HTML files typically contain a lot of extraneous information, such as JavaScript, CSS, and advertisements. JSON responses, on the other hand, are tailored to provide only the required data, making your scraping process cleaner and faster.
API Advantages: Many modern websites offer APIs that return data in JSON format. APIs are designed to provide stable and predictable data, reducing the chances of scraping errors due to changes in website structure.
Fewer Parsing Challenges: HTML parsing requires navigating tags, classes, and attributes. JSON data eliminates this overhead, allowing you to access keys and values directly.
Error Handling: JSON data is easier to validate and debug than HTML data. You can quickly check for malformed JSON or missing fields without worrying about misaligned tags or broken HTML structures.
It’s cheaper: connected to the second point if we’re using proxies when scraping a website, it’s way less expensive to extract data from internal APIs rather than download the whole HTML and extract the data we need. The bandwidth used is way less, and your Pay Per GB proxy bill will be lighter.
APIs are less prone to change. They are usually built to serve several systems, not just a company's website, so their output is changed less frequently than the website's HTML. As mentioned in this previous post, pointing our scrapers to the JSON data returned will make them more reliable in the long term.
Online tools for viewing JSON data
Sometimes, we need to handle large JSON structures, and finding the data we need can be challenging without a proper visualization.
For this reason, I find it helpful to copy this data to some online tools that help me visualize it.
Here’s my selection:
JSON formatter: also helps you in case you copied a not valid JSON by highlighting errors. It is useful when you copy long strings from HTML and don’t notice you copied more JSONs together.
JSON Blob: this is particularly helpful when you need to save your JSON and share it with someone else.
Getting Started: Import Python’s JSON Library
Python includes a built-in json module, so there’s no need to install extra libraries. Just import it:
import json
This module will help you read, write, and manipulate JSON data effortlessly.
Parsing JSON Strings in Python
If you receive a JSON string, you can easily convert it to a Python dictionary using json.loads().
Example:
import json
# JSON string
json_string = '{"name": "Alice", "age": 30, "skills": ["Python", "Data Analysis"]}'
# Convert JSON string to Python dictionary
data = json.loads(json_string)
print(data["name"]) # Output: Alice
Reading JSON Files in Python
JSON is commonly stored in files, and Python makes it simple to read this data using json.load().
Example:
import json
# Open JSON file
with open("data.json", "r") as file:
data = json.load(file)
# Accessing data
print(data["skills"]) # Output: ["Python", "Data Analysis"]
Writing JSON Data to Files
Want to save data to a JSON file? Use Python’s json.dump() to write Python dictionaries to files in JSON format.
Example:
import json
# Python dictionary
data = {
"name": "Bob",
"age": 25,
"skills": ["Java", "Web Development"]
}
# Write JSON to a file
with open("output.json", "w") as file:
json.dump(data, file, indent=4) # `indent` makes the JSON readable
Pretty-Printing JSON Data
JSON data can be hard to read when it’s compact. Pretty-print your JSON for better readability using json.dumps().
Example:
import json
# JSON string
json_string = '{"name": "Alice", "age": 30, "skills": ["Python", "Data Analysis"]}'
# Convert string to dictionary
data = json.loads(json_string)
# Pretty-print JSON
pretty_json = json.dumps(data, indent=4)
print(pretty_json)
Output:
{
"name": "Alice",
"age": 30,
"skills": [
"Python",
"Data Analysis"
]
}
Handling Nested JSON Data in Python
Real-world JSON data often contains nested structures. Python’s dictionary operations make it easy to extract values from nested JSON.
Example:
import json
# Nested JSON string
nested_json = '{"user": {"name": "Alice", "details": {"age": 30, "skills": ["Python", "SQL"]}}}'
# Parse JSON
data = json.loads(nested_json)
# Access nested values
print(data["user"]["details"]["skills"]) # Output: ["Python", "SQL"]
Modifying JSON Data in Python
Python lets you modify JSON data like a regular dictionary. After making changes, you can save the updated JSON.
Example:
import json
# JSON string
json_string = '{"name": "Alice", "age": 30}'
# Parse JSON
data = json.loads(json_string)
# Modify a value
data["age"] = 35
# Convert back to JSON
modified_json = json.dumps(data, indent=4)
print(modified_json)
Parsing JSON from Scrapy and Playwright
Libraries like Scrapy and Playwright are commonly used for web scraping, and JSON responses are frequent. Here’s how to handle JSON in these libraries:
Scrapy Example:
import scrapy
import json
class ExampleSpider(scrapy.Spider):
name = 'example'
def start_requests(self):
url = 'https://jsonplaceholder.typicode.com/posts'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
# Parse JSON response
data = json.loads(response.text)
for item in data:
print(item["title"])
Playwright Example:
from playwright.sync_api import sync_playwright
import json
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Navigate to a page that returns JSON
response = page.request.get('https://jsonplaceholder.typicode.com/posts')
# Parse JSON response
data = json.loads(response.text())
for item in data:
print(item["title"])
browser.close()
Bonus: Parsing JSON with pandas
For data analysis, convert JSON to a DataFrame using pandas. This is especially useful when working with structured data.
Example:
import pandas as pd
# JSON data
json_data = '[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]'
# Convert to DataFrame
df = pd.read_json(json_data)
print(df)
Output:
name age
0 Alice 30
1 Bob 25
Like this article? Share it with your friends who might have missed it or leave feedback for me about it. It’s important to understand how to improve this newsletter.
You can also invite your friends to subscribe to the newsletter. The more you bring, the bigger prize you get.
Hi Pierluigi, maybe it is worth mentioning that Playwright and Requests response objects have built-in json() functions.
So in your example, instead of doing
data = json.loads(response.text())
You can do
data = response.json()
which, I believe, is a bit more concise.