In this tutorial I will show you how to scrap information from websites with Python using popular Scrapy library.
Introduction
Python has a number of tools and libraries that can be used for extracting or "scraping" information from websites. The requests library in combination with BeautifulSoup parser are often used for relatively simple cases. But for gathering data in larger scale you need a tool that will help you avoid re-inventing the wheel and violating DRY principle. This is where Scrapy comes to the rescue.
Scrapy is a framework that simplifies extracting, processing and saving information from websites. It includes all necessary "batteries" for extracting information from websites and is extensible so you can adapt it for your specific needs.
Getting Started
For my tutorial I have chosen vidsplay.com site that host free stock videos. I will show you how to build a scraper that extracts the list of all videos from the site with all necessary metadata and saves it to .json
and .xlsx
formats.
This tutorial assumes that you know how to work with Python virtual environments. I have used Python 3.6 and Scrapy 1.5.0.
First, create the directory for you vidsplay.com scraper project:
mkdir vidsplay-scraper
Now create and activate your Python virtual environment.
Go to our newly created project directory and create a Scrapy project structure:
scrapy startproject vidsplay_scraper .
The Scrapy command-line utility will create all necessary boilerplate files and directories for our project. This loosely reminds creating a new project for Django web-framework (if you are familiar with it).
Scrapy Project Structure
A Scrapy project consists of several files and directories, and each serves a particular purpose:
/vidsplay_scraper/
/spiders/
__init__.py
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
scrapy.cfg
Let's review them in brief.
scrapy.cfg
— a project configuration. For simple cases like ours we do not need to touch it, so let's leave it as is.
/vidsplay_scraper
— a Python package containing all project's Python components.
/spiders
— a Python package containing spiders, that is, Python classes responsible for parsing webpages.
items.py
— a module containing data models for information extracted from webpages. Data models or items are subclassed from scrapy.Item
that, in its turn, is a subclass of Python dict
. Use custom data models (items) if you need complex logic for your data that a dict
class does not provide. But in simple cases you can use plain dict
items, so in this tutorial I'll leave this module empty.
middlewares.py
— request/response processing middlewares. In most cases a default auto-generated module is enough, and in our tutorial we'll leave it as is.
pipelines.py
— item processing pipelines. Pipelines can be used, for example, for filtering data received from spiders and/or saving data in custom formats. In the 2nd part of my tutorial I'll show you how to use a custom pipeline to save scraped data in .xlsx
(Excel spreadsheet) format.
settings.py
— project settings. In this tutorial I'll change USER_AGENT
parameter to that of a Chrome browser and set AUTOTHROTTLE_ENABLED = True
so that not to overload vidsplay.com site with too much requests per second.
Creating A Spider
Now let's write a spider that will scrape the list of videos along with necessary metadata from vidsplay.com. A spider is a Python class responsible for making requests to a website and processing responses. Responses are processed asynchronously, which means that after receiving a response from a website Scrapy launches the respective callback method for processing raw HTML data returned with that response. So the process of parsing a website is a sequence of callbacks that reflects the structure of the website and in our tutorial we need to create this sequence for the vidsplay.com site.
Let's create videos_spider.py
module in spiders
sub-package with the following content:
import scrapy
class VideosSpider(scrapy.Spider):
"""Parse videos from vidsplay.com"""
name = 'videos'
start_url = 'https://www.vidsplay.com'
def start_requests(self):
"""Entry point for our spider"""
yield scrapy.Request(self.start_url, callback=self.parse)
def parse(self, response):
"""Parse vidsplay.com index page"""
category_urls = response.xpath(
'/html/body/div[1]/div/div/div/aside/section[3]/div/ul/li/a/@href'
).extract()
print(f'Category URLs: {category_urls}')
Our VideosSpider
is a sublclass of scrapy.Spider
. It must have a name (name
class attribute) so let's name it 'videos'
. The start_requests
method is the entry point of our spider. It is a Python generator that yields scrapy.Request
objects to load webpages from which you we start data scraping. In our case we have only one starting page that is the www.vidplay.com homepage, so our start_request
method yields only one request. A scrapy.Request
object receives at least 2 arguments: the URL of a webpage to be loaded and a callback that is a method to be called after the webpage is loaded.
Our parse
method is the first callback in the sequence that parses the vidsplay.com homepage. It receives a scrapy.Response
object containing the webpage data. From the homepage we need to extract URLs for "Categories" pages that will be parsed on the next step. To extract data from a webpage we can use either CSS or XPath selectors and the scrapy.Response
class provides the respective .css()
or .xpath()
methods. In this tutorial I'll use XPath selectors. Teaching you XPath syntax is beyond the scope of this tutorial and I recommend to search the necessary information online. However, I can give you a tip: you don't need to construct an XPath string by yourself. In a browser (Firefox or Chrome) right-click on the item you want to extract information from, in the context menu select "Inspect element", then in the developer tools window right click on the respective HTML element and in the context menu select Copy > XPath (or CSS Selector if you are using CSS selectors).
The .xpath()
(or .css()
) method returns the list of selector objects corresponding to the respective HTML elements, and .extract()
method extracts string representations of those elements. If you need only one element matching the XPath selector, you can use .extract_one()
method to return a string instead of the list of strings. Let's run our scraper. Enter the following command in the console (your working virtual environment must be activated):
scrapy craws videos
Among debug messages from the Scrapy crawler you should see the list of category URLs from the www.vidsplay.com homepage:
Category URLs: ['https://www.vidsplay.com/animals.html', 'https://www.vidsplay.com/architecture.html', 'https://www.vidsplay.com/army.html', 'https://www.vidsplay.com/backgrounds.html', 'https://www.vidsplay.com/business-office
.html', 'https://www.vidsplay.com/education.html', 'https://www.vidsplay.com/food.html', 'https://www.vidsplay.com/health.html', 'https://www.vidsplay.com/holidays.html', 'https://www.vidsplay.com/household.html', 'https://www.
vidsplay.com/industry.html', 'https://www.vidsplay.com/nature.html', 'https://www.vidsplay.com/people.html', 'https://www.vidsplay.com/places.html', 'https://www.vidsplay.com/production.html', 'https://www.vidsplay.com/religiou
s.html', 'https://www.vidsplay.com/society.html', 'https://www.vidsplay.com/sports.html', 'https://www.vidsplay.com/technology.html', 'https://www.vidsplay.com/time-lapse.html', 'https://www.vidsplay.com/transportation.html']
Now we need to follow those URL and scrape category pages. Parsing callbacks can also be Python generators that yield either scrapy.Request
objects or data items (dict
or scrapy.Item
instances). If Scrapy receives a Request instance from a callback, it loads the respective page and launches the next callback to parse its contents. Data items are passed to data processing pipelines. Now we want to parse webpages for each video category, so our parse()
method needs to yield Requests to load and parse category pages. We can create new Request objects directly but scrapy.Response
class provides .follow()
convenience method for this purpose. Let's modify our VideosSpider class like in the following example:
import scrapy
class VideosSpider(scrapy.Spider):
"""Parse videos from vidsplay.com"""
name = 'videos'
start_url = 'https://www.vidsplay.com'
def start_requests(self):
"""Entry point for our spider"""
yield scrapy.Request(self.start_url, callback=self.parse)
def parse(self, response):
"""Parse vidsplay.com index page"""
category_urls = response.xpath(
'/html/body/div[1]/div/div/div/aside/section[3]/div/ul/li/a/@href'
).extract()
for url in category_urls[:3]: # We want to be nice and scrap only 3 items
yield response.follow(url, callback=self.parse_category)
def parse_category(self, response):
"""Parse a video category page"""
base_selector = response.xpath(
'/html/body/div[1]/div/div/div/div/main/article'
)
category = base_selector.xpath(
'./header/h1/text()'
).extract_first()
video_selectors = base_selector.xpath(
'./div/div[1]/div/div/div/div[@class="pt-cv-ifield"]'
)
for selector in video_selectors[:3]: # We want to be nice and scrap only 3 items
url = selector.xpath('./p/a/@href').extract_first()
# ``meta`` argument can be used to pass data to downstream spider callbacks
yield response.follow(url,
callback=self.parse_video,
meta={'category': category})
def parse_video(self, response):
"""Parse a video details page"""
base_selector = response.xpath(
'/html/body/div[1]/div/div/div/div/main/article/div'
)
title = base_selector.xpath(
'./header/h1/text()'
).extract_first()
thumbnail = base_selector.xpath(
'./div/div[2]/div[1]/meta[@itemprop="thumbnailUrl"]/@content'
).extract_first()
url = base_selector.xpath(
'./div/div[2]/div[1]/meta[@itemprop="contentURL"]/@content'
).extract_first()
yield {
'category': response.meta['category'],
'title': title,
'thumbnail': thumbnail,
'url': url
}
As you can see, now .parse()
method yields requests to load and parse video categories. A category page is parsed with .parse_category()
method which extracts the category name and URLs for individual video pages. But now we have a small problem. A video item page does not include a category name for the video but we want to have this information. Fortunately, meta
argument of scrapy.Request
class constructor (or response.follow()
for that matter) allows to pass arbitrary data to downstream callbacks as a Python dict
.
And finally the .parse_video()
callback method retrieves information about each video item and yields it for Scrapy to process and save to some file format or a database. Our data items do not require complex logic so we use simple Python dictionaries to store information about each video item. Again, run scrapy crawl videos
to make sure that our callback chain is working correctly. Note that we process only 3 categories and 3 videos in each category because we want to play nice and not to overload the www.vidsplay.com site with too many requests. However, you can find the full list of videos scraped from www.vidsplay.com site in my example project on GitHub. The example Scrapy project for this tutorial you can find in my GitHub repository.
In the 2nd part of my tutorial I will show you how to store scraped data in .json
and Microsoft Excel .xlsx
formats.