In the 1st part of my tutorial I demonstrated how to create a simple Scrapy spider that collects data from a website. Now I will show you how to save scraped data in JSON
and XLSX
formats.
Feed Exports
Scrapy includes so called Feed Exports that allow to save data in JSON
, CSV
and XML
formats. All you need is to add the necessary options in your settings.py
file. The following example demonstrates a minimal set of options for saving data in a JSON
file on the local filesystem:
FEED_URI = 'file:///{}.json'.format(OUT_FILE)
FEED_FORMAT = 'json'
FEED_EXPORT_ENCODING = 'utf-8'
FEED_EXPORT_INDENT = 2
Here FEED_URI
parameter must include full path to the file to be saved (OUT_FILE
). The next 2 parameters are self-explanatory. FEED_EXPORT_INDENT
parameter allows you to save JSON
data in human-readable indented formatting. In this example we use 2-space indents.
Item Pipelines
If the build-in formats are not enough or your data require special processing you can use Item Pipelines. A pipeline is a class with several callback methods that are called at different stages of website scraping. A Pipeline class must implement at least process_item
method that receives a scraped item and a spider instance, and must return the received item, changed or unchanged, for further processing or raise DropItem
exception signaling that the item is not needed. The following example shows a pipeline for saving scraped data in an Excel spreadsheet using openpyxl
library:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from openpyxl import Workbook
from openpyxl.styles import Font
from .settings import OUT_FILE
class XlsxPipeline(object):
"""Save scraped data to an XLSX spreadsheet"""
def open_spider(self, spider):
"""Start scraping"""
# Create an Excel workbook
self._wb = Workbook()
# Select the active spreadsheet
self._ws = self._wb.active
self._ws.title = 'Videos'
self._ws.append(['Category', 'Title', 'Thumbnail', 'Video URL'])
row = list(self._ws.rows)[0]
for cell in row:
cell.font = Font(bold=True)
def process_item(self, item, spider):
# Append a row to the spreadsheet
self._ws.append([
item['category'],
item['title'],
item['thumbnail'],
item['url']
])
return item
def close_spider(self, spider):
"""Stop scraping"""
# Save the Excel workbook
self._wb.save(OUT_FILE + '.xlsx')
Our pipeline implements three callback methods. The open_spider
method is called once when our web spider starts working and in our example it is used to create an Excel spreadsheet and add column headers. The process_item
is called for each found item and is used for adding rows to our spreadsheet. The close_spider
method is called before closing the spider and used for saving the spreadsheet. Working with openpyxl
library is beyond the scope of this tutorial and you can refer to openpyxl documentation for specific details.
For pipelines to work they need to be added to ITEM_PIPELINES
dictionary in settigns.py
:
ITEM_PIPELINES = {
'vidsplay_scraper.pipelines.XlsxPipeline': 300,
}
The number represents relative priority order of the pipeline: the lower the value, the higher the priority. In our case we have only one pipeline so the number does not matter.
Conclusion
In this two-part tutorial I demonstrated how to use Python Scrapy framework for scraping data from websites and save them for further usage. The full code for this tutorial is available in my GitHub repository.