Scrapy Tutorial - Part 2

In the 1st part of my tutorial I demonstrated how to create a simple Scrapy spider that collects data from a website. Now I will show you how to save scraped data in JSON and XLSX formats.

Feed Exports

Scrapy includes so called Feed Exports that allow to save data in JSON, CSV and XML formats. All you need is to add the necessary options in your settings.py file. The following example demonstrates a minimal set of options for saving data in a JSON file on the local filesystem:

FEED_URI = 'file:///{}.json'.format(OUT_FILE)
FEED_FORMAT = 'json'
FEED_EXPORT_ENCODING = 'utf-8'
FEED_EXPORT_INDENT = 2

Here FEED_URI parameter must include full path to the file to be saved (OUT_FILE). The next 2 parameters are self-explanatory. FEED_EXPORT_INDENT parameter allows you to save JSON data in human-readable indented formatting. In this example we use 2-space indents.

Item Pipelines

If the build-in formats are not enough or your data require special processing you can use Item Pipelines. A pipeline is a class with several callback methods that are called at different stages of website scraping. A Pipeline class must implement at least process_item method that receives a scraped item and a spider instance, and must return the received item, changed or unchanged, for further processing or raise DropItem exception signaling that the item is not needed. The following example shows a pipeline for saving scraped data in an Excel spreadsheet using openpyxl library:

Click to show

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from openpyxl import Workbook
from openpyxl.styles import Font
from .settings import OUT_FILE


class XlsxPipeline(object):
    """Save scraped data to an XLSX spreadsheet"""

    def open_spider(self, spider):
        """Start scraping"""
        # Create an Excel workbook
        self._wb = Workbook()
        # Select the active spreadsheet
        self._ws = self._wb.active
        self._ws.title = 'Videos'
        self._ws.append(['Category', 'Title', 'Thumbnail', 'Video URL'])
        row = list(self._ws.rows)[0]
        for cell in row:
            cell.font = Font(bold=True)

    def process_item(self, item, spider):
        # Append a row to the spreadsheet
        self._ws.append([
            item['category'],
            item['title'],
            item['thumbnail'],
            item['url']
        ])
        return item

    def close_spider(self, spider):
        """Stop scraping"""
        # Save the Excel workbook
        self._wb.save(OUT_FILE + '.xlsx')

Our pipeline implements three callback methods. The open_spider method is called once when our web spider starts working and in our example it is used to create an Excel spreadsheet and add column headers. The process_item is called for each found item and is used for adding rows to our spreadsheet. The close_spider method is called before closing the spider and used for saving the spreadsheet. Working with openpyxl library is beyond the scope of this tutorial and you can refer to openpyxl documentation for specific details.

For pipelines to work they need to be added to ITEM_PIPELINES dictionary in settigns.py:

ITEM_PIPELINES = {
    'vidsplay_scraper.pipelines.XlsxPipeline': 300,
}

The number represents relative priority order of the pipeline: the lower the value, the higher the priority. In our case we have only one pipeline so the number does not matter.

Conclusion

In this two-part tutorial I demonstrated how to use Python Scrapy framework for scraping data from websites and save them for further usage. The full code for this tutorial is available in my GitHub repository.