Extracting Text from a PDF Using Python

 Jan. 6, 2019     0 comments

Recently I needed to extract text from a PDF file using Python. Quick googling led me to PyPDF2 package, however I wasn't able to extract any text from my test PDF with it. The test PDF was created with Google Docs (a very common scenario) and did not have any fancy formatting, so PyPDF2 was disqualified for my purposes. After further googling I found pdfminer package and its Python 3 compatible version — pdfminer.six. Unfortunately, it lacks API documentation, so I had to dig into the code to find out how to use it programmatically (not from a command line). As it turned out, extracting text fro a PDF file with pdfminer.six is very easy because it provides a high-level function for that purpose. Everything can be done with this simple code:

from io import StringIO
from pdfminer.high_level import extract_text_to_fp
from typing import BinaryIO


def extract_text_from_pdf(pdf_fo: BinaryIO) -> str:
    """
    Extracts text from a PDF

    :param pdf_fo: a byte file object representing a PDF file
    :return: extracted text
    :raises pdfminer.pdftypes.PDFException: on invalid PDF
    """
    out_fo = StringIO()
    extract_text_to_fp(pdf_fo, out_fo)
    return out_fo.getvalue()

This code fragment uses type annotations introduced in Python 3.6. You can remove type annotations for earlier Python 3 versions.

  Python