Testing PDF Extraction Tools

Posted on 7 September 2016

We are now 8 weeks into the Official Inquiries project. One of the most time-consuming challenges the project faces is that of turning PDF files of inquiries into clean, readable text. Converting a PDF file into a text file is not a simple matter of changing a file extension or copying and pasting from one to another. To accomplish this conversion, we use free, open source tools which are capable of reading the PDF and rewriting it as a text file. Of course, none of these tools can perfectly interpret the file - especially not bespoke to our personal preferences - and each inquiry’s PDF will be different, so we have tested and compared a number of different ones, to see which one works best for our purposes. After all, the more work the tools can do, the less effort we have to put into tidying ourselves before we can present the text online!

As an example, let’s take a page of the US Senate’s PSI report into the Financial Crisis:

financialcrisisreport

Our objective is to turn this PDF into a text file, which can then be tidied and put online. When we began the Official Inquiries project, our default tool for this was PDFMiner. Here is what PDFMiner makes of that page of text:

A.  Subcommittee Investigation

important cause of the crisis, it provides new, detailed, and compelling evidence of what

happened.  In so doing, we hope the Report leads to solutions that prevent it from happening

again.

  1.  EXECUTIVE SUMMARY

investigation into some of the key causes of the financial crisis.  Since then, the Subcommittee

has engaged in a wide-ranging inquiry, issuing subpoenas, conducting over 150 interviews and

depositions, and consulting with dozens of government, academic, and private sector experts.

The Subcommittee has accumulated and reviewed tens of millions of pages of documents,

including court pleadings, filings with the Securities and Exchange Commission, trustee reports,

prospectuses for public and private offerings, corporate board and committee minutes, mortgage

transactions and analyses, memoranda, marketing materials, correspondence, and emails.  The

Subcommittee has also reviewed documents prepared by or sent to or from banking and

In November 2008, the Permanent Subcommittee on Investigations initiated its

^LB.  Overview

(1)  High Risk Lending:

Case Study of Washington Mutual Bank

securities regulators, including bank examination reports, reviews of securities firms,

enforcement actions, analyses, memoranda, correspondence, and emails.

In April 2010, the Subcommittee held four hearings examining four root causes of the

financial crisis.  Using case studies detailed in thousands of pages of documents released at the

hearings, the Subcommittee presented and examined evidence showing how high risk lending by

U.S. financial institutions; regulatory failures; inflated credit ratings; and high risk, poor quality

financial products designed and sold by some investment banks, contributed to the financial

crisis.  This Report expands on those hearings and the case studies they featured.  The case

studies are Washington Mutual Bank, the largest bank failure in U.S. history; the federal Office

of Thrift Supervision which oversaw Washington Mutual’s demise; Moody’s and Standard &

Poor’s, the country’s two largest credit rating agencies; and Goldman Sachs and Deutsche Bank,

two leaders in the design, marketing, and sale of mortgage related securities.  This Report

devotes a chapter to how each of the four causative factors, as illustrated by the case studies,

fueled the 2008 financial crisis, providing findings of fact, analysis of the issues, and

recommendations for next steps.

2

The first chapter focuses on how high risk mortgage lending contributed to the financial

crisis, using as a case study Washington Mutual Bank (WaMu).  At the time of its failure, WaMu

was the nation’s largest thrift and sixth largest bank, with $300 billion in assets, $188 billion in

deposits, 2,300 branches in 15 states, and over 43,000 employees.  Beginning in 2004, it

embarked upon a lending strategy to pursue higher profits by emphasizing high risk loans.  By

2006, WaMu’s high risk loans began incurring high rates of delinquency and default, and in

2007, its mortgage backed securities began incurring ratings downgrades and losses.  Also in

2007, the bank itself began incurring losses due to a portfolio that contained poor quality and

fraudulent loans and securities.  Its stock price dropped as shareholders lost confidence, and

depositors began withdrawing funds, eventually causing a liquidity crisis at the bank.  On

September 25, 2008, WaMu was seized by its regulator, the Office of Thrift Supervision, placed

in receivership with the Federal Deposit Insurance Corporation (FDIC), and sold to JPMorgan

Chase for $1.9 billion.  Had the sale not gone through, WaMu’s failure might have exhausted the

entire $45 billion Deposit Insurance Fund.

As you can see, there are quite a few problems: there is lots of extra spacing in the text, artefacts like “^L” have been added and, most noticeably, the text is actually jumbled and not in the order it is in the PDF. The PSI report into the Financial Crisis is hundreds of pages long, so tidying this up manually would be very difficult for us. With that in mind, we decided to test Apache PDFbox and see if it couldn’t do a little better. Here’s what it output:

 

I.  EXECUTIVE SUMMARY

  1.  Subcommittee Investigation

In November 2008, the Permanent Subcommittee on Investigations initiated its

investigation into some of the key causes of the financial crisis.  Since then, the Subcommittee

has engaged in a wide-ranging inquiry, issuing subpoenas, conducting over 150 interviews and

depositions, and consulting with dozens of government, academic, and private sector experts.

The Subcommittee has accumulated and reviewed tens of millions of pages of documents,

including court pleadings, filings with the Securities and Exchange Commission, trustee reports,

prospectuses for public and private offerings, corporate board and committee minutes, mortgage

transactions and analyses, memoranda, marketing materials, correspondence, and emails.  The

Subcommittee has also reviewed documents prepared by or sent to or from banking and

2

securities regulators, including bank examination reports, reviews of securities firms,

enforcement actions, analyses, memoranda, correspondence, and emails.

In April 2010, the Subcommittee held four hearings examining four root causes of the

financial crisis.  Using case studies detailed in thousands of pages of documents released at the

hearings, the Subcommittee presented and examined evidence showing how high risk lending by

U.S. financial institutions; regulatory failures; inflated credit ratings; and high risk, poor quality

financial products designed and sold by some investment banks, contributed to the financial

crisis.  This Report expands on those hearings and the case studies they featured.  The case

studies are Washington Mutual Bank, the largest bank failure in U.S. history; the federal Office

of Thrift Supervision which oversaw Washington Mutual’s demise; Moody’s and Standard &

Poor’s, the country’s two largest credit rating agencies; and Goldman Sachs and Deutsche Bank,

two leaders in the design, marketing, and sale of mortgage related securities.  This Report

devotes a chapter to how each of the four causative factors, as illustrated by the case studies,

fueled the 2008 financial crisis, providing findings of fact, analysis of the issues, and

recommendations for next steps.

  1.  Overview

(1) High Risk Lending:

Case Study of Washington Mutual Bank

The first chapter focuses on how high risk mortgage lending contributed to the financial

crisis, using as a case study Washington Mutual Bank (WaMu).  At the time of its failure, WaMu

was the nation’s largest thrift and sixth largest bank, with $300 billion in assets, $188 billion in

deposits, 2,300 branches in 15 states, and over 43,000 employees.  Beginning in 2004, it

embarked upon a lending strategy to pursue higher profits by emphasizing high risk loans.  By

2006, WaMu’s high risk loans began incurring high rates of delinquency and default, and in

2007, its mortgage backed securities began incurring ratings downgrades and losses.  Also in

2007, the bank itself began incurring losses due to a portfolio that contained poor quality and

fraudulent loans and securities.  Its stock price dropped as shareholders lost confidence, and

depositors began withdrawing funds, eventually causing a liquidity crisis at the bank.  On

September 25, 2008, WaMu was seized by its regulator, the Office of Thrift Supervision, placed

in receivership with the Federal Deposit Insurance Corporation (FDIC), and sold to JPMorgan

Chase for $1.9 billion.  Had the sale not gone through, WaMu’s failure might have exhausted the

entire $45 billion Deposit Insurance Fund.

This result is much better. We also tested another tool called Poppler, which delivered a similar result to PDFbox, but with some of the extra artefacts of PDFMiner. Given that PDFbox had given us such good results with the Financial Crisis report, we also tested it on pages from the Chilcot Report and the Leveson Report. Although the difference was not as dramatic, PDFbox was the most consistent of the three and, going forward, it will be the tool we use most to process PDF inquiries into text.

However, while all of these tools are very useful for processing modern reports like Chilcot and Leveson, which released with well-made PDF files for public consumption, some reports, particularly older ones, are catalogued in PDFs filled with simple scans of the paper originals. Computers find it much harder to read text within image files and none of the three tools we tested are capable of reading these scanned PDFs. If you try to extract from a scanned PDF with one of these tools, it will just give you a list of image file names!

While the main PDF for the PSI Financial Crisis report is a modern PDF, the extended report, in four volumes, is scanned. So our next challenge is to work out how we can get text from these volumes. As a preliminary step, we tested Google Cloud Platform’s Vision to see what it made of a few pages of one of the volumes.

googlevision

As you can see, we have our work cut out for us!

If you have any suggestions or would like to help our project, please visit our website or our GitHub repo.