Rick Cable's Tech Blog / RickCable.com

Web Development ~ SEO ~ InfoSec

Recently, I received a request from a team member to find a way to:

  1. Extract a large amount of text from a large PDF file. 
  2. Once I get the text out I'll need to parse and get specific elements in to an excel file.
  3. Format the Excel file in to specific tabs for each type of report I extract and add column headers
  4. Create validation code where I connect to a data warehouse using an Ajax web service and Ajax call in the Excel macro to validate the data based on an ID in one of the columns.

Pretty cool right? Just finished the prototype today! 7/31/2018.

In this article I'll be covering the first step of this task where I use a free tool called Ghostscript to extract text from a PDF file. 

What is Ghostscript?

Ghostscript is a high-performance Postscript and PDF interpreter and rendering engine with the most comprehensive set of page description languages (PDL's) on the market today and technology conversion capabilities covering PDF, PostScript, PCL and XPS languages.

Ghostscript has been under active development for over 20 years, and offers an extremely versatile feature set and can be deployed across a wide range of platforms, modules, end uses (embedding in hardware, as an engine in document management systems, providing cloud solution integration and as an engine in leading PDF generators and tools).

How to extract text from a PDF using GhostScript

Please note that the PDF file must be formatted correctly (text not image only).

GhostScript Steps to Extract Text from PDF File:

- Download Ghostscript

- Install Ghostscript

- Copy your pdf file to the bin directory where you installed Ghostscript

- Open a command line window at the bin directory (as Administrator if you get access error when running).

 - Sample Command: gswin64 -sDEVICE=txtwrite -o[Output File Name] [Input File Name]

- Sample ghostscript command: gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf