Recently, I received a request from a team member to find a way to:
Pretty cool right? Just finished the prototype today! 7/31/2018.
In this article I'll be covering the first step of this task where I use a free tool called Ghostscript to extract text from a PDF file.
What is Ghostscript?
Ghostscript is a high-performance Postscript and PDF interpreter and rendering engine with the most comprehensive set of page description languages (PDL's) on the market today and technology conversion capabilities covering PDF, PostScript, PCL and XPS languages.
Ghostscript has been under active development for over 20 years, and offers an extremely versatile feature set and can be deployed across a wide range of platforms, modules, end uses (embedding in hardware, as an engine in document management systems, providing cloud solution integration and as an engine in leading PDF generators and tools).
Please note that the PDF file must be formatted correctly (text not image only).
GhostScript Steps to Extract Text from PDF File:
- Download Ghostscript
- Install Ghostscript
- Copy your pdf file to the bin directory where you installed Ghostscript
- Open a command line window at the bin directory (as Administrator if you get access error when running).
- Sample Command: gswin64 -sDEVICE=txtwrite -o[Output File Name] [Input File Name]
- Sample ghostscript command: gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf