Word frequency from PDFs

I recently attended the IEEE Intelligent Vehicles Symposium on the Gold Coast, Australia.

Clearly there were some trending topics and my feeling was stereo vision was a big one as well as specific visual tracking algorithms such as semi-global matching (SGM).

Feeling is one thing, actually knowing is another. Looking at the disc provided to me as an attendee I saw that all the PDFs were available for the presentations. There were 232 PDF files in total.

I thought it would be interesting to rip the text out of the entire set and do a word frequency count. This kind of operation is very simple in Python. I dislike the structure of the language itself but when you complete a task such as this you do have to admire the ground-swell of developers and resources available.

I had a look at both PyPDF2 as well as PDFMiner libraries. My initial experimentation with PyPDF2 showed that the structure of the PDFs was not going to be good for that approach – specifically, that there were no actual space characters in the PDF documents, so all text was extracted running together. This is not unusual for PDF layouts.

PDFMiner is setup predominantly as a command line tool which means it was quick to test before installing the library and writing a more comprehensive script. The key trick using PDFMiner was to employ the ‘-A’ flag to automatically detect the PDF layout and interpret word spacing properly. The following command worked properly:

python pdf2text.py -A -o text.txt -0006.pdf

Once that was sorted I knocked up the following script to recurse through each PDF and extract the text into one big text file.

import os
import glob

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    laparams.all_texts = True
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

pdflist = glob.glob("C:\Users\Pinky\Desktop\pdf\*.pdf")

for pdf in pdflist:
    print("Working on: " + pdf + '\n')
    fout = open('pdfs.txt','a')
    fout.write(convert_pdf(pdf))
    fout.close()

Note the laparams.all_texts = True which was the only part of that function I modified – the rest was a straight cut-and-paste from a Stack Overflow post.

The resulting text file containing all the extracted text (pdfs.txt) was surprisingly high quality. Better than I thought it would be.

Next I wrote a quick script to read in all the text (only 4.5MB in the end) and used the Python Counter module to create a table of all word frequencies.

from collections import Counter
import string

fin = open('pdfs.txt','r')
words = fin.read().lower()
out = words.translate(string.maketrans("",""), string.punctuation)
fin.close()

wordss = out.split()

cnt = Counter(wordss)

fout = open('counts.txt','w')
for k, v in cnt.items():
fout.write(k + "," + str(v) + '\n')
fout.close()

The translate method was essential to cleaning the word list up to a reasonable level without too much effort.

Finally the counts.txt file was imported into Excel and sorted in ascending order based on the term frequencies.

No surprises what the top ten words were:

Term
Freq
the
26534
of
11146
and
10037
a
8334
in
8035
to
7234
is
6295
for
3954
on
2861
with
2666

At least I guess it’s proof that the method works well enough!

Moving onto some more interesting terms:

Term
Freq
detection
955
vehicles
952
algorithm
585
intelligent
516
estimation
499
distance
480
pedestrian
439
vision
432
camera
406
tracking
403
parameters
367
performance
356
stereo
359

That is a much more interesting list because it confirms some of the clear topics that were presented.

In principle the method can be applied very quickly and easily to a batch of PDFs. The extraction takes a few minutes to run but the word count is very quick in comparison.

About Pinky

Computers, programming, new music and all things science, tech and sport
This entry was posted in Automotive, Computer programming. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>