I recently attended the IEEE Intelligent Vehicles Symposium on the Gold Coast, Australia.
Clearly there were some trending topics and my feeling was stereo vision was a big one as well as specific visual tracking algorithms such as semi-global matching (SGM).
Feeling is one thing, actually knowing is another. Looking at the disc provided to me as an attendee I saw that all the PDFs were available for the presentations. There were 232 PDF files in total.
I thought it would be interesting to rip the text out of the entire set and do a word frequency count. This kind of operation is very simple in Python. I dislike the structure of the language itself but when you complete a task such as this you do have to admire the ground-swell of developers and resources available.
I had a look at both PyPDF2 as well as PDFMiner libraries. My initial experimentation with PyPDF2 showed that the structure of the PDFs was not going to be good for that approach – specifically, that there were no actual space characters in the PDF documents, so all text was extracted running together. This is not unusual for PDF layouts.
PDFMiner is setup predominantly as a command line tool which means it was quick to test before installing the library and writing a more comprehensive script. The key trick using PDFMiner was to employ the ‘-A’ flag to automatically detect the PDF layout and interpret word spacing properly. The following command worked properly:
python pdf2text.py -A -o text.txt -0006.pdf
Once that was sorted I knocked up the following script to recurse through each PDF and extract the text into one big text file.
import os import glob from pdfminer.pdfinterp import PDFResourceManager, process_pdf from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from cStringIO import StringIO def convert_pdf(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() laparams.all_texts = True device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = file(path, 'rb') process_pdf(rsrcmgr, device, fp) fp.close() device.close() str = retstr.getvalue() retstr.close() return str pdflist = glob.glob("C:\Users\Pinky\Desktop\pdf\*.pdf") for pdf in pdflist: print("Working on: " + pdf + '\n') fout = open('pdfs.txt','a') fout.write(convert_pdf(pdf)) fout.close()
Note the laparams.all_texts = True which was the only part of that function I modified – the rest was a straight cut-and-paste from a Stack Overflow post.
The resulting text file containing all the extracted text (pdfs.txt) was surprisingly high quality. Better than I thought it would be.
Next I wrote a quick script to read in all the text (only 4.5MB in the end) and used the Python Counter module to create a table of all word frequencies.
from collections import Counter import string fin = open('pdfs.txt','r') words = fin.read().lower() out = words.translate(string.maketrans("",""), string.punctuation) fin.close() wordss = out.split() cnt = Counter(wordss) fout = open('counts.txt','w') for k, v in cnt.items(): fout.write(k + "," + str(v) + '\n') fout.close()
The translate method was essential to cleaning the word list up to a reasonable level without too much effort.
Finally the counts.txt file was imported into Excel and sorted in ascending order based on the term frequencies.
No surprises what the top ten words were:
At least I guess it’s proof that the method works well enough!
Moving onto some more interesting terms:
That is a much more interesting list because it confirms some of the clear topics that were presented.
In principle the method can be applied very quickly and easily to a batch of PDFs. The extraction takes a few minutes to run but the word count is very quick in comparison.