Word frequency from PDFs

I recently attended the IEEE Intelligent Vehicles Symposium on the Gold Coast, Australia.

Clearly there were some trending topics and my feeling was stereo vision was a big one as well as specific visual tracking algorithms such as semi-global matching (SGM).

Feeling is one thing, actually knowing is another. Looking at the disc provided to me as an attendee I saw that all the PDFs were available for the presentations. There were 232 PDF files in total.

I thought it would be interesting to rip the text out of the entire set and do a word frequency count. This kind of operation is very simple in Python. I dislike the structure of the language itself but when you complete a task such as this you do have to admire the ground-swell of developers and resources available.

I had a look at both PyPDF2 as well as PDFMiner libraries. My initial experimentation with PyPDF2 showed that the structure of the PDFs was not going to be good for that approach – specifically, that there were no actual space characters in the PDF documents, so all text was extracted running together. This is not unusual for PDF layouts.

PDFMiner is setup predominantly as a command line tool which means it was quick to test before installing the library and writing a more comprehensive script. The key trick using PDFMiner was to employ the ‘-A’ flag to automatically detect the PDF layout and interpret word spacing properly. The following command worked properly:

python pdf2text.py -A -o text.txt -0006.pdf

Once that was sorted I knocked up the following script to recurse through each PDF and extract the text into one big text file.

import os
import glob

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    laparams.all_texts = True
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)

    str = retstr.getvalue()
    return str

pdflist = glob.glob("C:\Users\Pinky\Desktop\pdf\*.pdf")

for pdf in pdflist:
    print("Working on: " + pdf + '\n')
    fout = open('pdfs.txt','a')

Note the laparams.all_texts = True which was the only part of that function I modified – the rest was a straight cut-and-paste from a Stack Overflow post.

The resulting text file containing all the extracted text (pdfs.txt) was surprisingly high quality. Better than I thought it would be.

Next I wrote a quick script to read in all the text (only 4.5MB in the end) and used the Python Counter module to create a table of all word frequencies.

from collections import Counter
import string

fin = open('pdfs.txt','r')
words = fin.read().lower()
out = words.translate(string.maketrans("",""), string.punctuation)

wordss = out.split()

cnt = Counter(wordss)

fout = open('counts.txt','w')
for k, v in cnt.items():
fout.write(k + "," + str(v) + '\n')

The translate method was essential to cleaning the word list up to a reasonable level without too much effort.

Finally the counts.txt file was imported into Excel and sorted in ascending order based on the term frequencies.

No surprises what the top ten words were:


At least I guess it’s proof that the method works well enough!

Moving onto some more interesting terms:


That is a much more interesting list because it confirms some of the clear topics that were presented.

In principle the method can be applied very quickly and easily to a batch of PDFs. The extraction takes a few minutes to run but the word count is very quick in comparison.

Posted in Automotive, Computer programming | Leave a comment

Cheap counterflow wort chiller

This is the pic I found online that I was using for inspiration:

Counterflow wort chiller I found online

I picked up the following items from the local hardware store (total about $70AUD):

  • 3m of 25mm reinforced PVC tube (clear). Reinforced not necessary, just liked the look of it.
  • 3m of 12.7mm copper pipe
  • 2x brass t-junctions (½” thread)
  • 2x 13mm garden watering system barbs
  • 2x stainless steel hose clamps
  • packet of nylon olives
  • 2x ½” compression nipples
  • packet of zip ties

I fitted all that together with some teflon tape I had laying around and voila! Unfortunately won’t get a chance to try it out for a couple of weeks but it should work well enough.


Posted in Homebrew | Leave a comment

Filtering homebrew

I recycled the yeast from a previous fermentation without proper consideration to the amount of yeast required to ferment the new batch. As a result the final beer had so much yeast in it you just about had to chew it. Honestly, it was undrinkable; and I don’t mind a bit of yeast in a craft beer.

Filter with corny keg line-out disconnect

Filter with corny keg line-out disconnect

I bought the housing off eBay for $30AUD inc. postage, and some 12.5mm PVC tubing from the local hardware store (food safe at cold temperature) and fitted the tubing to a cornelius keg disconnect using a stainless steel hose clamp. The 12.5mm tubing went over the top of the standard 6-8mm tubing connect area after soaking the tubing in hot water. The connectors between the tubing and filter are standard garden irrigation system threaded connectors to 13mm barbs. These should also be fine regarding food safety at the filtering temperature.

The idea for sanitising was that the lid, tubes and disconnect could all be sanitised as one item after removing the poppet from the disconnect with a screwdriver.

Source keg (LEFT) and destination keg (RIGHT) with filter (MIDDLE)

Source keg (LEFT) and destination keg (RIGHT) with filter (MIDDLE)

I filtered at roughly 100kPa and it took about 5-10min to filter the entire keg.

Destination keg overflowing from froth

Destination keg overflowing from froth

The batch had been stored in the fridge prior to filtering and since I had originally naturally carbonated the keg with 70g of dextrose it was not properly flat. You do not want to filter a fully carbonated keg. For all intents and purposes the beer was flat, but it still frothed a lot. I believe a major factor was that the cold beer holds dissolved CO2 more readily – it would have been preferable to keep the keg to be filtered at room temperature and use the release valve daily for a few days to make sure all dissolved CO2 had been removed from the liquid.

Filtering complete, this was the final state of the overlfowing destination keg

Filtering complete, this was the final state of the overlfowing destination keg

You can see from the patch of liquid on the ground that it didn’t really overflow that much. It was a simple job to fit the lid through the froth, then rinse the keg under an outdoor tap before putting it into the refrigerator.

Here is the uncarbonated (but cold) filtered beer

Here is the uncarbonated (but cold) filtered beer

The final finished beer was not exactly clear (although there is condensation on this glass from the cold liquid, so it is hard to tell) however it was a whole lot clearer than the original. I will probably look at 0.5 micron rather than a 1 micron filter in the future. The filter that I used said that it captures 85% of 1 micron sized particles. My research found that yeast was typically 3-5 microns across, so most if not all yeast should be removed and I am satisfied that the remaining cloudiness is probably just some heavier proteins.

The 1 micron filter immediately after filtering

The 1 micron filter immediately after filtering

The filter actually looked really clean after filtering. I couldn’t see any visible chunks on it. It was rinsed thoroughly and then re-assembled into the filter tank with water and no-rinse sanitiser and the disconnect and attached tube was removed while the filter-out tube was filled with water and wrapped around back onto the filter-in 13mm barb to make a loop so that the filter and tube was filled with sanitised solution. It will be stored this way until next use and properly sanitised again immediately before use. One advantage of the 1 micron filter is that water passes through very easily so it is simple to sanitise.

The waste poured out of the filter housing immediately after filtering

The waste poured out of the filter housing immediately after filtering

The liquid that came out of the centre of the filter (the waste) was quite thick but didn’t really differ in resemblence to the original unfiltered beer (yep, it was that bad!). Still, I wouldn’t drink this!

Posted in Homebrew | Leave a comment

Dill’s Atlantic Giant – Week #16. Size does matter.

As it turns out, size does matter. This was supposed to be a competition but my mate moved his while it was flowering and it’s struggling along. He did have a fruit though, so anything is possible – but my solar collector leaf array is massive compared to his, and only that sunlight is going to add mass. He can’t win, I dare say.

It’s growing, growing, growing. Every day we go outside and our minds are blown by how much mass has been added to the fruit in such a short amount of time.

I cut off all the flowers hoping to reduce the amount of energy the plant was putting into looking pretty, thinking that the energy can be redirected into growing the fruit. No idea if that makes sense from a horticultural science sense, but nothing to lose trying…and EVERYTHING to gain!

I decided to bed the fruit on top of some pea straw mulch so that it wasn’t in direct contact with the ground after seeing a similar approach on Gardening Australia. The idea is that any water can drain away easily from the fruit and reduce the risk of the skin rotting underneath. Seems to be working well so far.

Now it’s just a question of how much sunlight remains in the coming weeks before Autumn and the cooler climate and sun’s motion though the sky prevents any further growth.

Posted in Gardening | Leave a comment

Dill’s Atlantic Giant – Week #14. Home stretch.

Things are very real now. In the space of only one week the female flower opened and was pollinated – no doubt by insects, before I manually pollinated it just in case.

Bees are loving the pumkin flowers. You can stand there and watch bee after bee after bee land and collect pollen. I don’t think there is any doubt that the female flower was naturally pollinated.

Interestingly it does seem to attract a raft of other small insects as well. Hard to tell how they even get inside the flower – if they fly in or crawl in – but there is always something crawling around in there.

There is only one pumpkin (two little ones dropped off before the flower opened) but that one remaining is going great-guns! It is almost doubling size daily, so I put another bag of premium compost on top of the pile and mulched again with pea straw. I’ve been watering it twice daily, once in the morning and once in the evening, and despite some hot weather (a few 30’C+ days) it looks very happy in the full sun, so things are going well.

Posted in Gardening | Leave a comment

Dill’s Atlantic Giant – Week #13. And recovery!

In the last article Little F–k Thong (ฟักทอง “fak thaawng” meaning “pumpkin” in Thai), as I have named the pumpkin, had a major set-back by way of a hail storm. A really, really bad hail storm!

With the only fruit severed, I wasn’t sure if Little F–k Thong was going to survive – but turns out, it’s a fighter! Even though I had to cut swathes of foliage from the original vine because of the hail damage the plant has grown two huge new vine stems and currently has four female flowers which I will eventually pollinate and will become fruit!

The progress is timely as well. It is mid-Summer and there is still plenty of sunlight (mixed with showers) for the next 8 weeks at least. The rain can be a bit on the torrential side through the Melbourne Summer, so I will continue to keep my eye on the plant and make sure it survives anything untoward.

Posted in Gardening | Leave a comment

First time-lapse attempt

I’ve always loved the classic style of time-lapse photography that you see in pretty much every nature documentary but I’ve never had the gear to attempt it myself.

I was temped to hack into an old digital camera and hook it up with an Arduino or something like that to control the shutter with a laptop, but it’s something that I never managed to find the time to do myself.

Meanwhile we shelled out on a Nikon D7000 DSLR which has in-built time-lapse functionality, which is nice!

Needless to say I made an error first attempt – I accidentally set a 200 photo limit, so I missed sunset (whoops!), but I wasn’t totally happy with the shots regardless, and I want to do a time-lapse of a more interesting scene anyway, as interesting as the backyard is!

The resulting image set was 200 images captured from 1700h to roughly 1815h. The images were combined into a movie file in Linux using the command:

mencoder "mf://*.JPG" -mf fps=10 -ovc lavc -o output.avi

Here is the resulting 20s movie!

Posted in Photography | Leave a comment

Dill’s Atlantic Giant – Week #10. And disaster.

Things were going well. Really well. Until Tim Minchin’s song was chosen not to be aired on the Jonathan Ross Show, and then the hail hit.

The hail was the serious kind of hail. Both of our cars were damaged, the local train tracks became a new waterfall and worst…oh worst – the pumpkin was damaged, possibly beyond repair.

I don’t take soldiers fighting in Afghanistan and other war-torn places lightly, those guys do a great job, so this is a poor choice of similarly – but it was like a war zone. The pumpkin was peppered by the hail with bullet holes everywhere.

The worst part was that the pumpkin was actually starting to flower. Roughly 2m from the vine root there was actually a single female flower with the fruit just sitting there, waiting to be fertilised! It was not to be, however, for the stem with the fruit was completely severed from the vine by the hail.

So the battle plan is to prune the damaged vine and just see what happens. There are plenty of flowers – all male – and some new growth on the way, but with the engine  (leaves) damaged it’s hard to say if there will be any pumpkins on this plant.

Posted in Gardening | Leave a comment

Some things about Japan

I was in Japan recently and I made some notes on a scrap piece of paper while on a train. The aim was just to list some things that were interesting/different about Japan compared to Australia. So, here’s the list:

  • Rationing electricity: In the same way that water is rationed in Melbourne, electricity is currently rationed in Japan due to the Fukushima incident. Typical examples are 25% of office lights are removed, revolving door entrances are disabled every second elevator is disabled. The Japanese workers I spoke to didn’t mind the rationing at all and said that perhaps when it is over they won’t put their office lights back in because they quite like the darker environment.
  • No hand towels in public toilets. You simply shake your hands or wipe them on your clothes. I guess the premise of washing your hands is that they are clean so wiping them on your clothes shouldn’t really matter. Right?
  • Walk on the right on footpaths. Even though they drive on the left.
  • Unmaintained nature strips and play areas. Some areas are completely unmaintained with grass that is above knee high. This includes children’s play areas. I suppose they don’t have our snakes ;)
  • Pruning trees. The Japanese have a very unique way of pruning trees, ultimately leaving a bush of foliage on the end of a long, trimmed branch. Some trees need to be supported using wooden framework due to this pruning method.
  • Paper ads. In trains and elsewhere many advertisements, including posters, are simply paper held up by clips. In Australia that would be ripped down by some angry youth in an instant – but seems to work fine in Japan.
  • Book sleeves on trains. Japanese people like to be reserved about what they are reading. They cover their books in public places with a book sleeve or with the paper bag the book was sold in. Sometimes you get the feeling the men are reading adult content, so there is some reason for it.
  • Wearing holey socks at Buddhist temples. Don’t make this mistake.
  • No food vending machines. While the Japanese love vending machines for every type of drink you can imagine – there are very few food vending machines. Perhaps because their taste for sweet confectionery is more reserved.
  • Beer girls at baseball. They seem underage but they obviously are not. They run around selling as much beer as possible from backpacks with proper CO2 beer taps and then at a certain point they have disappeared. Must be regulated somehow.
  • Vending machines in food courts. In a typical food court you do not order and pay a person at a register. You order and pay at a vending machine which prints a ticket, then you take the ticket to the shop counter and they give you a wireless beeper. When your meal is ready your beeper goes off and you collect. It’s very good, I really like this system. Much more efficient.
  • Bi-lingual TV. Seems to be much more common there. Most movies are broadcast so that you can switch languages on the remote control. It’s great for tourists at least.
  • Terrible English in pop music. It seems to be popular to inject some English phrases into pop songs there, and it seems to be done without any proof-reading by an English speaker.
  • Driverless monorail. One exists in Tokyo. It’s a simple thing to automate, surprised there aren’t more. Impressive, none-the-less.

So there’s my basic list. Some cultural things that were interesting to me!

Posted in Uncategorized | Leave a comment

Dill’s Atlantic Giant – Week #5

Egypt, Libya, Syria, Israel, Palestine, Afghanistan, Iraq – sheesh, too heavy! I just want to grow a pumpkin!

So it’s week five and things are going well (I assume. I don’t have any previous experience, your Honour).

The pumpkin looks happy. And any interventionalist deity knows it should be happy with the amount of chemicals I put in that soil! I actually thought tonight that I should fertilise it, but then I second-guessed myself with how many different fertilisers I put into the original mix – seriously, a lot. Of every type.

So here it is after four and five weeks. It’s not the 20ft vine I’ve been reading about but I assume that will be coming. One thing I found interesting were the different shapes of the leaves. The first two were clearly round type leaves and thereafter it was the more typical vine type leaf. I wonder if the chemistry in each type of leaf is different? Interesting.

Anyway, ado aside, here are some images.

Posted in Gardening | Leave a comment