Word frequency from PDFs

I recently attended the IEEE Intelligent Vehicles Symposium on the Gold Coast, Australia.

Clearly there were some trending topics and my feeling was stereo vision was a big one as well as specific visual tracking algorithms such as semi-global matching (SGM).

Feeling is one thing, actually knowing is another. Looking at the disc provided to me as an attendee I saw that all the PDFs were available for the presentations. There were 232 PDF files in total.

I thought it would be interesting to rip the text out of the entire set and do a word frequency count. This kind of operation is very simple in Python. I dislike the structure of the language itself but when you complete a task such as this you do have to admire the ground-swell of developers and resources available.

I had a look at both PyPDF2 as well as PDFMiner libraries. My initial experimentation with PyPDF2 showed that the structure of the PDFs was not going to be good for that approach – specifically, that there were no actual space characters in the PDF documents, so all text was extracted running together. This is not unusual for PDF layouts.

PDFMiner is setup predominantly as a command line tool which means it was quick to test before installing the library and writing a more comprehensive script. The key trick using PDFMiner was to employ the ‘-A’ flag to automatically detect the PDF layout and interpret word spacing properly. The following command worked properly:

python pdf2text.py -A -o text.txt -0006.pdf

Once that was sorted I knocked up the following script to recurse through each PDF and extract the text into one big text file.

import os
import glob

from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def convert_pdf(path):

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    laparams.all_texts = True
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)

    str = retstr.getvalue()
    return str

pdflist = glob.glob("C:\Users\Pinky\Desktop\pdf\*.pdf")

for pdf in pdflist:
    print("Working on: " + pdf + '\n')
    fout = open('pdfs.txt','a')

Note the laparams.all_texts = True which was the only part of that function I modified – the rest was a straight cut-and-paste from a Stack Overflow post.

The resulting text file containing all the extracted text (pdfs.txt) was surprisingly high quality. Better than I thought it would be.

Next I wrote a quick script to read in all the text (only 4.5MB in the end) and used the Python Counter module to create a table of all word frequencies.

from collections import Counter
import string

fin = open('pdfs.txt','r')
words = fin.read().lower()
out = words.translate(string.maketrans("",""), string.punctuation)

wordss = out.split()

cnt = Counter(wordss)

fout = open('counts.txt','w')
for k, v in cnt.items():
fout.write(k + "," + str(v) + '\n')

The translate method was essential to cleaning the word list up to a reasonable level without too much effort.

Finally the counts.txt file was imported into Excel and sorted in ascending order based on the term frequencies.

No surprises what the top ten words were:


At least I guess it’s proof that the method works well enough!

Moving onto some more interesting terms:


That is a much more interesting list because it confirms some of the clear topics that were presented.

In principle the method can be applied very quickly and easily to a batch of PDFs. The extraction takes a few minutes to run but the word count is very quick in comparison.

Posted in Automotive, Computer programming | Leave a comment

Cheap counterflow wort chiller

This is the pic I found online that I was using for inspiration:

Counterflow wort chiller I found online

I picked up the following items from the local hardware store (total about $70AUD):

  • 3m of 25mm reinforced PVC tube (clear). Reinforced not necessary, just liked the look of it.
  • 3m of 12.7mm copper pipe
  • 2x brass t-junctions (½” thread)
  • 2x 13mm garden watering system barbs
  • 2x stainless steel hose clamps
  • packet of nylon olives
  • 2x ½” compression nipples
  • packet of zip ties

I fitted all that together with some teflon tape I had laying around and voila! Unfortunately won’t get a chance to try it out for a couple of weeks but it should work well enough.


Posted in Homebrew | Leave a comment

Filtering homebrew

I recycled the yeast from a previous fermentation without proper consideration to the amount of yeast required to ferment the new batch. As a result the final beer had so much yeast in it you just about had to chew it. Honestly, it was undrinkable; and I don’t mind a bit of yeast in a craft beer.

Filter with corny keg line-out disconnect

Filter with corny keg line-out disconnect

I bought the housing off eBay for $30AUD inc. postage, and some 12.5mm PVC tubing from the local hardware store (food safe at cold temperature) and fitted the tubing to a cornelius keg disconnect using a stainless steel hose clamp. The 12.5mm tubing went over the top of the standard 6-8mm tubing connect area after soaking the tubing in hot water. The connectors between the tubing and filter are standard garden irrigation system threaded connectors to 13mm barbs. These should also be fine regarding food safety at the filtering temperature.

The idea for sanitising was that the lid, tubes and disconnect could all be sanitised as one item after removing the poppet from the disconnect with a screwdriver.

Source keg (LEFT) and destination keg (RIGHT) with filter (MIDDLE)

Source keg (LEFT) and destination keg (RIGHT) with filter (MIDDLE)

I filtered at roughly 100kPa and it took about 5-10min to filter the entire keg.

Destination keg overflowing from froth

Destination keg overflowing from froth

The batch had been stored in the fridge prior to filtering and since I had originally naturally carbonated the keg with 70g of dextrose it was not properly flat. You do not want to filter a fully carbonated keg. For all intents and purposes the beer was flat, but it still frothed a lot. I believe a major factor was that the cold beer holds dissolved CO2 more readily – it would have been preferable to keep the keg to be filtered at room temperature and use the release valve daily for a few days to make sure all dissolved CO2 had been removed from the liquid.

Filtering complete, this was the final state of the overlfowing destination keg

Filtering complete, this was the final state of the overlfowing destination keg

You can see from the patch of liquid on the ground that it didn’t really overflow that much. It was a simple job to fit the lid through the froth, then rinse the keg under an outdoor tap before putting it into the refrigerator.

Here is the uncarbonated (but cold) filtered beer

Here is the uncarbonated (but cold) filtered beer

The final finished beer was not exactly clear (although there is condensation on this glass from the cold liquid, so it is hard to tell) however it was a whole lot clearer than the original. I will probably look at 0.5 micron rather than a 1 micron filter in the future. The filter that I used said that it captures 85% of 1 micron sized particles. My research found that yeast was typically 3-5 microns across, so most if not all yeast should be removed and I am satisfied that the remaining cloudiness is probably just some heavier proteins.

The 1 micron filter immediately after filtering

The 1 micron filter immediately after filtering

The filter actually looked really clean after filtering. I couldn’t see any visible chunks on it. It was rinsed thoroughly and then re-assembled into the filter tank with water and no-rinse sanitiser and the disconnect and attached tube was removed while the filter-out tube was filled with water and wrapped around back onto the filter-in 13mm barb to make a loop so that the filter and tube was filled with sanitised solution. It will be stored this way until next use and properly sanitised again immediately before use. One advantage of the 1 micron filter is that water passes through very easily so it is simple to sanitise.

The waste poured out of the filter housing immediately after filtering

The waste poured out of the filter housing immediately after filtering

The liquid that came out of the centre of the filter (the waste) was quite thick but didn’t really differ in resemblence to the original unfiltered beer (yep, it was that bad!). Still, I wouldn’t drink this!

Posted in Homebrew | Leave a comment

The online media problem

End of NFL in US article image

End of NFL in US article image

Sometimes you read an article so void of substance that you shake your head the entire way through. Then you wonder why you wasted the last five minutes of your life with that trash.

And then you realise it is an online article, surrounded by advertisements which are generating the publisher revenue and you just cringe at what online media has become. You’ve just paid them to read that.

It’s not an isolated situation. My preferred printed broadsheet newspaper is a great source of credible news and opinion in that format, but as soon as you type in the URL to their online version…*gasp*…just don’t.

So while news publishers battle with their revenue models for online content, consumers lose out. It’s a lose-lose situation for consumers – and that means it’s a lose-lose for us all.

This ranting post has come about due to this gem: http://es.pn/A39Tns (You know what? Please don’t click through unless you can’t help yourself, I’ll explain).

It’s a piece about the National Football League (NFL) in the United States, and speculates what the future might be like without ‘football’ (can we go with ‘handegg’, the recent Imgurian coined expression which better represents this game?).

To summarise the article content: “Football makes heaps of ad money, people will lose out when it collapses.” I just don’t even know where to start in criticising their reporting.

There are so many logical problems with the article it would be a good example for high school students to play “name that logical fallacy”.

In conclusion the article speculates where football players might go; high jump (one of their perhaps tongue-in-cheek suggestions)? Really guys?


Posted in Criticism | Leave a comment

Dill’s Atlantic Giant – Week #16. Size does matter.

As it turns out, size does matter. This was supposed to be a competition but my mate moved his while it was flowering and it’s struggling along. He did have a fruit though, so anything is possible – but my solar collector leaf array is massive compared to his, and only that sunlight is going to add mass. He can’t win, I dare say.

It’s growing, growing, growing. Every day we go outside and our minds are blown by how much mass has been added to the fruit in such a short amount of time.

I cut off all the flowers hoping to reduce the amount of energy the plant was putting into looking pretty, thinking that the energy can be redirected into growing the fruit. No idea if that makes sense from a horticultural science sense, but nothing to lose trying…and EVERYTHING to gain!

I decided to bed the fruit on top of some pea straw mulch so that it wasn’t in direct contact with the ground after seeing a similar approach on Gardening Australia. The idea is that any water can drain away easily from the fruit and reduce the risk of the skin rotting underneath. Seems to be working well so far.

Now it’s just a question of how much sunlight remains in the coming weeks before Autumn and the cooler climate and sun’s motion though the sky prevents any further growth.

Posted in Gardening | Leave a comment

Debunking Kennett’s myths on car manufacturing

Former Victorian Premier, Jeff Kennett, had an opinion piece published in The Herald Sun on Federal Government subsidies. I will discuss only his second example – new vehicle manufacturing in Australia.

GMH and Ford’s challenge is not the level of the Australian dollar, but their failure to produce a car the community is clamouring to buy.

No, Jeff. The facts don’t support this opinion. Figure 1 Roy Morgan market research shows that those intending to buy a new vehicle in the next four years are very aware of the Holden Astra product and since the welcome investment into the Holden Cruze product – well, the slope on that line speaks for itself really.

Even SportsBet had a ‘novelty bet’ market recently on the best selling car out of the Mazda 3 and Holden Cruze in 2012!

Figure 1. Model awareness amongst new car intenders for the next 4 years (Source: Roy Morgan Research)

Why is that a good investment by the Federal Government?

Clearly because Australians intend to purchase small cars – a fact shown again in Roy Morgan Research that almost 25% (the largest category) of new car intenders will purchase a small vehicle – a percentage that is matched by the statistics from the Federal Chamber of Automotive Industries own sales reporting figures.

So the market demands are clearly being met by automotive manufacturers. Not only are the demands being met but the locally produced vehicles are winning awards like the Drive Car of the Year 2011 Best Small Car (Ford Focus).

Kennet wants the Australian tax payer to share his sentiment that Australian produced vehicles are not at the high standard of other automotive manufacturers in the world. So are they?

When the locally produced Ford Australia engineered Ranger achieved the first ever EuroNCAP 5-star rating for a ute (‘pickup truck’), let’s see what the world thought about that…


N24: http://www.n24.de/news/newsitem_7371765.html

Süddeutsche.de: http://newsticker.sueddeutsche.de/list/id/1223753

Stern.de: http://www.stern.de/auto/news/nur-einer-bleibt-unter-5-sternen-1743784.html

Focus.de: http://www.focus.de/auto/news/euroncap-crashtest-pick-up-als-freund-der-fussgaenger_aid_678353.html

DMM.de: http://dmm.travel/news/artikel/lesen/2011/10/ncap-vergibt-neue-sterne-fuer-sicherheit-39304/

Auto-presse.de: http://auto-presse.de/autonews.php?newsid=111663

Motor-talk.de: http://www.motor-talk.de/news/euro-ncap-5-sterne-fuer-11-von-12-neuen-modellen-t3558105.html

Autonachrichten.de: http://www.autonachrichten.de/2011/10/26/ford-ranger-erhalt-als-erster-pick-up-funf-euro-ncap-sterne.html

Net-tribune.de: http://www.net-tribune.de/nt/node/68830/news/Ford-Ranger-erster-Pick-up-mit-fuenf-EuroNCAP-Sternen

Weser Kurier.de: http://www.weser-kurier.de/Artikel/Ratgeber/Auto/470839/Neuer-Ford-Ranger%3A-1.-Pick-up-mit-Crashtest-Bestnote.html

Mitteldeutsche Zeitung.de: http://www.mz-web.de/servlet/ContentServer?pagename=ksta/page&atype=ksArtikel&aid=1318611100968

Social media










Corriere della Sicurezza: http://www.ilcorrieredellasicurezza.it/articolo.asp?idarticolo=per-il-ford-ranger-stelle-ai-crash-test-euro-ncap_5203

Sicurauto: http://www.sicurauto.it/crash-test/news/crash-test-euro-ncap-ottobre-2011.html

Vega editrice: http://www.vegaeditrice.it/asapress/3-prodotto/44697-ford-5-stelle-euroncap-per-il-pick-up-ranger

Autoappassionati: http://www.autoappassionati.it/index.php/news/item/euro-ncap-pioggia-di-stelle-193.html?category_id=23

Veicoli commerciali 24: http://www.veicolicommerciali24.it/articolo/356/nuovo-ford-ranger-ottiene-5-stelle-ai-crash-test-euro-ncap/

Autoblog: http://www.autoblog.it/post/36351/test-euro-ncap-5-stelle-per-11-nuovi-modelli-4-stelle-per-lancia-voyager

Omniauto: http://www.omniauto.it/magazine/17659/5-stelle-euro-ncap-per-lancia-thema-e-fiat-freemont

Quattroruote: http://www.quattroruote.it/notizie/sicurezza/crash-test-euroncap-cinque-stelle-per-11-auto-video



LA VANGUARDIA: www.lavanguardia.com/motor/20111026/54237096607/el-ford-ranger-primer-pick-up-con-la-maxima-puntuacion-en-seguridad.html

LA GACETA: www.intereconomia.com/noticias-gaceta/motor/nuevo-ford-ranger-5-estrellas-los-test-euro-ncap-20111026

EL CORREO – BLOGS : http://blogs.elcorreo.com/plazadegaraje/2011/10/26/el-ford-ranger-primer-pick-up-en-conseguir-las-5-estrellas-euroncap/

AUTOFACIL: http://www.autofacil.es/seguridad/ford-ranger-primer-pick-up-con-cinco-estrellas-euro-ncap

ACTUALIDAD AUTOCASION http://actualidad.autocasion.com/noticias/91654/cinco-estrellas-euro-ncap-para-un-pick-up/

EL ECONOMISTA: http://www.eleconomista.es/publicidad/opbo11/ecomotor/motor/noticias/3482163/10/11/El-Ford-Ranger-primer-pick-up-que-logra-la-maxima-puntuacion-de-EuroNCAP.html

DIARIO MOTOR: http://www.diariomotor.com/2008/11/26/ultimo-informe-euroncap-muchas-luces-y-alguna-sombra/euroncap-ford-ranger-01/

EUROPA PRESS: http://www.europapress.es/motor/industriales-00642/noticia-ford-ranger-primer-pick-up-liderar-test-seguridad-euroncap-20111026113827.html



PORTAL COCHES: http://www.portalcoches.net/El-Ford-Ranger-primer-pick-up-en-liderar-los-test-de-seguridad-de-EuroNCAP/5798.html

EL MUNDO: http://www.elmundo.es/elmundomotor/2011/10/26/seguridad/1319640616.html
ABC: http://www.abc.es/20111027/motor-novedades/abci-ford-ranger-estrellas-euroncap-201110262246.html

Social media: 




3D Car Shows ­- http://3d-car-shows.com/2011/all-new-ford-ranger-makes-history/

AM Online (Automotive Management) - http://www.am-online.com/news/2011/10/26/latest-new-car-models-gain-five-star-euro-ncap-safety-rating/29917/

Auto Express - http://www.autoexpress.co.uk/news/autoexpressnews/274279/euro_ncap_results.html

Autoblog - http://www.autoblog.com/2011/10/26/new-ford-ranger-becomes-first-pickup-to-earn-five-star-euro-ncap/

Auto-Media.Info - http://auto-media.info/2011/10/26/euro-ncap-awards-first-5-star-rating-for-a-pick-up-video/

Automotive Industry Digest - http://www.automotiveindustrydigest.com/2011/10/27/new-ford-ranger-becomes-first-pick-up-to-notch-five-star-safety-rating/

Car Keys - http://www.carkeys.co.uk/news/euro-ncap-awards-october-2011

Car and Van News - http://carandvannews.co.uk/2011/10/26/ford%E2%80%99s-ranger-is-first-five-star-pickup/

Car61 - http://www.car61.com/2011/new-ford-ranger-got-5-star-euro-ncap-rating/

Contract Hire and Leasing - http://www.contracthireandleasing.com/car-leasing-news/euro-ncap-results-round-up/

Expert Reviews - http://www.expertreviews.co.uk/car-tech/1288147/ford-confirms-pricing-on-new-ranger

Fleet News - http://www.fleetnews.co.uk/news/2011/10/27/all-new-ford-ranger-seals-euro-ncap-first/41106/

Fleet World - http://www.fleetworldgroup.co.uk/news/2011/Oct/Ford-Ranger-becomes-first-pickup-with-five-star-Euro-NCAP-rating-/0434003799/

In Auto news - http://www.inautonews.com/euro-ncap-first-5-start-rating-for-a-pick-up-the-ford-ranger

Motortorque - http://www.motortorque.com/news/auto-1110/ford-ranger-is-safest-pickup-in-europe.asp

Motorward - http://www.motorward.com/2011/10/ford-ranger-first-pickup-to-earn-5-star-ncap-rating/

Vans A2Z - http://www.vansa2z.com/All-new-Ranger-scores-five-EURO-NCAP-stars

Whatcar?  - http://www.whatcar.com/car-news/top-marks-for-10-cars-in-euro-ncap-test/259736

Which? - http://www.which.co.uk/news/2011/10/12-new-cars-tested-and-rated-by-euro-ncap-269553/

The Autochannel - http://www.theautochannel.com/news/2011/10/26/012198-all-new-ford-ranger-raises-bar-safety.html



Tf1.fr - http://www.tf1.fr/auto-moto/photo/crash-test-euroncap-derniers-resultats-avec-12-nouveautes-6794465.html

Autonews.fr - http://www.autonews.fr/Breves/Ford-Ranger-5-etoiles-a-Euro-NCAP-281955/

Turbo.fr - http://www.turbo.fr/actualite-automobile/454798-euro-ncap-resultats-2011/

Unhomme.fr - http://www.unhomme.fr/page-al-alias6663.html

Leblogauto.com - http://www.leblogauto.com/2011/10/euroncap-derniere-fournee-et-quasi-sans-fautes.html

Autotitre.com - http://www.autotitre.com/a/Euro-NCAP-les-crash-tests-du-Ford-Ranger-en-video-41866.htm

Auto-buzz.com - http://www.auto-buzz.com/tests-euroncap-distribution-des-etoiles-538712.html

Blogautomobile.fr - http://blogautomobile.fr/euroncap-2011-pluie-de-5-etoiles-sauf-129987#axzz1bzrmu8op

Autosource.fr - http://www.autosource.fr/actus/resultats-euro-ncap-5-etoiles-audi-q3-bmw-serie-1-toyota-yaris/

Lyonne.fr - http://www.lyonne.fr/france_monde/automobile/le_ford_ranger_est_le_premier_pick_up_a_obtenir_5_etoiles_au_crash_test_d_euro_ncap@CARGNjFdJSsAFx0BAxk-.html

Lamontagne.fr - http://www.lamontagne.fr/france_monde/automobile/le_ford_ranger_est_le_premier_pick_up_a_obtenir_5_etoiles_au_crash_test_d_euro_ncap@CARGNjFdJSsAFx0BAxk-.html













Automotorsport: http://www.automotorsport.se/artiklar/nyheter/20111026/euro-ncap-inte-bara-femstjarnigt

FRONT PAGE of Tekniken Svarld: http://www.teknikensvarld.se/2011/10/26/25040/ford-ranger-forsta-pickup-med-fem-stjarnor/










The Netherlands












Posted in Automotive | Leave a comment

Dill’s Atlantic Giant – Week #14. Home stretch.

Things are very real now. In the space of only one week the female flower opened and was pollinated – no doubt by insects, before I manually pollinated it just in case.

Bees are loving the pumkin flowers. You can stand there and watch bee after bee after bee land and collect pollen. I don’t think there is any doubt that the female flower was naturally pollinated.

Interestingly it does seem to attract a raft of other small insects as well. Hard to tell how they even get inside the flower – if they fly in or crawl in – but there is always something crawling around in there.

There is only one pumpkin (two little ones dropped off before the flower opened) but that one remaining is going great-guns! It is almost doubling size daily, so I put another bag of premium compost on top of the pile and mulched again with pea straw. I’ve been watering it twice daily, once in the morning and once in the evening, and despite some hot weather (a few 30’C+ days) it looks very happy in the full sun, so things are going well.

Posted in Gardening | Leave a comment

Dill’s Atlantic Giant – Week #13. And recovery!

In the last article Little F–k Thong (ฟักทอง “fak thaawng” meaning “pumpkin” in Thai), as I have named the pumpkin, had a major set-back by way of a hail storm. A really, really bad hail storm!

With the only fruit severed, I wasn’t sure if Little F–k Thong was going to survive – but turns out, it’s a fighter! Even though I had to cut swathes of foliage from the original vine because of the hail damage the plant has grown two huge new vine stems and currently has four female flowers which I will eventually pollinate and will become fruit!

The progress is timely as well. It is mid-Summer and there is still plenty of sunlight (mixed with showers) for the next 8 weeks at least. The rain can be a bit on the torrential side through the Melbourne Summer, so I will continue to keep my eye on the plant and make sure it survives anything untoward.

Posted in Gardening | Leave a comment

First time-lapse attempt

I’ve always loved the classic style of time-lapse photography that you see in pretty much every nature documentary but I’ve never had the gear to attempt it myself.

I was temped to hack into an old digital camera and hook it up with an Arduino or something like that to control the shutter with a laptop, but it’s something that I never managed to find the time to do myself.

Meanwhile we shelled out on a Nikon D7000 DSLR which has in-built time-lapse functionality, which is nice!

Needless to say I made an error first attempt – I accidentally set a 200 photo limit, so I missed sunset (whoops!), but I wasn’t totally happy with the shots regardless, and I want to do a time-lapse of a more interesting scene anyway, as interesting as the backyard is!

The resulting image set was 200 images captured from 1700h to roughly 1815h. The images were combined into a movie file in Linux using the command:

mencoder "mf://*.JPG" -mf fps=10 -ovc lavc -o output.avi

Here is the resulting 20s movie!

Posted in Photography | Leave a comment

Dill’s Atlantic Giant – Week #10. And disaster.

Things were going well. Really well. Until Tim Minchin’s song was chosen not to be aired on the Jonathan Ross Show, and then the hail hit.

The hail was the serious kind of hail. Both of our cars were damaged, the local train tracks became a new waterfall and worst…oh worst – the pumpkin was damaged, possibly beyond repair.

I don’t take soldiers fighting in Afghanistan and other war-torn places lightly, those guys do a great job, so this is a poor choice of similarly – but it was like a war zone. The pumpkin was peppered by the hail with bullet holes everywhere.

The worst part was that the pumpkin was actually starting to flower. Roughly 2m from the vine root there was actually a single female flower with the fruit just sitting there, waiting to be fertilised! It was not to be, however, for the stem with the fruit was completely severed from the vine by the hail.

So the battle plan is to prune the damaged vine and just see what happens. There are plenty of flowers – all male – and some new growth on the way, but with the engine  (leaves) damaged it’s hard to say if there will be any pumpkins on this plant.

Posted in Gardening | Leave a comment