tf.idf

4:11 pm Research

I thought it would take me another week, but I managed to put together a script that computes tf.idf scores in perl. OK, Steve helped me, but it wasn’t easy! And our solution with a two-dimensional hash is quite elegant. Now I only need to tweak it to perform a proper tokenisation step beforehand and only print terms above a certain tf.idf threshold, but the worst bit is over :)

I really need to hack more often!

4 Responses

  1. Rowan Says:

    Can I see? Are you going to publish code?

  2. merpeltje Says:

    I’ll send it to you tomorrow. I can’t be arsed to retrieve it from my computer at work now, and it’s too messy to be put out in the open…

  3. Van Gils Blogs » Programming Ruby Says:

    [...] It’s been a while since I did some real coding. Actually, I haven’t done any coding in ages! Time to change all that. I saw that Marieke had posted that she had written a script for doing tf.idf calculations in Perl. The idea is simple: you have a couple of documents with a bunch of words in them and you want to calculate per word how often it occurs in a document (document frequency, tf) divided by how often it occurs in the total collection (document frequency, df). A couple of years ago I wrote the same thing in Perl for a course in IR. A while later I rewrote it in Perl. Since I’m mostly interested in Ruby these days I rewrote it again. The code can be found here and the nicely formatted version can be found here. [...]

  4. Bas Says:

    Hey merpel! tf.idf calculations remain tricky. I recall the IR classes a couple of years ago where we had to battle Paai’s horrible collection about CRUDE and OIL stuff. I used PERL back then too. Things went fine but the Sun’s we had in the lab back then didn’t like the nested hashes (yikes). A couple of years later I implemented it in PYTHON which was easier. Since a little while I’ve been fiddling with RUBY. Man, awesome lil’ language. Anyway, I blogged about it here

    take care!