boss-level.com
21Mar/130

Absolutely minimal text classification

tl;dr

Bare bones text classifier implemented in Clojure. Nothing to raise pulses.

The next example from Kevin Murphy's tremendous book ML book that I've tackled in Clojure in the naive text classifier. Its an old chestnut: figure out whether a given post comes from comp.os.ms-windows or comp.windows.x. The code and data is on Github.

The statistical model is pretty basic: use a Bernoulli variable for the appearance or not of a word in a document of a given class What you want to figure out is parameter of this variable, theta: the probability of the word's appearance. To be Bayesian, we express the prior over this variable as a Beta (Beta(1,1) which is uniform), and do MAP estimation.

The mechanics are even simpler: for each word-document (stem-document, actually) pair in the training set, record whether the word appeared in the document or not, and then average each word count across each document class. You end up with the percentage of docs in which the stem appears per class, and we call that theta. Actually, incorporating the prior slightly wrinkles this, but amounts to just a Laplace plus-one on the counts.

Once we've worked out out the thetas (the posterior probability that a given stem appears in a document of a given class), to classify an unseen document we multiply up these probabilities for each word in the document for the two classes and pick the maximum (the extension to many is obvious). Bare bones MAP.

The implementation is not especially interesting either. I propogate the calculation as a value, as in the Number's Game post, but this time using Prismatic's Graph library. I used Graph just for an opportunity to play with it, its really a bigger gun than we need in this fight.

As I said, nothing very interesting here. Just noting that its trivially easy to do this sort of MAP estimation in Clojure. The extension of this to real problems is interesting: having Lucene at hand for the stemmer or lazy seqs for giant document sets, or being able to drop the intial counting into Cascalog (for hadoop sized problems) and then bringing the model back into REPL code means that this same code steps easily into the real world.

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

No trackbacks yet.