boss-level.com
6Sep/1353

Linear Regression 3 Ways in Julia

Introduction

Julia is a new upstart numerical language that I've been looking at recently. Its still very rough, not even 0.2 yet, but it has some impressive chops. As an alternative to R and MATLAB, it has a lot going for it. Much has been made in blogs and forums about the advertised speed, and whether this is hyped or not. I think this misses the point; what is important is the tradeoff of speed for expressiveness. In R you trade all speed for great expressivenes, in MATLAB the other way around (and frequently not even that). In this post I show that Julia has superb expressiveness. As we know that the Julia has the goal and capacity for speed, this is very exciting for the future.

The Problem

About the most simple thing you can do in stats is a linear regression. So I set out to see how Julia can express this simple operation and compare that to how it would be done in R and MATLAB.

Code Walkthrough


# Pull in packages
using Distributions
using Winston
using GLM

#=====================================================================
# Set up the global parameters
N = 20                                       # Training set size
true_w = [-0.3 0.5 -0.9]                     # The answer
σ² = 0.2^2                                   # Variance of noise

# Nothing special here except to note that using unicode symbols seems
# to be idiomatic in Julia and is easily achieved in emacs using
# M-x set-input-method TeX

#=====================================================================
#  Set up the data
# Line generating fns.  See how easy functional programming is, closures
# just fall right out of the syntax.
generating_f = (w) -> x -> w[1] + w[2]*x + w[3]*x.^3  # The main thing
true_line    = generating_f(true_w)          # Bind the true params
noise        = ()->rand(Normal(0, σ²))       # Noise itself
noisy        = (f) -> x -> f(x) + noise()    # Return a noisy function


# Sample random training x and y points.  Note that I have to match
# the columns with the generating function.  Not so awesome from a
# coding perspective.
function generate_data(f, n)
 x = -1 + 2*rand(n)
 y = map(f, x)
 ([ones(n,1) x x.^3], y)
end

# Thanks to Kevin for pointing out how to destructure this!
(X, y) = generate_data(noisy(true_line), N)

#=====================================================================
#  Classic Linear Regression the MATLAB way.
#  This uses solves the linear system w'X = y using the QR decomp
w_matlab = X\y


#=====================================================================
#  Classic Linear Regression the R way
#  This bottoms out to precisely the same QR decomp as the \ operator

#  This only seems contorted compared to the Matlab above because I
#  set the data up specifically for the Matlab operator and now have
#  to remunge it.
df = DataFrame(x  = X[:,2],
               x³ = X[:,3],
               y  = y)
model = lm(:(y ~ x + x³), df)
w_R = coef(model)


#=====================================================================
#  Bayesian Linear Regression the Julia way !

# The prior is a 3D isotropic Normal. Use ₀ to indicate prior params.
μ₀ = [0.0, 0.0, 0.0]               # Prior assumes zero mean on w
Σ₀ = eye(3)*0.8                    # Prior is isotropic and wide
prior = MvNormal(μ₀, Σ₀)

# Now update the prior with the data.  Use ₁ to indicate posterior params.
# I'm cheekily assuming I know the noise variance σ².
Σ₁ = inv(inv(Σ₀) + 1/σ²*X'*X)         # Equations (7.55) in MLAPP
μ₁ = Σ₁*inv(Σ₀)*μ₀ + 1/σ²*Σ₁*X'*y
posterior = MvNormal(μ₁, Σ₁)

# Sample from the posterior.  Beautiful.
w_samples = rand(posterior, 20*N)

#=====================================================================
# Plot our answers

domain = [-1.0 : 0.05 : 1.0]
p = FramedPlot()
setattr(p, "xrange", (-1,1))
setattr(p, "yrange", (-1,1))
setattr(p, "aspect_ratio", 1)
setattr(p, "title", "Regression Comparison")

# Plot samples from the posterior
so = Nothing
for i=1:(20*N)
  so = Curve(domain, generating_f(w_samples[:,i])(domain), "type", "dotted", "color", "grey" )
  add(p, so)
  setattr(so, "label", "Posterior Samples")
end

# Plot the posterior mean
w_postmean = mean(posterior)
s = Curve(domain, generating_f(w_postmean)(domain), "color", "blue" )
setattr(s, "label", "Posterior Mean")
add(p, s)

# Plot the MATLAB and R answers
c = Curve(domain, generating_f(w_matlab)(domain), "color", "orange")
setattr(c, "label", "Classic Linear Regression")
add(p, c)

# Plot the true answer
ct = Curve(domain, true_line(domain), "color", "red")
setattr(ct, "label", "True")
add(p, ct)

# Plot the data we used for inference
pt = Points( X[:,2], y, "type", "filled circle", "color", "red" )
setattr(pt, "label", "Training Data")
add(p, pt)

l = Legend( .1, .9, {so, s, c, ct, pt} )
add(p, l)

file(p, "posterior.png")
Winston.display(p)

Expressiveness

Firstly the code is very readable (and as a lisp fanboy I surprise myself there). In setting up the generating functions I could use sensible FP concepts, like a grownup. I have access to the handy-dandy backslash operator giving me equivalent expressiveness for this important op as I'd have in MATLAB. I can also use the much better declarative formula syntax from R for this. Its not 100 percent as good, as in R I could have written y ~ x + I(x^3) and not have to build the x-cubed column in the dataframe myself. However, I'm sure that functionality is coming. Finally, I really like the unicode, particularly in the Bayesian inference section where my code now looks identical to the maths formulae. Lasty the Distribution package allows me to deal with Distribution objects as such, and interact with them through a sensible interface. All in all, this is very expressive numerical computing.

Conclusion

I really like this, and don't think I could have written this code so succinctly in any other language. Julia is a winner.

Tagged as: 53 Comments
14Apr/135

Can we do anything with matrices in Clojure, yet ? An example.

TL;DR: Clojure's matrix stuff is becoming useful but still rough.

Introduction

In Ch 4 of "Machine Learning: A Probabilistic Perspective" Murphy demonstrates a very cute interpolation technique.
The MATLAB code is tiny, but super heavy on the matrix DSL, and so I wanted to see how the equivalent Clojure would stack up. Here're the results.

The Maths

You have a bunch of observed points and you want to interpolate the points between them. You can fit polynomials or splines, but another option is fit something looser. We encode your notion of 'smooth' by saying that each point on the curve is the average of its immediate neighbours plus a bit of Gaussian noise. Turns out if you do this, all the points on the curve are jointly Gaussian distributed and you can then use a well known conditional probability formula to infer the mean for the hidden points from the observed points.

Nicking the main points from the book (Section 4.3.2.2) if x are your points (seen and unseen), and \epsilon \sim N(0,\frac{1}{\lambda}\mathbb{I}) is the noise then we can express the smoothness with Lx = \epsilon, where

L =  \left( \begin{array}{cccccc}  -1 & 2  & -1 &        &    &\\     & -1 & 2  & -1     &    &\\     &    &    & \ddots &    &\\     &    &    & -1     & 2  &-1\end{array} \right)

Which allows us to express the prior over all the data points as p(x) = N(x|0,(\lambda^2L^TL)^{-1}). So if we partition L into L_1 and L_2 the seen and unseen points respectively, then, some maths later we determine that the MAP of the posterior mean is \mu_{1|2}=-\Lambda_{11}^{-1}\Lambda_{12}x_{obs}, where \Lambda_{11} = L_1^TL_1 and \Lambda_{12} = L_1^TL_2 are precision matrices.

The Code

So this is clearly what MATLAB was built to solve. In one line you can create the tridiagonal L matrix, and make it sparse for good measure. Then, solving for the MAP mean is pretty easy. How about Clojure ?

I'm going to use the Clatrix implementation of JBLAS, using the (under construction) core.matrix protocols to try crack this.

(defn L-matrix
  "Create the interpolation matrix.  This is GROSS"
  [n]
  (let [m0 (m/mul (m/identity-matrix :clatrix n) -1)
        m1 (reduce (fn [im i] (set-2d im i (inc i) 2)) m0 (range (dec n)))
        m2 (reduce (fn [im i] (set-2d im i (+ i 2) -1)) m1 (range (- n 2)))]
    (m/reshape m2 [(- n 2) n])))

Hmm.. nastiness indeed. Not Unexpectedly there's no tridiagonal constructor in core.matrix yet, and JBLAS doesn't offer a sparse matrix implementation either. Another implementation of core.matrix, based on Parallel Colt is underway which will offer the sparseness. Anyway, as you can see I've used a dense matrix and then manually set the diagonal terms. Its not a gross as for loops, but its not great.

Setting up the problem is fine, we just create a map of the L matrix, and then randomly choose some values of x to be observed and give them values. All very neat.

(defn setup
  "Return a map of the problem setup.  Hidden, observed data and values"
  [n n-observed lambda]
  (let [i (shuffle (range n))]
    {:L        (m/mul (L-matrix n) lambda)
     :bserved (take n-observed i)
     :hidden   (drop n-observed i)
     :bserved-values (m/matrix :clatrix
                                (repeatedly n-observed rand))}))

Now for solving the problem itself, which again is not bad at all

(defn solve
  "Return the MAP for each hidden point"
  [{:keys [L observed hidden observed-values] :as m}]
  (let [nc  (m/column-count L)
        nr  (m/row-count L)
        L1  (c/get L (range nr) hidden)
        L2  (c/get L (range nr) observed)
        l11 (m/mul (m/transpose L1) L1)
        l12 (m/mul (m/transpose L1) L2)]
    (assoc m
      :hidden-values
      (m/scale
       (m/mul (m/mul (m/inverse l11) l12) observed-values)
       -1))))

We partition into L_1 and L_2, do some matrix-inversions and -multiplications per the maths above and get our answer. Its more verbose than the MATLAB, certainly, but is quite clear.

Finally lets have a look at the interpolated curve, here using Incanter's charts

(let [s (solve (setup 150 10 30))]
  (view
   (add-points
    (xy-plot (concat (:hidden s) (:observed s))
             (concat (:hidden-values s) (:observed-values s)))
    (:observed s) (:observed-values s))))

So that's rather nice, isn't it? The full code is on GitHub

Summary

In this post we explored a pure-play matrix problem to see how Clojure handled. Clojure's matrix story, driven by core.matrix (thanks in huge part to @mikera) is improving quickly. Its still got a way to go before you'd actively choose to use it in preference to something else, but the promise is there, and you can, as shown here, actually do fairly non-trivial things already. Happiness !

9Apr/131

Data all the Query: Composition and Abstraction of Datomic Queries

Introduction

Composition and abstraction are the downforce and acceleration of coding. Yet when dealing with databases these operations are usually unavailable due to the limitations of SQL as a DSL. Thus database code is invariably ugly: string manipulations or ORM. Datomic presents a radical change that allows queries to be composed and abstracted naturally, which it achieves through the idea of queries-as-data. In this post I hope to demonstrate the query-as-data idea and how it enables clean composition and abstraction.

We'll demonstrate this by building a small analysis around some US presidential election data. The code and data are available on Github at. Load up the core.clj file and go.

The Dataset

We'll work with tiny dataset of Electoral College Votes for US President by State and Year courtesy of the good peole of Infochimps.. This dataset shows by year and state how many Electoral College votes the state cast and whether they were for Team Red, Blue or Independent.

We need a database schema. Our entities are parties, states and votes. A vote has a year, a party, and a state. I'm discarding the number of college votes and just storing whether the state plumped for R, D or I (nobody understand the electoral college anyway!). The ingestion of the data file into the db is handled by ingest.clj. Its not especially interesting. The schema is ubersimple:

  [;; ------- States ---------
   {:db/id #db/id[:db.part/db]
    :db/ident :state/name
    :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one
    :db/index true
    :db/unique :db.unique/value
    :db/doc "Name of a state"
    :db.install/_attribute :db.part/db}

   ;; ------- Party ----------
   {:db/id #db/id[:db.part/db]
    :db/ident :party/name
    :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one
    :db/index true
    :db/unique :db.unique/value
    :db/doc "Political Party"
    :db.install/_attribute :db.part/db}

   ;; ------- Vote ----------
   {:db/id #db/id[:db.part/db]
    :db/ident :year
    :db/valueType :db.type/string
    :db/cardinality :db.cardinality/one
    :db/index true
    :db/unique :db.unique/value
    :db/doc "In which year was the vote"
    :db.install/_attribute :db.part/db}

   {:db/id #db/id[:db.part/db]
    :db/ident :party
    :db/valueType :db.type/ref
    :db/cardinality :db.cardinality/one
    :db/index true
    :db/unique :db.unique/value
    :db/doc "For which party"
    :db.install/_attribute :db.part/db}

   {:db/id #db/id[:db.part/db]
    :db/ident :state
    :db/valueType :db.type/ref
    :db/cardinality :db.cardinality/one
    :db/index true
    :db/unique :db.unique/value
    :db/doc "From which state"
    :db.install/_attribute :db.part/db}]

The Base Query

Now that the data is input we want to query it, which we handle in query.clj. The idea is create a base query which is returns the results of a big join of all the entities we care about: votes, states and parties.

So starting at the bottom, the join is expressed as a datomic rule.

(def vote-join
  '[[vote-join ?v ?y ?p ?s]
    [?v :year ?y]
    [?v :state ?s-id]
    [?v :party ?p-id]
    [?s-id :state/name ?s]
    [?p-id :party/name ?p]])

This is a list that expresses the join between the entities. The key thing here is that this is one element of a database query and I can pull it out and put a handle on it: I've abstracted the join. We'll return to this in a moment.

Next I have to put the join into a query:

(def base-query
  {:find   ['?v]
   :in     ['$ '%]
   :where  ['(vote-join ?v ?y ?p ?s)]})

This is a map, expressing a database query that returns all the votes from the vote-join. I've stitched the join into the query by using it as a rule in the query (thanks to @tgkristensen who showed me how)...

Finally, I want to call execute the query against the database:

(defn votes
  "Return all the votes from the db"
  [query]
  (q query (db @db/conn) [vote-join]))

I can now get all the votes in the database by calling

(votes base-query)

.

Composition

The base query is not especially, interesting. I want to be able to ask things about states and years. Now, because the query is expressed as a datastructure (the key point) I can use data manipulation functions to modify it. In this case be adding a constraint.

Lets start with years:

(defn year [y]
  [['?v :year y]])

This is a datomic rule, a little fragment that says the year attribute for the vote must have value v. To stitch it into query:

(defn constrain
  "Apply constraints to a query"
  [query constraints]
  (update-in query [:where] (partial apply concat) constraints))

This is the key magic function. I take an input query, and, because its a map, I just add a new assertion to the :where value using familiar update-in! That's pretty amazing.

So the query before constraining is base-query:

{:find [?v], :in [$ %], :where [(vote-join ?v ?y ?p ?s)]}

and after

(constrain base-query [(year "2000")])
{:find [?v], :in [$ %], :where ((vote-join ?v ?y ?p ?s) [?v :year "2000"])}

The output just a vector, so its a normal Clojure datastructure so I can

(count (votes (constrain base-query [(year "2000")])))

and have it return 51 (50 states and Washington DC).

Of course I can add more than one constraint identically. Here's the constraint fragment for states, by joining to the vote:

(defn state [s]
  [['?v :state '?s-id]
   ['?s-id :state/name s]])

and as you expect

(count (votes (constrain base-query [(state "Georgia") (year "2000")])))

returns 1!

As you can pass arbitrary functions into Datomic queries you can make rather powerful and useful constraints. Here the code to constrain the resuts to votes after a given year:

(defn >year* [y1 y2] (> (Integer. y1) (Integer. y2)))
(defn >year [y]
  [`[(dataquery.query/>year* ~(symbol '?y) ~y)]])

Its a little tricksy because I haven't quite figured out how to properly quote namespace qualified symbols. Anyway.

So that's how, by representing the queries as data structures Datomic allows you to compose queries in a really natural fashion.

Abstraction

Next up the food chain from composition is abstraction. Here I'll give a little example of how to abstract database queries to probabilistic queries. Say we want to answer questions like what is the probability of a west coast state voting Republican (say), this can be expressed mathematically as the conditional probability: P(Vote is Republican | State is West Coast). Datomic is a beautiful for this sort of work because it deals in "facts", where probability deals in "events", both are different domain jargon for the same thing! An impedence matching property, perhaps.

So lets raise the level of abstraction. To do probability queries we don't care what the actual votes are, just the counts. So lets create a function that returns the number of votes that meet a given constraint.

(def num-votes (comp count (partial votes base-query)))

That could have been achieved using the count aggregate function in Datomic itself as well.

(def count-query
  (assoc base-query :find ['(count ?v)]))
(def num-votes (comp ffirst (partial votes count-query)))

OK, now for the probability function, which is now trivial

(defn p
  ([J]    (p J []))
  ([J C] (/ (num-votes (concat J C))
             (num-votes C))))

Here we are returning P(J |C), the probability of event J given event C, which matches directely to the probability of the set of events matching constraint J given the set of events matching constraint C. We have just utilised the maths fact that P(J | C) = P(J,C) / P(C) = #{J,C}/N / #{C}/N = #{J,C}/#{C}, so we can just divide out the counts to get the probabilities.

So, what's the probability of a state voting independent ?

(p [(party "I")])

=> 2/187. Not so good. How about of Texas voting Democrat post 1980 ?

(p [(party "D")] [(state "Texas") (>year 1980)])

=> 0.

Probability Summary

In summary we can abstract up from database queries to probability queries very cleanly. This is due not only to a nice match between our database and our domain, but emerges from the ability of datomic to compose queries. I state my probability problem with a datastructure of functions:

[(state "Texas")  (>year 1980)]

etc, these are then composed into database queries. The probability "DSL" is thin and natural. All the query and probability code are 75 lines, with comments, formatting and the rest in the buffer.

Bits and bobs

One potential weakness in this scheme is that the logic variables used in the various query fragments have to match up. This is a leaky, and I don't especially like it, but I can't see a way around it right now. Suggestions ?

A cool trick is to memoize the votes function to prevent the query having to run every time. To get that right though you need to parameterised by the database value, which would make this code a bit messy for demo.

Summary

The advantages of Datomic are often hard to explain because its more than the sum of its parts, and the parts themselves are quite unfamiliar. I hope that this little example explains the idea of query-as-data and shows how it allows great composition of queries, and thus abstraction of the database layer up into the domain layer.

Tagged as: , 1 Comment
21Mar/130

Absolutely minimal text classification

tl;dr

Bare bones text classifier implemented in Clojure. Nothing to raise pulses.

The next example from Kevin Murphy's tremendous book ML book that I've tackled in Clojure in the naive text classifier. Its an old chestnut: figure out whether a given post comes from comp.os.ms-windows or comp.windows.x. The code and data is on Github.

The statistical model is pretty basic: use a Bernoulli variable for the appearance or not of a word in a document of a given class What you want to figure out is parameter of this variable, theta: the probability of the word's appearance. To be Bayesian, we express the prior over this variable as a Beta (Beta(1,1) which is uniform), and do MAP estimation.

The mechanics are even simpler: for each word-document (stem-document, actually) pair in the training set, record whether the word appeared in the document or not, and then average each word count across each document class. You end up with the percentage of docs in which the stem appears per class, and we call that theta. Actually, incorporating the prior slightly wrinkles this, but amounts to just a Laplace plus-one on the counts.

Once we've worked out out the thetas (the posterior probability that a given stem appears in a document of a given class), to classify an unseen document we multiply up these probabilities for each word in the document for the two classes and pick the maximum (the extension to many is obvious). Bare bones MAP.

The implementation is not especially interesting either. I propogate the calculation as a value, as in the Number's Game post, but this time using Prismatic's Graph library. I used Graph just for an opportunity to play with it, its really a bigger gun than we need in this fight.

As I said, nothing very interesting here. Just noting that its trivially easy to do this sort of MAP estimation in Clojure. The extension of this to real problems is interesting: having Lucene at hand for the stemmer or lazy seqs for giant document sets, or being able to drop the intial counting into Cascalog (for hadoop sized problems) and then bringing the model back into REPL code means that this same code steps easily into the real world.

Tagged as: No Comments