Humphrey Sheil - Blog

A quick run through Vowpal Wabbit

08 Jan 2016 | 0 comments

Vowpal Wabbit (abbreviated to VW from now on) came to my attention again at NIPS 2015 - I first saw it at NIPS 2013 and I've heard good things about it, so I decided to give it a test run.

I attended the ACM RecSys 2015 conference in Vienna a few months ago and the competition is very close to one of the data sets that I encounter regularly in my day job at Eysys.

The submissions that won the competition this year (and the top ten table in general) all seemed to be very hand-crafted / bespoke and in fact for the winning solution, the exact approach used is not clear - it is based on Gradient Boosting Machines (GBM) but the exact algorithm was not documented - for commercial reasons I guess.

I'm interested in how well Vowpal - a readily available, fast and well understood online learner - can perform on this problem set, and that is the focus of this post.

The challenge dataset

Let's use the publically available RecSys 2015 challenge dataset from YOOCHOOSE and see what we can get VW to do. This specific challenge is two-fold:

  1. Classify visitors into "clickers" and "buyers" (so binary classification).

  2. For "buyers", predict what they will buy (so a classic recommendation / ranking problem).

In this post, I'll focus on applying VW to part one of the problem - classification. We'll tackle part two in a second post.

About Vowpal Wabbit

Briefly, Vowpal Wabbit is a (very) fast online learner - just how fast for this problem you'll see in a bit.

The specific NIPS 2015 tutorial inspired me to get back into Vowpal, but there's a lot more documentation on the web and what looks like an O'Reilly book coming soon as well.

How to apply VW

Most of my time with VW was spent first generating the input file and then adding in new features to improve classification accuracy. Vowpal is well-written and easy to use. Don't underestimate this ease of use - if you've ever had to generate input files in weird input formats. I had to use NetCDF once for a recurrent neural network library and I still wince when I think about it!

Feature selection and engineering is more art than science in machine learning, but Vowpal has a very nice tendency to just get on and use whatever you pass into it.

So then:

  1. As with any machine learning problem, select the features of interest and transform your input data set to the format VW wants.

  2. Run VW (either from the command line or using the various language bindings that ship with it - we'll use the command line here for brevity) to solve the problem you want to solve (classification in our case).

  3. Test and measure what we've done with VW, ideally on a separate testing set for guaranteed results but worst case on the training set just to sanity check what you've done.

YOOCHOOSE dataset transformation

I wrote a parser in Java which generates a valid Vowpal input file from the YOOCHOOSE data set as follows:

  1. Loads in all of the click and purchase events and groups them by user from earliest to latest in a common Event model.

  2. Calculates some features to improve accuracy, i.e. make Vowpal's job of differentiating between clickers and purchasers easier. The parser calculates some of the features from the paper which won the competition.

  3. Enumerates all the events and converts them into Vowpal features, using a separate namespace for each event.

Note: this parser really wants 8 GB RAM to run - I wrote an analyse() method to figure out what data I wanted to include / exclude and to calculate features that roughly approximated item and category popularity. The easiest way to do this is to have a Map<Integer, List> in memory.

The code uses ant to build and ivy to manage the dependencies (there are only three - junit, a CSV parser and a logger).

You can compile and run the code by running this from the console:

ant clean compile test

The code assumes that you have downloaded and unpacked the data set (specifically yoochoose-clicks.dat and yoochoose-buys.dat into the root directory where you cloned the repo from).

Here is some output from the parser - it processes the clicks / buys file at about 400k / second and outputs the Vowpal file at about 30k visitor examples / second.

hsheil@wordsworth:~/code/recsys2015$ ant clean compile test
ant clean compile test 
Buildfile: /home/hsheil/code/recsys2015/build.xml
clean:
[delete] Deleting directory /home/hsheil/code/recsys2015/bin
[delete] Deleting directory /home/hsheil/code/recsys2015/lib
resolve:
[ivy:retrieve] :: Apache Ivy 2.4.0-rc1 - 20140315220245 :: http://ant.apache.org/ivy/ ::
[ivy:retrieve] :: loading settings :: url = jar:file:/home/hsheil/.ant/lib/ivy-2.4.0-rc1.jar!/org/apache/ivy/core/settings/ivysettings.xml
compile:
[mkdir] Created dir: /home/hsheil/code/recsys2015/bin
[javac] Compiling 4 source files to /home/hsheil/code/recsys2015/bin
compile-tests:
[javac] Compiling 1 source file to /home/hsheil/code/recsys2015/bin
test:
[junit] Running data.yoochoose.YoochooseTest
[junit] 12:46:58.634 [main] INFO  data.yoochoose.YoochooseParser - Loading yoochoose-clicks.dat
[junit] 12:47:03.633 [main] INFO  data.yoochoose.YoochooseParser - 1863733 events processed (rate: 372000.0/sec)
[junit] 12:47:08.634 [main] INFO  data.yoochoose.YoochooseParser - 3951705 events processed (rate: 417000.0/sec)
[junit] 12:47:13.635 [main] INFO  data.yoochoose.YoochooseParser - 6006145 events processed (rate: 410000.0/sec)
[junit] 12:47:18.677 [main] INFO  data.yoochoose.YoochooseParser - 7937499 events processed (rate: 386000.0/sec)
[junit] 12:47:23.763 [main] INFO  data.yoochoose.YoochooseParser - 9539240 events processed (rate: 320000.0/sec)
[junit] 12:47:28.764 [main] INFO  data.yoochoose.YoochooseParser - 11421411 events processed (rate: 376000.0/sec)
[junit] 12:47:33.765 [main] INFO  data.yoochoose.YoochooseParser - 13503379 events processed (rate: 416000.0/sec)
[junit] 12:47:38.766 [main] INFO  data.yoochoose.YoochooseParser - 15524833 events processed (rate: 404000.0/sec)
[junit] 12:47:43.767 [main] INFO  data.yoochoose.YoochooseParser - 17610738 events processed (rate: 417000.0/sec)
[junit] 12:47:48.768 [main] INFO  data.yoochoose.YoochooseParser - 19826503 events processed (rate: 443000.0/sec)
[junit] 12:47:53.999 [main] INFO  data.yoochoose.YoochooseParser - 21954328 events processed (rate: 425000.0/sec)
[junit] 12:47:59.000 [main] INFO  data.yoochoose.YoochooseParser - 23957426 events processed (rate: 400000.0/sec)
[junit] 12:48:04.001 [main] INFO  data.yoochoose.YoochooseParser - 26011638 events processed (rate: 410000.0/sec)
[junit] 12:48:09.066 [main] INFO  data.yoochoose.YoochooseParser - 28189572 events processed (rate: 435000.0/sec)
[junit] 12:48:14.358 [main] INFO  data.yoochoose.YoochooseParser - 30487599 events processed (rate: 459000.0/sec)
[junit] 12:48:19.359 [main] INFO  data.yoochoose.YoochooseParser - 32843301 events processed (rate: 471000.0/sec)
[junit] 12:48:20.110 [main] INFO  data.yoochoose.YoochooseParser - 33003944 data lines -> 9249729 visitors, processed in 81 secs
[junit] 12:48:20.110 [main] INFO  data.yoochoose.YoochooseParser - Loading yoochoose-buys.dat
[junit] 12:48:23.101 [main] INFO  data.yoochoose.YoochooseParser - 1150753 data lines -> 9249729 visitors, processed in 2 secs
[junit] 12:48:25.251 [main] INFO  data.yoochoose.YoochooseParser - All max: 245937 secs, purchasers max: 27815 secs, avg: 401.23264 secs, single click 1237348, single purchase 0, unknown 0, max events 262, avg events 3.5587366
<snip/>
[junit] 12:50:55.359 [main] INFO  data.yoochoose.YoochooseParser - 9175441 instance processed (rate: 56000.0/sec)
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 238.136 sec
BUILD SUCCESSFUL
Total time: 4 minutes 0 seconds

On this machine, the parser runs in 4 minutes.

Validating our file

Ok now we have our file, let's sanity check it using the online validator.

Here's a single line from the file:

0 1.0 'clicker|AggregateFeatures numClicks:4 lifespan:351 sMonth:4 sDay:7 sWeekDay:1 sHour:10 sMin:51 sSec:9 eMonth:4 eDay:7 eWeekDay:1 eHour:10 eMin:57 eSec:0 numItems:4 numCategories:1 viewedPopularItems:0.0 viewedPopularCats:0.0 catSimilarity:0 |Event0 mth:4 day:7 hour:10 itemId:214536502 dwellTime:180 catId:0 special:0.0 |Event1 mth:4 day:7 hour:10 itemId:214536500 dwellTime:37 catId:0 special:0.0 |Event2 mth:4 day:7 hour:10 itemId:214536506 dwellTime:133 catId:0 special:0.0 |Event3 mth:4 day:7 hour:10 itemId:214577561 dwellTime:100 catId:0 special:0.0

And here is the elided output from the validator (tl;dr - it's good input):

Validation Feedback
Total of 1 examples pasted.
 (example #1) Example “0 1.0 'clicker|A… Id:0 special:0.0”.
(example #1) Found “0 1.0 'clicker” ambiguous prefix with 3 elements.
(example #1) Example label / response / class is “0”.
(example #1) Importance weight is “1.0”.
(example #1) Example has default “0” base.
(example #1) Tag is “clicker”.
(example #1, namespace #1) Namespace found with name “AggregateFeatures”.
(example #1, namespace #1) Using default scale of “1.0” for namespace.
(example #1, namespace #1) Found 19 feature(s).
(example #1, namespace #1, feature #1) Label “numClicks”.
(example #1, namespace #1, feature #1) Value “4”.
(example #1, namespace #1, feature #2) Label “lifespan”.
(example #1, namespace #1, feature #2) Value “351”.
(example #1, namespace #1, feature #3) Label “sMonth”.
(example #1, namespace #1, feature #3) Value “4”.
(example #1, namespace #1, feature #4) Label “sDay”.
(example #1, namespace #1, feature #4) Value “7”.
(example #1, namespace #1, feature #5) Label “sWeekDay”.
(example #1, namespace #1, feature #5) Value “1”.
(example #1, namespace #1, feature #6) Label “sHour”.
(example #1, namespace #1, feature #6) Value “10”.
<snip/>

Our file format is good, so now we can run Vowpal on our data.

Running Vowpal Wabbit

Vowpal can be run in three ways:

  1. Invoked from the command line.

  2. Embedded into your code as a library using the shipped bindings (Java, C#, R, Python, C).

  3. In daemon mode (essentially a server that waits for inbound prediction requests).

We'll be using option 1 in this post.

Vowpal is pretty easy to install on Linux (I use Debian) - ensure you install the Boost libraries first for your platform and then install Vowpal following the installation instructions - it's pretty much git clone and make.

Ok, let's run vw on our generated input file. This is the timed, complete output (Vowpal logging backs off as the amount of data grows which is nice - no console spamming!):

time vw -d yoochoose-train.vw 
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = yoochoose-train.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.000000 0.000000            1            1.0   0.0000   0.0000       36
0.000000 0.000000            2            2.0   0.0000   0.0000       47
0.000000 0.000000            4            4.0   0.0000   0.0000       27
0.000000 0.000000            8            8.0   0.0000   0.0000       32
0.102080 0.204159           16           16.0   0.0000   0.0646       37
0.128184 0.154289           32           32.0   0.0000   0.0138       26
0.083987 0.039790           64           64.0   0.0000   0.0000       27
0.066823 0.049660          128          128.0   0.0000   0.0000       27
0.072272 0.077722          256          256.0   1.0000   0.0560       41
0.076569 0.080865          512          512.0   0.0000   0.0000       27
0.072003 0.067437         1024         1024.0   0.0000   0.0000       27
0.067769 0.063535         2048         2048.0   0.0000   0.0397       27
0.063839 0.059909         4096         4096.0   0.0000   0.0000       27
0.062026 0.060212         8192         8192.0   0.0000   0.1001       31
0.056768 0.051511        16384        16384.0   0.0000   0.0000       21
0.054861 0.052954        32768        32768.0   0.0000   0.0104       27
0.052077 0.049293        65536        65536.0   1.0000   0.5093       94
0.049467 0.046856       131072       131072.0   0.0000   0.0000       27
0.047585 0.045704       262144       262144.0   0.0000   0.0000       28
0.046592 0.045598       524288       524288.0   0.0000   0.4964      293
0.045002 0.043413      1048576      1048576.0   0.0000   0.0000       26
0.042893 0.040785      2097152      2097152.0   0.0000   0.0924       33
0.041382 0.039870      4194304      4194304.0   0.0000   0.0388       42
0.038183 0.034984      8388608      8388608.0   0.0000   0.1082       33
finished run
number of examples per pass = 9249729
passes used = 1
weighted example sum = 9249729.000000
weighted label sum = 509696.000000
average loss = 0.038001
best constant = 0.055104
best constant's loss = 0.052067
total feature number = 348970047
real    0m51.675s
user    1m18.433s
sys 0m5.552s

The speed is impressive, and the scalability is also. We've done no tuning of Vowpal at all so far - either hyperparameter search or optimisation for speed / dataset size.

Measuring accuracy (precision and recall)

Ok, so how did Vowpal really do? Rich Caruana co-chaired the KDD Cup competition in 2004 and made perf available as part of that exercise. The perf program calculates lots of things including AUC but for now I just want to see the confusion matrix for our two target labels. By default, perf.c defines MAX_ITEMS to 50000 on line 6, so change that to 5000000 and then recompile it using make. Now we generate the target and prediction files that perf wants using the cut command on linux and then we can generate our confusion matrix (but not on a separate test set yet, using the same training set - where accuracy will inevitably be much better).

Note: Perf is also used in one of the VW examples, and I've applied it here to our data set as well.

Shell script to run vw and perf together

#!/bin/bash          
echo Running vw
time vw -d yoochoose-train.vw --predictions p_out
echo Processing predictions
cut -d' ' -f1 yoochoose-train.vw > targets
cut -d' ' -f1 p_out > predicted
echo Running perf
../perf/perf.src/perf -confusion -files targets predicted

And here is the relevant part of the perf output:

Predicted_1  Predicted_0   Threshold  0.500000
True_1  70864.0000       438832.0000
True_0  46398.0000       8693635.0000
ACC    0.94326   freq_thresh  0.261766

The ACC score is nice and high, but only because we have such a large number of clickers - the confusion matrix shows us that we have found 13.9% of buyers (70864/(70864+438832)) using the training set for validation.

Evaluation calculation for the RecSysy 2015 challenge

Looking at the evaluation metric for the RecSys 2015 challenge, we can only get a partial score because we haven't built the recommendation component yet. So we give ourselves a point for every correctly identified buyer (True Positive or TP) and deduct a point for every clicker we incorrectly classified as a buyer (False Positive or FP) - so 70864 - 46398 = 24466.

So there's more work to do, viz:

  1. Build a recommendation engine using Vowpal (it looks like Vowpal can do this, I "just" need to massage the YOOCHOOSE data into an acceptable format).

  2. Add the similarity score (as measured by the Jaccard score).

  3. Move away from calculating accuracy on the training set - doing this yields higher numbers due to the problem of over-fitting and use the test set and solution file that the RecSys team made available after the competition ended.

  4. See what our final score is and see how we can improve it (either by tuning Vowpal using hyperparameters, or selecting / engineering better features).

So stay tuned for further updates and all comments appreciated!


No comments

Leave a comment

  Back to Blog