A quick run through Vowpal Wabbit
Vowpal Wabbit (abbreviated to VW from now on) came to my attention again at NIPS 2015 - I first saw it at NIPS 2013 and I've heard good things about it, so I decided to give it a test run.
The submissions that won the competition this year (and the top ten table in general) all seemed to be very hand-crafted / bespoke and in fact for the winning solution, the exact approach used is not clear - it is based on Gradient Boosting Machines (GBM) but the exact algorithm was not documented - for commercial reasons I guess.
I'm interested in how well Vowpal - a readily available, fast and well understood online learner - can perform on this problem set, and that is the focus of this post.
The challenge dataset
Classify visitors into "clickers" and "buyers" (so binary classification).
For "buyers", predict what they will buy (so a classic recommendation / ranking problem).
In this post, I'll focus on applying VW to part one of the problem - classification. We'll tackle part two in a second post.
About Vowpal Wabbit
Briefly, Vowpal Wabbit is a (very) fast online learner - just how fast for this problem you'll see in a bit.
How to apply VW
Most of my time with VW was spent first generating the input file and then adding in new features to improve classification accuracy. Vowpal is well-written and easy to use. Don't underestimate this ease of use - if you've ever had to generate input files in weird input formats. I had to use NetCDF once for a recurrent neural network library and I still wince when I think about it!
Feature selection and engineering is more art than science in machine learning, but Vowpal has a very nice tendency to just get on and use whatever you pass into it.
As with any machine learning problem, select the features of interest and transform your input data set to the format VW wants.
Run VW (either from the command line or using the various language bindings that ship with it - we'll use the command line here for brevity) to solve the problem you want to solve (classification in our case).
Test and measure what we've done with VW, ideally on a separate testing set for guaranteed results but worst case on the training set just to sanity check what you've done.
YOOCHOOSE dataset transformation
Loads in all of the click and purchase events and groups them by user from earliest to latest in a common Event model.
Calculates some features to improve accuracy, i.e. make Vowpal's job of differentiating between clickers and purchasers easier. The parser calculates some of the features from the paper which won the competition.
Enumerates all the events and converts them into Vowpal features, using a separate namespace for each event.
Note: this parser really wants 8 GB RAM to run - I wrote an analyse() method to figure out what data I wanted to include / exclude and to calculate features that roughly approximated item and category popularity. The easiest way to do this is to have a Map<Integer, List
The code uses ant to build and ivy to manage the dependencies (there are only three - junit, a CSV parser and a logger).
You can compile and run the code by running this from the console:
ant clean compile test
The code assumes that you have downloaded and unpacked the data set (specifically yoochoose-clicks.dat and yoochoose-buys.dat into the root directory where you cloned the repo from).
Here is some output from the parser - it processes the clicks / buys file at about 400k / second and outputs the Vowpal file at about 30k visitor examples / second.
hsheil@wordsworth:~/code/recsys2015$ ant clean compile test ant clean compile test Buildfile: /home/hsheil/code/recsys2015/build.xml clean: [delete] Deleting directory /home/hsheil/code/recsys2015/bin [delete] Deleting directory /home/hsheil/code/recsys2015/lib resolve: [ivy:retrieve] :: Apache Ivy 2.4.0-rc1 - 20140315220245 :: http://ant.apache.org/ivy/ :: [ivy:retrieve] :: loading settings :: url = jar:file:/home/hsheil/.ant/lib/ivy-2.4.0-rc1.jar!/org/apache/ivy/core/settings/ivysettings.xml compile: [mkdir] Created dir: /home/hsheil/code/recsys2015/bin [javac] Compiling 4 source files to /home/hsheil/code/recsys2015/bin compile-tests: [javac] Compiling 1 source file to /home/hsheil/code/recsys2015/bin test: [junit] Running data.yoochoose.YoochooseTest [junit] 12:46:58.634 [main] INFO data.yoochoose.YoochooseParser - Loading yoochoose-clicks.dat [junit] 12:47:03.633 [main] INFO data.yoochoose.YoochooseParser - 1863733 events processed (rate: 372000.0/sec) [junit] 12:47:08.634 [main] INFO data.yoochoose.YoochooseParser - 3951705 events processed (rate: 417000.0/sec) [junit] 12:47:13.635 [main] INFO data.yoochoose.YoochooseParser - 6006145 events processed (rate: 410000.0/sec) [junit] 12:47:18.677 [main] INFO data.yoochoose.YoochooseParser - 7937499 events processed (rate: 386000.0/sec) [junit] 12:47:23.763 [main] INFO data.yoochoose.YoochooseParser - 9539240 events processed (rate: 320000.0/sec) [junit] 12:47:28.764 [main] INFO data.yoochoose.YoochooseParser - 11421411 events processed (rate: 376000.0/sec) [junit] 12:47:33.765 [main] INFO data.yoochoose.YoochooseParser - 13503379 events processed (rate: 416000.0/sec) [junit] 12:47:38.766 [main] INFO data.yoochoose.YoochooseParser - 15524833 events processed (rate: 404000.0/sec) [junit] 12:47:43.767 [main] INFO data.yoochoose.YoochooseParser - 17610738 events processed (rate: 417000.0/sec) [junit] 12:47:48.768 [main] INFO data.yoochoose.YoochooseParser - 19826503 events processed (rate: 443000.0/sec) [junit] 12:47:53.999 [main] INFO data.yoochoose.YoochooseParser - 21954328 events processed (rate: 425000.0/sec) [junit] 12:47:59.000 [main] INFO data.yoochoose.YoochooseParser - 23957426 events processed (rate: 400000.0/sec) [junit] 12:48:04.001 [main] INFO data.yoochoose.YoochooseParser - 26011638 events processed (rate: 410000.0/sec) [junit] 12:48:09.066 [main] INFO data.yoochoose.YoochooseParser - 28189572 events processed (rate: 435000.0/sec) [junit] 12:48:14.358 [main] INFO data.yoochoose.YoochooseParser - 30487599 events processed (rate: 459000.0/sec) [junit] 12:48:19.359 [main] INFO data.yoochoose.YoochooseParser - 32843301 events processed (rate: 471000.0/sec) [junit] 12:48:20.110 [main] INFO data.yoochoose.YoochooseParser - 33003944 data lines -> 9249729 visitors, processed in 81 secs [junit] 12:48:20.110 [main] INFO data.yoochoose.YoochooseParser - Loading yoochoose-buys.dat [junit] 12:48:23.101 [main] INFO data.yoochoose.YoochooseParser - 1150753 data lines -> 9249729 visitors, processed in 2 secs [junit] 12:48:25.251 [main] INFO data.yoochoose.YoochooseParser - All max: 245937 secs, purchasers max: 27815 secs, avg: 401.23264 secs, single click 1237348, single purchase 0, unknown 0, max events 262, avg events 3.5587366 <snip/> [junit] 12:50:55.359 [main] INFO data.yoochoose.YoochooseParser - 9175441 instance processed (rate: 56000.0/sec) [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 238.136 sec BUILD SUCCESSFUL Total time: 4 minutes 0 seconds
On this machine, the parser runs in 4 minutes.
Validating our file
Ok now we have our file, let's sanity check it using the online validator.
Here's a single line from the file:
0 1.0 'clicker|AggregateFeatures numClicks:4 lifespan:351 sMonth:4 sDay:7 sWeekDay:1 sHour:10 sMin:51 sSec:9 eMonth:4 eDay:7 eWeekDay:1 eHour:10 eMin:57 eSec:0 numItems:4 numCategories:1 viewedPopularItems:0.0 viewedPopularCats:0.0 catSimilarity:0 |Event0 mth:4 day:7 hour:10 itemId:214536502 dwellTime:180 catId:0 special:0.0 |Event1 mth:4 day:7 hour:10 itemId:214536500 dwellTime:37 catId:0 special:0.0 |Event2 mth:4 day:7 hour:10 itemId:214536506 dwellTime:133 catId:0 special:0.0 |Event3 mth:4 day:7 hour:10 itemId:214577561 dwellTime:100 catId:0 special:0.0
And here is the elided output from the validator (tl;dr - it's good input):
Validation Feedback Total of 1 examples pasted. (example #1) Example “0 1.0 'clicker|A… Id:0 special:0.0”. (example #1) Found “0 1.0 'clicker” ambiguous prefix with 3 elements. (example #1) Example label / response / class is “0”. (example #1) Importance weight is “1.0”. (example #1) Example has default “0” base. (example #1) Tag is “clicker”. (example #1, namespace #1) Namespace found with name “AggregateFeatures”. (example #1, namespace #1) Using default scale of “1.0” for namespace. (example #1, namespace #1) Found 19 feature(s). (example #1, namespace #1, feature #1) Label “numClicks”. (example #1, namespace #1, feature #1) Value “4”. (example #1, namespace #1, feature #2) Label “lifespan”. (example #1, namespace #1, feature #2) Value “351”. (example #1, namespace #1, feature #3) Label “sMonth”. (example #1, namespace #1, feature #3) Value “4”. (example #1, namespace #1, feature #4) Label “sDay”. (example #1, namespace #1, feature #4) Value “7”. (example #1, namespace #1, feature #5) Label “sWeekDay”. (example #1, namespace #1, feature #5) Value “1”. (example #1, namespace #1, feature #6) Label “sHour”. (example #1, namespace #1, feature #6) Value “10”. <snip/>
Our file format is good, so now we can run Vowpal on our data.
Running Vowpal Wabbit
Vowpal can be run in three ways:
Invoked from the command line.
Embedded into your code as a library using the shipped bindings (Java, C#, R, Python, C).
In daemon mode (essentially a server that waits for inbound prediction requests).
We'll be using option 1 in this post.
Vowpal is pretty easy to install on Linux (I use Debian) - ensure you install the Boost libraries first for your platform and then install Vowpal following the installation instructions - it's pretty much git clone and make.
Ok, let's run vw on our generated input file. This is the timed, complete output (Vowpal logging backs off as the amount of data grows which is nice - no console spamming!):
time vw -d yoochoose-train.vw Num weight bits = 18 learning rate = 0.5 initial_t = 0 power_t = 0.5 using no cache Reading datafile = yoochoose-train.vw num sources = 1 average since example example current current current loss last counter weight label predict features 0.000000 0.000000 1 1.0 0.0000 0.0000 36 0.000000 0.000000 2 2.0 0.0000 0.0000 47 0.000000 0.000000 4 4.0 0.0000 0.0000 27 0.000000 0.000000 8 8.0 0.0000 0.0000 32 0.102080 0.204159 16 16.0 0.0000 0.0646 37 0.128184 0.154289 32 32.0 0.0000 0.0138 26 0.083987 0.039790 64 64.0 0.0000 0.0000 27 0.066823 0.049660 128 128.0 0.0000 0.0000 27 0.072272 0.077722 256 256.0 1.0000 0.0560 41 0.076569 0.080865 512 512.0 0.0000 0.0000 27 0.072003 0.067437 1024 1024.0 0.0000 0.0000 27 0.067769 0.063535 2048 2048.0 0.0000 0.0397 27 0.063839 0.059909 4096 4096.0 0.0000 0.0000 27 0.062026 0.060212 8192 8192.0 0.0000 0.1001 31 0.056768 0.051511 16384 16384.0 0.0000 0.0000 21 0.054861 0.052954 32768 32768.0 0.0000 0.0104 27 0.052077 0.049293 65536 65536.0 1.0000 0.5093 94 0.049467 0.046856 131072 131072.0 0.0000 0.0000 27 0.047585 0.045704 262144 262144.0 0.0000 0.0000 28 0.046592 0.045598 524288 524288.0 0.0000 0.4964 293 0.045002 0.043413 1048576 1048576.0 0.0000 0.0000 26 0.042893 0.040785 2097152 2097152.0 0.0000 0.0924 33 0.041382 0.039870 4194304 4194304.0 0.0000 0.0388 42 0.038183 0.034984 8388608 8388608.0 0.0000 0.1082 33 finished run number of examples per pass = 9249729 passes used = 1 weighted example sum = 9249729.000000 weighted label sum = 509696.000000 average loss = 0.038001 best constant = 0.055104 best constant's loss = 0.052067 total feature number = 348970047 real 0m51.675s user 1m18.433s sys 0m5.552s
The speed is impressive, and the scalability is also. We've done no tuning of Vowpal at all so far - either hyperparameter search or optimisation for speed / dataset size.
Measuring accuracy (precision and recall)
Ok, so how did Vowpal really do? Rich Caruana co-chaired the KDD Cup competition in 2004 and made perf available as part of that exercise. The perf program calculates lots of things including AUC but for now I just want to see the confusion matrix for our two target labels. By default, perf.c defines MAX_ITEMS to 50000 on line 6, so change that to 5000000 and then recompile it using make. Now we generate the target and prediction files that perf wants using the cut command on linux and then we can generate our confusion matrix (but not on a separate test set yet, using the same training set - where accuracy will inevitably be much better).
Note: Perf is also used in one of the VW examples, and I've applied it here to our data set as well.
Shell script to run vw and perf together
#!/bin/bash echo Running vw time vw -d yoochoose-train.vw --predictions p_out echo Processing predictions cut -d' ' -f1 yoochoose-train.vw > targets cut -d' ' -f1 p_out > predicted echo Running perf ../perf/perf.src/perf -confusion -files targets predicted
And here is the relevant part of the perf output:
Predicted_1 Predicted_0 Threshold 0.500000 True_1 70864.0000 438832.0000 True_0 46398.0000 8693635.0000 ACC 0.94326 freq_thresh 0.261766
The ACC score is nice and high, but only because we have such a large number of clickers - the confusion matrix shows us that we have found 13.9% of buyers (70864/(70864+438832)) using the training set for validation.
Looking at the evaluation metric for the RecSys 2015 challenge, we can only get a partial score because we haven't built the recommendation component yet. So we give ourselves a point for every correctly identified buyer (True Positive or TP) and deduct a point for every clicker we incorrectly classified as a buyer (False Positive or FP) - so 70864 - 46398 = 24466.
So there's more work to do, viz:
Build a recommendation engine using Vowpal (it looks like Vowpal can do this, I "just" need to massage the YOOCHOOSE data into an acceptable format).
Add the similarity score (as measured by the Jaccard score).
Move away from calculating accuracy on the training set - doing this yields higher numbers due to the problem of over-fitting and use the test set and solution file that the RecSys team made available after the competition ended.
See what our final score is and see how we can improve it (either by tuning Vowpal using hyperparameters, or selecting / engineering better features).
So stay tuned for further updates and all comments appreciated!