Saturday, November 21, 2015

IDR 11: Results are in for first app

My algorithm looks to have worked with 97% accuracy. Out of 1000 examples classified, there were 28 false positives and 1 false negative. In my case, the false negatives are more important. So another way to interpret this is that I only "missed" on 1 example out of 300 that my algorithm should have caught, which is great.

Now I need to verify that there isn't anything fishy going on. Which features are having the biggest impact? Are my training and test examples as random as they should be? Am I unintentionally cheating in any way with the data?

I'm finally asking research and analysis questions.

Also, I've moved my external base of operations from a Starbucks to a Harris Teeter grocery store. They have wifi, power, and there's college football on a TV. On top of that, the food and drinks are cheaper and healthier.

IDR Series

IDR 10: Test data initialization

Yesterday, I was humbly reminded just how much I still *don't* know about R.

First, I learned that there is apparently no straightforward way to map a row from one data.frame into the columns of another. This meant that it would be much simpler for my code to store all training and "test" examples for each experiment in a shared data frame, with a new column indicating training vs. test. I finally managed to rewrite that code and now have a single CSV being dumped with the new column and all examples.

Then, I finally dealt with the fact that all of my data was being loaded with 0s and NAs instead of 0s and 1s. I was initializing a data.frame from a matrix of 0s. This led the data.frame to think that the only "level" in every variable in my model was "0" (the variables were at least correctly being interpreted as categorical variables or factors in R). So when I actually wanted to use a new value, this was essentially rejected, and the value set to "NA".

There are apparently some ways around this, but it was easier for me to rewrite my code to set the 0 and 1 values within the matrix instead. Now creation of the data.frame looks to be working correctly (I still have a 2 hour cycle time to run this on a real dataset).

So this turned out to be another frustrating day by the end, but it did have its moments of hope before these problems crept up. Here's hoping that today is a better one!

PS, I also got that dental work fixed that I mentioned a couple of posts ago. It actually hurt a bit more than the same procedure did last time, but hopefully that will go away.

IDR Series

Thursday, November 19, 2015

IDR 09: Renewed sense of hope

Yesterday sucked. I realized that one of my main sources of feature data isn't going to work correctly with the other main source of features. This required adjusting my experiments pretty significantly.

But today, I solved that problem. I corrected the experiment in a way to uses data that I've been collecting for 2 years, so that's a really good thing.

I also realized that there are some things I can do to improve my model's predictive performance. I'm using the glmnet package in R, maintained by Trevor Hastie from Stanford (one of the professors from the StatLearning course and videos I mentioned in a previous post). I realized that I'm doing a couple of dumb things with glmnet, and that there are alternatives that are more appropriate for my dataset. I'm going to redo some of those statistics and see if they improve (they should, and possibly significantly).

So today, there's a lot more hope to go around than there has been the past couple of days. I have a follow-up with my professor in the morning, so I'll be pushing to get a large batch of experimental results today.

On a separate thread, I'm also still digging for better features while the current round of experiments is executing. ML is very sensitive to "garbage in, garbage out", and I may need some additional features to consider if the performance of my technique doesn't match up.

I expect to have an eventful and productive day!

Thanks for reading.

IDR Series

Wednesday, November 18, 2015

IDR 08: Filtering columns of data.table in R

Today, I've resumed my fight with some R "stats code" that's analyzing my important ML features.

This is a long post. Here is the ultimate outcome, as a public snippet on Github:

I wanted to try out a couple of analysis techniques on my local machine before running them in a distributed fashion across all of my datasets. To reduce my cycle time further, I also wanted to work with a smaller subset of the 10K test cases and 50K features in my full models right now.

First, I downloaded one of my full data files (1.6 GB), which was in CSV format with a header row as the first row. I have two classes in this training data, and I did not construct the file in any random order. I needed to take some training examples from the top and some from the bottom of the file.

To do this, I used old-school Unix commands on my machine (a Macbook) to cut out a decent subset from the raw data.
bryan$ head -n100 raw.csv > subset.csv
bryan$ tail -n500 raw.csv >> subset.csv
head and tail can take the top "n" lines or bottom "n" lines from a file, respectively. And the ">" redirects to a new file, while ">>" appends to an existing. At the end of these commands, I have a new CSV with the same structure as my original, but only 600L of data (plus header row) instead of 10K.

But there's another problem with my subset of data. My training examples were built on my original ~50K features, but now I only have 600 of the original examples. This means that a large percentage of my features are actually unnecessary, and can be dropped to make my processing even faster.

This isn't just a problem for me because I'm running locally. If, at any point later, I decide that I want to trim down my feature set, I'll need to perform the same types of "filtering" operations on the raw data rather than fully reprocessing test cases and their data for a subset of the same features. (Reprocessing from artifacts takes over 8 hours, and that time is dominated by fetching large binary artifacts from databases - so my feature processing time is not really important. BIG O strikes again!)

I also took the opportunity to speed up how I was loading data, so I found the data.table package. I successfully loaded data with "fread" from data.table. This puts my data in a matrix-like structure, with my 600 training examples as the rows and my 50K features as the columns.

For my first trick with fread, I needed to treat my feature labels as "categorical" in R, or what R calls factors. In my case, all features are binary (they are either present in a training example, or not), so I should have up to 50K features with 2 "levels" each. With fread, there is an option to "load strings as factors", so I used this to define my variables as factors from the start.

For my next trick, I had to build a list of all columns from my data.table which needed to be dropped. This turned out to be tricky, because data.table has some crazy syntax when trying to extract columns. I ultimately used an older representation of "data$colName" to extract factor variables as-is from the data.table. Then I was able to apply the "nlevels" function to determine whether

A full example of how I was doing all of this locally from a "subset" CSV is in this Github Gist:

IDR Series

IDR 07: The human element

I've been working pretty hard, but every other vector of life doesn't happen to stop to accommodate my mission!

So far this week (it's only Tuesday):

  • Our car was due for maintenance, and its battery died. Also, front brakes had to be replaced.
  • My dental work is loose, and would take weeks (and thousands!) to fix
  • My wife and I need to (unexpectedly) travel for Thanksgiving next week
All of this is manageable stuff! I'm thankful to have the car, teeth, and family that I do have, and all of these are being sorted out within the next couple of weeks.

IDR Series

Sunday, November 15, 2015

IDR 06: NullPointerExceptions and sleepless nights

Since Friday afternoon, I've been chasing a bug in some old code. The bug which I mentioned in my previous post that I had finally managed to fix was indeed fixed; but I wasn't using the new code correctly, and that led to a new one.

This bug was one that was ultimately pretty simple to fix: something called a NullPointerException.

In programming languages, we have variables. These aren't too unlike variables from Algebra problems - we give the variable a name and use it in the place of whatever actual value it will be set to when our code is eventually called. So we can write a block of code like this (in Java):

public int add(int a, int b) {
  return a+b;
We have at least two additional concerns when programming that aren't immediately obvious from Algebra:
  • The variables have a type (int in the snippet above)
  • The variables have an address
[ As an aside, variables in Algebra do usually some kind of "type": Rational, Irrational, Integer, etc. This is somewhat analogous to types in programming languages, in that the type implies some characteristics of a variables value.]

In Java, the type of a variable is meant to be obvious. You have to specify the type everywhere, you can't "mix" types unless you add some additional overhead, etc. The idea is that the programming language will help you enforce this type so that you're less likely to make type-related bugs. It also makes the language more efficient at optimizing your programs.

The address of variables is more interesting, though. If we wanted to add two numbers together out here in real life, we don't need to worry about addresses. Just tell me what numbers you'd like to add (ie, the values of those numbers, like 2 and 5), and we can add them.

For "simple" values, this might be OK, but the types in modern programming languages can be very complex "Objects" which track multiple values. In order to work with an Object, it's easier to pass around a reference (traditionally called a "pointer") to the Object. When we need to modify or use a value from that object, the program ultimately starts with the reference in order to find it.

In my case, I had written a function and assumed that the objects passed into it would be properly set up before this function was used. So I had to track down who was using this function, where their objects came from, and why those objects might not have been set up correctly.

Ultimately, the problem was that my new feature extraction code requires a new input parameter of a "test suite ID" which I had forgotten to add to the calling script. My code did not complain loudly enough when this argument was missing - it just rolled along with several uninitialized Objects instead!

To defend against this in code, I added a few blocks like this, which cause the program to preemptively exit a throw a "stack trace" for debugging when an object is null:
if (object == null) {
  throw new RuntimeException("Unexpected null object");
There are a couple of additional patterns/practices that can help avoid "NPEs", but I don't have time to implement them right now :)

IDR Series

Friday, November 13, 2015

IDR 05: R (and cycle times, again)

Cycle times are indeed killing me since Wednesday, but I am making some progress.

I've written an R script which will apply the Lasso method ( to my candidate features, which should help ID the most important ones.

I've had to run multiple cycles of "feature extraction", though, to deal with a bug in an "optimization" I had written in that code about a year ago. Ouch.

Now that I've got that sorted out, another hour to go until I have some feature data. Another 6 hours from that, and I'll have the full data ready for processing by R. To give an idea, I'm working with 10K test cases and 100K features per test case at the moment. I'm currently working with a single "app" (Application under Test, or AUT), but have 3 others ready once the features are finalized.

In the meantime, the ordinary grad student responsibilities don't really stop. I've got to review two papers for an international conference with really bad English ... but at least that will be out of the way by the time my full data set is ready for this app!

Thanks for reading.

IDR Series