November 18, 2015

IDR 08: Filtering columns of data.table in R

Today, I've resumed my fight with some R "stats code" that's analyzing my important ML features.

This is a long post. Here is the ultimate outcome, as a public snippet on Github:

I wanted to try out a couple of analysis techniques on my local machine before running them in a distributed fashion across all of my datasets. To reduce my cycle time further, I also wanted to work with a smaller subset of the 10K test cases and 50K features in my full models right now.

First, I downloaded one of my full data files (1.6 GB), which was in CSV format with a header row as the first row. I have two classes in this training data, and I did not construct the file in any random order. I needed to take some training examples from the top and some from the bottom of the file.

To do this, I used old-school Unix commands on my machine (a Macbook) to cut out a decent subset from the raw data.
bryan$ head -n100 raw.csv > subset.csv
bryan$ tail -n500 raw.csv >> subset.csv
head and tail can take the top "n" lines or bottom "n" lines from a file, respectively. And the ">" redirects to a new file, while ">>" appends to an existing. At the end of these commands, I have a new CSV with the same structure as my original, but only 600L of data (plus header row) instead of 10K.

But there's another problem with my subset of data. My training examples were built on my original ~50K features, but now I only have 600 of the original examples. This means that a large percentage of my features are actually unnecessary, and can be dropped to make my processing even faster.

This isn't just a problem for me because I'm running locally. If, at any point later, I decide that I want to trim down my feature set, I'll need to perform the same types of "filtering" operations on the raw data rather than fully reprocessing test cases and their data for a subset of the same features. (Reprocessing from artifacts takes over 8 hours, and that time is dominated by fetching large binary artifacts from databases - so my feature processing time is not really important. BIG O strikes again!)

I also took the opportunity to speed up how I was loading data, so I found the data.table package. I successfully loaded data with "fread" from data.table. This puts my data in a matrix-like structure, with my 600 training examples as the rows and my 50K features as the columns.

For my first trick with fread, I needed to treat my feature labels as "categorical" in R, or what R calls factors. In my case, all features are binary (they are either present in a training example, or not), so I should have up to 50K features with 2 "levels" each. With fread, there is an option to "load strings as factors", so I used this to define my variables as factors from the start.

For my next trick, I had to build a list of all columns from my data.table which needed to be dropped. This turned out to be tricky, because data.table has some crazy syntax when trying to extract columns. I ultimately used an older representation of "data$colName" to extract factor variables as-is from the data.table. Then I was able to apply the "nlevels" function to determine whether

A full example of how I was doing all of this locally from a "subset" CSV is in this Github Gist:

IDR Series

1 comment: