January 30, 2016

New Side Project: Baseball

I'm starting a side project, developing a tool for Sabermetrics research.

Rather than extensively repeat myself, you can read about the project on its website:

And you can read and respond to my "Initial Thoughts" here:

Let me know what you think!

November 30, 2015

IDR 13: Back to work

The intensive part of my final stretch to dissertation defense is now done. I managed to successfully collect and analyze data, and that data should be enough for a successful defense.

Up next, I'll be writing. I have about 5 chapters to write:

  • Intro
  • Background
  • Tools and Infrastructure
  • Experiments
  • Analysis and Conclusions
I'm writing them in a funny order: Chapter 3, then 2, then 4, then 5, then 1. I'll also be able to pull from material in my dissertation proposal, journal publication, and conference paper drafts. I'm hoping to have this all done by the first week of January, which may be ambitious on a nights-and-weekends schedule.

Thanks to everyone who happened to read and reach out over the past three weeks. I look forward to a successful defense very soon!

November 27, 2015

IDR 12: One app to go

Things are moving along pretty well for now. Since I've automated so much over the past 2-3 years, it's been pretty straightforward to gather data for additional applications.

I'm now down to my last app of the 4 in my primary dataset, which is awesome.

If the data continue to look "right" for the apps, I should be able to schedule my defense for some time toward the end of January.

Another thing I'm working on is a list of follow-ups  that the research technique itself and the resulting data suggest would be good things to research next. Of course I'm hoping that I won't be the one doing any significant additional work, but tracking these helps me get them off of my mind while I finish up my own analysis. I've found that to be a pretty helpful benefit in general - I sleep much better once I have good TODO lists on the project, daily, and "someday maybe" levels.

I'm thankful.

IDR Series

November 21, 2015

IDR 11: Results are in for first app

My algorithm looks to have worked with 97% accuracy. Out of 1000 examples classified, there were 28 false positives and 1 false negative. In my case, the false negatives are more important. So another way to interpret this is that I only "missed" on 1 example out of 300 that my algorithm should have caught, which is great.

Now I need to verify that there isn't anything fishy going on. Which features are having the biggest impact? Are my training and test examples as random as they should be? Am I unintentionally cheating in any way with the data?

I'm finally asking research and analysis questions.

Also, I've moved my external base of operations from a Starbucks to a Harris Teeter grocery store. They have wifi, power, and there's college football on a TV. On top of that, the food and drinks are cheaper and healthier.

IDR Series

IDR 10: Test data initialization

Yesterday, I was humbly reminded just how much I still *don't* know about R.

First, I learned that there is apparently no straightforward way to map a row from one data.frame into the columns of another. This meant that it would be much simpler for my code to store all training and "test" examples for each experiment in a shared data frame, with a new column indicating training vs. test. I finally managed to rewrite that code and now have a single CSV being dumped with the new column and all examples.

Then, I finally dealt with the fact that all of my data was being loaded with 0s and NAs instead of 0s and 1s. I was initializing a data.frame from a matrix of 0s. This led the data.frame to think that the only "level" in every variable in my model was "0" (the variables were at least correctly being interpreted as categorical variables or factors in R). So when I actually wanted to use a new value, this was essentially rejected, and the value set to "NA".

There are apparently some ways around this, but it was easier for me to rewrite my code to set the 0 and 1 values within the matrix instead. Now creation of the data.frame looks to be working correctly (I still have a 2 hour cycle time to run this on a real dataset).

So this turned out to be another frustrating day by the end, but it did have its moments of hope before these problems crept up. Here's hoping that today is a better one!

PS, I also got that dental work fixed that I mentioned a couple of posts ago. It actually hurt a bit more than the same procedure did last time, but hopefully that will go away.

IDR Series

November 19, 2015

IDR 09: Renewed sense of hope

Yesterday sucked. I realized that one of my main sources of feature data isn't going to work correctly with the other main source of features. This required adjusting my experiments pretty significantly.

But today, I solved that problem. I corrected the experiment in a way to uses data that I've been collecting for 2 years, so that's a really good thing.

I also realized that there are some things I can do to improve my model's predictive performance. I'm using the glmnet package in R, maintained by Trevor Hastie from Stanford (one of the professors from the StatLearning course and videos I mentioned in a previous post). I realized that I'm doing a couple of dumb things with glmnet, and that there are alternatives that are more appropriate for my dataset. I'm going to redo some of those statistics and see if they improve (they should, and possibly significantly).

So today, there's a lot more hope to go around than there has been the past couple of days. I have a follow-up with my professor in the morning, so I'll be pushing to get a large batch of experimental results today.

On a separate thread, I'm also still digging for better features while the current round of experiments is executing. ML is very sensitive to "garbage in, garbage out", and I may need some additional features to consider if the performance of my technique doesn't match up.

I expect to have an eventful and productive day!

Thanks for reading.

IDR Series

November 18, 2015

IDR 08: Filtering columns of data.table in R

Today, I've resumed my fight with some R "stats code" that's analyzing my important ML features.

This is a long post. Here is the ultimate outcome, as a public snippet on Github:

I wanted to try out a couple of analysis techniques on my local machine before running them in a distributed fashion across all of my datasets. To reduce my cycle time further, I also wanted to work with a smaller subset of the 10K test cases and 50K features in my full models right now.

First, I downloaded one of my full data files (1.6 GB), which was in CSV format with a header row as the first row. I have two classes in this training data, and I did not construct the file in any random order. I needed to take some training examples from the top and some from the bottom of the file.

To do this, I used old-school Unix commands on my machine (a Macbook) to cut out a decent subset from the raw data.
bryan$ head -n100 raw.csv > subset.csv
bryan$ tail -n500 raw.csv >> subset.csv
head and tail can take the top "n" lines or bottom "n" lines from a file, respectively. And the ">" redirects to a new file, while ">>" appends to an existing. At the end of these commands, I have a new CSV with the same structure as my original, but only 600L of data (plus header row) instead of 10K.

But there's another problem with my subset of data. My training examples were built on my original ~50K features, but now I only have 600 of the original examples. This means that a large percentage of my features are actually unnecessary, and can be dropped to make my processing even faster.

This isn't just a problem for me because I'm running locally. If, at any point later, I decide that I want to trim down my feature set, I'll need to perform the same types of "filtering" operations on the raw data rather than fully reprocessing test cases and their data for a subset of the same features. (Reprocessing from artifacts takes over 8 hours, and that time is dominated by fetching large binary artifacts from databases - so my feature processing time is not really important. BIG O strikes again!)

I also took the opportunity to speed up how I was loading data, so I found the data.table package. I successfully loaded data with "fread" from data.table. This puts my data in a matrix-like structure, with my 600 training examples as the rows and my 50K features as the columns.

For my first trick with fread, I needed to treat my feature labels as "categorical" in R, or what R calls factors. In my case, all features are binary (they are either present in a training example, or not), so I should have up to 50K features with 2 "levels" each. With fread, there is an option to "load strings as factors", so I used this to define my variables as factors from the start.

For my next trick, I had to build a list of all columns from my data.table which needed to be dropped. This turned out to be tricky, because data.table has some crazy syntax when trying to extract columns. I ultimately used an older representation of "data$colName" to extract factor variables as-is from the data.table. Then I was able to apply the "nlevels" function to determine whether

A full example of how I was doing all of this locally from a "subset" CSV is in this Github Gist:

IDR Series