June 17, 2016

HOWTO: CloudFormation and Masterless Puppet on the Baseball Workbench Project

Within days of my successful dissertation defense in February, I started Baseball Workbench, a side project around a self-service tool for advanced baseball analytics, to have some fun and sharpen my development skills.

One area I've put way too much effort into so far is the automated creation and configuration of AWS resources for the project. The creation and configuration of AWS resources for Baseball Workbench is now completely automated, using a combination of:
  • AWS CloudFormation
  • CloudFormation's support for cloud-init
  • r10k
  • hiera
  • puppet apply (AKA, local Puppet runs, without a Puppet master)
  • Custom "role" and "profile" Puppet classes
  • A custom Puppet module for "superbuilds" for configuring my CI server
  • A custom Puppet module "aws_ec2_facts" for converting EC2 tags into Puppet facts
In this post, I walk through the details of each of these components in-turn. Hopefully, this combination of implementation choices is interesting to you. The concepts should translate to similar approaches as well.


January 30, 2016

New Side Project: Baseball

I'm starting a side project, developing a tool for Sabermetrics research.

Rather than extensively repeat myself, you can read about the project on its website:
http://bryantrobbins.github.io/baseball/

And you can read and respond to my "Initial Thoughts" here:
https://groups.google.com/d/msg/btr-baseball/4SvqoAMF5zY/LOfYAe0tAgAJ

Let me know what you think!

November 30, 2015

IDR 13: Back to work

The intensive part of my final stretch to dissertation defense is now done. I managed to successfully collect and analyze data, and that data should be enough for a successful defense.

Up next, I'll be writing. I have about 5 chapters to write:

  • Intro
  • Background
  • Tools and Infrastructure
  • Experiments
  • Analysis and Conclusions
I'm writing them in a funny order: Chapter 3, then 2, then 4, then 5, then 1. I'll also be able to pull from material in my dissertation proposal, journal publication, and conference paper drafts. I'm hoping to have this all done by the first week of January, which may be ambitious on a nights-and-weekends schedule.

Thanks to everyone who happened to read and reach out over the past three weeks. I look forward to a successful defense very soon!

November 27, 2015

IDR 12: One app to go

Things are moving along pretty well for now. Since I've automated so much over the past 2-3 years, it's been pretty straightforward to gather data for additional applications.

I'm now down to my last app of the 4 in my primary dataset, which is awesome.

If the data continue to look "right" for the apps, I should be able to schedule my defense for some time toward the end of January.

Another thing I'm working on is a list of follow-ups  that the research technique itself and the resulting data suggest would be good things to research next. Of course I'm hoping that I won't be the one doing any significant additional work, but tracking these helps me get them off of my mind while I finish up my own analysis. I've found that to be a pretty helpful benefit in general - I sleep much better once I have good TODO lists on the project, daily, and "someday maybe" levels.

I'm thankful.

IDR Series

November 21, 2015

IDR 11: Results are in for first app

My algorithm looks to have worked with 97% accuracy. Out of 1000 examples classified, there were 28 false positives and 1 false negative. In my case, the false negatives are more important. So another way to interpret this is that I only "missed" on 1 example out of 300 that my algorithm should have caught, which is great.

Now I need to verify that there isn't anything fishy going on. Which features are having the biggest impact? Are my training and test examples as random as they should be? Am I unintentionally cheating in any way with the data?

I'm finally asking research and analysis questions.

Also, I've moved my external base of operations from a Starbucks to a Harris Teeter grocery store. They have wifi, power, and there's college football on a TV. On top of that, the food and drinks are cheaper and healthier.

IDR Series

IDR 10: Test data initialization

Yesterday, I was humbly reminded just how much I still *don't* know about R.

First, I learned that there is apparently no straightforward way to map a row from one data.frame into the columns of another. This meant that it would be much simpler for my code to store all training and "test" examples for each experiment in a shared data frame, with a new column indicating training vs. test. I finally managed to rewrite that code and now have a single CSV being dumped with the new column and all examples.

Then, I finally dealt with the fact that all of my data was being loaded with 0s and NAs instead of 0s and 1s. I was initializing a data.frame from a matrix of 0s. This led the data.frame to think that the only "level" in every variable in my model was "0" (the variables were at least correctly being interpreted as categorical variables or factors in R). So when I actually wanted to use a new value, this was essentially rejected, and the value set to "NA".

There are apparently some ways around this, but it was easier for me to rewrite my code to set the 0 and 1 values within the matrix instead. Now creation of the data.frame looks to be working correctly (I still have a 2 hour cycle time to run this on a real dataset).

So this turned out to be another frustrating day by the end, but it did have its moments of hope before these problems crept up. Here's hoping that today is a better one!

PS, I also got that dental work fixed that I mentioned a couple of posts ago. It actually hurt a bit more than the same procedure did last time, but hopefully that will go away.

IDR Series

November 19, 2015

IDR 09: Renewed sense of hope

Yesterday sucked. I realized that one of my main sources of feature data isn't going to work correctly with the other main source of features. This required adjusting my experiments pretty significantly.

But today, I solved that problem. I corrected the experiment in a way to uses data that I've been collecting for 2 years, so that's a really good thing.

I also realized that there are some things I can do to improve my model's predictive performance. I'm using the glmnet package in R, maintained by Trevor Hastie from Stanford (one of the professors from the StatLearning course and videos I mentioned in a previous post). I realized that I'm doing a couple of dumb things with glmnet, and that there are alternatives that are more appropriate for my dataset. I'm going to redo some of those statistics and see if they improve (they should, and possibly significantly).

So today, there's a lot more hope to go around than there has been the past couple of days. I have a follow-up with my professor in the morning, so I'll be pushing to get a large batch of experimental results today.

On a separate thread, I'm also still digging for better features while the current round of experiments is executing. ML is very sensitive to "garbage in, garbage out", and I may need some additional features to consider if the performance of my technique doesn't match up.

I expect to have an eventful and productive day!

Thanks for reading.

IDR Series