Yesterday I finally figured out two ways to identify and derive better features for my training examples (https://en.wikipedia.org/wiki/Supervised_learning).
I also now have three computers going (one for tunes, one for VPN, and a third one for the IDE).
One of the most difficult parts of this "analysis" phase of things is that things have such a long "cycle time". For me to extract the 100K+ features from my 10K test cases takes about 90 minutes. To convert them from a "DB" format to CSV (which is much easier for "R" to read) took 6 hours. Now that they are in R, I can finally dive in and do something with them.
But let's say that I realize the need for some new feature(s), or that I need the same data from 3 more apps? Then I turn my attention to something else for a while (reading, writing) until those are finally done. That's a pretty long cycle time, and it makes mistakes on the "critical path" of data collection much more costly.
In the past, I have gotten annoyed enough with some of these cycle times to spend effort reducing them. Usually, I try to find some way to do the work in parallel, and then aggregate it back. I've vowed not to do that anymore unless it's absolutely necessary, because it usually turns out to be some rabbit hole I'll spend way too much time on for relatively little gain.
All of that to say this: I'm finally doing some statistical analysis of my training examples today, and I hope to find out the "top N" features and derive some new features too. Once I have the right features, I'll be running a different round of predictive experiments (hopefully with many fewer features, and much shorter cycle times).