Monday, December 30, 2013

Site News: Movin' on up in 2014!

Last year, I set a New Year's resolution to do more analytics work, including 60 blog posts. I got through 15.

But it's not all bad! Two big pieces of news for 2014:
  • I will be attending the Sloan Sports Analytics Conference again this year.  I submitted an abstract to the research paper competition, which was accepted.  Unfortunately, the results of my research disproved my hypothesis, and the whole thing's come crashing down.  Ordinarily, that means you would see it repurposed here as a blog post but...
  • I've been hired as a contributing writer to Beyond the Box Score, "a saber-slanted baseball community", where I will be writing articles on a regular basis.
I'll keep this blog open for non-baseball stuff, but most of my writing will appear over there.

Best wishes to all my reader(s) for a happy and healthy 2014!

Friday, December 27, 2013

Sorting Through a Million Bags of M&Ms

As a kid, I used to sort bags of M&Ms by color. This was my first sort of data science project, and my parents' first clue that this one was a little off. Every once in awhile, I'll revert to that habit (especially with those fun-size bags you get around Halloween), which led to a long, drawn-out discussion with a friend about the probability of getting a fun-size bag of Skittles with no purples.


So when a recent trip to the vending machine produced a free bag of M&Ms, I found myself asking a number of questions about the distribution of the different colors in a bag of M&Ms. A quick Google search produced no official statement from the company, except this one from 2008:


Our color blends were selected by conducting consumer preference tests, which indicate the assortment of colors that pleased the greatest number of people and created the most attractive overall effect.

On average, our mix of colors for M&M'S MILK CHOCOLATE CANDIES is 24% cyan blue, 20% orange, 16% green, 14% bright yellow, 13% red, 13% brown.

Each large production batch is blended to those ratios and mixed thoroughly. However, since the individual packages are filled by weight on high-speed equipment, and not by count, it is possible to have an unusual color distribution.

Well, we have two bags of M&Ms here. We can check to see whether these proportions are still accurate using a chi-square goodness of fit test. Since there are six colors, we will be looking at a distribution with five degrees of freedom.

These tables show the result of the chi-square calculation for each bag.

BAG 1 Red Orange Yellow Green Blue Brown Total
Observed 8 12 4 10 12 8 54
Expected 7.02 10.8 7.56 8.64 12.96 7.02 54
(O-E)^2/E 0.137 0.133 1.676 0.214 0.071 0.137 2.369

BAG 2 Red Orange Yellow Green Blue Brown Total
Observed 8 7 2 14 12 11 54
Expected 7.02 10.8 7.56 8.64 12.96 7.02 54
(O-E)^2/E 0.137 1.337 4.089 3.325 0.071 2.256 11.22

For a 95% confidence interval (alpha = 0.05), x^2 = 11.0705 for a distribution with 5 degrees of freedom. This suggests that we have one normal bag and one outlier. This is not especially conclusive evidence for or against the 2008 distribution, but the significant lack of yellows makes me suspect the distribution has changed.

We can also ask questions about how unusual is a bag with only two M&Ms of a given color. Unfortunately, this is non-trivial to solve theoretically. But we can estimate these probabilities by simulating a large number of bags of M&Ms. I built a MATLAB script (available on request) to simulate an arbitrary number of 1.69-oz M&M bags. For convenience, I assumed each bag had a consistent number of M&Ms (54). I then drew 54 random numbers uniformly distributed on the interval [0,1], splitting up the number line to match the 2008 proprtions (i.e., a random number less than 0.13 meant a red M&M, a number between 0.13 and 0.33 meant orange, and so on). I repeated this process one million times, because "a million bags of M&Ms" sounded cool.


"A million bags of M&Ms isn't cool. You know what's cool? A billion bags of M&Ms."

Shut up, Justin.

Anyway, you end up with this graph showing the distribution for each color. It's discretized, of course, because 0.4 of an M&M is nonsensical. But, thanks to the central limit theorem, all of the distributions are normal. Note that the red and brown curves are essentially right on top of each other*.

* - Apologies to those of you, like my advisor, who are red-green colorblind, and thought M&Ms came in blue, yellow, orange, and a couple shades of brown.


Now that we have this data set, we can answer a whole bunch of questions.

What are the odds that I get a bag with less than N yellows?
For this, we can make cumulative distribution functions based on that figure. So, if we assume the 2008 distribution is accurate, the probability of getting four or fewer yellows in a bag is approximately 11%. And if each bag represents an independent sample (which might not be true, depending on the manufacturing process), the probability of getting two consecutive bags with four or fewer yellows is 1.2%.

What are the odds that I get a bag with less than N of one color?
Here we have to use a different curve. For instance, a bag with only four yellows seems rare from the previous graph, but remember: that just deals with the probability that you have four or fewer yellows, or four or fewer reds. This question deals with the probability you have four or fewer of one color, regardless of which color it is. And now we see that the probability you have a bag with no more than four of one color is about 48%. For two consecutive bags (assuming independence), the probability is a still-reasonable 23%. So, while getting two bags with a small number of yellows is unusual, getting two bags with a small number of any color is pretty common.


What are the odds that I get a bag missing a color?
I'm sure the process of setting those percentages involves minimizing this possibility: if you were six, and your favorite color was red, you might get upset if you went through a whole bag of M&Ms with no reds. As a result, this is a pretty uncommon occurrence: in my data set, the odds were approximately 1-in-690.

What are the odds that I get a bag that's entirely one color?
This never happened in the million trials I ran. In fact, the greatest number of any single color in one bag was 30 blues (out of 54).

What are the odds that I get a bag with equal numbers of all colors?
You would think this wouldn't be too crazy, but in fact it's very rare. I estimate that the odds are about 1-in-42,000.

What are the odds I get more blues than any other color?
Before I present these results, it's important to note that MATLAB's min and max functions don't deal with ties very well. Ideally, you'd had a function such that a two-way tie would count as 0.5 for each color, and a three-way tie as 1/3, but what actually happens is the left-most column gets 1, and everyone else gets zero. This means that red, orange, and yellow will be skewed a little high, and green, blue, and brown will be skewed a little low. But this is already a 1,000-word entry on candy-coated chocolate, so the min/max functions are good enough for me.

Thursday, December 19, 2013

Bad Beats: Getting Back on the Bowl Prediction Horse

Last year I used the Sagarin pure points method and found the games with the biggest discrepancy between the spread Sagarin predicted and the actual spread. I charted this for all of last year's bowls and finished around .500.

This year, I'm using The Prediction Tracker, which aggregate predictions from a number of systems. Treat the picks as independent and assume they fall in a normal distribution (warning: this is probably a terrible set of assumptions, but roll with it). Then, we can see how many standard deviations each spread is from the mean predictions. The ones that are furthest away are the best values.

Let's run through the Chick-Fil-A Bowl (Duke-TAMU) as an example. There are 48 picks with a mean of TAMU -6.54 and a standard deviation of 4.2. The current spread at the LVH is TAMU -12.5, approximately 1.42 standard deviations away from the mean prediction. The high Z-score demonstrates that there's value in this line, and that TAMU is overvalued. And that's why I picked Duke.

For reference, I put all the scores in a Google Document so you can bet against them as you will.

There haven't been many updates here lately, but that doesn't mean I haven't been working. Expect another update soon with a bunch of news.