Tuesday, May 20, 2014

Tangled in the Rigging 2: Probabalistic Boogaloo

I can't believe I have to do this again.

It was two years ago, and the New Orleans Hornets had just won the NBA Draft Lottery. With the third-worst record in the league, the Hornets had a 13.7 percent chance to win the 2012 Anthony Davis sweepstakes. This is equivalent to a coin coming up heads three times in a row -- i.e., unusual but not super rare -- but of course everyone cried foul since the league owned the Hornets at the time and was working to sell them to a private owner.

The draft lottery being rigged is one of the oldest jokes known to man. That joke is so old, creationists have to insist it was created by the Devil to test our faith. That joke went stale so long ago, the mold growing on that joke was used to discover penicillin*. Enough.

* - If you like this construction, here are some more examples.

And even reasonable people -- people who love them some numbers! -- fall into this trap. Here's Grantland's Bill Barnwell on Twitter tonight:



This is technically true:
P(Cavs win in 2011)*P(Cavs win in 2013)*P(Cavs win in 2014) = (2.8%)(15.6%)(1.7%) = .0000743,
or 13,467-to-1. But really, that's disingenous for a couple reasons. First off, the Cavs actually had two chances to win in the 2011 lottery: their own chance, based on their own abysmal performance (19.9%) and the Clippers' chance, acquired by trade (2.8%). So their actual probability of winning the first pick in 2011 was 22.7%, improving their odds to 1,661-to-1.



Okay, fine. But even then, we're still not talking about the odds of the Cavs winning three of four lotteries. We're talking about the odds of the Cavs winning those three lotteries in that order. If we really wanted to find the odds of the Cavs winning three of the last four lotteries, we'd need to calculate something like this:
P(2011)*P(2012)*P(2013)*P(~2014) + P(2011)*P(2012)*P(~2013)*P(2014) + P(2011)*P(~2012)*P(2013)*P(2014) + P(~2011)*P(2012)*P(2013)*P(2014),
where P(x) is the probability the Cavs won in year X, and P(~x) is the probability the Cavs didn't win in year X.

Also, this sequence of lottery selections isn't even the least improbable four-year stretch in the last decade. From 2005-2008, four straight lottery winners had less than a 10% chance of winning. The odds of those four teams winning is 200,194-to-1. But because the picks went to four different teams, no one thought it was that unusual.

That's because every lottery outcome is unlikely until it happens. Here's another example: the Rhode Island lottery has a four-digit numbers game, where you have to pick four single-digit numbers in order to win the grand prize. Sunday's winning numbers were 9-2-9-9. Now, the odds of drawing three nines in a four-number drawing could be expressed as 1,000-to-1, and that's correct, but are the three nines really so unlikely? No. They're no more or less unlikely than any other three-digit combination*; we're just wired to pick out those patterns.

* - We're assuming it's a fair lottery here, but it is Rhode Island, so who knows.

Monday, April 14, 2014

Rough Translations: Is the traditional OBP formula the most suitable?

Thanks to a recent Sports Illustrated article and Baseball Prospectus interview, I stumbled across the website for the Grupo Independiente para la Investigacion del Beisbol (GIIB), a group interested in applying sabermetric principles to Cuban baseball. Their website is very interesting, but it's in Spanish. In order to spread their very useful approach, I'm putting my loose translation here, in the hopes that (even if it's not 100% accurate) it will at least be an improvement over what you can get for free through something like Google Translate.

DISCLAIMER

  • All content is property of the GIIB and is not my own. I claim zero rights to it. If they get mad about this translation, they just have to contact me and I will absolutely take it down.
  • I also claim zero responsibility for the accuracy of these translations. I am not a native Spanish speaker, but I did take Spanish in high school and am currently a level-17 Duolingo user, for whatever that's worth. Any missing Spanish knowledge (of which there is a lot) will be supplied by Google Translate.
  • Because my understanding of the original is limited, the translations will probably not be exact. I hope to at least capture the spirit of the article, so others who don't know Spanish can still read the GIIB's research. Sentences or phrases I can't get a good handle on will be denoted by italics. You're welcome to leave corrections or other constructive feedback in the comments.

***

¿Es la fórmula tradicional del OBP la más idónea? (Is the traditional OBP formula the most suitable?)

Is the OBP formula the most correct way to measure the probability that a batter reaches base? Is not the sacrifice bunt an opportunity to get on base? Why does the OBP formula ignore when a batter reaches base on an error?

OBP = (H+BB+HBP)/(AB+BB+HBP+SF)

H: Hits
BB: Base on Balls
HBP: Hit by Pitch
AB: At Bats
SF: Sacrifice Flies

The formula for OBP is too focused on the analysis of the individual hitter and not what he really contributes to the team. One piece of evidence for this last statement is the fact that the OBP formula excludes the sacrifice bunt from the denominator. Suppose a batter comes up with runners on first and second with no outs, and grounds out to second, allowing the runners to advance. What is the difference between this at bat and a sacrifice bunt? Are the two actions not worth the same to their team? Then why does OBP make a distinction between them? Defenders of OBP argue that, because the sacrifice bunt is ordered by the bench, it should not be seen as an opprotunity to get on base and therefore should be excluded from the denominator. But is this true? Are all sacrifice bunts ordered? Should OBP distinguish between sacrifice bunts put on by the manager and other sacrifices? Of course note; and even if we consider all sacrifices as ordered from the bench, the sacrifice isn't an opportunity to reach base? Really?

Let's return to the previous situation: runners on first and second, no outs. Suppose the batter bunts the ball and reaches first safely, getting credit for a hit; is this bunt not an opportunity to get on base? In other words, if the bunt goes for a hit, it is a positive action, but if it advances the runners (which could also be accomplished by other means as shown above) and the batter is thrown out at first, it doesn't count as an opportunity to get on base? This is incongruous.

Now we analyze another important aspect of OBP: the exclusion of times the batter reached on an error.

Suppose a batter reaches on an error. The batter accomplished one of his goals (to get on base), and has become a runner and an opportunity for his team to score. Any runner represents a great opportunity for a team to manufacture runs, regardless of whether he has reached by error, hit, or walk. But the basic OBP formula doesn't see it this way. According to the basic OBP formula, reaching on an error is a negative action. When a batter reaches first by an error, according to this formula he is not credited with reaching first safely but is credited with an at bat. In other words, the times when a batter reaches on an error, it is counted as a failed opportunity to reach first. This is difficult to understand.

Let's analyze this from a different perspective. For some, the error has nothing to do with the offensive player; it is true that the error is a bad defensive acction, but does this mean the batter has no influence? What is the difference between a hit and an error? Subjective concepts; first, the positioning of the defensive player, and then if the scorer considered there to be an opportunity for an out. Is a Texas leaguer to center field worth more than to connect on a shot to third that the fielder can't handle? In general, baseball rewards speed and placement and not force, but is this really important? Imagine your favorite team, losing by one in the ninth, with two outs and the tying run on third. The batter reaches on an error and the game is tied. Does it really matter how the game was tied? Would you rather lose?

It is true that errors are sometimes just bad defensive plays, where there is not a hard-hit ball or fast players to rush the defense and force them to make risky throws. But is the HBP not an error by the pitcher? The HBP is in most cases a mistake by the pitcher, a pitch that gets away from him; but it is nevertheless counted in the traditional OBP formula as a positive offensive action. It takes a lot for us to understand how a HBP is worth more for the batter than to reach on an error.

We now return to the initial question: is the traditional OBP formula the best way to measure the probability of a batter reaching base? Briefly, OBP does not count when a batter reaches on an error and doesn't consider a sacrifice bunt as an opportunity to get on base. We therefore calculate OBP in an alternative way and call it gOBP.

gOBP = (H+BB+HBP+ROE)/(AB+BB+HBP+SF+SAC)

ROE: Reached on error [Trans. note: abbreviated EE in original]
SAC: Sacrifice hit

To capture the reality of the concepts we must move away from the moralistic way of thinking that still exists in baseball analysis. If we want to know the real probability that a batter reaches base, then we have to count all opportunities and all successful actions. Each turn at bat is an opportunity to reach base; and of course, every time the batter reaches first safely, it is a positive result for him. We exclude only interference or obstruction from this analysis, because the probabilities of these actions occuring in a baseball game is around 0.0027 [~0.27 percent]. That is to say, it is such a small sample that it is negligible.

For now, enough philosophical discussions about baseball. We will concentrate on mathematical tools that allow us to demonstrate which of these two statistics is more useful when analyzing the offensive prowess of a team.

For this we compiled statistics from the last 15 Cuban National Series. We calculated both variants of OBP and found the linear correlation of both with runs scored per game.


This is the scatterplot of the traditional OBP vs. runs per game. These statistics show a linear correlation of 0.93. But we are not questioning the proven utility of the classic formula, but we are analyzing which formula is the most suitable.

We next show the scatterplot of gOBP with respect to runs per game.


These statistics show a linear correlation of 0.95.

Although the difference is not very high, it is not possible to deny that gOBP is slightly closer to reality than OBP, at least numerically. We remember that correlation does not explain causation between variables. But we consider gOBP the optimal indicator to measure the probability that a batter will reach base. This is why STRIKE, as well as other projects that derive from it including StatsPlay, includes gOBP in its reports and recommends it over OBP to measure this concept.

Regards, friends.

Rough Translations: Analyzing Industriales

Thanks to a recent Sports Illustrated article and Baseball Prospectus interview, I stumbled across the website for the Grupo Independiente para la Investigacion del Beisbol (GIIB), a group interested in applying sabermetric principles to Cuban baseball. Their website is very interesting, but it's in Spanish. In order to spread their very useful approach, I'm putting my loose translation here, in the hopes that (even if it's not 100% accurate) it will at least be an improvement over what you can get for free through something like Google Translate.

DISCLAIMER

  • All content is property of the GIIB and is not my own. I claim zero rights to it. If they get mad about this translation, they just have to contact me and I will absolutely take it down.
  • I also claim zero responsibility for the accuracy of these translations. I am not a native Spanish speaker, but I did take Spanish in high school and am currently a level-17 Duolingo user, for whatever that's worth. Any missing Spanish knowledge (of which there is a lot) will be supplied by Google Translate.
  • Because my understanding of the original is limited, the translations will probably not be exact. I hope to at least capture the spirit of the article, so others who don't know Spanish can still read the GIIB's research. Sentences or phrases I can't get a good handle on will be denoted by italics. You're welcome to leave corrections or other constructive feedback in the comments.

***

Analyzing Industriales (Analizando a Industriales)

Via the attachment that this article contains and you can download [Trans. note: available through thegiib.com.], the first preseason analysis that our group performed for Industriales is made public (dated July 2013). In this analysis, very controversial issues surrounding the blue team are touched on. Below is a summary of what you will find; we hope you enjoy it:

- Regular players?
- Short field?
- Catchers?
- Yulieski or Rudy at 3B?
- How to replace Odrisamer?
- Predictions?
- Some players and their characteristics
- Advice
- Graph of the principal Industriales players' wOBA over time.

Rough Translations: Announcement

Thanks to a recent Sports Illustrated article and Baseball Prospectus interview, I stumbled across the website for the Grupo Independiente para la Investigacion del Beisbol (GIIB), a group interested in applying sabermetric principles to Cuban baseball. Their website is very interesting, but it's in Spanish. In order to spread their very useful approach, I'm putting my loose translation here, in the hopes that (even if it's not 100% accurate) it will at least be an improvement over what you can get for free through something like Google Translate.

DISCLAIMER

  • All content is property of the GIIB and is not my own. I claim zero rights to it. If they get mad about this translation, they just have to contact me and I will absolutely take it down.
  • I also claim zero responsibility for the accuracy of these translations. I am not a native Spanish speaker, but I did take Spanish in high school and am currently a level-17 Duolingo user, for whatever that's worth. Any missing Spanish knowledge (of which there is a lot) will be supplied by Google Translate.
  • Because my understanding of the original is limited, the translations will probably not be exact. I hope to at least capture the spirit of the article, so others who don't know Spanish can still read the GIIB's research. Sentences or phrases I can't get a good handle on will be denoted by italics. You're welcome to leave corrections or other constructive feedback in the comments.

***

Convocatoria (Announcement)

This website is the official way the StatsPlay application is distributed, along with the necessary databases and accessories. We believe that this is the best way to make the project available to everyone, without borders, and with the most independence possible. We regret that many users in our country can't download all that they need for the project to work for them. We know the limited connectivity options and Internet access that exist. Nevertheless, many people do have access to a variety of sites on the national network, such as through the intranets of their workplaces.

For that reason:

I call

on all sites hosted on servers on the national territory under the .cu domain, especially Cubadebate, Granma, Juventud Rebelde, Infomed, and BeisbolCubano, to contribute to the dissemination of this project. All national sites have total freedom to host and allow the downloads of everything related to the StatsPlay applications and databases from their servers. We hope that this way, this project will have a very broad scope, and that a greater number of people can enjoy it and use it.

-Camilo Quintas, leader of the GIIB.

Rough Translations: Introduction/The GIIB Web Site

Thanks to a recent Sports Illustrated article and Baseball Prospectus interview, I stumbled across the website for the Grupo Independiente para la Investigacion del Beisbol (GIIB), a group interested in applying sabermetric principles to Cuban baseball. Their website is very interesting, but it's in Spanish. In order to spread their very useful approach, I'm putting my loose translation here, in the hopes that (even if it's not 100% accurate) it will at least be an improvement over what you can get for free through something like Google Translate.

DISCLAIMER

  • All content is property of the GIIB and is not my own. I claim zero rights to it. If they get mad about this translation, they just have to contact me and I will absolutely take it down.
  • I also claim zero responsibility for the accuracy of these translations. I am not a native Spanish speaker, but I did take Spanish in high school and am currently a level-17 Duolingo user, for whatever that's worth. Any missing Spanish knowledge (of which there is a lot) will be supplied by Google Translate.
  • Because my understanding of the original is limited, the translations will probably not be exact. I hope to at least capture the spirit of the article, so others who don't know Spanish can still read the GIIB's research. Sentences or phrases I can't get a good handle on will be denoted by italics. You're welcome to leave corrections or other constructive feedback in the comments.

***

Sitio Web del GIIB (The GIIB Web Site)

With great joy we launch our group's official website today. We want to thank everyone who has kindly offered us their help. Although it is not necessary, it is recommended that all users register to be able to consume all of our site's services without difficulty. We believe that this is the best way to make our first proposal public and official: the launch of the StatsPlay project. In this site you will find everything you need to inform yourself about StatsPlay. We hope that with this project, all those linked to baseball can supplement the existing information needed in our country.

Welcome to all. GIIB.

Monday, December 30, 2013

Site News: Movin' on up in 2014!

Last year, I set a New Year's resolution to do more analytics work, including 60 blog posts. I got through 15.

But it's not all bad! Two big pieces of news for 2014:
  • I will be attending the Sloan Sports Analytics Conference again this year.  I submitted an abstract to the research paper competition, which was accepted.  Unfortunately, the results of my research disproved my hypothesis, and the whole thing's come crashing down.  Ordinarily, that means you would see it repurposed here as a blog post but...
  • I've been hired as a contributing writer to Beyond the Box Score, "a saber-slanted baseball community", where I will be writing articles on a regular basis.
I'll keep this blog open for non-baseball stuff, but most of my writing will appear over there.

Best wishes to all my reader(s) for a happy and healthy 2014!

Friday, December 27, 2013

Sorting Through a Million Bags of M&Ms

As a kid, I used to sort bags of M&Ms by color. This was my first sort of data science project, and my parents' first clue that this one was a little off. Every once in awhile, I'll revert to that habit (especially with those fun-size bags you get around Halloween), which led to a long, drawn-out discussion with a friend about the probability of getting a fun-size bag of Skittles with no purples.


So when a recent trip to the vending machine produced a free bag of M&Ms, I found myself asking a number of questions about the distribution of the different colors in a bag of M&Ms. A quick Google search produced no official statement from the company, except this one from 2008:


Our color blends were selected by conducting consumer preference tests, which indicate the assortment of colors that pleased the greatest number of people and created the most attractive overall effect.

On average, our mix of colors for M&M'S MILK CHOCOLATE CANDIES is 24% cyan blue, 20% orange, 16% green, 14% bright yellow, 13% red, 13% brown.

Each large production batch is blended to those ratios and mixed thoroughly. However, since the individual packages are filled by weight on high-speed equipment, and not by count, it is possible to have an unusual color distribution.

Well, we have two bags of M&Ms here. We can check to see whether these proportions are still accurate using a chi-square goodness of fit test. Since there are six colors, we will be looking at a distribution with five degrees of freedom.

These tables show the result of the chi-square calculation for each bag.

BAG 1 Red Orange Yellow Green Blue Brown Total
Observed 8 12 4 10 12 8 54
Expected 7.02 10.8 7.56 8.64 12.96 7.02 54
(O-E)^2/E 0.137 0.133 1.676 0.214 0.071 0.137 2.369

BAG 2 Red Orange Yellow Green Blue Brown Total
Observed 8 7 2 14 12 11 54
Expected 7.02 10.8 7.56 8.64 12.96 7.02 54
(O-E)^2/E 0.137 1.337 4.089 3.325 0.071 2.256 11.22

For a 95% confidence interval (alpha = 0.05), x^2 = 11.0705 for a distribution with 5 degrees of freedom. This suggests that we have one normal bag and one outlier. This is not especially conclusive evidence for or against the 2008 distribution, but the significant lack of yellows makes me suspect the distribution has changed.

We can also ask questions about how unusual is a bag with only two M&Ms of a given color. Unfortunately, this is non-trivial to solve theoretically. But we can estimate these probabilities by simulating a large number of bags of M&Ms. I built a MATLAB script (available on request) to simulate an arbitrary number of 1.69-oz M&M bags. For convenience, I assumed each bag had a consistent number of M&Ms (54). I then drew 54 random numbers uniformly distributed on the interval [0,1], splitting up the number line to match the 2008 proprtions (i.e., a random number less than 0.13 meant a red M&M, a number between 0.13 and 0.33 meant orange, and so on). I repeated this process one million times, because "a million bags of M&Ms" sounded cool.


"A million bags of M&Ms isn't cool. You know what's cool? A billion bags of M&Ms."

Shut up, Justin.

Anyway, you end up with this graph showing the distribution for each color. It's discretized, of course, because 0.4 of an M&M is nonsensical. But, thanks to the central limit theorem, all of the distributions are normal. Note that the red and brown curves are essentially right on top of each other*.

* - Apologies to those of you, like my advisor, who are red-green colorblind, and thought M&Ms came in blue, yellow, orange, and a couple shades of brown.


Now that we have this data set, we can answer a whole bunch of questions.

What are the odds that I get a bag with less than N yellows?
For this, we can make cumulative distribution functions based on that figure. So, if we assume the 2008 distribution is accurate, the probability of getting four or fewer yellows in a bag is approximately 11%. And if each bag represents an independent sample (which might not be true, depending on the manufacturing process), the probability of getting two consecutive bags with four or fewer yellows is 1.2%.

What are the odds that I get a bag with less than N of one color?
Here we have to use a different curve. For instance, a bag with only four yellows seems rare from the previous graph, but remember: that just deals with the probability that you have four or fewer yellows, or four or fewer reds. This question deals with the probability you have four or fewer of one color, regardless of which color it is. And now we see that the probability you have a bag with no more than four of one color is about 48%. For two consecutive bags (assuming independence), the probability is a still-reasonable 23%. So, while getting two bags with a small number of yellows is unusual, getting two bags with a small number of any color is pretty common.


What are the odds that I get a bag missing a color?
I'm sure the process of setting those percentages involves minimizing this possibility: if you were six, and your favorite color was red, you might get upset if you went through a whole bag of M&Ms with no reds. As a result, this is a pretty uncommon occurrence: in my data set, the odds were approximately 1-in-690.

What are the odds that I get a bag that's entirely one color?
This never happened in the million trials I ran. In fact, the greatest number of any single color in one bag was 30 blues (out of 54).

What are the odds that I get a bag with equal numbers of all colors?
You would think this wouldn't be too crazy, but in fact it's very rare. I estimate that the odds are about 1-in-42,000.

What are the odds I get more blues than any other color?
Before I present these results, it's important to note that MATLAB's min and max functions don't deal with ties very well. Ideally, you'd had a function such that a two-way tie would count as 0.5 for each color, and a three-way tie as 1/3, but what actually happens is the left-most column gets 1, and everyone else gets zero. This means that red, orange, and yellow will be skewed a little high, and green, blue, and brown will be skewed a little low. But this is already a 1,000-word entry on candy-coated chocolate, so the min/max functions are good enough for me.