Mind Bowling

Book Pressure

Here’s a statistical model for book pressure that takes into account both the passive and active orders.

I’ve known about this paradox for a while (in fact, I was asked about it at an interview I had many years ago at Goldman Sachs), but it has always bugged me. Yesterday I finally sat down to perform a more thorough analysis and was able to construct a super weird example. Check out the pdf below:

envelope-paradox Download

Should you play the lottery?

The standard answer to this question is a resounding “No”. This is usually based on the idea that the expected return on any ‘investment’ is negative. You may buy a lottery ticket for $1, but since the chance of winning the lottery is so small, you can only expect to receive e.g. $0.50 back on average.

Utility Function

The above argument may not be the full story since money by itself is not a sufficiently meaningful metric to base decisions on. Instead, we need a utility function. A utility function is a way to map dollars to value, it’s based on the idea that one dollar for you and one dollar for me may mean different things. So, we should really ask ourselves what is the expected return of utility when playing the lottery.

One useful property of utility functions is that they are concave. This means that each incremental dollar is worth less to me than the dollar before it. Put another way, if I’m starving and homeless, a dollar is worth a lot whereas if I’m a multi-millionair, that same dollar is worth a lot less. This is called the the law of diminishing marginal utility.

These two ideas (that we should look at utility rather than money, and that the marginal utility decreases) seem to strengthen the argument against playing the lottery. We already know that we will, on average, lose money when playing the lottery. Now we’re also saying that the millions of dollars that we may win aren’t even worth as much as we had hoped they were. The expected return of utility is then even worse than for money.

Utility Function for Playing the Lottery

While it is generally the case that utility functions are concave (marginal utility decreases), there seems to be situations where this is not the case. Consider, as a toy example, a universe with precisely two products: chewing gum and sports cars. Suppose that each piece of gum costs $1 and that the sports car costs $100,000. Consider two agents in this universe, one with a net worth of $100 and one with a net worth of $99,999. The traditional argument of decreasing marginal utility would tell us that a dollar is worth a lot more to the person who is worth $100 than to the person who is worth $99,999. However, in this arguably contrived universe, the only thing the person worth $100 can do with that additional dollar is buy one more piece of gum. The rather wealthy individual, who can currently only purchase 99,999 pieces of gum, can, with the additional dollar, now instead purchase the sports car. It would not be too far fetched to suppose that the wealthy individual would gain more marginal utility for that dollar than the less wealthy individual would. So, in this universe we see that marginal utility is not necessarily decreasing.

It should be pointed out that in the toy example above, the marginal utility is still locally decreasing. To someone who is worth $200, the extra dollar will not be worth as much as it would be to the person worth $100. It’s just at certain transition points that we see discrete jumps.

Let’s go back to the lottery. Presumably the reason why people play the lottery is not so that they can buy more of the things that they already own. Instead they hope that with the additional money they could do things that were simply not possible before. Perhaps they can quit their job, buy a house, or rise out of poverty. It seems reasonable that these types of transitions are similar to the transition in the toy example above. While each player’s utility function may still be locally concave, it may still exhibit large discrete jumps. Taking these into account, it is conceivable that the overall increase in utility in winning the lottery may in fact exceed what would naively be expected based on a purely monetary analysis. The expected return of utility may then be positive, in which case playing the lottery would in fact be a logical thing to do.

p-values

This has got to be one of the most misunderstood concepts in statistics. In this pdf I try to explain what a p-value is and what it is not. I give three examples to highlight subtleties, the third of which demonstrates how a lower p-value can sometimes actually bolster your belief in the null-hypothesis!

pValues

The Sigmoid Function

While it may seem like the sigmoid function that is used in logistic regression is pulled out of thin air, it actually follows from assuming that the class conditional distribution is part of the exponential family. I detail this here:

Why The Sigmoid Function

Laplace Smoothing

From a frequentist perspective it seems ad hoc, why just add one to each class? However, using a fully Bayesian treatment with a non-informative prior over the Bernoulli parameter, the concept of Laplace smoothing falls out automatically. Read all the details in my write up here:

Laplace Smoothing

Clustering people based on location and time

A while back I decided to start a machine learning driven messaging app. The idea was to allow people in a crowd – people who didn’t necessarily know each other – to communicate while they were gathered. Suppose that you find yourself watching a street performer. You start taking photos and see that everyone else is as well. However, when you leave that event you only get to keep your own photos and not those taken by everyone else. Also, suppose you’re at a baseball game and you see a foul ball being caught on the other side of the stadium. It’s so far away that you resort to squinting or watching the big screen. Wouldn’t it be neat if people right by that section could post their photos and comments into a stream for you to see? The app (with working title clustr) automatically created events for people to accomplish this based on the GPS coordinates and timestamp of each message and photo.

Since we wanted to classify both large events like baseball games and small events like street performers it’s not quite as simple as just putting a fixed size radius around each event. Instead the size of the cluster has to grow with the group. Furthermore you don’t want to confuse an event that takes place today with an event that took place in the same spot a week ago. In other words, time has to factor in as well. Just like with spatial clustering, there is a range of timescales that events take up – some are over in 10 minutes and some can take hours. Again, a fixed time cutoff won’t work.

What we needed was a malleable model that updated its beliefs in real-time as more data was revealed (i.e. Bayesian). In fact, sometimes it could accidentally think that two parts of the same event were actually two different events. It became clear that our system would have to allow itself to fix mistakes by sometimes merging two events into one.

Back when I was working on this, I wrote up my thoughts on this product. I thought it might be worth posting them, so please enjoy the write-up here: Mathematical Analysis of Clustr – iPhone App

Mode, Median, and Mean

These are three so-called summary statistics meant to somehow convey the nature of a larger data set in terms of a single number. Now, we’re often taught these as separate concepts, but there’s a nice unifying framework for these that I’d like to revisit here.

Suppose we’re given a data set with elements $\{x_i \hspace{2mm} | \hspace{2mm} i=1, 2, \ldots, n\}$ . Then suppose that we want to represent this data set with a single number $\alpha$ . How should we choose this? Any single number that we can pick to represent an entire data set will naturally not do the data set full justice. Put another way it will only represent the set up to some error. Presumably we’d want to choose $\alpha$ in such a way as to minimize the error which brings us to the error function.

Error function

Let’s define the amount of error we have incurred in reducing the entire data set into a single number by

$E_p(\alpha)= \sum_{i=1}^{n} |\alpha - x_i|^p$

Depending on what we choose for the power $p$ , we’ll end up with the mode, median, or mean.

Mean

Let’s start with $p=2$ . We want to choose $\alpha$ in order to minimize the error above. In other words, let’s take the derivative and set it to zero

$\frac{d}{d\alpha}E_2(\alpha)= 2\sum_{i=1}^{n}(\alpha - x_i) = 0 \rightarrow \alpha = \frac{1}{n}\sum_{i=1}^{n}x_i = \mbox{mean}$

Median

Next, let’s set $p=1$ . In this case we’re just measuring the total distance away from all the data points. The simplest way of understanding how to choose $\alpha$ is to first recognize that it should be less than the maximum data point and bigger than the minimum data point. Then, as long as it’s between these two endpoints, the error incurred from those two terms is fixed. The next step is then to minimize the error incurred from the second to largest and second to smallest points. Again, as long as it’s between these two extremes we’re good to go. Continuing in this fashion we realize that $\alpha$ should be chosen to be equal to the middle element in the case of an odd number of data points and anywhere in between the middle two elements in the case of an even number of points. From this perspective the prescription to average the middle two points is simply a convention. We can also derive this more directly as above:

$\frac{d}{d\alpha}E_1(\alpha) = \sum_{i=1}^{n}\mbox{sign}(\alpha-x_i)=0 \rightarrow \alpha = \mbox{median}$

Mode

Finally, let’s take a look at what happens when we set $p=0$ . Here we need to use the convention that $0^0 = 0$ . If you’re uncomfortable with this, just imagine that we’re smoothly tuning the power parameter to zero, $p\rightarrow 0$ . In this case the error turns out to be the number of misses, i.e. the number of times $\alpha$ does not coincide with one of the $x_i$ . In order to minimize this we’ll then simply pick $\alpha$ to correspond to the most common value for $x_i$ , i.e. the mode.

Summary

In other words, the only difference between mode, median, and mean is which error function we’re using to quantify how poorly the summary statistic is summarizing the data set. Using $E_2$ gives us the mean, $E_1$ gives us the median, while $E_0$ produces the mode. That’s it.

Mind Bowling

Book Pressure

Two Envelope Paradox

Should you play the lottery?

Utility Function

Utility Function for Playing the Lottery

p-values

The Sigmoid Function

Laplace Smoothing

Clustering people based on location and time

Mode, Median, and Mean

Menu