Misapplied Math

# Dynasty Meets the Central Limit TheoremNov 20

There's a classic probability problem/brainteaser popularized by google that states:

In a country in which people only want boys, every family continues to have children until they have a boy. If they have a girl, they have another child. If they have a boy, they stop. What is the proportion of boys to girls in the country?

It seems reasonable that such a country would succeed in their goal of skewing the population demographics. However, google's solution to the brainteaser goes on to justify how the population will still end up 50% male/50% female.

Interestingly enough, google's "official" solution of 50%/50% is incorrect, depending on how you interpret their wording. Assuming that they're asking for:

$\E\left[\frac{B}{B + G}\right]$

(what's the expected ratio of boys to girls?) there's a problem with their reasoning. That's not the only twist. While for a large enough population their answer is very close to correct, for any one family the expected percentage of boys is closer to 70%.

The crux of the issue stems from an invalid application of the expectation operator. We're interested in the random variable $R(n) = \frac{n}{X(n)}$, where $n$ is fixed and $X(n)$ is itself a random variable: the total number of children in the population. Note that because each family has exactly one boy, $n$ is both the number of families and the number of boys. If we assume that boys and girls are born with $p = .5$, the expected number of children that any one family will have before producing a boy (inclusive) is given by the geometric distribution with $p = .5$:

$\E[X(1)] = \frac{1}{p} = 2$

From there, it seems reasonable to assume (as the google argument does) that:

$\E\left[\frac{n}{X(n)}\right] = \frac{n}{E[X(n)]} = \frac{n}{2n} = .5$

However, expectation only commutes with linear operators, so the equality above does not hold. Taking things one step further we can find a bound showing that ratio is greater than 1/2 for all finite populations. Jensen's inequality (the Swiss Army Knife of mathematical sanity checks) gives that for a non-degenerate probability distribution and a convex function $\varphi$, $\E[\varphi(X)] > \varphi(\E[X])$. Letting $\varphi = \frac{n}{x}$ gives:

$\E\left[\frac{n}{X(n)}\right] > \frac{n}{E[X(n)]}$

One of the most interesting things to come out of this analysis is the observation that the expected ratio of boys to girls in one family is a biased estimator of the population mean. To understand why, remember that 50% of the time we’ll observe a family with one child (a boy, making for a ratio of 100% boys), and 50% of the time we’ll observe a family with at least one girl. Working with ratios instead of sums underweights the contribution coming from families with at least one girl. Individual families will, on average, produce a ratio of boys to girls close to 70%. However, as families can have at most one boy and potentially many girls, the population ratio will approach 50% from above.

We can calculate the single family distribution empirically or explicitly. Here's what it looks like: The red dashed line denotes the mean value – .69 for my run of the simulation. Using the same set of sample data and treating it as one simulated population of 100,000 instead of 100,000 simulated populations of one family:

gives "Population mean: 0.50." To see why the population will tend towards 50% we'll need to appeal to the Central Limit Theorem (CLT). For a rigorous explanation of the math see the excellent post by Ben Golub here. In short, by the CLT, as the number of families $n$ becomes large, the total number of children in the population $X(n)$ will tend towards $X(n) \approx \E[X(1)]n = 2n$. We'll have $n$ boys for our $\approx 2n$ children, leading to a ratio of $\approx \frac{1}{2}$ with tight error bounds given by the CLT.

The applicability of the CLT depends, loosely speaking, on "how poorly behaved" the distribution that you're sampling from is, and the size of your sample. Lucky for us, the geometric distribution is well behaved (finite variance and mean – both assumptions of the CLT), and our samples are definitely independent. We're not always that lucky though – some fat tailed distribution such as the Cauchy distribution for which neither mean nor variance is defined can prove problematic.

So how well does the CLT handle our family planning problem? The expected ratio of boys to girls for a population of $k$ individuals is given by:

$\E[R(n)] = 1 - \frac{k}{2}\left(\psi\left(\frac{k + 2}{2}\right) - \psi\left(\frac{k + 1}{2}\right)\right)$

where $\psi$ is the Digamma function (the derivation is here). Plotting this function we see that it rapidly converges to a ratio of .5: We can run a quick simulation to confirm our results: We see that our simulations are within a few percent of the theoretical and converge towards the true value as $n$ becomes large. So far we're only looking at one simulated population. How do our results look if we average across many simulated populations? The graph above depicts the empirical and exact population ratios, along with bands denoting variance between means for the simulated populations. As you can see, averaging across multiple simulated populations gives much faster convergence. The Central Limit Theorem works its magic once again, this time by smoothing out variation between our simulated populations. We can see that even for a single family population, with enough simulation passes the empirical result is almost indistinguishable from the analytical one. Pretty cool if you ask me.

# The Geometry of Signal ExtractionOct 9

## Teasing out the Signal

There's a classic signal extraction problem stated as follows: you observe a random variable $Z$ as the sum of two normal distributions, $N(\mu_1, \sigma_1) \sim X$ and $N(\mu_2, \sigma_2) \sim Y$ such that $Z = X + Y$. Given an observation of $Z = c$, what is the conditional expectation of $X$?

The problem asks us to find $\E[X | X + Y = c]$. There are a number of reasons why we might want to do so. For starters, if we're interested in the value of some Gaussian distribution $X$, but we can only observe $X + Y$, the conditional expectation given above is exactly what we're looking for. In the past I've seen it derived hammer and tongs via the definition of conditional expectation:

$\E[X | Y ] = \int\frac{xf(x,y)}{f(x)}dx$

If $X$ and $Y$ are statistically independent we can express the joint distribution as a product of marginal distributions, fix $X + Y = c$, and end up with the expression that we're looking for:

$\E[X | X + Y = c] = \frac{\int_0^c xf_x(x)f_y(c-x)dx}{\int_0^c f_x(x)f_y(c-x)dx}$

Ouch. It works, but it's not pretty. Last night I came up with a geometric interpretation for the normal case that I wanted to share. Googling around there are similar derivations but I figured that one more writeup with some deeper explanation wouldn't hurt.

## Regression as an Operator

To start we note a general propriety of conditional expectation: $\E[X | Y] = f(Y)$, where $f$ is some measurable function. We also need a simple decomposition lemma: any random variable $Y = y$ can be written as: $y = \E[y | x] + \epsilon$, where $\epsilon$ is a RV s.t. $\E[\epsilon | x] = 0$ and $\E[f(x)\epsilon] = 0\ \forall \ f(\cdot)$. The intuition here is that almost by definition any variable can be expressed as a conditional expectation and an error term. The proof is simple:

$\E[\epsilon | x] = \E[y - \E[y | x] | x] = \E[y | x] - \E[y | x] = 0$ $\E[f(x)\epsilon] = \E[f(x)\E[\epsilon | x]] = 0$

We need this to prove the following result:

$\E[y | x] = {\tiny\begin{matrix} \\ {\normalsize argmax} \\ ^{\scriptsize f(x)}\end{matrix}} E[(y - f(x))^2]$

Proof:

\begin{aligned} (y - f(x))^2 &= ((y-\E[y | x]) + (\E[y | x] - f(x)))^2 \\ &= (y-\E[y | x])^2 + 2(y-\E[y | x])(\E[y | x] - f(x))) + (\E[y | x] - f(x)))^2 \end{aligned}

From the decomposition property that we proved above $y = \E[y | x] + \epsilon$ so the second term simplifies to $2\epsilon(\E[y | x] - f(x))$. Now let $g(x) \equiv 2 (\E[y | x] - f(x)) = 0$ in expectation by the second decomposition property. Thus, we're left with the first term (not a function of $f(x))$, and the third term, which vanishes when $\E[y | x] = f(x)$, thus minimizing the function. QED.

## A Geometric Interpretation

If the joint distribution of $X$ and $Y$ is normal (and it is for for our example – the joint distribution of a sum of normal distributions is always normal, even if they're not independent) $f(x) = \E[x | y] = \alpha + \beta x$. I won't prove this as it's repeated many times over in the derivation of linear regression. Why do we care? At the end of the day $Z$ is just another random variable. We can set aside the fact that $Z = X + Y$ and note that $\E[X | Z = c]$ is actually just regression. Our nasty integral formula for conditional expectation has a beautiful geometric interpretation: $x = \alpha + \beta z$. We can work out our original signal extraction problem using the formula for simple linear regression:

$\alpha = \bar{y} - \beta\bar{x}$

and:

$\beta = \frac{Cov(x, y)}{Var(x)}$

Applying this to our problem:

\begin{aligned} Cov(X, Z) & = \E(XZ) - E[X]E[Z] \\ &= \E[X^2] + E[X]E[Y] - E[X]^2 - E[X]E[Y] \\ &= \E[X^2] - E[X]^2 = \sigma_x^2 \end{aligned}

The fact that $\E[XY] = E[X]E[Y]$ results from our earlier stipulation that the distributions are independent. We also have that:

$Var(Z) = Var(X + Y) = \sigma_x^2 + \sigma_y^2$

Thus,

$\beta = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2}$

and:

$\alpha = \mu_x - \beta (\mu_x + \mu_y)$

Putting it all together and simplifying gives us our final form:

$\E[X | X + Y = c] = \mu_x\frac{\sigma_y^2}{\sigma_x^2 + \sigma_y^2} + (c - \mu_y)\frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2}$

Note that when the means are zero the formula above becomes a simple ratio of variances. That's a pretty satisfying result – when $X$ accounts for most of the variance it's reasonable to expect that, when $Y$ has a zero mean, the bulk of whatever we observe comes from variance in $X$. This is very closely related to how principal component analysis works.

## Visualizing the Result

Let's start by taking a look at the density of our 2D Gaussian: The density above resulted from the sum of two independent normal distributions $N(1, 1)$ and $N(2, 2)$. As such, it's centered at $(1, 2)$, and elongated along the $y$ axis. Plugging in the parameters of our distributions gives $\alpha = .4$ and $\beta = .2$. Fitting the simulation data we find that our equation holds:

Note that $R^2 = .2$, the same as our value of $\beta$. For a simple linear regression $R^2$ is simply the squared correlation coefficient, or in our case:

\begin{aligned} R^2 &= \left(\frac{Cov(x, y)}{\sigma_x\sigma_y}\right)^2 \\ &= \frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2} = \beta \end{aligned}

That gives us a cool new interpretation of $\beta$ as the proportion of variance explained. It does however hint at a shortcoming that $R^2$ has as a goodness of fit measure – it explicitly depends on how our regressors are distributed.

At this point we have a simple formula to calculate a very useful conditional expectation. We also have a nice geometric interpretation of the solution, and an understanding that both regression and our original signal extraction problem distill down to a ratio of variances. Fantastic. However, we're assuming that we know the proper, fixed parametrization of our model: $(\mu_1, \mu_2, \sigma_1, \sigma_2)$, and that's pretty unlikely. How do we estimate these parameters for a time-variant system given that we can only observe a series of $Z \sim z$? There are a myriad of approaches, each with pros, cons, and assumptions. Mixture models and machine learning are growing in favor for many applications. The CDS approach usually involves posing the model as a state space and estimating the parameters online. There's no easy way out when faced with a partially observed, non-stationary process. So sorry folks – when it comes to denoising your signal, the equation above is the easy part

# Visualization is EverythingOct 5

## And with that in mind…

I've introduced a new visualization gallery to showcase my graphics and interactive data, some of which will be post specific, some will stand on their own. Exploratory data analysis helps me see features that don't show up well in summary statistics alone – fat tails, outliers, skew, clustering, fit issues, bias, and patterns. I use R, Python, and Matlab for static visualizations, and I've started playing around with d3.js as a means of generating interactive graphs.

Until now I always wrote tools for interactive data visualization using java and C++. I need a few new toys, and given the progress made towards closing the gap between native and browser performance, web based development seems like the way to go. Javascript got fast; it takes a fraction of the time to develop, the frameworks are awesome, and the results are pretty. I'm setting out to rewrite one of my most data intensive visualizations (a L2/L3 limit order book viewer) using pure HTML5 and javascript. At some point I'll put up a screenshot of the result, or a post mortem on the project if I, like Zuckerberg, put too much faith in HTML5.

As an initial experiment with d3 I decided to write a tool that shows a quick overview of volume and open interest on the CME. You can check out the end-of-day snapshot here. The link given is a simple adaptation of mbostock's tree map visualization. Tree maps are a great way to visualize hierarchical data, and given the nested structure of product sectors and sub sectors, they're a perfect fit.

The version that I use is geared towards real-time surveillance across a subset of markets, but the general idea is the same. One notes that e-mini (ES) volume is remarkably high relative to open interest…a pattern made obvious by switching between the two modes and keeping an eye on ratios.

If you want to play around on your own, or create daily updates, the CME offers publicly available end-of-day data on volume and open interest. Feel free to take my code (here's the javascript) and use an adaptation of this quick and dirty parser to convert your data to JSON. Note that the parse_row function in the code below isn't implemented – it's simple but depends on your input format.