Misapplied Math

Trading, Data Science, CS

There's a classic probability problem/brainteaser popularized by google that states:

In a country in which people only want boys, every family continues to have children until they have a boy. If they have a girl, they have another child. If they have a boy, they stop. What is the proportion of boys to girls in the country?

It seems reasonable that such a country would succeed in their goal of skewing the population demographics. However, google's solution to the brainteaser goes on to justify how the population will still end up 50% male/50% female.

Interestingly enough, google's "official" solution of 50%/50% is incorrect, depending on how you interpret their wording. Assuming that they're asking for:

E[BB+G]\E\left[\frac{B}{B + G}\right]

(what's the expected ratio of boys to girls?) there's a problem with their reasoning. That's not the only twist. While for a large enough population their answer is very close to correct, for any one family the expected percentage of boys is closer to 70%.

The crux of the issue stems from an invalid application of the expectation operator. We're interested in the random variable R(n)=nX(n)R(n) = \frac{n}{X(n)}, where nn is fixed and X(n)X(n) is itself a random variable: the total number of children in the population. Note that because each family has exactly one boy, nn is both the number of families and the number of boys. If we assume that boys and girls are born with p=.5p = .5, the expected number of children that any one family will have before producing a boy (inclusive) is given by the geometric distribution with p=.5p = .5:

E[X(1)]=1p=2\E[X(1)] = \frac{1}{p} = 2

From there, it seems reasonable to assume (as the google argument does) that:

E[nX(n)]=nE[X(n)]=n2n=.5\E\left[\frac{n}{X(n)}\right] = \frac{n}{E[X(n)]} = \frac{n}{2n} = .5

However, expectation only commutes with linear operators, so the equality above does not hold. Taking things one step further we can find a bound showing that ratio is greater than 1/2 for all finite populations. Jensen's inequality (the Swiss Army Knife of mathematical sanity checks) gives that for a non-degenerate probability distribution and a convex function φ\varphi, E[φ(X)]>φ(E[X])\E[\varphi(X)] > \varphi(\E[X]). Letting φ=nx\varphi = \frac{n}{x} gives:

E[nX(n)]>nE[X(n)]\E\left[\frac{n}{X(n)}\right] > \frac{n}{E[X(n)]}

One of the most interesting things to come out of this analysis is the observation that the expected ratio of boys to girls in one family is a biased estimator of the population mean. To understand why, remember that 50% of the time we’ll observe a family with one child (a boy, making for a ratio of 100% boys), and 50% of the time we’ll observe a family with at least one girl. Working with ratios instead of sums underweights the contribution coming from families with at least one girl. Individual families will, on average, produce a ratio of boys to girls close to 70%. However, as families can have at most one boy and potentially many girls, the population ratio will approach 50% from above.

We can calculate the single family distribution empirically or explicitly. Here's what it looks like:

population_sim.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Simulates a single family instance in which the parents will have continue
# to have kids until a boy is born. The probability of a boy and a girl is 
# taken as .5.
sim.family <- function() {
	# R hates loops, so let's use a tail recrusive call to a vectorized 
	# function for performance reasons.
	sim.count <- 10
	sim.family <- function(n) {
		# Conduct sim.count Bernoulli trials and see when the first heads 
		# (a boy comes up). If there aren't any boys in our vector, recurse, 
		# adding sim.count to the total number of kids in our simulated 
		# family.
		result <- which(as.logical(rbinom(sim.count, 1, .5)))
		if(length(result) > 0)
		{
			return(n + result[1])
		} else {
			return(sim.family(n + sim.count))
		}
	}
	return(sim.family(0))
}

# Simulates n populations of k individuals
sim.population <- function(n, k) {
	matrix(nrow = n, ncol = k, data = apply(as.matrix(1:(n * k)), 1, 
		function(x) { sim.family() }))
}

# Computes the mean ratio of boys to girls for a simulated population.
population.stats <- function(x) {
	mu.vec <- dim(x)[2] / apply(x, 1, sum)
	return(mean(mu.vec))
}

single.sample <- sim.population(100000, 1)
df <- data.frame(Ratio = 1 / single.sample)

ggplot(df, aes(x = Ratio, fill = ..count..)) + 
	geom_histogram() + 
	geom_vline(xintercept = mean(df$Ratio), color = "red", linetype = "longdash") +
	theme(plot.background = element_rect(fill = '#F1F1F1'), 
	legend.background = element_rect(fill = '#F1F1F1')) +
	xlab("Ratio of Boys to Girls") + 
	ylab("Frequency") +
	ggtitle("Distribution of Family Sizes for a Simulated Single Family Population")

Distribution of family sizes for a simulated single family population

The red dashed line denotes the mean value – .69 for my run of the simulation. Using the same set of sample data and treating it as one simulated population of 100,000 instead of 100,000 simulated populations of one family:

population_sim.R
1
2
cat(sprintf("Population mean: %.2f\n", 
	dim(single.sample)[1] / sum(single.sample)))

gives "Population mean: 0.50." To see why the population will tend towards 50% we'll need to appeal to the Central Limit Theorem (CLT). For a rigorous explanation of the math see the excellent post by Ben Golub here. In short, by the CLT, as the number of families nn becomes large, the total number of children in the population X(n)X(n) will tend towards X(n)E[X(1)]n=2nX(n) \approx \E[X(1)]n = 2n. We'll have nn boys for our 2n\approx 2n children, leading to a ratio of 12\approx \frac{1}{2} with tight error bounds given by the CLT.

The applicability of the CLT depends, loosely speaking, on "how poorly behaved" the distribution that you're sampling from is, and the size of your sample. Lucky for us, the geometric distribution is well behaved (finite variance and mean – both assumptions of the CLT), and our samples are definitely independent. We're not always that lucky though – some fat tailed distribution such as the Cauchy distribution for which neither mean nor variance is defined can prove problematic.

So how well does the CLT handle our family planning problem? The expected ratio of boys to girls for a population of kk individuals is given by:

E[R(n)]=1k2(ψ(k+22)ψ(k+12))\E[R(n)] = 1 - \frac{k}{2}\left(\psi\left(\frac{k + 2}{2}\right) - \psi\left(\frac{k + 1}{2}\right)\right)

where ψ\psi is the Digamma function (the derivation is here). Plotting this function we see that it rapidly converges to a ratio of .5:

Expected ratio of boys to girls for a population of k individuals

We can run a quick simulation to confirm our results:

population_sim.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
pop.size <- c(20, 50, 100, 1000, 10000)
results <- NULL
for(k in pop.size)
{
	results <- c(results, population.stats(sim.population(1, k)))
}

pop.size.count <- length(pop.size)
df <- data.frame(
	type = factor(c(rep("Exact", pop.size.count), 
		rep("Simulated", pop.size.count)), levels = c("Exact", "Simulated")),
	mean = c(prop.boys(pop.size), results),
	k = rep(pop.size, 2)
)

ggplot(df, aes(x = k, y = mean, shape = type, color = type)) + 
geom_line() + 
geom_point() + 
theme(plot.background = element_rect(fill = '#F1F1F1'), 
	legend.background = element_rect(fill = '#F1F1F1')) +
	scale_x_log10("Population Size") + 
	ylab("Ratio of Boys to Girls") +
	ggtitle("Simulated and Exact Boy/Girl Ratio vs Population Size")

Simulated ratio of boys to girls for a population of k individuals

We see that our simulations are within a few percent of the theoretical and converge towards the true value as nn becomes large. So far we're only looking at one simulated population. How do our results look if we average across many simulated populations?

population_sim.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Computes the mean ratio of boys to girls for a simulated population as a 
# vector in the form (mean, sd).
population.stats <- function(x) {
	mu.vec <- dim(x)[2] / apply(x, 1, sum)
	return(c(mean(mu.vec), sd(mu.vec)))
}

samples <- 1000000
pop.size <- c(1, 5, 10, 20, 50, 100, 500, 1000)

results <- NULL

for(k in pop.size)
{
	# hold the total number of samples constant across trails to keep things
	# computationally tractable.
	results <- rbind(results, population.stats(
		sim.population(samples / k, k)))
}

pop.size.count <- length(pop.size)
true.prop <- prop.boys(pop.size)
sample.prop <- as.vector(results[, 1])
sample.sd <- as.vector(results[, 2]) / sqrt(pop.size)
df <- data.frame(
	type = factor(c(rep("Exact", pop.size.count), 
		rep("Simulated", pop.size.count)), levels = c("Exact", "Simulated")),
	mean = c(true.prop, sample.prop),
	k = rep(pop.size, 2),
	sigma.lower = c(true.prop, sample.prop - sample.sd),
	sigma.upper = c(true.prop, sample.prop + sample.sd)
)

ggplot(df, aes(x = k, y = mean, shape = type, color = type)) + 
geom_line() + 
geom_point() + 
geom_ribbon(aes(ymax = sigma.upper, ymin = sigma.lower), fill = "blue", 
	alpha = .3) + 
theme(plot.background = element_rect(fill = '#F1F1F1'), 
	legend.background = element_rect(fill = '#F1F1F1')) +
	xlab("Population Size") + 
	ylab("Ratio of Boys to Girls") +
	ggtitle("Simulated and Exact Boy/Girl Ratio vs Population Size")

Distribution of family fizes for a multiple family population

The graph above depicts the empirical and exact population ratios, along with bands denoting variance between means for the simulated populations. As you can see, averaging across multiple simulated populations gives much faster convergence. The Central Limit Theorem works its magic once again, this time by smoothing out variation between our simulated populations. We can see that even for a single family population, with enough simulation passes the empirical result is almost indistinguishable from the analytical one. Pretty cool if you ask me.


Teasing out the Signal

There's a classic signal extraction problem stated as follows: you observe a random variable ZZ as the sum of two normal distributions, N(μ1,σ1)XN(\mu_1, \sigma_1) \sim X and N(μ2,σ2)YN(\mu_2, \sigma_2) \sim Y such that Z=X+YZ = X + Y. Given an observation of Z=cZ = c, what is the conditional expectation of XX?

The problem asks us to find E[XX+Y=c]\E[X | X + Y = c]. There are a number of reasons why we might want to do so. For starters, if we're interested in the value of some Gaussian distribution XX, but we can only observe X+YX + Y, the conditional expectation given above is exactly what we're looking for. In the past I've seen it derived hammer and tongs via the definition of conditional expectation:

E[XY]=xf(x,y)f(x)dx\E[X | Y ] = \int\frac{xf(x,y)}{f(x)}dx

If XX and YY are statistically independent we can express the joint distribution as a product of marginal distributions, fix X+Y=cX + Y = c, and end up with the expression that we're looking for:

E[XX+Y=c]=0cxfx(x)fy(cx)dx0cfx(x)fy(cx)dx\E[X | X + Y = c] = \frac{\int_0^c xf_x(x)f_y(c-x)dx}{\int_0^c f_x(x)f_y(c-x)dx}

Ouch. It works, but it's not pretty. Last night I came up with a geometric interpretation for the normal case that I wanted to share. Googling around there are similar derivations but I figured that one more writeup with some deeper explanation wouldn't hurt.

Regression as an Operator

To start we note a general propriety of conditional expectation: E[XY]=f(Y)\E[X | Y] = f(Y), where ff is some measurable function. We also need a simple decomposition lemma: any random variable Y=yY = y can be written as: y=E[yx]+ϵy = \E[y | x] + \epsilon, where ϵ\epsilon is a RV s.t. E[ϵx]=0\E[\epsilon | x] = 0 and E[f(x)ϵ]=0  f()\E[f(x)\epsilon] = 0\ \forall \ f(\cdot). The intuition here is that almost by definition any variable can be expressed as a conditional expectation and an error term. The proof is simple:

E[ϵx]=E[yE[yx]x]=E[yx]E[yx]=0\E[\epsilon | x] = \E[y - \E[y | x] | x] = \E[y | x] - \E[y | x] = 0 E[f(x)ϵ]=E[f(x)E[ϵx]]=0\E[f(x)\epsilon] = \E[f(x)\E[\epsilon | x]] = 0

We need this to prove the following result:

E[yx]=argmaxf(x)E[(yf(x))2]\E[y | x] = {\tiny\begin{matrix} \\ {\normalsize argmax} \\ ^{\scriptsize f(x)}\end{matrix}} E[(y - f(x))^2]

Proof:

(yf(x))2=((yE[yx])+(E[yx]f(x)))2=(yE[yx])2+2(yE[yx])(E[yx]f(x)))+(E[yx]f(x)))2\begin{aligned} (y - f(x))^2 &= ((y-\E[y | x]) + (\E[y | x] - f(x)))^2 \\ &= (y-\E[y | x])^2 + 2(y-\E[y | x])(\E[y | x] - f(x))) + (\E[y | x] - f(x)))^2 \end{aligned}

From the decomposition property that we proved above y=E[yx]+ϵy = \E[y | x] + \epsilon so the second term simplifies to 2ϵ(E[yx]f(x))2\epsilon(\E[y | x] - f(x)). Now let g(x)2(E[yx]f(x))=0g(x) \equiv 2 (\E[y | x] - f(x)) = 0 in expectation by the second decomposition property. Thus, we're left with the first term (not a function of f(x))f(x)), and the third term, which vanishes when E[yx]=f(x)\E[y | x] = f(x), thus minimizing the function. QED.

A Geometric Interpretation

If the joint distribution of XX and YY is normal (and it is for for our example – the joint distribution of a sum of normal distributions is always normal, even if they're not independent) f(x)=E[xy]=α+βxf(x) = \E[x | y] = \alpha + \beta x. I won't prove this as it's repeated many times over in the derivation of linear regression. Why do we care? At the end of the day ZZ is just another random variable. We can set aside the fact that Z=X+YZ = X + Y and note that E[XZ=c]\E[X | Z = c] is actually just regression. Our nasty integral formula for conditional expectation has a beautiful geometric interpretation: x=α+βzx = \alpha + \beta z. We can work out our original signal extraction problem using the formula for simple linear regression:

α=y¯βx¯\alpha = \bar{y} - \beta\bar{x}

and:

β=Cov(x,y)Var(x)\beta = \frac{Cov(x, y)}{Var(x)}

Applying this to our problem:

Cov(X,Z)=E(XZ)E[X]E[Z]=E[X2]+E[X]E[Y]E[X]2E[X]E[Y]=E[X2]E[X]2=σx2\begin{aligned} Cov(X, Z) & = \E(XZ) - E[X]E[Z] \\ &= \E[X^2] + E[X]E[Y] - E[X]^2 - E[X]E[Y] \\ &= \E[X^2] - E[X]^2 = \sigma_x^2 \end{aligned}

The fact that E[XY]=E[X]E[Y]\E[XY] = E[X]E[Y] results from our earlier stipulation that the distributions are independent. We also have that:

Var(Z)=Var(X+Y)=σx2+σy2Var(Z) = Var(X + Y) = \sigma_x^2 + \sigma_y^2

Thus,

β=σx2σx2+σy2\beta = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2}

and:

α=μxβ(μx+μy)\alpha = \mu_x - \beta (\mu_x + \mu_y)

Putting it all together and simplifying gives us our final form:

E[XX+Y=c]=μxσy2σx2+σy2+(cμy)σx2σx2+σy2\E[X | X + Y = c] = \mu_x\frac{\sigma_y^2}{\sigma_x^2 + \sigma_y^2} + (c - \mu_y)\frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2}

Note that when the means are zero the formula above becomes a simple ratio of variances. That's a pretty satisfying result – when XX accounts for most of the variance it's reasonable to expect that, when YY has a zero mean, the bulk of whatever we observe comes from variance in XX. This is very closely related to how principal component analysis works.

Visualizing the Result

Let's start by taking a look at the density of our 2D Gaussian:

gaussian_simulation.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
n <- 100000
mu_x <- 1
mu_y <- 2
sigma_x <- 1
sigma_y <- 2

x <- rnorm(n, mean = mu_x, sd = sigma_x)
y <- rnorm(n, mean = mu_y, sd = sigma_y)

var_z <- sigma_x^2 + sigma_y^2

beta <- sigma_x^2 / var_z
alpha <- mu_x - beta * (mu_x + mu_y)

df <- data.frame(x, y)
ggplot(df, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..level..), geom="polygon") + 
  geom_abline(intercept = alpha, slope = beta, color = "red", 
  	linetype = "longdash") + 
  geom_vline(xintercept = mu_x, color = "red", linetype = "dotted") + 
  geom_hline(yintercept = mu_y, color = "red", linetype = "dotted") + 
  theme(plot.background = element_rect(fill = '#F1F1F1'), 
  	legend.background = element_rect(fill = '#F1F1F1')) +
	ggtitle("2D Normal Density")

2D Gaussian Density

The density above resulted from the sum of two independent normal distributions N(1,1)N(1, 1) and N(2,2)N(2, 2). As such, it's centered at (1,2)(1, 2), and elongated along the yy axis. Plugging in the parameters of our distributions gives α=.4\alpha = .4 and β=.2\beta = .2. Fitting the simulation data we find that our equation holds:

gaussian_simulation.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
summary(lm(x ~ z))

Call:
lm(formula = x ~ z)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5625 -0.6019 -0.0021  0.6016  3.9475 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.400942   0.004733   84.71   <2e-16 ***
z           0.199886   0.001263  158.29   <2e-16 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 0.8946 on 99998 degrees of freedom
Multiple R-squared: 0.2004,	Adjusted R-squared: 0.2004 
F-statistic: 2.506e+04 on 1 and 99998 DF,  p-value: < 2.2e-16 

Note that R2=.2R^2 = .2, the same as our value of β\beta. For a simple linear regression R2R^2 is simply the squared correlation coefficient, or in our case:

R2=(Cov(x,y)σxσy)2=σx2σx2+σy2=β\begin{aligned} R^2 &= \left(\frac{Cov(x, y)}{\sigma_x\sigma_y}\right)^2 \\ &= \frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2} = \beta \end{aligned}

That gives us a cool new interpretation of β\beta as the proportion of variance explained. It does however hint at a shortcoming that R2R^2 has as a goodness of fit measure – it explicitly depends on how our regressors are distributed.

At this point we have a simple formula to calculate a very useful conditional expectation. We also have a nice geometric interpretation of the solution, and an understanding that both regression and our original signal extraction problem distill down to a ratio of variances. Fantastic. However, we're assuming that we know the proper, fixed parametrization of our model: (μ1,μ2,σ1,σ2)(\mu_1, \mu_2, \sigma_1, \sigma_2), and that's pretty unlikely. How do we estimate these parameters for a time-variant system given that we can only observe a series of ZzZ \sim z? There are a myriad of approaches, each with pros, cons, and assumptions. Mixture models and machine learning are growing in favor for many applications. The CDS approach usually involves posing the model as a state space and estimating the parameters online. There's no easy way out when faced with a partially observed, non-stationary process. So sorry folks – when it comes to denoising your signal, the equation above is the easy part


And with that in mind…

I've introduced a new visualization gallery to showcase my graphics and interactive data, some of which will be post specific, some will stand on their own. Exploratory data analysis helps me see features that don't show up well in summary statistics alone – fat tails, outliers, skew, clustering, fit issues, bias, and patterns. I use R, Python, and Matlab for static visualizations, and I've started playing around with d3.js as a means of generating interactive graphs.

Until now I always wrote tools for interactive data visualization using java and C++. I need a few new toys, and given the progress made towards closing the gap between native and browser performance, web based development seems like the way to go. Javascript got fast; it takes a fraction of the time to develop, the frameworks are awesome, and the results are pretty. I'm setting out to rewrite one of my most data intensive visualizations (a L2/L3 limit order book viewer) using pure HTML5 and javascript. At some point I'll put up a screenshot of the result, or a post mortem on the project if I, like Zuckerberg, put too much faith in HTML5.

As an initial experiment with d3 I decided to write a tool that shows a quick overview of volume and open interest on the CME. You can check out the end-of-day snapshot here. The link given is a simple adaptation of mbostock's tree map visualization. Tree maps are a great way to visualize hierarchical data, and given the nested structure of product sectors and sub sectors, they're a perfect fit.

The version that I use is geared towards real-time surveillance across a subset of markets, but the general idea is the same. One notes that e-mini (ES) volume is remarkably high relative to open interest…a pattern made obvious by switching between the two modes and keeping an eye on ratios.

If you want to play around on your own, or create daily updates, the CME offers publicly available end-of-day data on volume and open interest. Feel free to take my code (here's the javascript) and use an adaptation of this quick and dirty parser to convert your data to JSON. Note that the parse_row function in the code below isn't implemented – it's simple but depends on your input format.

encode_json.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#!/usr/bin/env python

import sys
import json

class Node(object):
    def __init__(self, parent, name):
        self.parent = parent
        self.name = name
        self.children = []        

class Leaf(Node):
    def __init__(self, parent, name, full_name, volume, open_interest):
        super(Leaf, self).__init__(parent, None)
        self.name = name
        self.full_name = full_name
        self.volume = volume
        self.open_interest = open_interest

class NodeJSONEncoder(json.JSONEncoder):
    def default(self, node):
        if type(node) == Node:
            return { "name":node.name, "children":node.children }
        elif type(node) == Leaf:
            return { "name":node.name, "full_name":node.full_name, 
                "volume":node.volume, "open_interest":node.open_interest }
        raise TypeError("{} is not an instance of Node".format(node))

def get_or_create_node(node_dictionary, parent, name):
    if name in node_dictionary:
        return node_dictionary[name]
    else:
        node = Node(parent, name)
        node_dictionary[name] = node
        parent.children.append(node)
        return node


if __name__ == "__main__":
    
    sectors = dict()
    subsectors = dict()

    root = Node(None, "CME Products")

    with open(sys.argv[1]) as f:
        for row in f.readlines():           
            p = parse_row(row)        
            sector = get_or_create_node(sectors, root, p.sector_name)
            subsector = get_or_create_node(subsectors, sector, p.subsector_name)
            subsector.children.append(Leaf(subsector, p.name, p.full_name, 
                p.volume, p.open_interest))

    print NodeJSONEncoder().encode(root)