Misapplied Math

Trading, Data Science, CS

The Geometry of Signal Extraction

Teasing out the Signal

There's a classic signal extraction problem stated as follows: you observe a random variable ZZ as the sum of two normal distributions, N(μ1,σ1)XN(\mu_1, \sigma_1) \sim X and N(μ2,σ2)YN(\mu_2, \sigma_2) \sim Y such that Z=X+YZ = X + Y. Given an observation of Z=cZ = c, what is the conditional expectation of XX?

The problem asks us to find E[XX+Y=c]\E[X | X + Y = c]. There are a number of reasons why we might want to do so. For starters, if we're interested in the value of some Gaussian distribution XX, but we can only observe X+YX + Y, the conditional expectation given above is exactly what we're looking for. In the past I've seen it derived hammer and tongs via the definition of conditional expectation:

E[XY]=xf(x,y)f(x)dx\E[X | Y ] = \int\frac{xf(x,y)}{f(x)}dx

If XX and YY are statistically independent we can express the joint distribution as a product of marginal distributions, fix X+Y=cX + Y = c, and end up with the expression that we're looking for:

E[XX+Y=c]=0cxfx(x)fy(cx)dx0cfx(x)fy(cx)dx\E[X | X + Y = c] = \frac{\int_0^c xf_x(x)f_y(c-x)dx}{\int_0^c f_x(x)f_y(c-x)dx}

Ouch. It works, but it's not pretty. Last night I came up with a geometric interpretation for the normal case that I wanted to share. Googling around there are similar derivations but I figured that one more writeup with some deeper explanation wouldn't hurt.

Regression as an Operator

To start we note a general propriety of conditional expectation: E[XY]=f(Y)\E[X | Y] = f(Y), where ff is some measurable function. We also need a simple decomposition lemma: any random variable Y=yY = y can be written as: y=E[yx]+ϵy = \E[y | x] + \epsilon, where ϵ\epsilon is a RV s.t. E[ϵx]=0\E[\epsilon | x] = 0 and E[f(x)ϵ]=0  f()\E[f(x)\epsilon] = 0\ \forall \ f(\cdot). The intuition here is that almost by definition any variable can be expressed as a conditional expectation and an error term. The proof is simple:

E[ϵx]=E[yE[yx]x]=E[yx]E[yx]=0\E[\epsilon | x] = \E[y - \E[y | x] | x] = \E[y | x] - \E[y | x] = 0 E[f(x)ϵ]=E[f(x)E[ϵx]]=0\E[f(x)\epsilon] = \E[f(x)\E[\epsilon | x]] = 0

We need this to prove the following result:

E[yx]=argmaxf(x)E[(yf(x))2]\E[y | x] = {\tiny\begin{matrix} \\ {\normalsize argmax} \\ ^{\scriptsize f(x)}\end{matrix}} E[(y - f(x))^2]

Proof:

(yf(x))2=((yE[yx])+(E[yx]f(x)))2=(yE[yx])2+2(yE[yx])(E[yx]f(x)))+(E[yx]f(x)))2\begin{aligned} (y - f(x))^2 &= ((y-\E[y | x]) + (\E[y | x] - f(x)))^2 \\ &= (y-\E[y | x])^2 + 2(y-\E[y | x])(\E[y | x] - f(x))) + (\E[y | x] - f(x)))^2 \end{aligned}

From the decomposition property that we proved above y=E[yx]+ϵy = \E[y | x] + \epsilon so the second term simplifies to 2ϵ(E[yx]f(x))2\epsilon(\E[y | x] - f(x)). Now let g(x)2(E[yx]f(x))=0g(x) \equiv 2 (\E[y | x] - f(x)) = 0 in expectation by the second decomposition property. Thus, we're left with the first term (not a function of f(x))f(x)), and the third term, which vanishes when E[yx]=f(x)\E[y | x] = f(x), thus minimizing the function. QED.

A Geometric Interpretation

If the joint distribution of XX and YY is normal (and it is for for our example – the joint distribution of a sum of normal distributions is always normal, even if they're not independent) f(x)=E[xy]=α+βxf(x) = \E[x | y] = \alpha + \beta x. I won't prove this as it's repeated many times over in the derivation of linear regression. Why do we care? At the end of the day ZZ is just another random variable. We can set aside the fact that Z=X+YZ = X + Y and note that E[XZ=c]\E[X | Z = c] is actually just regression. Our nasty integral formula for conditional expectation has a beautiful geometric interpretation: x=α+βzx = \alpha + \beta z. We can work out our original signal extraction problem using the formula for simple linear regression:

α=y¯βx¯\alpha = \bar{y} - \beta\bar{x}

and:

β=Cov(x,y)Var(x)\beta = \frac{Cov(x, y)}{Var(x)}

Applying this to our problem:

Cov(X,Z)=E(XZ)E[X]E[Z]=E[X2]+E[X]E[Y]E[X]2E[X]E[Y]=E[X2]E[X]2=σx2\begin{aligned} Cov(X, Z) & = \E(XZ) - E[X]E[Z] \\ &= \E[X^2] + E[X]E[Y] - E[X]^2 - E[X]E[Y] \\ &= \E[X^2] - E[X]^2 = \sigma_x^2 \end{aligned}

The fact that E[XY]=E[X]E[Y]\E[XY] = E[X]E[Y] results from our earlier stipulation that the distributions are independent. We also have that:

Var(Z)=Var(X+Y)=σx2+σy2Var(Z) = Var(X + Y) = \sigma_x^2 + \sigma_y^2

Thus,

β=σx2σx2+σy2\beta = \frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2}

and:

α=μxβ(μx+μy)\alpha = \mu_x - \beta (\mu_x + \mu_y)

Putting it all together and simplifying gives us our final form:

E[XX+Y=c]=μxσy2σx2+σy2+(cμy)σx2σx2+σy2\E[X | X + Y = c] = \mu_x\frac{\sigma_y^2}{\sigma_x^2 + \sigma_y^2} + (c - \mu_y)\frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2}

Note that when the means are zero the formula above becomes a simple ratio of variances. That's a pretty satisfying result – when XX accounts for most of the variance it's reasonable to expect that, when YY has a zero mean, the bulk of whatever we observe comes from variance in XX. This is very closely related to how principal component analysis works.

Visualizing the Result

Let's start by taking a look at the density of our 2D Gaussian:

gaussian_simulation.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
n <- 100000
mu_x <- 1
mu_y <- 2
sigma_x <- 1
sigma_y <- 2

x <- rnorm(n, mean = mu_x, sd = sigma_x)
y <- rnorm(n, mean = mu_y, sd = sigma_y)

var_z <- sigma_x^2 + sigma_y^2

beta <- sigma_x^2 / var_z
alpha <- mu_x - beta * (mu_x + mu_y)

df <- data.frame(x, y)
ggplot(df, aes(x = x, y = y)) +
  stat_density2d(aes(fill = ..level..), geom="polygon") + 
  geom_abline(intercept = alpha, slope = beta, color = "red", 
  	linetype = "longdash") + 
  geom_vline(xintercept = mu_x, color = "red", linetype = "dotted") + 
  geom_hline(yintercept = mu_y, color = "red", linetype = "dotted") + 
  theme(plot.background = element_rect(fill = '#F1F1F1'), 
  	legend.background = element_rect(fill = '#F1F1F1')) +
	ggtitle("2D Normal Density")

2D Gaussian Density

The density above resulted from the sum of two independent normal distributions N(1,1)N(1, 1) and N(2,2)N(2, 2). As such, it's centered at (1,2)(1, 2), and elongated along the yy axis. Plugging in the parameters of our distributions gives α=.4\alpha = .4 and β=.2\beta = .2. Fitting the simulation data we find that our equation holds:

gaussian_simulation.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
summary(lm(x ~ z))

Call:
lm(formula = x ~ z)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5625 -0.6019 -0.0021  0.6016  3.9475 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.400942   0.004733   84.71   <2e-16 ***
z           0.199886   0.001263  158.29   <2e-16 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1 

Residual standard error: 0.8946 on 99998 degrees of freedom
Multiple R-squared: 0.2004,	Adjusted R-squared: 0.2004 
F-statistic: 2.506e+04 on 1 and 99998 DF,  p-value: < 2.2e-16 

Note that R2=.2R^2 = .2, the same as our value of β\beta. For a simple linear regression R2R^2 is simply the squared correlation coefficient, or in our case:

R2=(Cov(x,y)σxσy)2=σx2σx2+σy2=β\begin{aligned} R^2 &= \left(\frac{Cov(x, y)}{\sigma_x\sigma_y}\right)^2 \\ &= \frac{\sigma_x^2}{\sigma_x^2 + \sigma_y^2} = \beta \end{aligned}

That gives us a cool new interpretation of β\beta as the proportion of variance explained. It does however hint at a shortcoming that R2R^2 has as a goodness of fit measure – it explicitly depends on how our regressors are distributed.

At this point we have a simple formula to calculate a very useful conditional expectation. We also have a nice geometric interpretation of the solution, and an understanding that both regression and our original signal extraction problem distill down to a ratio of variances. Fantastic. However, we're assuming that we know the proper, fixed parametrization of our model: (μ1,μ2,σ1,σ2)(\mu_1, \mu_2, \sigma_1, \sigma_2), and that's pretty unlikely. How do we estimate these parameters for a time-variant system given that we can only observe a series of ZzZ \sim z? There are a myriad of approaches, each with pros, cons, and assumptions. Mixture models and machine learning are growing in favor for many applications. The CDS approach usually involves posing the model as a state space and estimating the parameters online. There's no easy way out when faced with a partially observed, non-stationary process. So sorry folks – when it comes to denoising your signal, the equation above is the easy part

Comments