This lesson is still being designed and assembled (Pre-Alpha version)

Smoothing

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • Can a model be fitted to a dataset which shape is unknown but smooth?

Objectives
  • Fit a smooth regression model to data which behavior depends conditionally on a set of predictors

  • Predict the expected value of a smooth model given the value of the predictors

Smoothing

Smoothing is a very powerful technique used all across data analysis. It is designed to estimate $f(x)$ when the shape is unknown, but assumed to be smooth. The general idea is to group data points that are expected to have similar expectations and compute the average, or fit a simple parametric model. We illustrate two smoothing techniques using a gene expression example.

The following data are gene expression measurements from replicated RNA samples.

##Following three packages are available from Bioconductor
library(Biobase)
library(SpikeIn)
library(hgu95acdf)
data(SpikeIn95)

We consider the data used in an MA-plot comparing two replicated samples ($Y$ = log ratios and $X$ = averages) and take down-sample in a way that balances the number of points for different strata of $X$ (code not shown):

library(rafalib)
mypar()
plot(X,Y)

MA-plot comparing gene expression from two arrays.

MA-plot comparing gene expression from two arrays.

In the MA plot we see that $Y$ depends on $X$. This dependence must be a bias because these are based on replicates, which means $Y$ should be 0 on average regardless of $X$. We want to predict $f(x)=\mbox{E}(Y \mid X=x)$ so that we can remove this bias. Linear regression does not capture the apparent curvature in $f(x)$:

mypar()
plot(X,Y)
fit <- lm(Y~X)
points(X,Y,pch=21,bg=ifelse(Y>fit$fitted,1,3))
abline(fit,col=2,lwd=4,lty=2)

MA-plot comparing gene expression from two arrays with fitted regression line. The two colors represent positive and negative residuals.

MA-plot comparing gene expression from two arrays with fitted regression line. The two colors represent positive and negative residuals.

The points above the fitted line (green) and those below (purple) are not evenly distributed. We therefore need an alternative more flexible approach.

Bin Smoothing

Instead of fitting a line, let’s go back to the idea of stratifying and computing the mean. This is referred to as bin smoothing. The general idea is that the underlying curve is “smooth” enough so that, in small bins, the curve is approximately constant. If we assume the curve is constant, then all the $Y$ in that bin have the same expected value. For example, in the plot below, we highlight points in a bin centered at 8.6, as well as the points of a bin centered at 12.1, if we use bins of size 1. We also show the fitted mean values for the $Y$ in those bins with dashed lines (code not shown):

MAplot comparing gene expression from two arrays with bin smoother fit shown for two points.

MAplot comparing gene expression from two arrays with bin smoother fit shown for two points.

By computing this mean for bins around every point, we form an estimate of the underlying curve $f(x)$. Below we show the procedure happening as we move from the smallest value of $x$ to the largest. We show 10 intermediate cases as well (code not shown):

Illustration of how bin smoothing estimates a curve. Showing 12 steps of process.

Illustration of how bin smoothing estimates a curve. Showing 12 steps of process.

The final result looks like this (code not shown):

MA-plot with curve obtained with bin-smoothed curve shown.

MA-plot with curve obtained with bin-smoothed curve shown.

There are several functions in R that implement bin smoothers. One example is ksmooth. However, in practice, we typically prefer methods that use slightly more complicated models than fitting a constant. The final result above, for example, is somewhat wiggly. Methods such as loess, which we explain next, improve on this.

Loess

Local weighted regression (loess) is similar to bin smoothing in principle. The main difference is that we approximate the local behavior with a line or a parabola. This permits us to expand the bin sizes, which stabilizes the estimates. Below we see lines fitted to two bins that are slightly larger than those we used for the bin smoother (code not shown). We can use larger bins because fitting lines provide slightly more flexibility.

MA-plot comparing gene expression from two arrays with bin local regression fit shown for two points.

MA-plot comparing gene expression from two arrays with bin local regression fit shown for two points.

As we did for the bin smoother, we show 12 steps of the process that leads to a loess fit (code not shown):

Illustration of how loess estimates a curve. Showing 12 steps of the process.

Illustration of how loess estimates a curve. Showing 12 steps of the process.

The final result is a smoother fit than the bin smoother since we use larger sample sizes to estimate our local parameters (code not shown):

MA-plot with curve obtained with loess.

MA-plot with curve obtained with loess.

The function loess performs this analysis for us:

fit <- loess(Y~X, degree=1, span=1/3)

newx <- seq(min(X),max(X),len=100) 
smooth <- predict(fit,newdata=data.frame(X=newx))

mypar ()
plot(X,Y,col="darkgrey",pch=16)
lines(newx,smooth,col="black",lwd=3)

Loess fitted with the loess function.

Loess fitted with the loess function.

There are three other important differences between loess and the typical bin smoother. The first is that rather than keeping the bin size the same, loess keeps the number of points used in the local fit the same. This number is controlled via the span argument which expects a proportion. For example, if N is the number of data points and span=0.5, then for a given $x$ , loess will use the 0.5*N closest points to $x$ for the fit. The second difference is that, when fitting the parametric model to obtain $f(x)$, loess uses weighted least squares, with higher weights for points that are closer to $x$. The third difference is that loess has the option of fitting the local model robustly. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and downweighted for the next iteration. To use this option, we use the argument family="symmetric".

Exercises

  1. Generate the following data:
n = 10000
set.seed(1)
# Generate a sample of heights for a mixed population of men and women
men = rnorm(n,176,7) #height in centimeters
women = rnorm(n,162,7) #height in centimeters
# Assign a class label to each height generated above (0: men, 1:women)
y = c(rep(0,n),rep(1,n))
x = round(c(men,women))
## mix it up
ind = sample(seq(along=y))
y = y[ind]
x = x[ind]

Set the seed at 5, set.seed(5) and take a random sample of 250 from:

set.seed(5)
N = 250
# Take a sample of size N=250 individuals from our mixed population
ind = sample(length(y), N)
# Remember that `Y` contains the labels that identify if the individual is a
# man or a woman.
Y = y[ind]
# And `X` contains the heights if those individuals.
X = x[ind]

Use loess to estimate f(x) = E(Y |X = x) using the default parameters. What is the predicted f(168)?

Solution

# Fit a LOESS model to predict if an individual is a man or a woman using
# its height as predictor.
fit <- loess(Y~X)
# Generate a grid on the height axis to plot the model fitted above
newx <- seq(min(X),max(X),len=45)
# Predict if the individual is a man or a woman according to the heights on
# our `newx` grid
hat <- predict(fit, newdata=data.frame(X=newx))
mypar()
plot(X,Y)
names(hat) <- round(newx,1)
lines(newx,hat)
# Lets check what is the predicted label for an individual whos height is
# 168 cm. A label closer to 0 (< 0.5) would be an insight that the
# individual is a man, whereas a label closer to 1 (> 0.5) would indicate
# that the individual is a woman.
hat['168']
  1. The loess estimate above is a random variable. We can compute standard errors for it. Here we use Monte Carlo to demonstrate that it is a random variable. Use Monte Carlo simulation to estimate the standard error of your estimate of f (168).

Set the seed to 5, set.seed(5) and perform 10000 simulations and report the SE of the loess based estimate.

Solution

set.seed(5)
B <- 10000
N <- 250
newx <- seq(min(X),max(X),len=45)
res <- replicate(B, {
  ind = sample(length(y),N)
  Y = y[ind]
  X = x[ind]
  # The model fitted by LOESS will be different according to the data used
  # to fit it, so we need to fit it again to each new random sample.
  fit <- loess(Y~X)
  hat <- predict(fit, newdata=data.frame(X=newx))
  names(hat) <- round(newx,1)
  # Because the model is different, the predicted label for a specific
  # height will be different too. We are focused to know how much that
  # prediction will vary.
  return(hat['168'])
})
names(res) <- NULL
# Compute the Standard Error (SE) of the label estimation
popsd(res)

Key Points

  • The smoothing methods work well when used inside the range of predictor values seen in the training set, however them are not suitable for extrapolation the prediction outside those ranges.