Normalization in Spatial Transcriptomics

Last updated on 2024-06-18 | Edit this page

Overview

Questions

  • How do we determine the necessity of normalization in spatial transcriptomics?
  • What insights do additional modalities like H&E staining provide in assessing normalization needs?
  • How do specific normalization techniques like SCTransform and log scaling work and when should they be applied?

Objectives

  • Assess the need for normalization using both spatial transcriptomics data and ancillary modalities like H&E staining.
  • Understand the specific applications and mechanisms of normalization techniques such as SCTransform and log scaling.
  • Implement adaptive normalization strategies that accurately reflect both absolute and relative cellular information.

Understanding Normalization in Spatial Transcriptomics


Normalization in spatial transcriptomics must be carefully tailored to each dataset, balancing the technical corrections with the preservation of biologically meaningful signals. There are two artifacts of the data that we need to adjust for:

  1. the difference in total counts across spots, and
  2. the difference in variance across genes.

For the first gloal, each spot may have a different number of total counts. This is termed the “library size”. Since each spot has a different number of counts, it will be difficult to compare gene expression values between them in a meaningful way because the denominator (total spot counts) is different in each spot. On the other hand, different spots may contain different types of cells, which may express differing numbers of transcripts. So there is a balance between normalizing all spots to have the same total counts and leaving some variation in total counts which may be due to the biology of the tissue.

For the second goal, in order to compare gene expression values between different genes, the within-gene variance should be similar between genes. This is because many statistical tests require that the within-group variance be the same. As you’ll see below, there is a relationship between the mean and variance of genes that will allow us to correct for this difference. Hence, we seek to “stabilize the variance” – expression variance should be independent of mean expression.

Total Counts per Spot are Variable

Let’s first assess the variability in the total counts per spot.

The spots are arranged in column in the data matrix. We will look at the distribution of total counts per spot by summing the counts in each column and making a histogram.

R

counts <- LayerData(filter_st, layer = 'counts')
hist(colSums2(counts), breaks = 100, 
     main = "Histogram of Counts per Spot")

As you can see, the total counts per spot ranges cross three orders of magnitude. Some of this may be due to the biology of the tissue, i.e. some cells may express more transcripts. But some of this may be due to technical issues. Let’s explore each of these two considerations further.

Sources of Biological Variation in Total Counts

Hematoxylin and Eosin (H&E) staining is critical for preliminary assessments of tissue sections. It highlights structural and pathological features, guiding the interpretation of transcriptomic data. For example, high RNA counts in a necrotic region, typically characterized by reduced cellular material, might suggest technical artifacts, indicating a need for normalization.

Maynard and colleagues used the information encoded in the H&E, in particular cellular organization, morphology, and density, in conjunction with expression data to annotate the six layers and the white matter of the neocortex. Additionally, they applied standard image processing techniques to the H&E image to segment and count nuclei under each spot. They provide this as metadata. Let’s load that layer annotation and cell count metadata and add it to our Seurat object.

R

spot_metadata <- read.table("./data/spot-meta.tsv", sep="\t")
# Subset to our sample
spot_metadata <- subset(spot_metadata, sample_name == 151673)
rownames(spot_metadata) <- spot_metadata$barcode
stopifnot(all(Cells(filter_st) %in% rownames(spot_metadata)))
spot_metadata <- spot_metadata[Cells(filter_st),]

filter_st <- AddMetaData(object = filter_st, metadata = spot_metadata[, c("layer_guess", "cell_count"), drop=FALSE])

Now, we can plot the layer annotations to understand the structure of the tissue. We will use a simple wrapper, SpatialDimPlotColorSafe, around the Seurat function SpatialDimPlot. This is defined in code/spatial_utils.R and uses a color-blind safe palette.

R

SpatialDimPlotColorSafe(filter_st[, !is.na(filter_st[[]]$layer_guess)], "layer_guess") + labs(fill="Layer")

We noted that the authors used cellular density to aid in discerning layers. Let’s see those H&E-derived cell counts vary across layers.

R

g <- ggplot(na.omit(filter_st[[]][, c("layer_guess", "cell_count")]), aes(x = layer_guess, y = cell_count))
g <- g + geom_boxplot() + xlab("Layer") + ylab("Cell Count")
g

We see that the white matter (WM) has increased cells per spot, whereas Layer 1 has fewer cells per spot.

We can also plot these cell counts spatially.

R

SpatialFeaturePlot(filter_st, "cell_count")

The cell counts partially reflect the banding of the layers.

As a potential surrogate for cell count, let’s plot the number of UMIs per spot as a function of layer.

R

g <- ggplot(na.omit(filter_st[[]][, c("layer_guess", "nCount_Spatial")]), aes(x = layer_guess, y = nCount_Spatial))
g <- g + geom_boxplot()
g

Layer 1 has fewer UMIs, consitent with its lower cell count. An increase in UMIs consistent with that in cell count is not observed for the white matter, however. Despite this imperfect correlation between UMI and cell counts, we wish the emphasie that UMI (i.e., read) counts, as well as feature (i.e., gene) counts, can encode biological information. That certainly occurs here. As such, we strongly recommend visualizing raw UMI and features counts prior to normalization.

Normalization Techniques to Mitigate Sources of Technical Variation in Total Counts

Mean-variance plots are an essential tool for assessing gene expression variability relative to the mean expression levels across different spots. By plotting the variance of gene expression against the mean expression level, researchers can identify genes with variance that deviates significantly from what would be expected under normal biological conditions. This can be particularly useful for spotting genes that are overly influenced by technical artifacts or biological outliers, suggesting the corresponding normalization choices.

LogNormalize

One common approach that attempts to meet our about two objectives is log-transformation of normalized counts. The resulting values are often ambiguously referred to as log-normalized counts, which elides stating that the raw counts are first normalized or scaled and then log transformed. Scaling accounts for the differences in spot-specific RNA counts. The log transformation reduces skewness caused by highly expressed genes and stabilizes the variance, at least for certain mean-variance relationships. In practice, the log transformation is applied to 1+x, where x is the scaled expression value – the so-called log1p transformation.

In Seurat, we can apply this transformation via the NormalizeData function:

R

lognorm_st <- NormalizeData(filter_st, 
                           assay                = "Spatial", 
                           normalization.method = "LogNormalize", 
                           scale.factor         = 1e6)

OUTPUT

Normalizing layer: counts

This function first normalizes the raw counts by scale.factor before appplying the log1p transformation. The “scale.factor” argument has a default of 10,000. Here, we selected 1,000,000 because it made the mean- variance relationship somewhat flatter. The normalized counts, in this case, are CPMs – counts per million.

Log normalization adds a “data” object to the Seurat object.

R

Layers(lognorm_st)

OUTPUT

[1] "counts" "data"  

Our goal is that the variance should be stabilized – i.e., independent of the mean. Let’s plot this “mean-variance” relationship with the VariableFeaturePlot function. We aim for a flat line, indicating no trend between mean and variance. We can compare this diagnostic plot across normalization methods to compare them on our given dataset. Additionally, let’s highlight highly variables genes on this plot.

R

lognorm_st <- FindVariableFeatures(lognorm_st)

OUTPUT

Finding variable features for layer counts

R

top15        <- head(VariableFeatures(lognorm_st), 15)
plot_lognorm <- VariableFeaturePlot(lognorm_st) + 
                  ggtitle("Variable Features - Log normalization")
plot_lognorm <- LabelPoints(plot = plot_lognorm, points = top15, repel = TRUE)

OUTPUT

When using repel, set xnudge and ynudge to 0 for optimal results

R

plot_lognorm

WARNING

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

As a sanity check that the normalization is going something sensible, let’s look at the expression of two, known layer-restricted marker genes – MOBP and PCP4 MOBP is restricted to the white matter, while PCP4 is expressed in Layer 5.

R

SpatialFeaturePlot(lognorm_st, slot="data", c("MOBP"))

R

SpatialFeaturePlot(lognorm_st, slot="data", c("PCP4"))

Indeed, this is what we observe.

SCTransform

SCTransform is a normalization method for single-cell and spatial transcriptomics that uses a regularized negative binomial regression to stabilize variance across expression levels Choudhary et al.. It selects highly variable genes and corrects for technical noise by modeling gene expression counts with Pearson residuals. This approach effectively adjusts for confounding factors such as sequencing depth, facilitating more accurate downstream analyses like clustering.

R

filter_st <- SCTransform(filter_st, assay = "Spatial")

OUTPUT

Running SCTransform on assay: Spatial

OUTPUT

Running SCTransform on layer: counts

OUTPUT

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

WARNING

Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'

OUTPUT

Variance stabilizing transformation of count matrix of size 17996 by 3633

OUTPUT

Model formula is y ~ log_umi

OUTPUT

Get Negative Binomial regression parameters per gene

OUTPUT

Using 2000 genes, 3633 cells

OUTPUT

Found 105 outliers - those will be ignored in fitting/regularization step

OUTPUT

Second step: Get residuals using fitted parameters for 17996 genes

OUTPUT

Computing corrected count matrix for 17996 genes

OUTPUT

Calculating gene attributes

OUTPUT

Wall clock passed: Time difference of 20.09192 secs

OUTPUT

Determine variable features

OUTPUT

Centering data matrix

OUTPUT

Set default assay to SCT

The SCTransform method added a new Assay called “SCT.”

R

Assays(filter_st)

OUTPUT

[1] "Spatial" "SCT"    

It made this new Assay the default. Be aware that Seurat functions often operate on the DefaultAssay.

R

DefaultAssay(filter_st)

OUTPUT

[1] "SCT"

Within this new “SCT” Assay, SCTransform has created three Layers.

R

Layers(filter_st)

OUTPUT

[1] "counts"     "data"       "scale.data"

As you can see by reading its documentation, these new Layers are “counts” (counts corrected for differences in sequencing depth between cells), “data” (log1p transformation of the corrected counts), and “scale.data” (scaled Pearson residuals, i.e., the difference between an observed count and its expected value under the model used by SCTransform, divided by the standard deviation in that count under the model).

R

?SCTransform

Notice, in particular, that the “counts” Layers in the “Spatial” and “SCT” Assays are different. As mentioned above, the latter have been corrected for differences in sequencing depth between cells. As such, the distribution in total counts per cell is much more uniform in the latter case.

R

layout(matrix(1:2, ncol = 1))
raw_counts_spatial <- LayerData(filter_st, layer = "counts", assay = "Spatial")
hist(colSums2(raw_counts_spatial), main = "Raw counts (Spatial)")

corrected_counts_sct <- LayerData(filter_st, layer = "data", assay = "SCT")
hist(colSums2(corrected_counts_sct), main = "Corrected counts (SCT)")

Let’s plot the mean-variance relationship.

R

top15SCT    <- head(VariableFeatures(filter_st), 15)
plot_sct    <- VariableFeaturePlot(filter_st) + 
                 ggtitle("Variable Features - SCT")
plot_sct    <- LabelPoints(plot = plot_sct, points = top15SCT, repel = TRUE)
plot_sct

The geometric mean (mean of the log counts) is shown on the X-axis and the residual variance is on the Y-axis. Each point shows one gene. By default, Seurat selects a set of 3,000 variable genes which are colored in red. The variance is largely stable across a range of mean expression values.

Let’s again check that the two marker genes show the appropriate layer-restricted expression.

R

SpatialFeaturePlot(filter_st, slot="data", c("MOBP"))

R

SpatialFeaturePlot(filter_st, slot="data", c("PCP4"))

Challenge 1: Compare Mean-Variance Plots

Above, we created mean-variance plots for the SCT and Log normalizations. Which method does a better job of stabilizing the variance across genes? Turn to the person next to you and put the Log-normalized plot on one of your screens and the SCT transform plot on the other person’s screen. Discuss the mean-variance relationship in each plot and decide which one you think stabilizes the variance across genes better.

In the Log-normalization plot, the variation in the variance is wider for genes with lower expression and lower for genes with higher expression. Also, the range of variances is quite wide. In the SCT plot, the variance of the low- expressed genes is smaller and it rises as mean expression increases. The height of the variances becomes stable for genes with mean expression over about 100. For this reason, we prefer the SCT transform.

Comparing Normalizations

First, let’s look at the total counts per spot after normalization by each method. The normalized counts are stored in the “data” slot of the Seurat object.

R

layout(matrix(1:2, ncol = 1))
counts_log <- LayerData(lognorm_st, layer = "data")
hist(colSums2(counts_log), main = "Log-norm")

counts_sct <- LayerData(filter_st, layer = "data")
hist(colSums2(counts_sct), main = "SCT")

Notice that the log-normalization has a range of total counts per spot that ranges across several orders of magnitude. The SCT transform has a more uniform distribution of total counts and spans a factor of three, from ~1000 to ~2700.

Next, we will compare the mean-variance plots between the two methods.

R

plot_lognorm

WARNING

Warning in scale_x_log10(): log-10 transformation introduced infinite values.

R

plot_sct

No One-Size-Fits-All Approach

Raw read counts provide essential insights into the absolute cell type densities within a sample, which are crucial for mapping cellular distribution. In contrast, normalized data adjusts for technical variations like sequencing depth and RNA capture efficiency, thus revealing the relative proportions of cell types and identifying specific tissue structures, such as epithelium or fibrosis. Saiselet et al. demonstrated that while normalized data effectively identify distinct morphologies, raw counts are vital for detecting areas with unusual cell-type concentrations, such as high epithelial regions expressing vimentin (VIM) (Saiselet, M., et al., Journal of Molecular Cell Biology, 2020). Hence, choosing the right normalization method depends on the specific characteristics of each dataset and the biological questions at hand. Researchers must understand the impact of each normalization strategy on both the biological and technical aspects of their data.

Key Points

  • Normalization is essential but must be selectively applied based on the unique characteristics of each dataset and the specific biological questions at hand.
  • Techniques like SCTransform and log scaling offer ways to balance technical correction with biological integrity.
  • Examining both raw and normalized data can provide comprehensive insights into the absolute and relative characteristics of cellular components in spatial transcriptomics.