Background
Spatial data used in our analyses typically include information about whether cells are positive or negative for specific features. In single-cell spatial datasets, these features often correspond to cell types (e.g., T cell, B cell, malignant cell) or compartments of origin (e.g., lymph node, tumor, stroma). While such data are often simplified to categorical labels, the original measurements are usually continuous—such as pixel intensity, protein abundance, or gene expression. These continuous variables hold important biological meaning and offer a more granular understanding of the system under study.
In this vignette, we demonstrate how to use scSpatialSIM to simulate continuous cellular attributes and apply a spatial autocorrelation method—specifically, Moran’s I from the spdep package. Spatial autocorrelation refers to the degree to which a variable is similar based on geographic or spatial proximity—meaning, whether nearby cells tend to have similar or dissimilar values.
Simulating Data
We will demonstrate the simulation of a protein (CD3) expressed in cells of tissue microarray (TMA) cores. The way that scSpatialSIM does this is by first simulating samples with cell types, or a marked point pattern, and then using the positive/negative assignment of cells to simulate a multimodal distribution of values. In this case, we simulate cells labeled as either CD3-positive or CD3-negative, and assign corresponding high or low levels of CD3 expression.
The same simulation process is shown in
vignette("Using with SpatialTIME")
, which demonstrates
creating a simulation object with a single cell type. In this vignette,
we go further by creating two simulation objects with different spatial
clustering levels. These serve as positive and negative controls—one
with intentional spatial clustering and one with a more random
(non-clustered) distribution—to compare the resulting spatial
autocorrelation metrics.
library(scSpatialSIM)
#> scSpatialSIM Version:
#> 0.1.3.5
#> _____ _ _ _ _____ _____ __ __
#> / ____| | | (_) | |/ ____|_ _| \/ |
#> ___ ___| (___ _ __ __ _| |_ _ __ _| | (___ | | | \ / |
#> / __|/ __|\___ \| '_ \ / _` | __| |/ _` | |\___ \ | | | |\/| |
#> \__ \ (__ ____) | |_) | (_| | |_| | (_| | |____) |_| |_| | | |
#> |___/\___|_____/| .__/ \__,_|\__|_|\__,_|_|_____/|_____|_| |_|
#> | |
#> |_|
#>
#> Fridley Lab Enjoy
set.seed(333)
#create simulation object for tight clusters
sim_object_tight = CreateSimulationObject(sims = 5, cell_types = 1) %>%
#simulate point pattern
GenerateSpatialPattern()
#> No `window` specified - defaulting to x (0, 10); y (0, 10)
#create simulation object for no clusters
sim_object_null = CreateSimulationObject(sims = 5, cell_types = 1) %>%
#simulate point pattern
GenerateSpatialPattern()
#> No `window` specified - defaulting to x (0, 10); y (0, 10)
When using GenerateCellPositivity()
, the
probs
parameter needs to be scaled back for the null case
due to all cells having the same probability of being positive resulting
in much greater abundances of positive cells at a given high abundance
theshold
sim_object_tight = GenerateCellPositivity(sim_object_tight,
k = 4,
sdmin = 1, sdmax = 3,
density_heatmap = F,
probs = c(0.0, 0.9))
#> Computing probability for Cell 1
sim_object_null = GenerateCellPositivity(sim_object_null,
probs = c(0.0, 0.2),
no_kernel = TRUE)
#> random cell assignments without kernels
#> Computing probability for Cell 1
Good quality control is to always look at the data. For a simulation study, thousands of samples would be simulated resulting in better distribution estimates for the abundance of T cells in both the clustered and null scenarios. In these simulation objects we have 5 simulated samples so distributions are rough. Even so, visualizing the abundance will give us an general idea.
library(ggplot2)
#calculate abundance for clustered samples
cluster_abundance = sapply(sim_object_tight@`Spatial Files`, function(x){
sum(x$`Cell 1 Assignment` == "Positive")/nrow(x)
})
#calculate abundance for null/negative control samples
null_abundance = sapply(sim_object_null@`Spatial Files`, function(x){
sum(x$`Cell 1 Assignment` == "Positive")/nrow(x)
})
#create histogram of abundances
data.frame(abundance = c(cluster_abundance, null_abundance),
simulation = c(rep("clustered", 5), rep("null", 5))) %>%
ggplot() +
geom_histogram(aes(x = abundance, fill = simulation))
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For the 5 samples in the null/negative control and clustered
scenarios, they are pretty similar as an average, maybe slightly higher
T cell abundance in the clustered samples. If running a larger study,
tuning the abundance to be more similar is important. My recommendation
is running the clustered scenarios at an abundance and cluster size that
is satisfactory (changing sdmin
, sdmax
, and
probs
), then setting the high probs
for the
no_kernel = TRUE
scenario at the mean/median of the
clustered abundance. Due to the implementation of
no_kernel = TRUE
, the cell abundance of each sample in the
scenario will be similar (very little variation). If variability is
desired, abundance can be assigned outside scSpatialSIM
by using rnorm
with the mean and standard deviation of the
cell abundance from clustering scenarios, then applying that to each
null scenario sample with lapply
and rbinom
to
slightly adjust the prob
of each.
With each scenarios having T cells simulated, we can extract the
samples from the simulation objects using
CreateSpatialList()
. This creates either a
list
or a data.frame
depending on the
single_df
- here we will keep it FALSE
.
Because we want our T cells in both simulation scenarios to have the
same CD3 protein distribution we will combine then into a single list,
give the list elements appropriate names. Then we will use
GenerateDistributions()
to produce our protein
abundance.
#extract simulated samples to make lists
cluster_list = CreateSpatialList(sim_object_tight, single_df = FALSE)
null_list = CreateSpatialList(sim_object_null, single_df = FALSE)
#combine lists to make a single list
spatial_list = c(cluster_list, null_list)
names(spatial_list) = c(paste0("Clustered Sample ", 1:5),
paste0("Null Sample ", 1:5))
For demonstration purposes, we will use arbitrary values for positive and negative protein abundances. This can also be derived from real world data such as multiplex immunofluorescence for protein or spatial transcriptomics for gene expression.
spat_data_distribution = GenerateDistributions(spatial_list,
positive_mean = 15,
negative_mean = 5,
positive_sd = 3,
negative_sd = 1)
Calculating Moran’s I
Spatial autocorrelation is a measure similar to the common Pearson correlation, but extended to look at spatial relationships. How I like to think of Moran’s I is how similar a value (in our case the expression of the protein CD3) in an anchor cell is to the aggregated value of it’s neighbors. So “how similar is CD3 expression in cells to the CD3 expression in surrounding cells?” Similar to Pearson correlation, a row-standardized weight calculation for Moran’s I will have a range from -1 to 1, where greater positive values indicate strong similarity in a value with the cells/points neighboring cells/points values, negative indicates strong dissimilarity in a value with neighboring cells values, and 0 indicates no spatial relationship.
To assess Moran’s I on in our simulated samples, we will use the
spdep package. There are a couple steps to get to using
the spdep::moran.test()
function like calculating neighbors
and creating a weight list. The number of neighbors used can also
influence results of Moran’s I, but that is beyond the scope of this
vignette Here, we will use each cells 10 nearest neighbors to assess the
autocorrelation. Row-standardized weight means that the weights for the
neighbors sum to 1. What we expect is that the CD3 expression
will show greater spatial autocorrelation in samples where the cells
were simulated as clustered than in the samples that were our null or no
clustering cases.
#identify neighbors
#create weight list
#calculate moran's i
library(dplyr)
library(spdep)
#> Loading required package: spData
#> To access larger datasets in this package, install the spDataLarge
#> package with: `install.packages('spDataLarge',
#> repos='https://nowosad.github.io/drat/', type='source')`
#> Loading required package: sf
#> Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
results = lapply(spat_data_distribution, function(dat){
#convert data frame to an sf object compatable with other spdep functions
sf_dat = st_as_sf(dat, coords = c("x", "y"))
#calculate the 10 nearest neighbors
knn = knearneigh(sf_dat, k = 10)
#convert knn to neighbor list
knn_nb = knn2nb(knn)
#convert neighbor list to weight list
knn_nb_listw = nb2listw(knn_nb,
style = "W")
#calculate moran's I on the simulated protein expression "Cell 1 Var"
res = moran.test(sf_dat$`Cell 1 Var`,
listw = knn_nb_listw)
#convert results to a data frame
data.frame(as.list(res$estimate), check.names = FALSE)
}) %>%
#convert list results to a data frame
bind_rows(.id = "Sample ID")
#look at the results
head(results)
#> Sample ID Moran I statistic Expectation Variance
#> 1 Clustered Sample 1 0.3758399 -0.0004009623 7.410341e-05
#> 2 Clustered Sample 2 0.3402620 -0.0004083299 7.533573e-05
#> 3 Clustered Sample 3 0.3052300 -0.0004001601 7.395509e-05
#> 4 Clustered Sample 4 0.3792493 -0.0003979308 7.338008e-05
#> 5 Clustered Sample 5 0.2748225 -0.0004024145 7.445675e-05
#> 6 Null Sample 1 -0.0101181 -0.0003974563 7.353824e-05
The results in data frame format, thanks to dplyrs
bind_rows()
function, allow us to then visualize the
Moran’s I autocorrelation. We will use ggplot and plot boxplots to show
the distribution of calculated Moran’s I. For the null cases, we expect
values around 0 which indicate no spatial relationship to the simulated
protein expression.
results2 = results %>%
mutate(Group = rep(c("Clustered", "Null"), each = 5))
results2 %>%
ggplot() +
geom_boxplot(aes(x = Group, y = `Moran I statistic`)) +
theme_classic()
Conclusion
In this vignette, we demonstrated the simulation of spatially-resolved continuous cell attributes and demonstrated how to assess spatial structure using a well-established spatial statistic, Moran’s I. We used scSpatialSIM to simulate protein expression (CD3) in TMA cores by assigning continuous values to cells based on their classification and positive or negative. These simulations included two scenarios: one with spatial clustering (positive control) and one without (negative control) allowing us to explore how spatial patterns influence downstream analyses. After simulating the data, we applied Moran’s I to quantify spatial autocorrelation - that is, the degree to which similar CD3 protein expression values cluster in space.
Spatial patterns in molecular data can reveal critical biological insights that are lost when reducing data to binary or categorical labels. Simulating and analyzing continuous attributes in a spatial context allows for better benchmarking of methods and more nuanced interpretation of cell behavior within tissues. The approach outlined here is broadly applicable to real datasets and can be extended using a variety of spatial analysis tools.