vignettes/spatialexperiment.Rmd
spatialexperiment.Rmd
We’ll be using data from the VectraPolarisData Bioconductor package (Wrobel, J and Ghosh, T, 2022). This provides two datasets from the Vectra Polaris imaging system in the form of SpatialExperiment objects. See this vignette for more details.
These are the packages needed to run this tutorial. These will only be installed if they are not found in the installed packages.
if (!("devtools" %in% installed.packages())) {
install.packages("devtools")
}
if (!("spatialTIME" %in% installed.packages())) {
devtools::install_github("FridleyLab/spatialTIME", ref = "feature-alex")
}
if (!("BiocManager" %in% installed.packages())) {
install.packages("BiocManager")
}
if (!("VectraPolarisData" %in% installed.packages())) {
BiocManager::install("VectraPolarisData")
}
if (!("survival" %in% installed.packages())) {
BiocManager::install("survival")
}
if (!("lmtest" %in% installed.packages())) {
install.packages("lmtest")
}
We’ll need to load these packages. This just means these packages will be made readily available to us.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## spatialTIME version:
## 1.4.0
## If using for publication, please cite our manuscript:
## https://doi.org/10.1093/bioinformatics/btab757
In this example we will be using data from a human ovarian cancer study (Steinhart et al, 2021).
# Ovarian cancer example
ovarian <- VectraPolarisData::HumanOvarianCancerVP()
## Warning: package 'SingleCellExperiment' was built under R version 4.4.2
## Warning: package 'MatrixGenerics' was built under R version 4.4.2
## Warning: package 'IRanges' was built under R version 4.4.2
## Warning: package 'GenomeInfoDb' was built under R version 4.4.2
## see ?VectraPolarisData and browseVignettes('VectraPolarisData') for documentation
## loading from cache
For more details on the SpatialExperiment
class see this
Bioconductor
Vignette and preprint.
We can inspect what this type of object looks like typically by printing it out below:
# Use this code chunk to explore the ovarian object.
The colData
entry has a lot of our marker information as
well as other annotations. Let’s specifically look at the
sample_id
and some of the markers we will be using.
This is how we look at the colData in this set.
colData(ovarian)[1:5, 1:5]
## DataFrame with 5 rows and 5 columns
## cell_id tissue_category slide_id nucleus_area_square_microns
## <numeric> <character> <character> <numeric>
## 1 1 Stroma 030120 P9HuP6 TMA 1-B 9.4
## 2 2 Tumor 030120 P9HuP6 TMA 1-B 11.1
## 3 3 Stroma 030120 P9HuP6 TMA 1-B 8.9
## 4 4 Stroma 030120 P9HuP6 TMA 1-B 9.8
## 5 5 Stroma 030120 P9HuP6 TMA 1-B 12.1
## nucleus_area_percent
## <numeric>
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
We can look at the sample IDs by doing this:
# this is how we can extract sample IDs
head(ovarian$sample_id)
## [1] "030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3"
## [2] "030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3"
## [3] "030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3"
## [4] "030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3"
## [5] "030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3"
## [6] "030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3"
There are also marker columns we will need for plotting. They all
start with phenotype_
head(ovarian$phenotype_cd19)
## [1] "" "" "" "CD19-" "CD19-" "CD19-"
Now to use these data with the spatialTIME package for analyses we
will need to convert it to the MIF (multiplex immunofluorescence)
format. For SpatialExperiment objects specifically we can do this using
the spatial_exp_to_mif()
function. For all other types of
data you can use the create_mif()
function.
There are three required arguments in this function to create our mif
object. 1) The SpatialExperiment
object we are using. 2)
The name of the column that links the colData with the spatialCoord
data. In this case it is called “sample_id”. 3) The names of the columns
that contain any markers we’d like to use for the analysis or plotting
that are included in the data as column names.
# These are the genes we will be using as markers that exist in these data.
markers <- c("phenotype_cd68", "phenotype_cd3", "phenotype_cd8")
These objects come with so many columns of data that spatialTIME
doesn’t need, so many will be dropped from the object. But if there is a
particular column you will need for labeling purposes, you can name it
in the cols_to_keep
variable. But using this argument is
optional rather than required.
Now we can pass all of this into the spatial_exp_to_mif
function to create our mif
object for use with
spatialTIME
.
ova_mif <- spatial_exp_to_mif(
spatial_exp = ovarian,
patient_id = "sample_id",
markers = markers,
cols_to_keep = "tissue_category"
)
Now we can use our mif
object to make a set of
immunofluorescence plots, one for each sample in the set.
ova_mif <- plot_immunoflo(ova_mif,
plot_title = "sample_id",
mnames = markers,
xloc = "x",
yloc = "y"
)
We can view individual plots by pulling them out using the indexing
notation. Where [[1]]
would be the first sample in the list
and [[3]]
, the third sample in the list, etc.
ova_mif$derived$spatial_plots[[3]] +
ggplot2::coord_equal()
Now let’s write our own code to take a look at a different sample using the above code as a reference.
ova_mif$derived$spatial_plots[[1]] +
ggplot2::coord_equal()
ova_mif$derived$spatial_plots[[2]] +
ggplot2::coord_equal()
We could build on to this plot using ggplot2
and
tidyverse code.
We can subset our mif object to a certain compartment. Note this is
why when we made our object we used
cols_to_keep = "tissue_category"
because we are using that
labeling now.
# filter the mif object to be only the tumor compartment
ova_mif_tumor <- subset_mif(ova_mif,
classifier = "tissue_category",
level = "Tumor",
markers = markers
)
Now that we have subset to tumor only, let’s check a spatial data
frame for what levels are in tissue_category
# check a spatial data frame for levels in tissue_category
ova_mif_tumor$spatial[[1]]$tissue_category %>% unique()
## [1] "Tumor"
Success! We only have cells in a tumor compartment. Summary of the samples has also been updated to reflect the new compartment.
names(ova_mif_tumor$sample)
## [1] "patient_id" "sample_id"
## [3] "Tumor: phenotype_cd68" "Tumor: phenotype_cd3"
## [5] "Tumor: phenotype_cd8" "Tumor: Total Cells"
## [7] "Tumor: % phenotype_cd68" "Tumor: % phenotype_cd3"
## [9] "Tumor: % phenotype_cd8"
END OF OFFICIAL INSTRUCTION
Now we can use the Ripley’s K stat to quantify our spatial clustering.
# takes time to process all samples
ova_mif_tumor <- ripleys_k(ova_mif_tumor,
mnames = markers,
r_range = 0:100,
edge_correction = "translation", # translational edge correction
permute = FALSE, # allows for exact estimation of complete spatial ransomness, or CSR
workers = 1, # depending on number of spatial samples and ram, may cause crashes if overallocated
big = 50000, # really only needed (keep larger than largest spatial data) when computer has ram limitations otherwise would avoid, causes non-perfect estimation of CSR due to rounding/edge corrections
xloc = "x",
yloc = "y",
overwrite = TRUE
)
## Warning: 1 point was rejected as lying outside the specified window
## Warning: data contain duplicated points
## Warning: 1 point was rejected as lying outside the specified window
## Warning: data contain duplicated points
Now that we’ve run Ripley’s K, let’s plot phenotype_cd3
to view t-cell clustering
ova_mif_tumor$derived$univariate_Count %>%
filter(Marker == "phenotype_cd3") %>%
ggplot() +
geom_line(aes(x = r, y = `Degree of Clustering Exact`,
group = sample_id),
alpha = 0.5
)
## Warning: Removed 202 rows containing missing values or values outside the scale range
## (`geom_line()`).
We can see that most samples have a positive ‘degree of clustering’ meaning that the T cells in the most samples are clustering closer together than expected if they were randomly distributed around the samples. We also see that there are 202 rows that have missing values, i.e. the spatial files didn’t have enough positive cells for a marker.
Alternatively we can use Nearest Neighbor G to examine clustering which identifies the probability of encountering a neighbor cell.
ova_mif_tumor <- NN_G(
mif = ova_mif_tumor,
mnames = markers,
r_range = 0:100,
num_permutations = 10, # NN G(r) requires permutations to estimate CSR
edge_correction = "rs", # reduced sampling edge correction
keep_perm_dis = FALSE, # whether to keep results from each permutation
workers = 1,
overwrite = TRUE,
xloc = "x",
yloc = "y"
)
## Warning: 1 point was rejected as lying outside the specified window
## Warning: data contain duplicated points
## Warning: 1 point was rejected as lying outside the specified window
## Warning: data contain duplicated points
Now let’s plot phenotype_cd3
again but with NNG results
to view t-cell clustering.
ova_mif_tumor$derived$univariate_NN %>%
filter(Marker == "phenotype_cd3") %>%
ggplot() +
geom_line(aes(x = r, y = `Degree of Clustering Permutation`, group = sample_id),
alpha = 0.5
)
## Warning: Removed 242 rows containing missing values or values outside the scale range
## (`geom_line()`).
We can see NN G(r) is much more ‘curvy’ than Ripley’s K. Values above 0 suggest clustering more than expected if the T cells were random. We also can vaguely see that there is a bit of a hill between 0 and 25, indicating tight clusters of CD3, then outside of that they either fall to 0 or maintain their degree of clustering. NN G(r) (observed) tells the proportion of T cells with a neighbor at or below a measured radius.
ova_mif_tumor$derived$univariate_NN %>%
filter(Marker == "phenotype_cd3") %>%
ggplot() +
geom_line(aes(x = r, y = `Observed G`, group = sample_id),
alpha = 0.5
)
## Warning: Removed 242 rows containing missing values or values outside the scale range
## (`geom_line()`).
By plotting the raw NN G(r) value, we see that it grows and approaches 1, at which point all T cells in the sample have another T cell within that radius. Or 75% of T cells have another T cell within whatever radius. The reason we take the difference between observed and permuted (Degree of Clustering Permutation) is to account for sample-specific technical artifacts like holes or even necrotic regions that violate what Ripley’s K and NN G assume of the sample. This difference is then what we compare between samples.
Unless diving into complicated statistics with functional data analysis (analysis of the entire shape of the Ripley’s K or NN G curve), researchers have to pick a radius that they would like to associate with some feature of the samples like patient survival, stage, treatment response, or even disease subtypes.
The best way to do this is to plot all curves for a marker and find a radius that’s small enough to be biologically meaningful while large enough to capture variance in the data.
Let’s plot phenotype_cd3 to view t-cell clustering again to take a look.
ova_mif_tumor$derived$univariate_Count %>%
filter(Marker == "phenotype_cd3") %>%
ggplot() +
geom_line(aes(x = r, y = `Degree of Clustering Exact`, group = sample_id),
alpha = 0.5
)
## Warning: Removed 202 rows containing missing values or values outside the scale range
## (`geom_line()`).
For this data, we might pick a radius of 50 pixels because after that most values don’t drastically change, so we can subset and move forward.
analysis_dat <- ova_mif_tumor$derived$univariate_Count %>%
filter(Marker == "phenotype_cd3", r == 50) %>% # filter ripley's k output to CD3 cells and radius of 50
full_join(
ova_mif_tumor$clinical, # merge with clinical (on left side)
., by = c("patient_id" = "sample_id")) # ripleys k on right
head(analysis_dat)
## patient_id tma
## 1 030120 P9HuP6 TMA 1-B_Core[1,1,D]_[12124,35348].im3 1
## 2 030120 P9HuP6 TMA 1-B_Core[1,1,E]_[14220,35348].im3 1
## 3 030120 P9HuP6 TMA 1-B_Core[1,1,F]_[16379,35348].im3 1
## 4 030120 P9HuP6 TMA 1-B_Core[1,1,G]_[18601,35348].im3 1
## 5 030120 P9HuP6 TMA 1-B_Core[1,1,H]_[20633,35348].im3 1
## 6 030120 P9HuP6 TMA 1-B_Core[1,1,B]_[7934,35665].im3 1
## diagnosis primary
## 1 Recurrent primary peritoneal carcinoma 0
## 2 Papillary serous adenocarcinoma of ovarian surface 1
## 3 serous adenocarcinoma 1
## 4 Primary peritoneal carcinoma 1
## 5 Post-Treatment High grade serous carcinoma of the peritoneum 1
## 6 High grade serous carcinoma 1
## recurrent treatment_effect stage grade survival_time death BRCA_mutation
## 1 1 0 3 3 68 1 1
## 2 0 0 3 3 37 1 0
## 3 0 0 3 3 76 0 0
## 4 0 0 3 3 22 1 NA
## 5 0 1 4 3 70 0 0
## 6 0 0 3 2 46 0 0
## age_at_diagnosis time_to_recurrence parpi_inhibitor debulking iter
## 1 47 25 Y Optimal Estimater
## 2 53 20 Y Optimal Estimater
## 3 53 No_Recurrence Y Interval Estimater
## 4 54 9 N Optimal Estimater
## 5 50 31 N Interval Estimater
## 6 69 39 Y Optimal Estimater
## Marker r Theoretical CSR Observed K Permuted CSR Exact CSR
## 1 phenotype_cd3 50 7853.982 29017.89 NA 10651.873
## 2 phenotype_cd3 50 7853.982 52929.95 NA 10014.605
## 3 phenotype_cd3 50 7853.982 13699.08 NA 8085.531
## 4 phenotype_cd3 50 7853.982 137283.93 NA 12436.604
## 5 phenotype_cd3 50 7853.982 20728.65 NA 11713.179
## 6 phenotype_cd3 50 7853.982 135060.12 NA 11249.827
## Degree of Clustering Permutation Degree of Clustering Theoretical
## 1 NA 21163.906
## 2 NA 45075.972
## 3 NA 5845.099
## 4 NA 129429.946
## 5 NA 12874.672
## 6 NA 127206.141
## Degree of Clustering Exact Run
## 1 18366.015 1
## 2 42915.348 1
## 3 5613.550 1
## 4 124847.324 1
## 5 9015.475 1
## 6 123810.296 1
For simplicity we can dichotomize the clustering values.
analysis_dat <- analysis_dat %>%
mutate(
cluster_group = ifelse(`Degree of Clustering Exact` >= median(`Degree of Clustering Exact`, na.rm = TRUE), # remembering that 2 samples didn't have enough T cells so remove NA
"High Clustering",
"Low Clustering"
), # if above the median, high, else low
cluster_group = factor(cluster_group, levels = c("Low Clustering", "High Clustering"))
) %>% # convert to factor to control reference
filter(!is.na(cluster_group)) # must remove extra rows that do not have clustering group
Create a formula using column names in
analysis_data
.
form1 <- formula(Surv(survival_time, as.numeric(death)) ~ age_at_diagnosis + stage + grade + cluster_group)
Now let’s fit the model using Cox proportional hazards regression
model using the survival
package.
# fit the model with the data
model1 <- coxph(form1, data = analysis_dat)
We can take a look at the summary of this fitting.
# view the fit
summary(model1)
## Call:
## coxph(formula = form1, data = analysis_dat)
##
## n= 126, number of events= 79
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age_at_diagnosis 0.01334 1.01342 0.01211 1.101 0.2709
## stage2 -0.09165 0.91242 1.23729 -0.074 0.9410
## stage3 1.27547 3.58038 1.03245 1.235 0.2167
## stage4 2.04767 7.74980 1.11641 1.834 0.0666 .
## grade -0.37427 0.68779 0.48848 -0.766 0.4436
## cluster_groupHigh Clustering 0.43714 1.54828 0.24070 1.816 0.0694 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age_at_diagnosis 1.0134 0.9868 0.98965 1.038
## stage2 0.9124 1.0960 0.08073 10.313
## stage3 3.5804 0.2793 0.47327 27.087
## stage4 7.7498 0.1290 0.86897 69.116
## grade 0.6878 1.4539 0.26404 1.792
## cluster_groupHigh Clustering 1.5483 0.6459 0.96597 2.482
##
## Concordance= 0.61 (se = 0.037 )
## Likelihood ratio test= 17.41 on 6 df, p=0.008
## Wald test = 14.9 on 6 df, p=0.02
## Score (logrank) test = 16.07 on 6 df, p=0.01
We see that stage 4 has moderately worse survival than stage 1 (reference) but not significant based on exp(coef) hazard ratio (exp[-coef]) = 7.75, 95% confidence interval (0.869, 69.116) We also see that high clustering has a moderately worse survival than the low clustering hazard ratio (exp[-coef]) = 1.55, 95% confidence interval (0.966, 2.482)
We can also see whether or not adding the clustering to the model
formula significantly improves the model fit to the data. This must be
done with a nested formula, i.e. removing or adding a column to the
formula to assess impact of the addition or removal. Here we just remove
the clustering_group
and fit another model.
Again, create a formula using column names in
analysis_data
, just remove the
clustering_group
.
form2 <- formula(Surv(survival_time, as.numeric(death)) ~ age_at_diagnosis + stage + grade)
Now we can fit the model with the data in the same way as we did previously.
# fit the model with the data
model2 <- coxph(form2, data = analysis_dat)
To identify the significance of including clustering, we can perform a likelihood ratio test comparing the two models.
lrt <- lmtest::lrtest(
model2,
model1
)
# Let's print it out
lrt
## Likelihood ratio test
##
## Model 1: Surv(survival_time, as.numeric(death)) ~ age_at_diagnosis + stage +
## grade
## Model 2: Surv(survival_time, as.numeric(death)) ~ age_at_diagnosis + stage +
## grade + cluster_group
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 5 -314.67
## 2 6 -312.99 1 3.3613 0.06674 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on this output, we can see probability between these two models is 0.07, so it doesn’t appear that adding in the spatial information significantly improves the survival model fit for the data.
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.7.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] VectraPolarisData_1.10.0 SpatialExperiment_1.16.0
## [3] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
## [5] Biobase_2.66.0 GenomicRanges_1.58.0
## [7] GenomeInfoDb_1.42.1 IRanges_2.40.1
## [9] S4Vectors_0.44.0 MatrixGenerics_1.18.1
## [11] matrixStats_1.5.0 ExperimentHub_2.14.0
## [13] AnnotationHub_3.14.0 BiocFileCache_2.14.0
## [15] dbplyr_2.5.0 BiocGenerics_0.52.0
## [17] survival_3.7-0 spatialTIME_1.4.0
## [19] dplyr_1.1.4 ggplot2_3.5.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 deldir_2.0-4 rlang_1.1.5
## [4] magrittr_2.0.3 compiler_4.4.1 RSQLite_2.3.9
## [7] spatstat.geom_3.3-5 png_0.1-8 systemfonts_1.2.1
## [10] pbmcapply_1.5.1 vctrs_0.6.5 pkgconfig_2.0.3
## [13] crayon_1.5.3 fastmap_1.2.0 magick_2.8.5
## [16] XVector_0.46.0 labeling_0.4.3 rmarkdown_2.29
## [19] UCSC.utils_1.2.0 ragg_1.3.3 purrr_1.0.2
## [22] bit_4.5.0.1 xfun_0.50 zlibbioc_1.52.0
## [25] cachem_1.1.0 jsonlite_1.8.9 goftest_1.2-3
## [28] blob_1.2.4 DelayedArray_0.32.0 spatstat.utils_3.1-2
## [31] parallel_4.4.1 R6_2.5.1 bslib_0.8.0
## [34] RColorBrewer_1.1-3 spatstat.data_3.1-4 spatstat.univar_3.1-1
## [37] lmtest_0.9-40 jquerylib_0.1.4 Rcpp_1.0.14
## [40] knitr_1.49 tensor_1.5 zoo_1.8-12
## [43] Matrix_1.7-2 splines_4.4.1 tidyselect_1.2.1
## [46] rstudioapi_0.17.1 abind_1.4-8 yaml_2.3.10
## [49] spatstat.random_3.3-2 spatstat.explore_3.3-4 curl_6.2.0
## [52] lattice_0.22-6 tibble_3.2.1 withr_3.0.2
## [55] KEGGREST_1.46.0 evaluate_1.0.3 desc_1.4.3
## [58] polyclip_1.10-7 Biostrings_2.74.1 pillar_1.10.1
## [61] BiocManager_1.30.25 filelock_1.0.3 renv_1.0.11
## [64] generics_0.1.3 BiocVersion_3.20.0 munsell_0.5.1
## [67] scales_1.3.0 glue_1.8.0 tools_4.4.1
## [70] fs_1.6.5 grid_4.4.1 tidyr_1.3.1
## [73] AnnotationDbi_1.68.0 colorspace_2.1-1 nlme_3.1-166
## [76] GenomeInfoDbData_1.2.13 cli_3.6.3 spatstat.sparse_3.1-0
## [79] rappdirs_0.3.3 textshaping_1.0.0 S4Arrays_1.6.0
## [82] gtable_0.3.6 sass_0.4.9 digest_0.6.37
## [85] SparseArray_1.6.1 rjson_0.2.23 htmlwidgets_1.6.4
## [88] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
## [91] pkgdown_2.1.1 lifecycle_1.0.4 httr_1.4.7
## [94] mime_0.12 bit64_4.6.0-1