Loading Personalis data
PersonalisIO.Rmd
This vignette demonstrates how to use the {PersonalisIO} package to load Personalis data into a {MultiAssayExperiment} object.
suppressPackageStartupMessages({
library(PersonalisIO)
library(purrr)
library(conflicted)
conflicts_prefer(base::setdiff)
})
#> [conflicted] Will prefer base::setdiff over any
#> other package.
Personalis ship their data in a (more or less) standardized directory structure within one folder for each patient. The PersonalisIO loader functions traverse this directory structure and extract the different modalities. To load data from multiple patients into a single object, we need a list of patient directories:
path <- Sys.getenv("PERSONALIS_PATH")
sample_list <- list.files(path, full.names = TRUE) |> keep(dir.exists)
sample_list
#> [1] "/tmp/personalis_example_data/sample1"
#> [2] "/tmp/personalis_example_data/sample2"
#> [3] "/tmp/personalis_example_data/sample3"
#> [4] "/tmp/personalis_example_data/sample4"
#> [5] "/tmp/personalis_example_data/sample5"
Reading single modalities into a SummarizedExperiment
We can now read a single modality into a {SummarizedExperiment}. For instance, to read gene expression data only.
personalis_gex <- PersonalisIO::read_personalis_gene_expression(sample_list)
By inspecting the object, we can see, this imported gene expression
data from 5 patients and ~23k genes. There are three different assays:
counts
, tpm
and normCounts
:
personalis_gex
#> class: SummarizedExperiment
#> dim: 22956 5
#> metadata(0):
#> assays(3): counts tpm normCounts
#> rownames(22956): 1 503538 ... 23140 26009
#> rowData names(1): Gene Symbol
#> colnames(5): sample1 sample2 sample3 sample4 sample5
#> colData names(0):
The following functions to load individual modalities exist, and they all return a {SummarizedExperiment}:
read_personalis_gene_expression
read_personalis_small_variant_reports
read_personalis_cnv_reports
read_personalis_hla_reports
read_personalis_tcr_reports
read_personalis_msi_reports
Reading multiple modalities into a MultiAssayExperiment
To import all modalities at once, we can use the
read_personalis
function. Not all modalities may be present
for all samples. Some may not be present in any sample of the dataset.
This is shown as warnings during the import process. It is your
repsonsibility to check the warnings carefully to make sure that all
data that you expect to be there is successfully imported.
mae <- PersonalisIO::read_personalis(sample_list)
#> >>> Reading gene expression...
#> >>> Reading DNA small variant data...
#> Reading DNA small variant data (somatic)...
#> Reading DNA small variant data (tumor)...
#> Reading DNA small variant data (normal)...
#> ⚠️ No gene DNA small variant (normal) found for sample1. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample2. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample3. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample4. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample5. Sample ignored.
#> Warning in warn_missing_samples(sample_list, sample_paths, description): 5 DNA
#> small variant (normal) sample(s) were not found. Please check the log for
#> details.
#> >>> Reading RNA small variant data...
#> Reading RNA small variant data (somatic)...
#> Reading RNA small variant data (tumor)...
#> Reading RNA small variant data (normal)...
#> ⚠️ No gene RNA small variant (normal) found for sample1. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample2. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample3. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample4. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample5. Sample ignored.
#> Warning in warn_missing_samples(sample_list, sample_paths, description): 5 RNA
#> small variant (normal) sample(s) were not found. Please check the log for
#> details.
#> >>> Reading microsatellite stability (MSI) data...
#> >>> Reading copy number variation (CNV) data...
#> >>> Reading HLA types...
#> ⚠️ No gene HLA found for sample1. Sample ignored.
#> ⚠️ No gene HLA found for sample2. Sample ignored.
#> ⚠️ No gene HLA found for sample3. Sample ignored.
#> ⚠️ No gene HLA found for sample4. Sample ignored.
#> ⚠️ No gene HLA found for sample5. Sample ignored.
#> Warning in warn_missing_samples(sample_list, sample_paths, description): 5 HLA
#> sample(s) were not found. Please check the log for details.
#> >>> Reading TCR clone reports...
Here’s an overview of the {MultiAssayExperiment} object with all available modalities:
mae
#> A MultiAssayExperiment object of 8 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 8:
#> [1] gene_expression: SummarizedExperiment with 22956 rows and 5 columns
#> [2] dna_small_variants_somatic: SummarizedExperiment with 6097 rows and 5 columns
#> [3] dna_small_variants_tumor: SummarizedExperiment with 376654 rows and 5 columns
#> [4] rna_small_variants_somatic: SummarizedExperiment with 5302 rows and 5 columns
#> [5] rna_small_variants_tumor: SummarizedExperiment with 201347 rows and 5 columns
#> [6] msi: SummarizedExperiment with 5 rows and 5 columns
#> [7] cnv: SummarizedExperiment with 145 rows and 5 columns
#> [8] tcr: SummarizedExperiment with 2 rows and 5 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat files
We can retrieve an individual modality by subsetting with double brackets:
mae[["gene_expression"]]
#> class: SummarizedExperiment
#> dim: 22956 5
#> metadata(0):
#> assays(3): counts tpm normCounts
#> rownames(22956): 1 503538 ... 23140 26009
#> rowData names(1): Gene Symbol
#> colnames(5): sample1 sample2 sample3 sample4 sample5
#> colData names(0):
For more information about advanced subsetting, please refer to the MultiAssayExperiment vignette.
Summary statistics
For some modalities, patient-level summary statistics are provided by
Personalis in their HTML reports. These are gathered automatically by
the respective reader function, and stored in colData
. For
instance, tumor mutational burden is available from the
dna_small_variants_somatic
modality:
colnames(colData(mae[["dna_small_variants_somatic"]]))
#> [1] "Somatic variants (SNVs)"
#> [2] "Somatic variants (Indels)"
#> [3] "Somatic variants (Total)"
#> [4] "Somatic variants per Mb (SNVs)"
#> [5] "Somatic variants per Mb (Indels)"
#> [6] "Somatic variants per Mb (Total)"
#> [7] "Ti/Tv ratio (SNVs)"
#> [8] "Ti/Tv ratio (Indels)"
#> [9] "Ti/Tv ratio (Total)"
#> [10] "Non-synonymous Somatic Variants (SNVs)"
#> [11] "Non-synonymous Somatic Variants (Indels)"
#> [12] "Non-synonymous Somatic Variants (Total)"
#> [13] "Non-synonymous Somatic Variants per Mb (SNVs)"
#> [14] "Non-synonymous Somatic Variants per Mb (Indels)"
#> [15] "Non-synonymous Somatic Variants per Mb (Total)"
Let’s take a closer look at the different mutation rates
colData(mae[["dna_small_variants_somatic"]]) |>
as_tibble(rownames = "sample") |>
select(sample, starts_with("Somatic.variants.per.Mb")) |>
head() |>
knitr::kable()
sample | Somatic.variants.per.Mb..SNVs. | Somatic.variants.per.Mb..Indels. | Somatic.variants.per.Mb..Total. |
---|---|---|---|
sample1 | 18.82 | 0.70 | 19.52 |
sample2 | 13.14 | 0.39 | 13.53 |
sample3 | 12.95 | 0.58 | 13.53 |
sample4 | 10.68 | 0.48 | 11.16 |
sample5 | 26.62 | 1.05 | 27.67 |
Working with BumpyMatrices
Not all Personalis data is strictly rectangular like a gene expression matrix and nicely fits into a SummarizedExperiment. For instance, TCR clones or small variant reports are a long-form data frame with many entries for each patient. The {BumpyMatrix} package comes to the rescue. It allows to fit long-form data frames into a {SummarizedExperiment} by building a matrix that is rectangular in the first two dimensions, but is ragged (i.e. arbitrary length) in the third dimension.
In practice, bumpy matrices are a bit cumbersome to work with. They
are to be seen as an intermediate container that allows to store
particular kinds of data in the MultiAssayExperiment object and (after
subsetting to the patients of interest) can be converted back to a data
frame using BumpyMatrix::unsplitAsDataFrame
.
Example: TCR data stored as BumpyMatrix
The tcr
assay of our mae
object is such a
BumpyMatrix. Here, the colnames correspond to our patient identifiers,
and the row names to the TCR chain (TRA = alpha, TRB = beta). Each cell
of the matrix is a data frame with all clones for the respective patient
and chain.
tcr <- assay(mae, "tcr")
class(tcr)
#> [1] "BumpyDataFrameMatrix"
#> attr(,"package")
#> [1] "BumpyMatrix"
colnames(tcr)
#> [1] "sample1" "sample2" "sample3" "sample4" "sample5"
rownames(tcr)
#> [1] "TRA" "TRB"
It is possible to retrieve this data frame for a single chain and patient:
tcr["TRA", "sample1"] |>
as_tibble() |>
select(-Clonal.Sequence)
#> # A tibble: 90 × 10
#> group group_name Clone.Count Clone.Frequency Top.V.Hits Top.J.Hits
#> <int> <chr> <dbl> <dbl> <chr> <chr>
#> 1 1 NA 193 0.0554 TRAV8-6, TRAV8-4 TRAJ5
#> 2 1 NA 193 0.0554 TRAV26-1 TRAJ44
#> 3 1 NA 158 0.0453 TRAV22 TRAJ28
#> 4 1 NA 138 0.0396 TRAV6 TRAJ41
#> 5 1 NA 137 0.0393 TRDV1 TRAJ23
#> 6 1 NA 136 0.0390 TRAV17 TRAJ12
#> 7 1 NA 130 0.0373 TRAV13-1 TRAJ12
#> 8 1 NA 129 0.0370 TRAV19 TRAJ13
#> 9 1 NA 121 0.0347 TRAV13-1 TRAJ57
#> 10 1 NA 118 0.0339 TRAV21 TRAJ12
#> # ℹ 80 more rows
#> # ℹ 4 more variables: CDR3.Amino.Acid.Sequence <chr>, locus <chr>,
#> # sample <chr>, Top.D.Hits <chr>
We can also slice the matrix for a single variable, e.g. we can use this to retrieve only the CDR3 amino acid sequence for all clones. This results in another bumpy matrix, except that now every cell of the matrix is a character vector instead of a data frame.
tcr[, , "Top V Hits"]
#> 2 x 5 BumpyCharacterMatrix
#> rownames: TRA TRB
#> colnames: sample1 sample2 sample3 sample4 sample5
#> preview [1,1]:
#> chr [1:90] "TRAV8-6, TRAV8-4" "TRAV26-1" "TRAV22" "TRAV6" "TRDV1" "TRAV17" ...
Finally, and arguably most useful for us, it is also possible to convert the bumpy matrix back into a long-form data frame. Here, the column and rownames are added as additional columns to the data frame:
BumpyMatrix::unsplitAsDataFrame(tcr) |>
as_tibble() |>
select(-Clonal.Sequence)
#> # A tibble: 4,266 × 10
#> row column Clone.Count Clone.Frequency Top.V.Hits Top.J.Hits
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 TRA sample1 193 0.0554 TRAV8-6, TRAV8-4 TRAJ5
#> 2 TRA sample1 193 0.0554 TRAV26-1 TRAJ44
#> 3 TRA sample1 158 0.0453 TRAV22 TRAJ28
#> 4 TRA sample1 138 0.0396 TRAV6 TRAJ41
#> 5 TRA sample1 137 0.0393 TRDV1 TRAJ23
#> 6 TRA sample1 136 0.0390 TRAV17 TRAJ12
#> 7 TRA sample1 130 0.0373 TRAV13-1 TRAJ12
#> 8 TRA sample1 129 0.0370 TRAV19 TRAJ13
#> 9 TRA sample1 121 0.0347 TRAV13-1 TRAJ57
#> 10 TRA sample1 118 0.0339 TRAV21 TRAJ12
#> # ℹ 4,256 more rows
#> # ℹ 4 more variables: CDR3.Amino.Acid.Sequence <chr>, locus <chr>,
#> # sample <chr>, Top.D.Hits <chr>