Loading Personalis data • PersonalisIO

This vignette demonstrates how to use the {PersonalisIO} package to load Personalis data into a {MultiAssayExperiment} object.

suppressPackageStartupMessages({
  library(PersonalisIO)
  library(purrr)
  library(conflicted)
  conflicts_prefer(base::setdiff)
})
#> [conflicted] Will prefer base::setdiff over any
#> other package.

Personalis ship their data in a (more or less) standardized directory structure within one folder for each patient. The PersonalisIO loader functions traverse this directory structure and extract the different modalities. To load data from multiple patients into a single object, we need a list of patient directories:

path <- Sys.getenv("PERSONALIS_PATH")
sample_list <- list.files(path, full.names = TRUE) |> keep(dir.exists)
sample_list
#> [1] "/tmp/personalis_example_data/sample1"
#> [2] "/tmp/personalis_example_data/sample2"
#> [3] "/tmp/personalis_example_data/sample3"
#> [4] "/tmp/personalis_example_data/sample4"
#> [5] "/tmp/personalis_example_data/sample5"

Reading single modalities into a SummarizedExperiment

We can now read a single modality into a {SummarizedExperiment}. For instance, to read gene expression data only.

personalis_gex <- PersonalisIO::read_personalis_gene_expression(sample_list)

By inspecting the object, we can see, this imported gene expression data from 5 patients and ~23k genes. There are three different assays: counts, tpm and normCounts:

personalis_gex
#> class: SummarizedExperiment 
#> dim: 22956 5 
#> metadata(0):
#> assays(3): counts tpm normCounts
#> rownames(22956): 1 503538 ... 23140 26009
#> rowData names(1): Gene Symbol
#> colnames(5): sample1 sample2 sample3 sample4 sample5
#> colData names(0):

The following functions to load individual modalities exist, and they all return a {SummarizedExperiment}:

read_personalis_gene_expression
read_personalis_small_variant_reports
read_personalis_cnv_reports
read_personalis_hla_reports
read_personalis_tcr_reports
read_personalis_msi_reports

Reading multiple modalities into a MultiAssayExperiment

To import all modalities at once, we can use the read_personalis function. Not all modalities may be present for all samples. Some may not be present in any sample of the dataset. This is shown as warnings during the import process. It is your repsonsibility to check the warnings carefully to make sure that all data that you expect to be there is successfully imported.

mae <- PersonalisIO::read_personalis(sample_list)
#> >>> Reading gene expression...
#> >>> Reading DNA small variant data...
#> Reading DNA small variant data (somatic)...
#> Reading DNA small variant data (tumor)...
#> Reading DNA small variant data (normal)...
#> ⚠️ No gene DNA small variant (normal) found for sample1. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample2. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample3. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample4. Sample ignored.
#> ⚠️ No gene DNA small variant (normal) found for sample5. Sample ignored.
#> Warning in warn_missing_samples(sample_list, sample_paths, description): 5 DNA
#> small variant (normal) sample(s) were not found. Please check the log for
#> details.
#> >>> Reading RNA small variant data...
#> Reading RNA small variant data (somatic)...
#> Reading RNA small variant data (tumor)...
#> Reading RNA small variant data (normal)...
#> ⚠️ No gene RNA small variant (normal) found for sample1. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample2. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample3. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample4. Sample ignored.
#> ⚠️ No gene RNA small variant (normal) found for sample5. Sample ignored.
#> Warning in warn_missing_samples(sample_list, sample_paths, description): 5 RNA
#> small variant (normal) sample(s) were not found. Please check the log for
#> details.
#> >>> Reading microsatellite stability (MSI) data...
#> >>> Reading copy number variation (CNV) data...
#> >>> Reading HLA types...
#> ⚠️ No gene HLA found for sample1. Sample ignored.
#> ⚠️ No gene HLA found for sample2. Sample ignored.
#> ⚠️ No gene HLA found for sample3. Sample ignored.
#> ⚠️ No gene HLA found for sample4. Sample ignored.
#> ⚠️ No gene HLA found for sample5. Sample ignored.
#> Warning in warn_missing_samples(sample_list, sample_paths, description): 5 HLA
#> sample(s) were not found. Please check the log for details.
#> >>> Reading TCR clone reports...

Here’s an overview of the {MultiAssayExperiment} object with all available modalities:

mae
#> A MultiAssayExperiment object of 8 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 8:
#>  [1] gene_expression: SummarizedExperiment with 22956 rows and 5 columns
#>  [2] dna_small_variants_somatic: SummarizedExperiment with 6097 rows and 5 columns
#>  [3] dna_small_variants_tumor: SummarizedExperiment with 376654 rows and 5 columns
#>  [4] rna_small_variants_somatic: SummarizedExperiment with 5302 rows and 5 columns
#>  [5] rna_small_variants_tumor: SummarizedExperiment with 201347 rows and 5 columns
#>  [6] msi: SummarizedExperiment with 5 rows and 5 columns
#>  [7] cnv: SummarizedExperiment with 145 rows and 5 columns
#>  [8] tcr: SummarizedExperiment with 2 rows and 5 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

We can retrieve an individual modality by subsetting with double brackets:

mae[["gene_expression"]]
#> class: SummarizedExperiment 
#> dim: 22956 5 
#> metadata(0):
#> assays(3): counts tpm normCounts
#> rownames(22956): 1 503538 ... 23140 26009
#> rowData names(1): Gene Symbol
#> colnames(5): sample1 sample2 sample3 sample4 sample5
#> colData names(0):

For more information about advanced subsetting, please refer to the MultiAssayExperiment vignette.

Summary statistics

For some modalities, patient-level summary statistics are provided by Personalis in their HTML reports. These are gathered automatically by the respective reader function, and stored in colData. For instance, tumor mutational burden is available from the dna_small_variants_somatic modality:

colnames(colData(mae[["dna_small_variants_somatic"]]))
#>  [1] "Somatic variants (SNVs)"                        
#>  [2] "Somatic variants (Indels)"                      
#>  [3] "Somatic variants (Total)"                       
#>  [4] "Somatic variants per Mb (SNVs)"                 
#>  [5] "Somatic variants per Mb (Indels)"               
#>  [6] "Somatic variants per Mb (Total)"                
#>  [7] "Ti/Tv ratio (SNVs)"                             
#>  [8] "Ti/Tv ratio (Indels)"                           
#>  [9] "Ti/Tv ratio (Total)"                            
#> [10] "Non-synonymous Somatic Variants (SNVs)"         
#> [11] "Non-synonymous Somatic Variants (Indels)"       
#> [12] "Non-synonymous Somatic Variants (Total)"        
#> [13] "Non-synonymous Somatic Variants per Mb (SNVs)"  
#> [14] "Non-synonymous Somatic Variants per Mb (Indels)"
#> [15] "Non-synonymous Somatic Variants per Mb (Total)"

Let’s take a closer look at the different mutation rates

colData(mae[["dna_small_variants_somatic"]]) |>
  as_tibble(rownames = "sample") |>
  select(sample, starts_with("Somatic.variants.per.Mb")) |>
  head() |>
  knitr::kable()

sample	Somatic.variants.per.Mb..SNVs.	Somatic.variants.per.Mb..Indels.	Somatic.variants.per.Mb..Total.
sample1	18.82	0.70	19.52
sample2	13.14	0.39	13.53
sample3	12.95	0.58	13.53
sample4	10.68	0.48	11.16
sample5	26.62	1.05	27.67

Working with BumpyMatrices

Not all Personalis data is strictly rectangular like a gene expression matrix and nicely fits into a SummarizedExperiment. For instance, TCR clones or small variant reports are a long-form data frame with many entries for each patient. The {BumpyMatrix} package comes to the rescue. It allows to fit long-form data frames into a {SummarizedExperiment} by building a matrix that is rectangular in the first two dimensions, but is ragged (i.e. arbitrary length) in the third dimension.

In practice, bumpy matrices are a bit cumbersome to work with. They are to be seen as an intermediate container that allows to store particular kinds of data in the MultiAssayExperiment object and (after subsetting to the patients of interest) can be converted back to a data frame using BumpyMatrix::unsplitAsDataFrame.

Example: TCR data stored as BumpyMatrix

The tcr assay of our mae object is such a BumpyMatrix. Here, the colnames correspond to our patient identifiers, and the row names to the TCR chain (TRA = alpha, TRB = beta). Each cell of the matrix is a data frame with all clones for the respective patient and chain.

tcr <- assay(mae, "tcr")
class(tcr)
#> [1] "BumpyDataFrameMatrix"
#> attr(,"package")
#> [1] "BumpyMatrix"
colnames(tcr)
#> [1] "sample1" "sample2" "sample3" "sample4" "sample5"
rownames(tcr)
#> [1] "TRA" "TRB"

It is possible to retrieve this data frame for a single chain and patient:

tcr["TRA", "sample1"] |>
  as_tibble() |>
  select(-Clonal.Sequence)
#> # A tibble: 90 × 10
#>    group group_name Clone.Count Clone.Frequency Top.V.Hits       Top.J.Hits
#>    <int> <chr>            <dbl>           <dbl> <chr>            <chr>     
#>  1     1 NA                 193          0.0554 TRAV8-6, TRAV8-4 TRAJ5     
#>  2     1 NA                 193          0.0554 TRAV26-1         TRAJ44    
#>  3     1 NA                 158          0.0453 TRAV22           TRAJ28    
#>  4     1 NA                 138          0.0396 TRAV6            TRAJ41    
#>  5     1 NA                 137          0.0393 TRDV1            TRAJ23    
#>  6     1 NA                 136          0.0390 TRAV17           TRAJ12    
#>  7     1 NA                 130          0.0373 TRAV13-1         TRAJ12    
#>  8     1 NA                 129          0.0370 TRAV19           TRAJ13    
#>  9     1 NA                 121          0.0347 TRAV13-1         TRAJ57    
#> 10     1 NA                 118          0.0339 TRAV21           TRAJ12    
#> # ℹ 80 more rows
#> # ℹ 4 more variables: CDR3.Amino.Acid.Sequence <chr>, locus <chr>,
#> #   sample <chr>, Top.D.Hits <chr>

We can also slice the matrix for a single variable, e.g. we can use this to retrieve only the CDR3 amino acid sequence for all clones. This results in another bumpy matrix, except that now every cell of the matrix is a character vector instead of a data frame.

tcr[, , "Top V Hits"]
#> 2 x 5 BumpyCharacterMatrix
#> rownames: TRA TRB 
#> colnames: sample1 sample2 sample3 sample4 sample5 
#> preview [1,1]:
#>   chr [1:90] "TRAV8-6, TRAV8-4" "TRAV26-1" "TRAV22" "TRAV6" "TRDV1" "TRAV17" ...

Finally, and arguably most useful for us, it is also possible to convert the bumpy matrix back into a long-form data frame. Here, the column and rownames are added as additional columns to the data frame:

BumpyMatrix::unsplitAsDataFrame(tcr) |>
  as_tibble() |>
  select(-Clonal.Sequence)
#> # A tibble: 4,266 × 10
#>    row   column  Clone.Count Clone.Frequency Top.V.Hits       Top.J.Hits
#>    <chr> <chr>         <dbl>           <dbl> <chr>            <chr>     
#>  1 TRA   sample1         193          0.0554 TRAV8-6, TRAV8-4 TRAJ5     
#>  2 TRA   sample1         193          0.0554 TRAV26-1         TRAJ44    
#>  3 TRA   sample1         158          0.0453 TRAV22           TRAJ28    
#>  4 TRA   sample1         138          0.0396 TRAV6            TRAJ41    
#>  5 TRA   sample1         137          0.0393 TRDV1            TRAJ23    
#>  6 TRA   sample1         136          0.0390 TRAV17           TRAJ12    
#>  7 TRA   sample1         130          0.0373 TRAV13-1         TRAJ12    
#>  8 TRA   sample1         129          0.0370 TRAV19           TRAJ13    
#>  9 TRA   sample1         121          0.0347 TRAV13-1         TRAJ57    
#> 10 TRA   sample1         118          0.0339 TRAV21           TRAJ12    
#> # ℹ 4,256 more rows
#> # ℹ 4 more variables: CDR3.Amino.Acid.Sequence <chr>, locus <chr>,
#> #   sample <chr>, Top.D.Hits <chr>