Peturbation-based benchmark

Introduction

This benchmark is based on experiments where specific kinases are expected to have an increased or decreased activity after being perturbed in various cell line experiments.

We provide the log fold changes of phosphorylation sites after the perturbation which can be used to infer changes in kinase activities using different methods or kinase-substrate libraries.

The methods are then evaluated based on whether they are able to recapitulate the peturbed kinases based on their inferred activities.

Getting set up

We first load the required packages to run the benchmark.

library(benchmarKIN)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Since we call the get_performances function from the decoupler-py package we also need to set up reticulate.

library(reticulate)

# Use reticulate to install the Python package
py_install("git+https://github.com/saezlab/decoupler-py.git", pip = TRUE)

# If you are using a conda environment make sure to use the correct path for your
# python version and install the decoupler-py package directly into your environment

# use_python("path-to-python", required = TRUE)  #you can find the correct path by activating your environment and typing "which python" into the terminal
# py_install("git+https://github.com/saezlab/decoupler-py.git", pip = TRUE, envname = "path-to-environment")

Perturbation data

The perturbation data can be extracted using the load_perturbData() function. This data frame contains the logFC for the perturbation collection previously published by Hernandez-Armenta et al. and the perturbation experiments from Hijazi et al.

mat <- load_perturbData()
mat[1:5, 20:23]
#>                            273_75      276_75     279_75     282_75
#> ARF5_S103|ARF5|S103            NA  0.04380698         NA         NA
#> M6PR_S267|M6PR|S267   -0.05630667 -0.01278283 0.01527238 -0.3346162
#> FKBP4_S118|FKBP4|S118          NA  0.45777031         NA  0.1405713
#> FKBP4_S15|FKBP4|S15            NA          NA         NA         NA
#> FKBP4_S258|FKBP4|S258          NA          NA         NA         NA

Additional information about each experiment can be found in the meta data. This contains information about the targeted kinase(s), the PMID, etc.

meta <- load_meta()
head(meta)
#>        id                     Description              Control Time_min
#> 1 102_117                          RG7356              Vehicle       30
#> 2 105_117                          RG7356              Vehicle       90
#> 3  111_45 BI 4834 (on Mitosis Nocodazole) Mitosis (Nocodazole)       15
#> 4  114_48                         Starved              Insulin       NA
#> 5  117_48                          Torin1              Insulin       NA
#> 6  120_48                       Rapamycin              Insulin       NA
#>       PMID target sign            class cell_line treatment
#> 1 22777824   AKT1    1 Serine/Threonine      <NA>      <NA>
#> 2 22777824   AKT1    1 Serine/Threonine      <NA>      <NA>
#> 3 21857030   PLK1   -1 Serine/Threonine      <NA>      <NA>
#> 4 21659604   MTOR   -1 Serine/Threonine      <NA>      <NA>
#> 5 21659604   MTOR   -1 Serine/Threonine      <NA>      <NA>
#> 6 21659604   MTOR   -1 Serine/Threonine      <NA>      <NA>

Kinase activity inference

This data can be used to test any method for kinase activity inference.

We will use the z-score (as implemented by RoKAI) and the curated kinase- substrate library as in example.

Kinase-substrate library

For that we have already processed a version of a combination of curated libraries, namely PhosphoSitePlus, PTMsigDB (excluding iKiP-DB) and the gold standard set of GPS 5.0, that can be mapped to our phosphorylation site ids.

head(curated)
#>    source      target target_protein position mor        sequence
#> 1 EIF2AK1  EIF2S1_S52         EIF2S1      S52   1 MILLSELSRRRIRSI
#> 2 EIF2AK1  EIF2S1_S49         EIF2S1      S49   1 IEGMILLSELSRRRI
#> 3   PRKCD  HDAC5_S259          HDAC5     S259   1 FPLRKTASEPNLKVR
#> 4   PRKCD  PTPRA_S204          PTPRA     S204   1 PLLARSPSTNRKYPP
#> 5   PRKCD    BCL2_S70           BCL2      S70   1 RDPVARTSPLQTPAA
#> 6   PRKCD HNRNPK_S302         HNRNPK     S302   1 GRGGRGGSRARNLPL
#>           ENSEMBL
#> 1 ENSG00000134001
#> 2 ENSG00000134001
#> 3 ENSG00000108840
#> 4 ENSG00000132670
#> 5 ENSG00000171791
#> 6 ENSG00000165119

After mapping the phosphorylation sites we can bring it into the right format to run the run_zscore function.

curated$target <- paste(curated$target, curated$target_protein, curated$position, sep = "|")

curatedLib <- curated %>%
  dplyr::select(source, target, mor) %>%
  dplyr::distinct()

head(curatedLib)
#>    source                  target mor
#> 1 EIF2AK1   EIF2S1_S52|EIF2S1|S52   1
#> 2 EIF2AK1   EIF2S1_S49|EIF2S1|S49   1
#> 3   PRKCD   HDAC5_S259|HDAC5|S259   1
#> 4   PRKCD   PTPRA_S204|PTPRA|S204   1
#> 5   PRKCD       BCL2_S70|BCL2|S70   1
#> 6   PRKCD HNRNPK_S302|HNRNPK|S302   1

Activity inference using the z-score

The z-score calculates a score for each kinase by aggregating the change in abundance of the direct targets in relation to changes in the non-targets.

act_scores <- run_zscore(mat = mat, network = curatedLib)

act_scores[1:5, 1:5]
#>               15_3       18_3        21_3       24_3      96_36
#> PRKCD    3.1802463  2.3143338  3.42871525  1.2961965 -4.2716360
#> CAMK2A   0.6594898  0.7286016  0.07022767  1.9630066         NA
#> CSNK2A1  1.2340623  1.2400087  1.38745758  2.1008802 -0.1351479
#> LATS1   -1.6980139 -2.0950547 -0.86222145  0.3599908         NA
#> CDK7     0.1237668 -0.6063455 -0.92708849 -2.0542065  0.4655647

Perturbation benchmark

Area under the receiver operator curve

The inferred kinase activities can now be evaluated using the run_perturbBench() function. Hereby, it is important to note that the kinase symbol needs to match with the meta data (here gene symbols).

performance <- run_perturbBench(act = act_scores, meta = meta, method_id = "zscore_ppsp")

We can then plot the areas under the receiver operator curve (AUROCs) and can compare this with other methods or prior knowledge resources.

auroc_p <- performance %>%
  ggplot2::ggplot(ggplot2::aes(x=method, y=auroc)) +
  ggplot2::geom_boxplot(outlier.size=1, lwd=0.6) +
  ggplot2::geom_hline(yintercept = 0.5)

auroc_p

Area under the receiver operator curve

Additionally we can calculate the rank and scaled rank of the perturbed kinase(s) in their respective experiments. A lower rank/scaled rank indicates a better performance of the method

rank <- run_rank(act = act_scores, meta = meta)
median(rank$scaled_rank)
#> [1] 0.2354167