PIMKL tutorial

Data retrieval

You can download the data from the tutorial from here.

In the following we assume you placed files in a folder called data with the following structure:

data
├── data.csv
├── gene_sets.gmt
├── interactions.csv
└── labels.csv

0 directories, 4 files

Please, check carefully the data format in case you want to run pimkl on your data.

Installation

For the installation of pimkl we suggest to follow the description reported here.

Run pimkl

The pimkl script reproduces the output that can be obtained from the PIMKL web service.

Pipeline execution

You can run the full pimkl pipeline (supervised) by executing:

pimkl -fd data/data.csv -nd tutorial --model_name EasyMKL data/interactions.csv network data/gene_sets.gmt genes data/preprocess data/output data/labels.csv

You can change the parameters (e.g., regularization, number of folds) by providing them to the script:

pimkl --help
Usage: pimkl [OPTIONS] NETWORK_CSV_FILE NETWORK_NAME GENE_SETS_GMT_FILE
             GENE_SETS_NAME PREPROCESS_DIR OUTPUT_DIR CLASS_LABEL_FILE [LAM]
             [K] [NUMBER_OF_FOLDS] [MAX_PER_CLASS] [SEED] [MAX_PROCESSES]
             [FOLD]

  Console script for a complete pimkl pipeline, including preprocessing and
  analysis. For more details consult the following console scripts, which
  are here executed in this order. `pimkl-preprocess --help` `pimkl-analyse
  run-performance-analysis --help`

Options:
  -fd, --data_csv_file PATH       [required]
  -nd, --data_name TEXT           [required]
  --model_name [EasyMKL|UMKLKNN|AverageMKL]
  --help                          Show this message and exit.

For example:

pimkl -fd data/data.csv -nd tutorial --model_name EasyMKL data/interactions.csv network data/gene_sets.gmt genes data/preprocess data/output data/labels.csv 0.2 5 50

Tip: increasing the number of folds is useful to have better estimates of the significant gene sets/pathways.

Stepwise execution

pimkl pipeline can be also run in a stepwise fashion.

Preprocessing

pimkl-preprocess prepares the pathway-specific inducers and the data for the subsequent analysis:

pimkl-preprocess --help
Usage: pimkl-preprocess [OPTIONS] NETWORK_CSV_FILE NETWORK_NAME
                        GENE_SETS_GMT_FILE GENE_SETS_NAME PREPROCESS_DIR

  Compute incuding Laplacian matrices and preprocess data matrices for
  matching features.

  Multiple data_csv_files may be passed. Each data_csv_file should readable
  as pandas.DataFrames `pd.read_csv(filename, sep=',', index_col=0)` where
  index are features (rows) and columns a are samples.

  The `network_csv_file` is an edge list readable with
  `pd.read_csv(filename)` where the 3rd columns is a numeric value.

  The `gene_sets_gmt_file` should follow the gmt specification. See http://s
  oftware.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_format
  s

  For each file, a name has to be passed. Names cannot contain "_" or "-".

  Results are written to `preprocess_dir`.

Options:
  -fd, --data_csv_file PATH  [required]
  -nd, --data_name TEXT      [required]
  --help                     Show this message and exit.

Execute it on the tutorial data by running:

pimkl-preprocess -fd data/data.csv -nd tutorial data/interactions.csv network data/gene_sets.gmt genes data/preprocess

Analysis

pimkl-analyse is responsible of analysing the preprocessed data.

pimkl-analyse --help
Usage: pimkl-analyse [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  run-kpca                  KernelPCA with given pathway weights
  run-performance-analysis  train and test many folds

Here we focus on the component pimkl-analyse run-performance-analysis, to obtain prediction performance and an estimate of the most significant gene sets/pathways:

pimkl-analyse run-performance-analysis --help
Usage: pimkl-analyse run-performance-analysis [OPTIONS] NETWORK_NAME
                                              GENE_SETS_NAME PREPROCESS_DIR
                                              OUTPUT_DIR CLASS_LABEL_FILE
                                              [LAM] [K] [NUMBER_OF_FOLDS]
                                              [MAX_PER_CLASS] [SEED]
                                              [MAX_PROCESSES]

  Run classifications using pathway induced multiple kernel learning on
  preprocessed data and inducers on a number of train/test splits and
  analyse the resulting classification performance and learned pathway
  weights.

  The `class_label_file` should be readable with `pd.read_csv(
  class_label_file, index_col=0, header=None, squeeze=True)`

Options:
  -nd, --data_name TEXT           [required]
  --model_name [EasyMKL|UMKLKNN|AverageMKL]
  --help                          Show this message and exit.

Run the analysis on the tutorial data by executing:

pimkl-analyse run-performance-analysis -nd tutorial --model_name EasyMKL network genes data/preprocess data/output data/labels.csv

If you want to have more info/examples on how to use pimkl feel free to open an issue with the tag tutorial on the repo.