Welcome to pimkl’s documentation!¶
pimkl¶
pathway induced multiple kernel learning for computational biology
Free software: MIT license
Documentation: https://pimkl.readthedocs.io.
Features¶
The pimkl command:
Usage: pimkl [OPTIONS] NETWORK_CSV_FILE NETWORK_NAME GENE_SETS_GMT_FILE
GENE_SETS_NAME PREPROCESS_DIR OUTPUT_DIR CLASS_LABEL_FILE [LAM]
[K] [NUMBER_OF_FOLDS] [MAX_PER_CLASS] [SEED] [MAX_PROCESSES]
[FOLD]
Console script for a complete pimkl pipeline, including preprocessing and
analysis. For more details consult the following console scripts, which
are here executed in this order. `pimkl-preprocess --help` `pimkl-analyse
run-performance-analysis --help`
Options:
-fd, --data_csv_file PATH [required]
-nd, --data_name TEXT [required]
--model_name [EasyMKL|UMKLKNN|AverageMKL]
--help Show this message and exit.
Requirements¶
C++14 capable C++ compiler
cmake (>3.0.2)
Python
Installation¶
Install the dependencies
pip install -r requirements.txt
Install the package
pip install .
Tutorial¶
You can find a brief tutorial in the dedicated folder.
Web service deprecation¶
The PIMKL web-service has been deprecated in favour of the python package hosted in this repository. Please check the examples and the tutorials to use PIMKL in your research.
Credits¶
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Installation¶
Stable release¶
To install pimkl, run this command in your terminal:
$ pip install pimkl
This is the preferred method to install pimkl, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for pimkl can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/PhosphorylatedRabbits/pimkl
Or download the tarball:
$ curl -OJL https://github.com/PhosphorylatedRabbits/pimkl/tarball/master
Once you have a copy of the source, you can install it with:
$ pip install .
PIMKL tutorial¶
Data retrieval¶
You can download the data from the tutorial from here.
In the following we assume you placed files in a folder called data with the following structure:
data
├── data.csv
├── gene_sets.gmt
├── interactions.csv
└── labels.csv
0 directories, 4 files
Please, check carefully the data format in case you want to run
pimkl
on your data.
Installation¶
For the installation of pimkl
we suggest to follow the description
reported here.
Run pimkl
¶
The pimkl
script reproduces the output that can be obtained from the
PIMKL web service.
Pipeline execution¶
You can run the full pimkl
pipeline (supervised) by executing:
pimkl -fd data/data.csv -nd tutorial --model_name EasyMKL data/interactions.csv network data/gene_sets.gmt genes data/preprocess data/output data/labels.csv
You can change the parameters (e.g., regularization, number of folds) by providing them to the script:
pimkl --help
Usage: pimkl [OPTIONS] NETWORK_CSV_FILE NETWORK_NAME GENE_SETS_GMT_FILE
GENE_SETS_NAME PREPROCESS_DIR OUTPUT_DIR CLASS_LABEL_FILE [LAM]
[K] [NUMBER_OF_FOLDS] [MAX_PER_CLASS] [SEED] [MAX_PROCESSES]
[FOLD]
Console script for a complete pimkl pipeline, including preprocessing and
analysis. For more details consult the following console scripts, which
are here executed in this order. `pimkl-preprocess --help` `pimkl-analyse
run-performance-analysis --help`
Options:
-fd, --data_csv_file PATH [required]
-nd, --data_name TEXT [required]
--model_name [EasyMKL|UMKLKNN|AverageMKL]
--help Show this message and exit.
For example:
pimkl -fd data/data.csv -nd tutorial --model_name EasyMKL data/interactions.csv network data/gene_sets.gmt genes data/preprocess data/output data/labels.csv 0.2 5 50
Tip: increasing the number of folds is useful to have better estimates of the significant gene sets/pathways.
Stepwise execution¶
pimkl
pipeline can be also run in a stepwise fashion.
Preprocessing¶
pimkl-preprocess
prepares the pathway-specific inducers and the data
for the subsequent analysis:
pimkl-preprocess --help
Usage: pimkl-preprocess [OPTIONS] NETWORK_CSV_FILE NETWORK_NAME
GENE_SETS_GMT_FILE GENE_SETS_NAME PREPROCESS_DIR
Compute incuding Laplacian matrices and preprocess data matrices for
matching features.
Multiple data_csv_files may be passed. Each data_csv_file should readable
as pandas.DataFrames `pd.read_csv(filename, sep=',', index_col=0)` where
index are features (rows) and columns a are samples.
The `network_csv_file` is an edge list readable with
`pd.read_csv(filename)` where the 3rd columns is a numeric value.
The `gene_sets_gmt_file` should follow the gmt specification. See http://s
oftware.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_format
s
For each file, a name has to be passed. Names cannot contain "_" or "-".
Results are written to `preprocess_dir`.
Options:
-fd, --data_csv_file PATH [required]
-nd, --data_name TEXT [required]
--help Show this message and exit.
Execute it on the tutorial data by running:
pimkl-preprocess -fd data/data.csv -nd tutorial data/interactions.csv network data/gene_sets.gmt genes data/preprocess
Analysis¶
pimkl-analyse
is responsible of analysing the preprocessed data.
pimkl-analyse --help
Usage: pimkl-analyse [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
run-kpca KernelPCA with given pathway weights
run-performance-analysis train and test many folds
Here we focus on the component
pimkl-analyse run-performance-analysis
, to obtain prediction
performance and an estimate of the most significant gene sets/pathways:
pimkl-analyse run-performance-analysis --help
Usage: pimkl-analyse run-performance-analysis [OPTIONS] NETWORK_NAME
GENE_SETS_NAME PREPROCESS_DIR
OUTPUT_DIR CLASS_LABEL_FILE
[LAM] [K] [NUMBER_OF_FOLDS]
[MAX_PER_CLASS] [SEED]
[MAX_PROCESSES]
Run classifications using pathway induced multiple kernel learning on
preprocessed data and inducers on a number of train/test splits and
analyse the resulting classification performance and learned pathway
weights.
The `class_label_file` should be readable with `pd.read_csv(
class_label_file, index_col=0, header=None, squeeze=True)`
Options:
-nd, --data_name TEXT [required]
--model_name [EasyMKL|UMKLKNN|AverageMKL]
--help Show this message and exit.
Run the analysis on the tutorial data by executing:
pimkl-analyse run-performance-analysis -nd tutorial --model_name EasyMKL network genes data/preprocess data/output data/labels.csv
If you want to have more info/examples on how to use pimkl
feel free
to open an issue with the tag
tutorial
on the repo.
pimkl API¶
pimkl package¶
Subpackages¶
pimkl.cli package¶
Submodules¶
pimkl.cli.analyse module¶
- pimkl.cli.analyse.analyse(data_names, network_name, gene_sets_name, preprocess_dir, output_dir, class_label_file, model_name='EasyMKL', lam=0.2, k=5, number_of_folds=2, max_per_class=20, seed=0, max_processes=2)[source]¶
pimkl.cli.cli module¶
Console script for pimkl.
pimkl.cli.preprocess module¶
Main module.
- pimkl.cli.preprocess.preprocess_data_and_inducers(data_csv_files, data_names, network_csv_file, network_name, gene_sets_gmt_file, gene_sets_name, preprocess_dir, match_samples)[source]¶
Inducers, that is Laplacian matrices for geneset subnetworks, and data are preprocessed and written to file. Data and inducers are filtered for genes (per dataset) available in the data and the network and the union of genesets. Conditionally, also the data is filtered for matching samples over all datasets.
Module contents¶
pimkl.factories package¶
Submodules¶
pimkl.factories.estimator_factory module¶
- estimator_factory.ESTIMATOR_FACTORY = {'EasyMKL': pymimkl.EasyMKL, 'SVC': <class 'sklearn.svm._classes.SVC'>}¶
pimkl.factories.induction_factory module¶
- induction_factory.INDUCTION_FACTORY = {'induce_gaussian_kernel': pymimkl.induce_gaussian_kernel, 'induce_linear_kernel': pymimkl.induce_linear_kernel, 'induce_polynomial_kernel': pymimkl.induce_polynomial_kernel, 'induce_sigmoidal_kernel': pymimkl.induce_sigmoidal_kernel}¶
pimkl.factories.mkl_factory module¶
- class pimkl.factories.mkl_factory.WeightedAverageMKL(*args: Any, **kwargs: Any)[source]¶
Bases:
AverageMKL
small wrapping of AverageMKL where the additional cunstructor parameter kernels_weights is used to predict a final kernel rather than the average.
The applied weights are corrected to sum up to one.
- class pimkl.factories.mkl_factory.WeightedAverageMKL(*args: Any, **kwargs: Any)[source]¶
Bases:
AverageMKL
small wrapping of AverageMKL where the additional cunstructor parameter kernels_weights is used to predict a final kernel rather than the average.
The applied weights are corrected to sum up to one.
- mkl_factory.MKL_FACTORY = {'AverageMKL': pymimkl.AverageMKL, 'EasyMKL': pymimkl.EasyMKL, 'UMKLKNN': pymimkl.UMKLKNN, 'WeightedAverageMKL': <class 'pimkl.factories.mkl_factory.WeightedAverageMKL'>}¶
Module contents¶
pimkl.models package¶
Submodules¶
pimkl.models.pimkl module¶
Pathway Induced Multiple Kernel Learning.
- class pimkl.models.pimkl.PIMKL(inducers, induction='induce_linear_kernel', mkl='UMKLKNN', estimator='EasyMKL', induction_parameters={}, mkl_parameters={'epsilon': 0.0001, 'k': 5, 'kernel_normalization': True, 'maxiter_qp': 100000, 'precompute': True}, estimator_parameters={'epsilon': 1e-05, 'kernel_normalization': False, 'lam': 0.2, 'precompute': True, 'regularization_factor': False})[source]¶
Bases:
BaseEstimator
,ClassifierMixin
Pathway Induced Multiple Kernel Learning with choice of MKL and estimator algorithm. Estimator is only trained when MKL is not an estimator itself.
Module contents¶
pimkl.utils package¶
Subpackages¶
Core data pre-processing utilities.
- pimkl.utils.preprocessing.core.enforce_pandas_dataframe_on_second_argument(function)[source]¶
Decorate to enforce pandas DataFrame argument as input.
Data scaling utilities.
Data standardization utilities.
Data pre-processing utilities.
Submodules¶
pimkl.utils.objects module¶
Module contents¶
Submodules¶
pimkl.analysis module¶
- pimkl.analysis.plot_aucs_to_buffer(df, save=False)[source]¶
plot AUC for multiindexed pandas.DataFrame where df.columns.names = [‘data’, ‘kind’]
- pimkl.analysis.plot_weights_significant_correlations_to_buffer(weights_df, correlation_type, save=False)[source]¶
plot heatmap showing value of correlation if significant between different molecular signatures where weights_df.index.names is [‘fold’, ‘class’]
pimkl.data module¶
Split data into training and test.
- pimkl.data.get_learning_data(X, labels=None, max_per_class=30)[source]¶
Return splitted test and training data for single data type.
- pimkl.data.get_learning_data_in_dict_mode(X, labels=None, data_types=None, max_per_class=30)[source]¶
Return splitted test and training data for multiple data types.
pimkl.evaluation module¶
pimkl.inducers module¶
- pimkl.inducers.get_matching_data_and_network(data, network)[source]¶
Interesct data labels with network node labels.
- pimkl.inducers.get_pathway_inducer(network, gene_set, normed=True)[source]¶
Get a laplacian based pathway inducer.
- pimkl.inducers.read_inducer(filename, size, header=None, sep=',')[source]¶
Read inducer in CSC format.
pimkl.network module¶
- pimkl.network.generate_random_sets(number_of_sets, max_nodes, nodes_labels, number_of_nodes=None)[source]¶
pimkl.pimkl module¶
Main module.
pimkl.run module¶
- pimkl.run.fold_generator(number_of_folds, data, labels, max_per_class, transformer_class=<class 'pimkl.utils.preprocessing.standardizer.Standardizer'>)[source]¶
generate class balanced splits of data and labels
- pimkl.run.run_model(inducers, induction_name, mkl_name, estimator_name, mkl_parameters, estimator_parameters, induction_parameters, inducers_extended_names, fold_parameters)[source]¶
Run a single fold of the model with data splits from fold_generator.
Arguments are those to PIMKL and then the inducer_names and a dict containing the fold specific arguments. In junction with partial and the fold_generator it can be used for running folds in parallel:
`list(pool.imap(run_fold, fold_generator(...)))`
Module contents¶
Top-level package for pimkl.
Credits¶
Development Lead¶
Joris Cadow <joriscadow@gmail.com>
Matteo Manica <drugilsberg@gmail.com>
Contributors¶
None yet. Why not be the first?
History¶
0.1.1 (2020-11-05)¶
Release on PyPI.
Setup tox tests on travis.
Setup readthedocs.
0.1.0 (2019-10-01)¶
First release.