Data Analysis

A selection of resources related to software applications, workflows, and compute platforms.

Guidelines and best practices

The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work.

View this resource Report broken link

Computing Platforms

A web-based platform for biomedical data analysis. Galaxy provides a collection of data analysis tools and pipelines. Analyses can be run on public Galaxy servers or servers can be deployed locally.

View this resource Report broken link

Software Applications and Workflows

Set of 10 quick tips, drafted by experienced workflow developers that will help researchers to apply FAIR4RS principles to workflows. The tips have been arranged according to the FAIR acronym, clarifying the purpose of each tip with respect to the FAIR4RS principles. Altogether, these tips can be seen as practical guidelines for workflow developers who aim to contribute to more reproducible and sustainable computational science, aiming to positively impact the open science and FAIR community.

View this resource Report broken link

This document presents the first version of the FAIR Principles for Research Software (FAIR4RS Principles), and includes explanatory text to aid adoption. It is an outcome of the FAIR for Research Software Working Group (FAIR4RS WG) based on community consultations that started in 2019.

View this resource Report broken link

Along with the packaged multi-omics data analysis workflow, the paper shares an experiences adopting various FAIR practices and creating a FAIR Digital Object. These experiences can help other researchers who develop omics data analysis workflows to turn FAIR principles into practice.

View this resource Report broken link

A freely-available and open source Windows client application for building Selected Reaction Monitoring (SRM)/Multiple Reaction Monitoring (MRM), Parallel Reaction Monitoring (PRM), Data Independent Acquisition (DIA/SWATH) and DDA with MS1 quantitative methods and analysing the resulting mass spectrometer data. It aims to employ cutting-edge technologies for creating and iteratively refining targeted methods for large-scale quantitative mass spectrometry studies in life sciences.

View this resource Report broken link

The MaxQuant software, with a suite of advanced algorithms and statistical methods, allows for accurate protein identification, quantification, and statistical analysis. The software incorporates state-of-the-art computational tools for data preprocessing, normalisation, and downstream analysis, enabling researchers to extract meaningful insights from their experiments. Additionally, MaxQuant supports a wide range of data types and offers customisable workflows to suit specific research needs. With its user-friendly interface and robust analytical capabilities, MaxQuant is an essential tool for scientists conducting proteomics research and striving to uncover the intricacies of complex biological systems. For detailed information see https://www.nature.com/articles/nprot.2016.136

View this resource Report broken link

Workflowhub is a registry for scientific computational workflows designed around FAIR principles. It allows sharing analysis workflows packaged as Research Object (RO) Crates with direct links to the source code on GitHub. RO-Crates are annoted with metadata about the workflows which faciliates findability of workflows. Workflowhub is sponsored by multiple European initiatives including EOSC-Life and ELIXIR.

View this resource Report broken link

nf-core/methylseq is a bioinformatics analysis pipeline used for Methylation (Bisulfite) sequencing data. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results. The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources.The results obtained from the full-sized test can be viewed on the nf-core website. The pipeline allows you to choose between running either Bismark or bwa-meth / MethylDackel.

View this resource Report broken link

Nf-core is a community collection of analysis pipelines using Nextflow. This curated collection of computational workflows includes many omics preprocessing workflows that have been widely adopted in the community. Several workflows were also applied in the EATRIS-Plus project.

View this resource Report broken link

Perseus, available through the MaxQuant software suite, is a powerful computational tool designed for the analysis and visualisation of large-scale proteomics and omics datasets. It offers a comprehensive set of statistical and computational methods, enabling researchers to explore, preprocess, and analyse their data. With an intuitive user interface, Perseus facilitates data quality control, normalisation, and downstream analysis, empowering researchers to uncover valuable insights. It supports a wide range of data types and provides customisable workflows, allowing for flexible data analysis. Utilised extensively in proteomics research, Perseus serves as an invaluable resource for researchers aiming to gain a deeper understanding of complex biological data. For detailed information see https://www.nature.com/articles/nmeth.3901

View this resource Report broken link

Provides a user-friendly, flexible and easily extendable software with a complete set of modules covering the entire MS data analysis workflow, with the main focus on LC-MS data. It is based on the original MZmine toolbox described in the 2006 Bioinformatics publication (MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data, https://academic.oup.com/bioinformatics/article/22/5/634/206500?login=false), but has been completely redesigned and rewritten since then.

View this resource Report broken link

Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.

View this resource Report broken link

CNVkit is a Python library and command-line software toolkit to infer and visualise copy number from high-throughput DNA sequencing data. It is designed for use with hybrid capture, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent.

View this resource Report broken link

A web server for multidimensional data integration

View this resource Report broken link

Different technologies, such as quantitative real-time PCR or microarrays, have been developed to measure microRNA (miRNA) expression levels. Quantification of miRNA transcripts implicates data normalization using endogenous and exogenous reference genes for data correction. However, there is no consensus about an optimal normalization strategy. The choice of a reference gene remains problematic and can have a serious impact on the actual available transcript levels and, consequently, on the biological interpretation of data. In this review article we discuss the reliability of the use of small RNAs, commonly reported in the literature as miRNA expression normalizers, and compare different strategies used for data normalization. A workflow strategy is proposed for normalization of miRNA expression data in an attempt to provide a basis for the establishment of a global standard procedure that will allow comparison across studies.

View this resource Report broken link

Open source and open data registry of bioinformatics software, databases, and services annotated with precise ontology-based descriptions. This allows to search the bio.tools registry by keywords to find, e.g., methods for multi-omics data analysis. Tools can be added and edited by the community.

View this resource Report broken link

Protocol for raw data search using Proteome Discoverer (Thermo) including detail description of all parameters.

View this resource Report broken link

Missing something? Contribute to the Multi-omics Toolbox!

Submit a resource