Case study: the SIDR data repository

Integration at the knowledge level

#

Knowledge integration lyies on hierarchies of understanding (HoU) that begin with data and end with wisdom/intelligence, as described originally by Ackoff and summarized in the figure below.
At the digital age, knowledge integration is deeply intertwined with data and process interoperability, in the context of coherent meanings (semantics). Nevertheless, it is not limited to interoperability issues between tools because any relevant comprehensive framework, out of the scope of tools, must also take into account at the same time.

In this respect, it is generally accepted that a number of solutions to existing problems should be found from new capabilities in combining knowledge from both internal and external sources. In these efforts to face worldwide transformations, knowledge integration is considered of strategic importance and is becoming a central topic that mobilizes increasingly actors and resources.
As expected, Life Sciences (including Health) do not derogate from these principles and important challenges ahead are challenges concerning knowledge integration. The case study below draws on multi-omics to address knowledge integration using Model-Driven Engineering to deal with multiple frameworks interoperability and domain models.

The "metadata/data" multi-omics challenge and community core frameworks

Problem statement

Omics approaches allow molecular analysis of subcellular compartments and their variations according to different internal and/or external environmental conditions. They are often run in high-throughput assays and produce huge amounts of data; let's mention: Genomics (study of genes) , Transcriptomices (study of mRNA), Proteomics (study of proteins), and so on; every Omics being held by an international expert's community, it has its own methodologies, own databases, own standards and formats, etc.

Nevertheless, understanding and solving complex problems on living systems is likely to require data aggregation and integration at different levels of granularity, that is to say, performing multi-omics integration even though it is a difficult task as data sets are heterogeneous in many different ways as mentionned above. In particular, each experiment has to be aligned with those to be integrated to bridge consistent items at technical, biological, semantical levels and so on.
Actually, describing data integration approaches fall first and foremost to refer to metadata to make experimental data meaningful and standards were proposed to help reporting minimum information about experiment, field-by-field, discipline-by-discipline; the Biosharing website provide a catalogue of domain standards.

But it is not enough to unleash complex issues and this "silo" data organization ground on a discipline and/or technology basis has to be overcome with a data resource that could collide all omics experiments in a standard format and contribute to a worldwide network of interoperable data ressources.
This was acheived thanks to multi-omics frameworks.

Two umbrella frameworks for integrating omics data and metadata

An umbrella frameworks does not contain elements of its own but common elements with other frames or schemas it allows the merging.
In functional genomics, the FuGE-OM (for: Functional Genomics Experiment Object Model) was developped to offer a high-level integration model and proposes major building blocks for constructing the core of a domain specific language for description of integrated biological data. FuGE was derived from MAGE-OM (for: Micro Arrays Genomic Experiment Object Model), a former OMG standard approved on 2003 (MAGE-OM consists of 17 packages, 132 classes and 223 associations), which allows the definition of all types of DNA arrays.
In parallel, the ISA-TAB data model was developped from the MAGE-TAB standard to provide the scientific community with a user-friendly format; in the same way as FuGE-OM was drawn upon MAGE-OM expertise, ISA-TAB benefitted of the MAGE tabular vesion

In spite of these common origins and while FuGE and ISA-TAB frameworks largely match in terms of their underlying principles (i.e., regarding Omics integration), they nevertheless display certain specific characteristics of their own and hence differences. From a syntactical viewpoint, FuGE was designed with the UML language and transformed into XML, whereas ISA-TAB wass presented in a single tabular format. From a semantic view, the two frameworks show serious differences; for example, the same term Investigation is used for describing a sinle study (FuGE) or a set of studies (ISA-TAB).

#

Handling frameworks

From Model-Driven Engineering to Model-Driven Interoperability

Model-Driven Engineering (MDE) is an approach to (software) development that focus on domain models for specifying functionnal aspects, leaving aside implementation aspects, which are addressed separately; a metamodel (model of models) is used for defining domain context and providing domain-specific elements for models design and handling.
Recently (Bruneliere et al., 2010) the role of models as key enablers for interoperability was highlighted and the advisory process was detailed as follows :

  1. making explicit the metamodel(i.e., the internal schema) of each system,
  2. aligning metamodels by matching related concepts,
  3. performing model-to-model transformations to exploit this matching information and to export data

The case study below is a first phase in adapting Model-Driven Interoperability methods to data problems in Life Sciences, following a previous paper in which this perspective was discussed (Roux and Schuch da Rosa, 2006).

# Ten Top Reasons for Systems Biology to Get Into Model-Driven Engineering (PDF 220 MB).

Bridging frameworks: the SIDR data model

Accordingly, the SIDR case study was a pilot initiative conducted at CNRS to investigate how MDE methods could be used to operate integration, storage and retrieval of multi-omics data. We ground the approach on our preliminary work on "Metamodeling architectures for complex data integration in biology" (Terrasse and Roux, 2010): taking advantage of the standards that are being currently developed to achieve consensual representations of technological Omics domains, we presented a models architecture based on these standards, each metamodel being used to describe a consensus that was shared by several models.
Instead of creating a metamodel of our own, we first tested our idea using the FuGE reference framework that was developped for providing all modeling elements for Omics model design.

#Metamodelling Architectures for Complex Data Integration in Systems Biology (PDF 520 MB)

The ISA-TAB to FuGE-OM (ISA2FuGE) mapping specifications

Since the two core frameworks are both used by the Omics community and because our key objective is "integration", we decided not to choose between the two of them. On the contrary, we selected to start straightaway from common foundational elements thanks to the mapping of the FuGE object model (FuGE-OM) to the ISA tabular model (ISA-TAB); we called the resulting model ISA2FuGE, the specifications of which are available

In this context, the ISA2FuGE model was a subset of the FuGE framework since some FuGE concepts and relations were not included in the ISA frame. Conversely, some ISA elements were missing in the FuGE source model and new elements had to be created mainly from the FuGE extension mechanisms to complete the missing links.
As a result, the ISA2FuGE model is the SIDR object model (SIDR-OM); its allows multi-omics data to be associated to FuGE- and ISA-compliant metadata.

# ISA2FuGE Document Specifications(PDF 1300 MB)

The SIDR data repository

Data repository or metadata registry?

A data repository is very different from a database because data is not edited unlike the case with database systems. As a consequence, validation procedures, falsely referred to "data curation", verify -in practice- the consistency of metadata associated to data sets (in fact, it is "metadata curation" done by data curators), the quality of data per se can only be verified -upstream- by the data producers.
Accordingly, the SIDR data repository contains curated metadata about omics experiments and their associated data sets; it could be referred as a "metadata registry" because it gives an extensive metadata description and data sets can only be retrieved through metadata search.
With respect to data storage, some data sets are stored in the SIDR central database whereas some are preserved in remote despositories and only available via their file URLs. The nature and the role of scientific metadata registries are discussed in a paper below (Roux, 2012).

# Metadata for Search Engines: What can be learned from e-Sciences

#

Architecture

The metadata/data loading chain

Data ressources produced using one or several technological domains were collected in the context of visits carried out on a selection of CNRS public laboratories.

A CNRS data curator was collecting and organizing these resources (data and metadata) according to the corresponding domain standards within the ISA-TAB frame.

For example, data/metadata from genome-wide protein-DNA identification performed using either chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq), was made compliant to MIAME and MINSEQE, respectively, in the context of the global ISA frame. Metadata ressources were further converted into FuGE format according to the FuGE2ISA document and stored in a PostgreSQL database.

On-line search

Metadata searching was available at http://sidr-dr.inist.fr.
To anticipate performance issues in querying large datasets, a search engine was planned to be used for eliciting and retrieving complex data. In the current test version, simple queries were allowed notwithstanding information stored on parameters and experimental conditions: the search engine was retrieving datasets by browsing straightforward indexed terms; then, metadata files can be downloaded in both ISA and FuGE formats. Due to the data model, complex queries of the style: "find all Metabolite profiling (MeasurementType) and Transcription profiling (MeasurementType) performed with any organism (Organism) treated for 12 hours (Factor) with Salicylic acid (Factor)", could be considered in the future. A selection of screenshots shows the SIDR Home page, a ChIP-on-chip experiment and its corresponding workflow allowing to visualize all successive steps with input/output protocols.

Content

In the test version, SIDR repository contains 520 data files related to 24 studies grouped in 13 investigations; main characteristics are presented in the table below.

Main features of the metadata in SIDR
Feature type Content
Organism Homo sapiens, Mus musculus, Caenorhabditis elegans, Escherichia coli, Cucumis melo
Measurement type Molecular structure, Molecular interaction, Cell counting, Transcription profiling, Transcription binding site identification, Protein-protein interaction, Metabolite profiling
Technology type Mass spectrometry, NMR spectroscopy, DNA microarray, Flow cytometry, Nucleotide sequencing
Vocabularies/Ontologies BTO, CHEBI, CL, EFO, FIX, GO , IMR, MI, MS, MSH, NCBITaxon, NCIt, OBI, PO, SO, SwissProt, SWO, UO, WBbt, WBls, WBPhenotype

Concluding remarks

During this work, we found that submission of metadata by researchers themselves, owning most of the times insufficient knowledge about standards, annotation and ontologies, has resulted in discrepancies and inaccuracies in metadata description hampering the comparison and the reuse of the datasets. This had highlighted the importance of providing tools that will allow in organizing complex knowledge to further retreive and analyze important data..