Data management pilot


Datasets collected in EMPHASIS installations are costly, labor intensive and impossible to reproduce (no experiment can reproduce the environmental conditions of another experiment). Hence, it is essential that datasets can be re-analysed by a wide scientific community, but also by users themselves: numerous datasets were lost in many labs. Finally, 'open science' is required by most journals, public research and funding institutions, involving the FAIR principles (datasets must be Findable, Accessible, Interoperable and Reusable).

Deploying an information system in each European local infrastructure is a necessary effort to reach the goal of 'open science'. Indeed, exchanging spreadsheets is not compatible with the 'findable' and 'accessible' principles, and is plagued with problems such as software obsolescence and non-accessible metadata (i.e. the information one needs to reanalyse an experiment).

Before deploying such a system in a given local infrastructure, several steps for data organization are necessary. These steps, rather than the installation of a software, limit this deployment. For example, sensors, plants or plots need to be identified with persistent and non-ambiguous identifiers (e.g. URIs), in particular to trace their spatial positions and calibrations. Environmental variables need to be organized to relate sensor outputs to time courses and/or spatial distribution of well-defined variables with unambiguous units, and mapped to ontologies.

The data pilot helps users to bridge these gaps. Trainings and software tools help the user to visualize what needs to be done. The objective here is not to impose solutions, but to help users in the formalization of their usual practices into machine-readable elements of an information system. 

Next, the pilot helps with installing an information system in local infrastructures (we discourage users to attempt developing their own information system in view of the workload it requires). Finally, existing information systems will be interconnected with an application under development so they become interoperable, thereby enabling a user to query several information systems at a time.

The pilot works in close collaboration with the infrastructures ELIXIR (genomics, via the MIAPPE working group) and AGMIP (modelling), so the datasets can be used for different purposes by different scientific communities.

Contact: François Tardieu (provisional, until a new contact person is recruited), francois.tardieu@inrae.fr

 Read advanced information about the data pilot