Sharing research data: how?

Sharing data is becoming the golden standard in science. It enables others to reproduce your results and prevent fraud and honest mistakes in data analysis. Moreover, it enables reuse of your data in new analyses, increasing the impact of your work.

Short guide: When to share what data?

  1. If data are completely anonymous, you can share them publicly in a dedicated repository, see step 1 or 2
  2. If data cannot be completely anonymized, they are personal. You need a legal basis to share these data:
    1. Informed consent: what you can do with the data depends on the contents of the consent form.
      • If participants consented to public data sharing and their data are not very sensitive (e.g., not from children or clinical groups), publish them in a repository or datapaper.
      • If participants consented to sharing with restrictions, use a repository that allows access restrictions or use a data use agreement to share data case by case.
      • If participants did not consent to any personal data sharing, share characteristics or aggregated data.
    2. Public interest: In theory, most research is publicly funded, and therefore we should be able to use this as legal basis for data sharing. However, it is still unclear when we are allowed to use it. The minimal prerequities are:
      • the personal data sharing should rely on the principles of lawfulness, fairness and transparency
      • informed consent was impossible to obtain, e.g., because the study took place a long time ago and consent cannot be obtained retroactively. Participants not consenting to data sharing is not a valid reason!
      When sharing personal data using the Public interest basis, you are encouraged to share data with access restrictions, especially if your data are sensitive or highly identifiable (e.g., data from minors or clinical groups, special categories of sensitive data, etc.)
    3. If you share data with a similar purpose as the original research project (such as for collaborating with other researchers on a related topic), a data use agreement suffices (not strictly necessary for EUR collaborators as they are from the same institution). Such agreement should lay out the conditions of storing, sharing and publishing the data. This falls under the scope of processing that is "compatible with the original purpose", which does not require a new/separate legal basis (GDPR Articles 5(1)(b), 6(4) and 89(1)).

Ways of sharing data

Publishing data can go roughly in the following ways:

1. Publish in a data repository

For example (or find one here):

In all cases, make your data FAIR and take privacy considerations into account.

2. Publish a datapaper

In a datapaper, you describe the data and the methods of collecting them, without the need to analyze them. This will get you a publication out of your data, irrespective of whether or not you publish results. This often requires that you make all described data public, because the aim of such publications is to provide access to high quality datasets and to facilitate reuse. Also, most journals have some policy in which repository you should deposit the data accompanying the datapaper. Note that a datapaper will be peer-reviewed just as well as a regular article. See this link for a list of data journals.

3. Share case-by-case

For data that cannot be shared publicly, you can sometimes still share the data case-by-case. This can be the case:

Please note that this is only a FAIR solution if your metadata and access options are publicly findable and available (e.g., consider creating a metadata-only record in a repository).

4. Share only characteristics of the data

If you do not want to or you can't share any real data, you can still make your data valuable:

Aggregated data

If your data are privacy-sensitive and you cannot share them, you can still share aggregated data, for example:

Synthetic data

Creating a synthetic dataset can be useful to capture the statistical idiosyncrasies of your real dataset. This synthetic dataset can be used to reproduce the results of your analysis, without violating any privacy or intellectual property regulations. Read more:

Federated learning

Federated learning explained. Source: Sheller et al., 2020

Federated learning arises from the field of Artificial Intelligence and relies “on the principle of remote execution—that is, distributing copies of a machine learning algorithm to the sites or devices where the data is kept (nodes), performing training iterations locally, and returning the results of the computation (for example, updated neural network weights) to a central repository to update the main algorithm.” (Kaissis et al., 2020). This means that you do not move your data, while still providing valuable information about it.

Some federated learning tools and projects:

Licensing data

With licenses, you specify what others are permitted to do with your product. You can see it as some kind of agreement: if someone violates the license, you have the right to sue them, just like a regular lawful agreement. For anonymous data, it is recommended to choose a CC0 (public domain) or CC-BY 4.0 license. These open licenses both allow others to use the data without restrictions. For non-anonymous data, use a more restrictive license (but please don't use non-derivate (ND) or non-commercial (NC) licenses, read why here) or formulate your own terms of use, for example in a data use agreement.

Don't know which license to choose? Use a license selector!

Resources