Depositing your data into MaveDB

Creating a complete entry in MaveDB requires several pieces of data and metadata. This document includes a checklist of what is required to deposit a study and a description of the required metadata.

For more information on how a dataset in MaveDB is structured, including descriptions of experiment sets, experiments, and score sets, please see Record types.

Metadata formatting

Experiment and score set records contain several different types of required and optional metadata, either free text or accession numbers for other databases. These elements are described in this section.

Free text metadata

Experiments and score sets both have descriptive free text fields. These are the title, short description, abstract, and methods.

The title and short description are plain text. The abstract and methods support Markdown formatting with embedded equations using MathML, converted using Pandoc.

The title is displayed at the top of the record page, and should be quite brief.

The short description is displayed in the search results table and should summarize the entry at a high level in one or two sentences.

The abstract should describe the motivation and approach for the dataset. Some MaveDB abstracts include a summary of the results of the related publications but many do not. The entry describes the MAVE data rather than a full study so the submitter should use their judgement when deciding what details are most relevant. It is common that experiments and score sets share the same abstract text if they are from the same study.

The methods section should describe the approach in a condensed form, suitable for a specialist audience of MAVE researchers.

For an experiment the methods section should include:

  • Variant library construction methods

  • Description of the functional assay, including model system and selection type

  • Sequencing strategy and sequencing technology

  • Structure of biological or technical replicates (if applicable)

For a score set the methods section should include:

  • Sequence read filtering approach

  • Description of the statistical model for converting counts to scores, including normalization

  • Description of additional data columns included in the score or count tables, including column naming conventions

  • Details of how replicates were combined (if applicable)

For a meta-analysis score set the methods section should include:

  • Description of the statistical model for converting the linked scores or counts into the scores presented

  • Description of additional data columns included in the score or count tables, including column naming conventions

Score sets can also include an optional free-text data usage policy intended for unpublished data. For example, data producers may wish to assert their right to publish the results of certain analyses first.

Publication details

Publications can be included by entering their PubMed ID and they will appear as formatted references. Publications included in an experiment will also be displayed on their associated score set pages.

Preprints or publications that are not indexed by PubMed can be included via the DOI field. Improved support for preprints (including displaying them as formatted references) is planned for a future release.

Raw data accessions

Experimenters are encouraged to deposit their raw sequence data in a public repository and link it to the relevant experiment record(s).

MaveDB currently supports accession numbers for:

Raw data that is stored elsewhere can be included via the DOI field.

Keywords

Experiments and score sets can be tagged with optional, user-specified keywords.

In a future release, the keyword vocabulary will become restricted and keyword selection will be mandatory. This will improve the ability for data modellers to select appropriate MAVE datasets for their studies, and also facilitate more sophisticated tracking of the kind of data being generated by researchers.

Data formatting

Score sets require detailed information about the target sequence, including the sequence, as well as a CSV-formatted file containing the variant scores (and optionally a second CSV-formatted file containing the variant counts).

For more information including how to prepare your data for submission, see Target sequence information and Data table formats.

Score set data must be given a license chosen from those listed on the Data licensing page.

Optional structured metadata

Score sets also support the inclusion of optional JSON-formatted metadata. This can be used to describe features like genomic coordinates for a target sequence or score cutoff ranges that the uploader would like to be more easily machine-readable than if this information was included in free text.

If optional metadata is included, the uploader should describe it in the score set methods.

Required information checklist

For each experiment and score set:

For each experiment you will also want:

For each score set you will also want:

  • Target information
    • Nucleotide sequence for the target

    • The sequence type (coding, regulatory, other non-coding)

    • Organism the sequence is derived from (if applicable)

    • UniProt ID (if applicable)

    • RefSeq ID (if applicable)

    • Ensembl ID (if applicable)

  • Variant score table

  • Variant count table (if available)

  • Choice of data license (see Data licensing)

  • Data usage policy text (if needed)