Data organization

MaveDB has three dataset types: Score Set, Experiment, and Experiment Set. All score, count, and target information is stored in score sets. Experiment sets and experiments are used for organization and metadata.

The dataset types are organized hierarchically. Each experiment set can contain multiple experiments and each experiment can contain multiple score sets.

Score sets

Score sets contain variant effect scores, counts, and associated metadata generated by analyzing data from a single functional assay and its biological or technical replicates. MaveDB can contain multiple associated score sets that provide different representations of the same data. For example, scores for nucleotide-level variants and amino acid-level variants for the same underlying assay data would be included as different score sets.

Score information is uploaded in .csv format. The only two columns required in a score set are the variant identifiers (in HGVS format) and the scores (as floating point values). The database supports any number of additional numeric columns with arbitrary names. These additional columns are intended to store scoring method-specific values for each variant, such as the variance or model parameters, and should be described in the score set method description.

Counts can be uploaded as a separate .csv file with arbitrarily named columns. The column naming convention for the counts should be described in the Score Set method description.

The coordinates in the variant identifiers are always listed with respect to the given target sequence. That is, position 1 in the HGVS string should correspond to the first base in the listed target. Mappings between the listed target sequence and full-length sequences can be specified using the extra target information accessions and corresponding offset values. Arbitrary genomic coordinates are not currently supported, but these can be added as extra metadata. This feature will be added in a future version.

Required information

  • Title
  • Contributor names
  • Target information
    • Name
    • Species
    • Reference genome and assembly (if applicable)
    • DNA sequence
  • Variant HGVS identifiers (uploaded with score file)
  • Variant scores (uploaded with score file)

Optional information

  • Short description
  • Abstract (Markdown and MathJax supported)
  • Method details (Markdown and MathJax supported)
  • Keywords
  • Target information
    • UniProt accession
    • RefSeq accession
    • Ensembl accession
  • DOIs for additional data sources
  • Publication references (as PubMed IDs)
  • Variant counts (uploaded with counts file)
  • Additional variant scoring columns (uploaded with score file)

Score sets with a UniProt identifier will show a banner that links out to MaveVis, which visualizes variant effect maps and structure information.

Experiments

Experiments describe a single set of assay data. The experiment can include one or more replicate assays, but the assay conditions for all data described by the experiment should be the same. This entry describes the raw data that is used as input to a scoring method to generate a score set. Each experiment can have multiple associated score sets.

Required information

  • Title
  • Contributor names

Optional information

  • Short description
  • Abstract (Markdown and MathJax supported)
  • Method details (Markdown and MathJax supported)
  • Keywords
  • DOIs for additional data sources
  • SRA accessions for raw sequence data
  • Publication references (as PubMed IDs)

Experiment sets

Experiment sets are used to coordinate related experiments. For example, if a single target is assayed under multiple conditions, these would be combined into a single experiment set. All experiments in the same set usually share the same target, but sometimes a single experiment set will contain all datasets for a single publication. Experiment sets contain no data or metadata.

Accession number formats

MaveDB accession numbers use the URN (Uniform Resource Name) format. These accession numbers have a hierarchical structure that reflects the relationship between experiment sets, experiments, score sets, and individual variants in MaveDB.

Permissions and editability

When created, a score set or experiment is private and given a temporary accession number. Private datasets are only viewable or editable by contributors to that dataset.

After the required information has been provided, a contributor can publish the entry at which point it will be publicly viewable. Upon publication, most required fields can no longer be modified, ensuring that the database will be stable over time. Contributors can modify optional fields, such as abstract and methods, after publication. These changes are tracked.

Contributor types

Contributors are tracked and authenticated using ORCID. There are three roles for each dataset, Viewer, Editor and Administrator. A viewer can access the entry if it is private. An editor access can change any of the data or metadata with the exception of adding or removing contributors. An administrator has all the editor’s permissions and also the ability to change the contributor list, change the data license,
and publish score sets. Each entry must have at least one admin user at all times.

All types of contributors appear in the contributor list for a given dataset and they are not distinguished visually.

Licensing

When uploading score set information to the database, the user can choose one of three licenses:

By default, new score sets will have the non-commercial license. The license can be changed after publication, but previously-downloaded copies of the dataset will retain the old license. The license will be listed in the header of any downloaded data files.

Note that the default non-commercial license is compatible with commercial relicensing by the data owners.

Users also have the option of adding a data usage policy to a score set, such as terms that dictate the use of pre-publication data. The data usage policy will be added to the header of any downloaded data files if one is present.

API

MaveDB implements a GET API that can access all fields of published datasets. To explore the API, visit https://www.mavedb.org/api/