Data table formats¶
MaveDB accepts and provides data tables in CSV format. Each row of the data table describes a single variant, and variants are described using MAVE-HGVS format. All other columns are expected to be floating point values.
Variant columns¶
For both score and count data tables, there are three variant columns:
hgvs_ntdescribes variants with respect to the nucleotide target sequencehgvs_splicedescribes variants with respect to a transcript modelhgvs_prodescribes variants with respect to the amino acid target sequence
hgvs_nt and hgvs_pro variants are required to be described in relation to the score set target.
All variants must be described in MAVE-HGVS format, which is a subset of the HGVS format.
This includes using the appropriate prefix (c. for coding DNA, n. for non-coding DNA, p. for protein) and
using brackets for complex variants that affect multiple positions (e.g. c.[123A>G;125_T>C]).
If hgvs_nt is present, it will be used as the primary key for distinguishing variants and must be unique.
Otherwise, hgvs_pro will be used as the primary key.
Note
Datasets with only hgvs_pro variants can specify nucleotide target sequences.
The target sequence will be translated using the
standard amino acid translation table for validation.
The hgvs_splice variants are not validated against a transcript model or target sequence.
This is a convenience feature for datasets that contain splice variants; most datasets will not use hgvs_splice.
Datasets that use hgvs_splice must also have hgvs_nt, which is used as the primary key,
and the hgvs_nt variants must use the 'g.' prefix.
Fully-qualified variants¶
‘Fully-qualified variants’ refer to variants which are submitted with complete information about their transcript or relative sequence.
For accession based targets, this means variants are submitted with respect to a specific RefSeq or Ensembl transcript (e.g: NM_000546.5:c.215C>G).
When submitting accession based targets, MaveDB requires fully-qualified variants to ensure unambiguous mapping to the target sequence.
For sequence-based targets, fully-qualified variants include the full sequence context of the variant. When you are submitting variants
with respect to a single sequence-based target, your variants need not be fully-qualified. However, if you are submitting variants
with respect to multiple sequence-based targets, MaveDB requires fully-qualified variants to ensure unambiguous mapping to the correct target sequence.
You should use the target label as the prefix for the variant (e.g: target1:c.215C>G).
For more information on targets, see the targets section.
Base editor data¶
If you are submitting data from a base editor experiment, you must also include a column guide_sequence in your score and/or count data table.
This column should contain the guide RNA sequence used to generate each variant. This column is required for base editor datasets, and replaces the
hgvs_nt column as the primary key for variants. As such, it must be unique.
Score table columns¶
All score tables must have a column named score that describes the score of that variant in the assay.
Score tables may have any number of additional numeric columns.
Suggested numeric columns include a standard deviation or variance,
or some other measure of uncertainty for the score such as a 95% confidence interval
(represented as two columns, e.g. ci_lower and ci_upper).
Warning
Do not submit patient data or anything which could be used to identify individuals to MaveDB.
Score sets that describe experiments with multiple replicates often include the score and standard deviation for each replicate as additional columns.
For datasets with categorical data, we recommend encoding the categories as integers and describing the mapping between integers and categories in the free text methods. Support for additional data columns with string data will be added in a future version to support this use case.
Here is a short excerpt from the score table for urn:mavedb:00000003-a-1.
This dataset uses hgvs_nt as the primary variant key.
It has several additional data columns with the scores and error estimates for multiple biological replicates.
Note that some variants do not have a score. This is permitted (and encouraged) as long as there is some data provided for that variant, such as a score in an individual replicate or some counts.
hgvs_nt |
hgvs_splice |
hgvs_pro |
score |
SE |
epsilon |
SE_PlusE2NewRep3 |
score_PlusE2NewRep3 |
SE_PlusE2NewRep4 |
score_PlusE2NewRep4 |
SE_PlusE2NewRep5 |
score_PlusE2NewRep5 |
SE_PlusE2Rep3 |
score_PlusE2Rep3 |
SE_PlusE2Rep4 |
score_PlusE2Rep4 |
SE_PlusE2Rep5 |
score_PlusE2Rep5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
c.38T>C |
NA |
p.Val13Ala |
-0.128 |
0.115 |
0.000 |
0.148 |
0.283 |
0.162 |
-0.456 |
0.075 |
-0.186 |
0.167 |
-0.165 |
0.289 |
-0.073 |
0.388 |
-0.184 |
c.186A>T |
NA |
p.Leu62Phe |
-4.132 |
0.396 |
0.000 |
0.289 |
-3.752 |
0.433 |
-4.166 |
0.549 |
-3.456 |
0.341 |
-3.166 |
0.499 |
-6.079 |
0.204 |
-4.309 |
c.164A>T |
NA |
p.Lys55Ile |
-0.655 |
0.112 |
0.000 |
0.100 |
-0.592 |
0.121 |
-0.929 |
0.086 |
-0.254 |
0.143 |
-0.695 |
0.039 |
-0.950 |
0.080 |
-0.514 |
c.[439C>A;441G>A;842C>A] |
NA |
p.[Gln147Lys;Ser281Ter] |
NA |
NA |
NA |
0.668 |
-1.612 |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
0.464 |
-1.273 |
c.22_23delinsCC] |
NA |
p.Glu8Pro |
-0.375 |
0.280 |
0.000 |
0.158 |
-1.421 |
0.240 |
-0.265 |
0.200 |
-0.796 |
0.192 |
-0.022 |
0.311 |
-0.232 |
0.091 |
0.476 |
c.598G>A |
NA |
p.Asp200Asn |
0.271 |
0.170 |
0.000 |
0.103 |
-0.337 |
0.094 |
0.830 |
0.092 |
0.408 |
0.163 |
0.051 |
0.243 |
0.278 |
0.172 |
0.382 |
c.285C>G |
NA |
p.Asp95Glu |
NA |
NA |
NA |
0.401 |
-3.993 |
0.150 |
-3.380 |
NA |
NA |
0.452 |
-3.221 |
0.228 |
-1.973 |
0.277 |
-1.774 |
c.[64G>C;142C>T] |
NA |
p.Glu22Gln |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
NA |
c.869T>G |
NA |
p.Leu290Ter |
-1.231 |
0.245 |
0.000 |
0.117 |
-0.456 |
0.285 |
-0.792 |
0.424 |
-0.993 |
0.084 |
-1.909 |
0.143 |
-1.325 |
0.093 |
-1.735 |
c.200T>G |
NA |
p.Ile67Arg |
NA |
NA |
NA |
0.255 |
-3.600 |
0.152 |
-3.399 |
NA |
NA |
0.171 |
-4.097 |
0.361 |
-3.764 |
NA |
NA |
c.[1G>T;97_99delinsGGG] |
NA |
p.[Asp1Tyr;Pro33Gly] |
NA |
NA |
NA |
0.159 |
-0.177 |
0.445 |
-0.583 |
0.500 |
-0.323 |
0.537 |
-0.470 |
NA |
NA |
0.284 |
0.188 |
c.476G>T |
NA |
p.Gly159Val |
-1.192 |
0.100 |
0.000 |
0.141 |
-1.050 |
0.079 |
-1.557 |
0.030 |
-0.969 |
0.114 |
-1.030 |
0.126 |
-1.264 |
0.168 |
-1.303 |
Count table columns¶
Count data are optional for MaveDB score sets, but are recommended.
There are no required columns for count data, but uploaders should decide on an intuitive naming convention for the column names and describe it in the free text methods.
Metadata files¶
Users may also upload metadata files in JSON format to provide additional structured metadata about the columns in the score and count tables.
These files should be named score_columns_metadata.json and count_columns_metadata.json respectively. These files are
optional, but recommended.
The metadata files should contain a JSON object where each key is a column name from the corresponding data table, and the value is another JSON object with additional metadata about that column. The additional JSON object may contain the following fields:
description: A string describing the column.details: A string with additional details about the contents of the column (optional).