Variants

All sample, reference genome, and alignments are listed in the Data view (the toggle switch in the top left). Clicking over to the Analyze view will allow you to filter through multi-sample variants and view their aligned reads. Use the dropdowns on the left below the Data/Analyze toggle to select your alignment and your reference genome and choose Variants. Once the alignments are complete, you should see a list of all variants that have been identified across all samples.

Cast vs. Melted

There are two ways to view variants.

Cast: Cast displays a summary row for one variant across all samples. You can see how many samples the variant is present in, as well as the variant’s effects.

Melted: ‘Melting’ the view shows one row for every combination of sample and and variant. It essentially multiplies the rows by the number of samples, so you can see data specific to individual samples. If a variant is not called in a sample, it’s Alt column will be blank.

Fields and Filtering

Millstone uses a simple language to understand query syntax for filtering variants.

Note: Currently some of the field names can be confusing. A list of all available fields can be found with the Fields… button. The default column names don’t always correspond to the internal field names. There isn’t currently a well-documented list of what each field means, but most of them are documented in the VCF specification. The INFO_EFF_* fields come from SnpEFF.

Examples

If you want to look at all variants in a certain gene:

INFO_EFF_GENE = tolC

If you want to look at all variants that have strong or moderate predicted phenotypic effects:

INFO_EFF_IMPACT = HIGH | INFO_EFF_IMPACT = LOW

If you want to look in a certain region:

CHROM = NC_000913 & POSITION > 500 & POSITION < 1000

Marginal Calls

We always run variant calling as diploid, even for haploid organisms like E. coli, so that some poorly-supported variants appear heterozygous. This allows marginal calls to be made in cases where only a portion of the reads show a SNV, in cases of regional duplications or if reads map to a non-unique region of the genome. Such marginal calls have an orange fraction icon in their ALT column, and can also be filtered on by using:

IS_HET = TRUE or IS_HET = FALSE

Additionally, the GT_TYPE field is another way to distinguish marginal from strongly called variants. GT_TYPE can take values between 0 or 1 for each sample/variant combination:

  • 0 means the variant was called as reference in the sample
  • 1 means the variant was called as heterozygous (i.e. marginal) in the sample
  • 2 means the variant was called as homozygous (well-supported) in the sample

If you’d like to filter on only well-supported variants that have moderate to strong affects on genes, you can use the filter:

GT_TYPE = 2 & (INFO_EFF_IMPACT = HIGH | INFO_EFF_IMPACT = MODERATE)

Variant Sets

Variant sets are a way to group variants after filtering. The sets created by default correspond to regions where the alignment had problems; either there was insufficient coverage, no coverage, too much coverage, or poor mapping quality (corresponding perhaps to regions that are non-unique).

You can also create your own sets to group interesting variants, or those whose alignments you’d like to examine by eye.

Creating a blank set

You can create your own blank sets from the Sets tab in the Data view. After creating a set, you can add variants to it in the Analyze view using the checkboxes and the master checkbox dropdown on the left.

Uploading a set from a VCF file

You can also upload a variant set from a VCF file. Only the first 5 columns of the VCF will be used. The file must be tab delimited. Here is an example:

#CHROM          POS ID  REF ALT
NC_000913   2242    .   G   A
NC_000913   76  .   C   A
NC_000913   3170    .   T   C
NC_000913   1623    .   G   C
NC_000913   3879    .   A   G
NC_000913   3112    .   A   T
NC_000913   1577    .   C   T
NC_000913   5352    .   G   A
NC_000913   4386    .   A   T
NC_000913   1167    .   G   T
NC_000913   5425    .   T   A
NC_000913   951 .   C   A
NC_000913   3993    .   A   G
NC_000913   226 .   G   C
NC_000913   2939    .   T   G
NC_000913   92  .   C   A
NC_000913   5563    .   A   C
NC_000913   4446    .   A   C
NC_000913   607 .   A   G
NC_000913   5088    .   A   T

This way, you can identify variants you expected to be called in your samples, such as alleles targeted by MAGE oligonucleotides.