Output format

    Variant calling

    Invocations of varlociraptor call variants output variant calls in BCF format (either printed out to STDOUT, the default, or saved to a given path). The output follows the established standard, but adds some fields are different or have a slightly different meaning than what you might be used to from other variant callers.

    INFO fields

    • Varlociraptor uses the usual fields to encode structural variant information (SVTYPE, SVLEN, END, MATEID, EVENT), as defined in the VCF/BCF standard.
    • In addition it provides one field for each event defined in the calling scenario (PROB_EVENTNAME with eventname is the name of the event in the scenario). These fields denote the posterior probability for this event, encoded in PHRED scale (the smaller the higher, with 0 being equal to an unscaled probability of 1). In addition to the events of the scenario, there are always the two implicit events absent and artifact (meaning that the variant in the record is not present at all or any kind of artifact, respectively), which are denoted in the corresponding field PROB_ABSENT and PROB_ARTIFACT. Together, all those posterior probabilities sum up to one (modulo minor numerical glitches caused by the encoding and available precision in BCF format).

    Sample specific (FORMAT) fields

    • By default, no genotype field (GT) is provided. The reason is that varlociraptor can be used to call much more complex scenarios that cannot be captured via genotypes (e.g. subclonal variants). If you still need a genotype field and your scenario is sufficiently simple, you can use varlociraptor genotype (see varlociraptor genotype --help) to annotate your BCF file with additional, standard compliant, genotype fields.
    • The variant allele frequency in each sample is stored in the AF field. This is Varlociraptors actual replacement of the by default missing genotype. It allows to display heterozygous (AF=0.5) vs. homozygous calls (AF=1.0), as well as any subclonal events in between.
    • The read depth is stored in the DP field of each sample. Importantly, this depth is not an ordinary read count, but rather the expected read depth given the mapping qualities of the reads covering the variant. Mathematically spoken it is the sum of the probabilities that each read maps correctly to the locus of the variant. This kind of depth is more accurate as it takes the mapping uncertainty into account, but it may differ slightly from the read counts you see e.g. when checking the pileup in IGV.
    • The SAOBS field provides a simplified summary of the observations (e.g. read alignments) favoring the alternative (ALT) allele (has to be considered together with SROBS). Each entry is encoded as CB, with C being a count, B being the posterior odds for the alt allele. The provided letter denotes an extended Kass Raftery score: B=barely, P=positive, S=strong, V=very strong (lower case if probability for correct mapping of fragment is <95%). Note that we extend Kass Raftery scores with a term for equality between the evidence of the two alleles (E=equal). Further note that there is no N=none score, as such observations occur with an opposite direction score (odds for the reference or a third allele) in the SROBS field.
    • The SROBS field provides a simplified summary of the observations favoring the reference (REF) or a third allele (has to be considered together with SAOBS). Each entry is encoded as CB, with C being a count, B being the posterior odds for the reference or a third allele. The latter denotes an extended Kass Raftery score: E=equal, B=barely, P=positive, S=strong, V=very strong (lower case if probability for correct mapping of fragment is <95%).
    • The OBS field provides a detailed summary of the observations. Each entry is encoded as CBDTASOPXI, with C being a count,and B being the posterior odds for the alt or the reference allele. The latter are given as a two letter code. The first letter (A or R) defines whether the odds favor the alt allele (A) or any other allele including the reference allele R. The second letter denotes an extended Kass Raftery score: N=none, E=equal, B=barely, P=positive, S=strong, V=very strong (lower case if the probability for correct mapping of fragment is <95%). Note that we extend Kass Raftery scores with a term for equality between the evidence of the two alleles (E=equal). D denotes the edit distance to the ALT allele in case it is higher than what could be expected from sequencing errors (in that case, Varlociraptor derives a third allele from the read sequence and considers that as an alternative to the alt allele, instead of the reference allele), T being the type of alignment, encoded as s = single end and p = paired end, A denoting whether the observations also map to an alternative locus (# = most found alternative locus, * = other locus, . = no locus), S being the strand that supports the observation (+, -, or * for both), O being the read orientation (> = F1R2, < = F2R1, * = unknown, ! = non standard, e.g. R1F2), P being the read position (^ = most found read position, * = any other position or position is irrelevant), X denoting whether the respective alignments entail a softclip ($ = softclip, . = no soft clip), and I denoting indel operations in the respective alignments against the alt allele (* = some indel, . = no indel or information irrelevant for variant type).
    • The OOBS field provides the number of omitted observations. For SNVs and MNVs, read pairs are omitted if they have a non-standard read orientation (neither F1R2 nor F2R1) as those can frequently lead to alignment artifacts.
    • The SB field denotes the strand bias estimate: + indicates that ALT allele is associated with forward strand, - indicates that ALT allele is associated with reverse strand, . indicates no strand bias. Strand bias is indicative for systematic sequencing errors. Probability for strand bias is captured by the ARTIFACT event (PROB_ARTIFACT).
    • The ROB field denotes the read orientation bias estimate: > indicates that ALT allele is associated with F1R2 orientation, < indicates that ALT allele is associated with F2R1 orientation, . indicates no read orientation bias. Read orientation bias is indicative of Guanin oxidation artifacts. Probability for read orientation bias is captured by the ARTIFACT event (PROB_ARTIFACT).
    • The RPB field denotes the read position bias estimate: ^ indicates that ALT allele is associated with the most found read position, . indicates that there is no read position bias. Read position bias is indicative of systematic sequencing errors, e.g. in a specific cycle. Probability for read position bias is captured by the ARTIFACT event (PROB_ARTIFACT).
    • The SCB field denotes the softclip bias estimate: $ indicates that ALT allele is associated with with softclips in the same alignment, . indicates that there is no softclip bias. Softclip bias is indicative of systematic alignment errors, cause by a part of the read that does not properly align to the reference (and is thus soft clipped). Note that softclips can also be caused by structural variants. However, structural variants on the same haplotype as e.g. an SNV should not cause a softclip bias, because there will usually still be reads that do not reach the SV, thereby providing evidence against a softclip bias. Probability for softclip bias is captured by the ARTIFACT event (PROB_ARTIFACT).
    • The HE field denotes the homopolymer error estimate: * indicates that ALT allele is associated with with homopolymer indel operations of varying length, . indicates that there is no homopolymer error. Homopolymer error is indicative of systematic PCR amplification errors. Probability for such homopolymer artifacts is captured by the ARTIFACT event (PROB_ARTIFACT).
    • The ALB field denotes the alt locus bias estimate: * indicates that ALT allele is systematically associated with either MAPQs smaller than the maximum MAPQ or a major alternative alignment (XA tag) reported by the used read mapper. This would be indicative of ALT reads actually coming from another locus (e.g. some repeat, a homology, a distant variant allele, or a CNV). Probability for alt locus bias is captured by the ARTIFACT event (PROB_ARTIFACT).
    • The AFD field provides the sampled posterior probability densities of allele frequencies in PHRED scale (the smaller the higher, with 0 being equal to an unscaled probability of 1). In the discrete case (no somatic mutation rate or continuous universe in the scenario), these can be seen as posterior probabilities. Note that densities can be greater than one. Futher note that the densities of each sample are conditional on the maximum a posteriori allele frequencies of the other called samples.