vignette

This notebook is to show you how to use the package Vivaldi to analyze viral single nucleotide variants from next generation sequencing methods (e.g. Illumina). This vignette takes variant data in the form of VCF files and performs several widely-used analyses and generates plots for a preliminary investigation of the variants found. Users are encouraged to continue their own analyses afterwards in order to address more specific questions with their data.

Set up

Load Vivaldi pacakge and requirements

library(vivaldi)
library(kableExtra)

Vignette data set

The data used in this vignette are simulated influenza genomes with known mutations generated at random frequencies. (More description here)

Set path for variant data and metadata

Metadata used for the calculations includes a .csv file containing the lengths of each of the viral segments. For non-segmented viruses, simply report the whole length of the genome. If relevant, the user should also supply a .csv file containing information about sequencing replicates containing 3 columns: the filename of both replicates, which replicate the sample represents, and the new sample name for the merged replicates. The user should also give a string containing the segments in the desired order for plots.

# Set variables that we will use
vardir = system.file("extdata", "vcfs", package="vivaldi") 
seg_sizes = system.file("extdata", "SegmentSize.csv", package="vivaldi")
rep_info = system.file("extdata", "reps.csv", package="vivaldi")

sizes = read.csv(file=seg_sizes,header=T,sep=",",na.strings = c(''))
replicates = read.csv(file = rep_info, header = T, sep = ",", na.strings = c(""))

genome_size = 13133
SEGMENTS = c("H1N1_PB2","H1N1_PB1","H1N1_PA","H1N1_HA","H1N1_NP","H1N1_NA","H1N1_MP","H1N1_NS")

Step 1: Loading in data and arranging

Load VCF files into a dataframe

The first step of the work flow is to load the VCF files into R and extract the important information. There are currently two potential functions to use at this stage, we hope to merge them into a single function in the future. For VCF files that contain GT information, use the arrange_gt_data() function. For VCF files that do not conatin GT information, use the arrange_no_gt_data() function. In both cases, the result is a tidy dataframe that contains the sample name, pulled from the file name, and information about the reference and alternative allele. If the VCF has not previously been annotated by SNPeff (recommended), the user should specify annotated = “no” for either function.

VCF_DF = arrange_gt_data(vardir, ref = system.file("extdata", "H1N1.fa", package="vivaldi"), annotated = 'yes')
#> Number of chroms in fasta: 8
#> Length of input files: 24
#> Input vcf files include: c("/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann.vcf", 
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_ivar_custom.ann.vcf", 
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_varscan_custom.ann.vcf", 
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann.vcf", 
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_ivar_custom.ann.vcf", 
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_varscan_custom.ann.vcf"
#> )
#> Sample name is: H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 107
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 107
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 107
#>   row_num: 0
#> 
Processed variant: 107
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 107
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 107
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 107
#>   row_num: 0
#> 
Processed variant: 107
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 104
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 104
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 104
#>   row_num: 0
#> 
Processed variant: 104
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 104 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 108
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 108
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 108
#>   row_num: 0
#> 
Processed variant: 108
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 109
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 109
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 109
#>   row_num: 0
#> 
Processed variant: 109
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 105
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 105
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 105
#>   row_num: 0
#> 
Processed variant: 105
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 105 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 110
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 110
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 110
#>   row_num: 0
#> 
Processed variant: 110
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 110
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 110
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 110
#>   row_num: 0
#> 
Processed variant: 110
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 108
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 108
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 108
#>   row_num: 0
#> 
Processed variant: 108
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 108 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 105
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 105
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 105
#>   row_num: 0
#> 
Processed variant: 105
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 104
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 104
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 104
#>   row_num: 0
#> 
Processed variant: 104
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 103
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 103
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 103
#>   row_num: 0
#> 
Processed variant: 103
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 103 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 105
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 105
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 105
#>   row_num: 0
#> 
Processed variant: 105
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 103
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 103
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 103
#>   row_num: 0
#> 
Processed variant: 103
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 102
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 102
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 102
#>   row_num: 0
#> 
Processed variant: 102
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 102 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 107
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 107
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 107
#>   row_num: 0
#> 
Processed variant: 107
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 105
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 105
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 105
#>   row_num: 0
#> 
Processed variant: 105
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 101
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 101
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 101
#>   row_num: 0
#> 
Processed variant: 101
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 101 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 99
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 99
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 99
#>   row_num: 0
#> 
Processed variant: 99
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 101
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 101
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 101
#>   row_num: 0
#> 
Processed variant: 101
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 100
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 100
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 100
#>   row_num: 0
#> 
Processed variant: 100
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 100 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 73
#>   header_line: 74
#>   variant count: 104
#>   column count: 10
#> 
Meta line 73 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 104
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 104
#>   row_num: 0
#> 
Processed variant: 104
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 12
#>   header_line: 13
#>   variant count: 106
#>   column count: 10
#> 
Meta line 12 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 106
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 106
#>   row_num: 0
#> 
Processed variant: 106
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 28
#>   header_line: 29
#>   variant count: 103
#>   column count: 10
#> 
Meta line 28 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 103
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 103
#>   row_num: 0
#> 
Processed variant: 103
#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 103 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#todo: merge three arrange_* functions into a single function

kable(head(VCF_DF))
ChromKey POS CHROM ID REF ALT ANN REF_COUNT ALT_COUNT gt_DP sample REF_FREQ ALT_FREQ ALT_TYPE majorfreq minorfreq major minor
1 1007 H1N1_HA NA G A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| 796 38 834 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9544365 0.0455635 minor 0.9544365 0.0455635 G A
1 1145 H1N1_HA NA T A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| 740 30 770 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9610390 0.0389610 minor 0.9610390 0.0389610 T A
1 1293 H1N1_HA NA T C C|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1293T>C|p.Gly431Gly|1293/1701|1293/1701|431/566|| 554 17 571 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9702277 0.0297723 minor 0.9702277 0.0297723 T C
1 1319 H1N1_HA NA C T T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1319C>T|p.Ala440Val|1319/1701|1319/1701|440/566|| 580 7 587 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9880750 0.0119250 minor 0.9880750 0.0119250 C T
1 1386 H1N1_HA NA A G G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| 545 7 552 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9873188 0.0126812 minor 0.9873188 0.0126812 A G
1 1505 H1N1_HA NA A G G|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1505A>G|p.Asp502Gly|1505/1701|1505/1701|502/566|| 380 5 386 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9844560 0.0129534 minor 0.9844560 0.0129534 A G
dim(VCF_DF)
#> [1] 2493   18

Filter out variants based on coverage and/or frequency cutoffs

This function provides the user with the opportunity to filter the variants by frequency or coverage of the ALT allele. Based on (our paper), we recommend 1% allele frequency and no coverage cutoff for sequencing data with replicates. For sequencing data without replicates, a much more stringent filtering step is recommended to ensure confidence in the variants called. We suggest a 3% allele frequency and 200X coverage for data without replicates.

# Use default coverage (300) and frequency (0.02) cutoffs 
# or set custom values
cov_cutoff = 400
freq_cutoff = 0.02
DF_filt = filter_variants(VCF_DF)
#> Total number of SNP filtered out: 1263
#DF_filt = filter_variants(VCF_DF, coverage_cutoff=cov_cutoff, frequency_cutoff=freq_cutoff)

kable(head(DF_filt))
ChromKey POS CHROM ID REF ALT ANN REF_COUNT ALT_COUNT gt_DP sample REF_FREQ ALT_FREQ ALT_TYPE majorfreq minorfreq major minor
1 1007 H1N1_HA NA G A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| 796 38 834 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9544365 0.0455635 minor 0.9544365 0.0455635 G A
1 1145 H1N1_HA NA T A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| 740 30 770 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9610390 0.0389610 minor 0.9610390 0.0389610 T A
1 1293 H1N1_HA NA T C C|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1293T>C|p.Gly431Gly|1293/1701|1293/1701|431/566|| 554 17 571 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9702277 0.0297723 minor 0.9702277 0.0297723 T C
1 435 H1N1_HA NA G A A|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.435G>A|p.Ser145Ser|435/1701|435/1701|145/566|| 726 23 749 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9692924 0.0307076 minor 0.9692924 0.0307076 G A
1 539 H1N1_HA NA A G G|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.539A>G|p.Lys180Arg|539/1701|539/1701|180/566|| 745 19 764 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9751309 0.0248691 minor 0.9751309 0.0248691 A G
2 500 H1N1_MP NA C T T|missense_variant|MODERATE|CDS_H1N1_MP_1_759|H1N1_M1|transcript|H1N1_M1.1|protein_coding|1/1|c.500C>T|p.Thr167Ile|500/759|500/759|167/252||,T|intron_variant|MODIFIER|CDS_H1N1_MP_1_759|H1N1_M1|transcript|H1N1_M1.2|protein_coding|1/1|c.27-215C>T|||||| 1163 28 1192 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9756711 0.0234899 minor 0.9756711 0.0234899 C T
dim(DF_filt)
#> [1] 1230   18

Format SNPeff information

This function takes the one large annotation column containing annotation information from SNPeff and separates them out into individual columns with one attribute each.

DF_filt = prepare_annotations(DF_filt)
#> >2 annotations: character(0)

kable(head(DF_filt))
ChromKey POS CHROM ID REF ALT allele annotation putative_impact gene_name gene_id feature_type feature_id transcript_biotype rank_total HGVS.c HGVS.p cDNA_position CDS_position protein_position distance_to_feature errors REF_COUNT ALT_COUNT gt_DP sample REF_FREQ ALT_FREQ ALT_TYPE majorfreq minorfreq major minor
1 1007 H1N1_HA NA G A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1007G>A p.Arg336Lys 1007/1701 1007/1701 336/566 796 38 834 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9544365 0.0455635 minor 0.9544365 0.0455635 G A
1 1145 H1N1_HA NA T A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1145T>A p.Leu382Gln 1145/1701 1145/1701 382/566 740 30 770 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9610390 0.0389610 minor 0.9610390 0.0389610 T A
1 1293 H1N1_HA NA T C C synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1293T>C p.Gly431Gly 1293/1701 1293/1701 431/566 554 17 571 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9702277 0.0297723 minor 0.9702277 0.0297723 T C
1 435 H1N1_HA NA G A A synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.435G>A p.Ser145Ser 435/1701 435/1701 145/566 726 23 749 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9692924 0.0307076 minor 0.9692924 0.0307076 G A
1 539 H1N1_HA NA A G G missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.539A>G p.Lys180Arg 539/1701 539/1701 180/566 745 19 764 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9751309 0.0248691 minor 0.9751309 0.0248691 A G
3 1042 H1N1_NA NA G C C missense_variant MODERATE CDS_H1N1_NA_1_1410 H1N1_NA transcript H1N1_NA.1 protein_coding 1/1 c.1042G>C p.Gly348Arg 1042/1410 1042/1410 348/469 691 30 721 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9583911 0.0416089 minor 0.9583911 0.0416089 G C
dim(DF_filt)
#> [1] 1390   33

Add metadata (segment and genome size)

Here, we add in information about the sizes of the segments, which will allow us to do downstream calculations like tstv, dNdS, and others. The user must provide both the variant dataframe and the loaded metadata dataaframe, as well as the names of the columns to merge by in both.

DF_filt = add_metadata(DF_filt, sizes, c('CHROM'), c('segment'))

kable(head(DF_filt))
CHROM ChromKey POS ID REF ALT allele annotation putative_impact gene_name gene_id feature_type feature_id transcript_biotype rank_total HGVS.c HGVS.p cDNA_position CDS_position protein_position distance_to_feature errors REF_COUNT ALT_COUNT gt_DP sample REF_FREQ ALT_FREQ ALT_TYPE majorfreq minorfreq major minor SegmentSize STRAIN
H1N1_HA 1 1007 NA G A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1007G>A p.Arg336Lys 1007/1701 1007/1701 336/566 796 38 834 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9544365 0.0455635 minor 0.9544365 0.0455635 G A 1701 H1N1
H1N1_HA 1 1145 NA T A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1145T>A p.Leu382Gln 1145/1701 1145/1701 382/566 740 30 770 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9610390 0.0389610 minor 0.9610390 0.0389610 T A 1701 H1N1
H1N1_HA 1 1293 NA T C C synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1293T>C p.Gly431Gly 1293/1701 1293/1701 431/566 554 17 571 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9702277 0.0297723 minor 0.9702277 0.0297723 T C 1701 H1N1
H1N1_HA 1 435 NA G A A synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.435G>A p.Ser145Ser 435/1701 435/1701 145/566 726 23 749 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9692924 0.0307076 minor 0.9692924 0.0307076 G A 1701 H1N1
H1N1_HA 1 539 NA A G G missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.539A>G p.Lys180Arg 539/1701 539/1701 180/566 745 19 764 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9751309 0.0248691 minor 0.9751309 0.0248691 A G 1701 H1N1
H1N1_HA 1 585 NA C T T synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.585C>T p.Gly195Gly 585/1701 585/1701 195/566 778 27 806 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 0.9652605 0.0334988 minor 0.9652605 0.0334988 C T 1701 H1N1
dim(DF_filt)
#> [1] 1390   35

Merge Replicates

Sequencing in replicate allow us to keep only those variants found in both runs, as any sequencing or PCR errors are extremely unlikely to be found in independent experiments. For this function, the user must provide the variant dataframe and the loaded replicates dataframe. In addition, whatever name the user provides for the replicates (ex. “Rep1”, “r1”, “rep1”, etc) must be provided in quotes, as well as a list of columns that should be used to merge the two dataframes. This should contain any columns that are expected to be identical between the two replicates, such as the segment and nucleotide position of the variant as well as any SNPeff annotation information. It should not contain replicate-specific information such as the allele frequency of the variant. We calculate the average allele frequency for both the ALT and REF allele, as well as the weighted average, which considers the number of reads for each replicate as well. We expect these to be generally similar, unless one of the replicates has significantly higher coverage per sample than the other.

cols = c("sample","CHROM","ChromKey","POS","REF","ALT","allele",
         "annotation","putative_impact","gene_name","gene_id","feature_type","feature_id","transcript_biotype",
         "rank_total", "HGVS.c","HGVS.p","cDNA_position","CDS_position","protein_position","distance_to_feature",
         "errors","ALT_TYPE","major","minor","SegmentSize","STRAIN")

DF_filt_reps = merge_replicates(DF_filt,replicates,"rep1","rep2",cols)

# if you want to have your chrom/segments in certain order must reorder before plotting:
DF_filt_reps$CHROM = factor(DF_filt_reps$CHROM, levels = SEGMENTS)

Step 2: Calculations

Calculate Shannon entropy

Shannon entropy is a commonly used metric to describe the amount of genetic diversity in sequencing data. It is calculated by considering the frequency of the ALT and REF allele at every position and then summing those values over either a segment or over the entire genome. These values can then be normalized by kb in order to compare across different segments or samples. As a result, this function outputs a dataframe with five new columns reflecting these different caluclations. The user must provide the variant dataframe and the length of the genome they are working with.

DF_filt_reps = shannon_entropy(DF_filt_reps,genome_size)

kable(head(DF_filt_reps))
sample CHROM ChromKey POS REF ALT allele annotation putative_impact gene_name gene_id feature_type feature_id transcript_biotype rank_total HGVS.c HGVS.p cDNA_position CDS_position protein_position distance_to_feature errors ALT_TYPE major minor SegmentSize STRAIN filename.x replicate.x ID.x REF_COUNT.x ALT_COUNT.x gt_DP.x REF_FREQ.x ALT_FREQ.x majorfreq.x minorfreq.x filename.y replicate.y ID.y REF_COUNT.y ALT_COUNT.y gt_DP.y REF_FREQ.y ALT_FREQ.y majorfreq.y minorfreq.y minorfreq majorfreq weighted_minorfreq weighted_majorfreq shannon_ntpos chrom_shannon genome_shannon shannon_chrom_perkb genome_shannon_perkb
a_1_fb H1N1_HA 1 1007 G A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1007G>A p.Arg336Lys 1007/1701 1007/1701 336/566 minor G A 1701 H1N1 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 NA 796 38 834 0.9544365 0.0455635 0.9544365 0.0455635 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 NA 780 35 816 0.9558824 0.0428922 0.9558824 0.0428922 0.0442279 0.9551594 0.0442424 0.9551515 0.2621955 0.8846285 169.5044 0.5200638 1.29e-05
a_1_fb H1N1_HA 1 1145 T A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1145T>A p.Leu382Gln 1145/1701 1145/1701 382/566 minor T A 1701 H1N1 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 NA 740 30 770 0.9610390 0.0389610 0.9610390 0.0389610 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 NA 781 29 810 0.9641975 0.0358025 0.9641975 0.0358025 0.0373818 0.9626182 0.0373418 0.9626582 0.2301561 0.8846285 169.5044 0.5200638 1.29e-05
a_1_fb H1N1_HA 1 435 G A A synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.435G>A p.Ser145Ser 435/1701 435/1701 145/566 minor G A 1701 H1N1 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 NA 726 23 749 0.9692924 0.0307076 0.9692924 0.0307076 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 NA 766 22 790 0.9696203 0.0278481 0.9696203 0.0278481 0.0292779 0.9694563 0.0292398 0.9694607 0.1925281 0.8846285 169.5044 0.5200638 1.29e-05
a_1_fb H1N1_HA 1 539 A G G missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.539A>G p.Lys180Arg 539/1701 539/1701 180/566 minor A G 1701 H1N1 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 NA 745 19 764 0.9751309 0.0248691 0.9751309 0.0248691 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 NA 762 29 793 0.9609079 0.0365700 0.9609079 0.0365700 0.0307195 0.9680194 0.0308285 0.9678870 0.1997490 0.8846285 169.5044 0.5200638 1.29e-05
a_1_fb H1N1_MP 2 642 T C ,C intron_variant MODIFIER CDS_H1N1_MP_1_759 H1N1_M1 transcript H1N1_M1.2 protein_coding 1/1 c.27-73T>C minor T C 982 H1N1 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 NA 824 29 853 0.9660023 0.0339977 0.9660023 0.0339977 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 NA 783 35 818 0.9572127 0.0427873 0.9572127 0.0427873 0.0383925 0.9616075 0.0383004 0.9616996 0.2348725 0.9888624 169.5044 1.0069882 1.29e-05
a_1_fb H1N1_MP 2 642 T C C synonymous_variant LOW CDS_H1N1_MP_1_759 H1N1_M1 transcript H1N1_M1.1 protein_coding 1/1 c.642T>C p.His214His 642/759 642/759 214/252 minor T C 982 H1N1 H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 NA 824 29 853 0.9660023 0.0339977 0.9660023 0.0339977 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 NA 783 35 818 0.9572127 0.0427873 0.9572127 0.0427873 0.0383925 0.9616075 0.0383004 0.9616996 0.2348725 0.9888624 169.5044 1.0069882 1.29e-05
dim(DF_filt_reps)
#> [1] 586  56

Count number of SNVs

The function tally_it() allows the user to count the number of variants over a given set of variables. These variables should be provided as a list and then passed into the function in addition to the name, in quotes, for the new column containing the number of variants. For example, to get the sum of variants on every segment, the user should construct as list containing the name of the sample column and the name of the segment column and pass that list into the function.

# Across segments:
group_list_seg = c('sample','CHROM', "SegmentSize")
seg_count = tally_it(DF_filt_reps, group_list_seg, "snv_count")

kable(head(seg_count))
sample CHROM SegmentSize snv_count
a_1_fb H1N1_PB2 2280 13
a_1_fb H1N1_PB1 2274 13
a_1_fb H1N1_PA 2151 7
a_1_fb H1N1_HA 1701 4
a_1_fb H1N1_NP 1497 8
a_1_fb H1N1_NA 1410 4


# Across genome:
group_list_gen = c('sample')
gen_count = tally_it(DF_filt_reps, group_list_gen, "snv_count")

kable(head(gen_count))
sample snv_count
a_1_fb 58
a_1_iv 66
a_2_fb 63
a_2_iv 66
a_3_fb 64
a_3_iv 68

Calculate Transitions/Transversions Ratio

The transistion/transversion rate is commonly used to test for a bias in nucleotide conversions. Transitions are expected to be more common. Here, the user must provide the variant dataframe and the length of the genome, as this ratio will be caluclated by segment and over the whole genome.

DF_tstv = tstv_ratio(DF_filt_reps,genome_size)
kable(head(DF_tstv))
sample CHROM SegmentSize chrom_or_genome transition transversion tstv_ratio tstv_ratio_perkb
a_1_fb H1N1_PB2 2280 tstv_chrom_count 11 2 5.500000 2.4122807
a_1_fb H1N1_PB2 2280 tstv_genome_count 51 7 7.285714 0.5547639
a_1_fb H1N1_PB1 2274 tstv_chrom_count 11 2 5.500000 2.4186456
a_1_fb H1N1_PB1 2274 tstv_genome_count 51 7 7.285714 0.5547639
a_1_fb H1N1_PA 2151 tstv_chrom_count 6 1 6.000000 2.7894003
a_1_fb H1N1_PA 2151 tstv_genome_count 51 7 7.285714 0.5547639

Step 3: Generate plots

Plot location of SNVs across segments

This function takes the variant dataframe and generates a large, interactive plot of all of the variants. The plot will be faceted by each segment for each individual sample. The user can scroll over each point in the plot to get information about that variant, including the nucleotide change and the position on the segment.

#library(plotly)
snv_location(DF_filt_reps)

Plot number of SNVs per sample and per segment

The snv_genome() function will take the variant dataframe and generate a plot of the number of variants per genome and colors them by their SNPeff annotation, while the snv_segment() function will generate the same plot but faceted by segment.

snv_genome(DF_filt_reps)

snv_segment(DF_filt_reps)

Plot TsTv

This function will take the variant dataframe and generates 4 plots: the Ts/Tv ratio across the genome, the ratio across each segment, and both plots normalized by kb.

tstv_plot(DF_tstv)

Plot shannon entropy per sample and per segment

This function takes the variant dataframe and generates three plots: the shannon entropy value for every variant, the sum of these values over each segment, and the sum over the entire genome. A higher Shannon entropy value indicates more diversity.

# makes 3 plots
plot_shannon(DF_filt_reps)

Determine if variants are shared among samples

This function takes the variant dataframe and generates a plot that annotates the variants based on the number of times it appears in independent samples. The darker the color of the point, the more samples that variant is found in. Depending on the context of the user’s experimental design, a variant that is found in many samples could be an indication of convergent evolution and would potentially be a good starting point for deeper investigation.

shared_snv_plot(DF_filt_reps)

Isolate variant of interest and plot AF at that position in all samples

This function allows users to focus on a single variant of interest and track the allele frequency of the ALT and the REF allele in all samples it is found in. To run, the user must provide the variant dataframe and the segment name and nucleotide position, in quotes, of the variant they are interested in. This can be run for any variant found in the sequencing data.

# provide the df, segment, and variant position
position_allele_freq(DF_filt_reps,"H1N1_HA", "1007")

Calculate dNdS ratio and plot per sample per segment

The dN/dS ratio is commonly used to test for a bias in amino acid conversions. dN counts the number of nonsynonymous mutations, or nucleotide changes that result in an amino acid change in the resulting protein product. dS counts the number of synonymous mutations, which don’t result in an amino acid change and are not expected to be under strong selection. Under neutral evolution, the dN/dS is expected to be appox. equal to 1, while a ratio > 1 indicates an enrichment of nonsynonymous changes, a hallmark of positive selection, and a ratio < 1 indicates a depletion of nonsynonsymous changes, a hallmark of negative or purifying selection. The user must provide the variant dataframe that contains information about the amino acid changes for all nucleotide variants, otherwise this metric will not accurately reflect the true value.

##dNdS_segment ## SAYS NEEDS dataframe - must be for amino-acid specific calculations, cannot be the same as the dataframe used for SNP calculations

Plot distribution of all minor variant frequencies

This function takes the variant dataframe and generates a plot showing the distribution of the the minor variants (appearing in less than 50% of the reads for any sample). For many viruses, we expect the distribution to be scewed to very low frequency variants (< 10%).

af_distribution(DF_filt_reps)