This notebook is to show you how to use the package Vivaldi to analyze viral single nucleotide variants from next generation sequencing methods (e.g. Illumina). This vignette takes variant data in the form of VCF files and performs several widely-used analyses and generates plots for a preliminary investigation of the variants found. Users are encouraged to continue their own analyses afterwards in order to address more specific questions with their data.
library(vivaldi)
library(kableExtra)
The data used in this vignette are simulated influenza genomes with known mutations generated at random frequencies. (More description here)
Metadata used for the calculations includes a .csv file containing the lengths of each of the viral segments. For non-segmented viruses, simply report the whole length of the genome. If relevant, the user should also supply a .csv file containing information about sequencing replicates containing 3 columns: the filename of both replicates, which replicate the sample represents, and the new sample name for the merged replicates. The user should also give a string containing the segments in the desired order for plots.
# Set variables that we will use
system.file("extdata", "vcfs", package="vivaldi")
vardir = system.file("extdata", "SegmentSize.csv", package="vivaldi")
seg_sizes = system.file("extdata", "reps.csv", package="vivaldi")
rep_info =
read.csv(file=seg_sizes,header=T,sep=",",na.strings = c(''))
sizes = read.csv(file = rep_info, header = T, sep = ",", na.strings = c(""))
replicates =
13133
genome_size = c("H1N1_PB2","H1N1_PB1","H1N1_PA","H1N1_HA","H1N1_NP","H1N1_NA","H1N1_MP","H1N1_NS") SEGMENTS =
The first step of the work flow is to load the VCF files into R and extract the important information. There are currently two potential functions to use at this stage, we hope to merge them into a single function in the future. For VCF files that contain GT information, use the arrange_gt_data() function. For VCF files that do not conatin GT information, use the arrange_no_gt_data() function. In both cases, the result is a tidy dataframe that contains the sample name, pulled from the file name, and information about the reference and alternative allele. If the VCF has not previously been annotated by SNPeff (recommended), the user should specify annotated = “no” for either function.
arrange_gt_data(vardir, ref = system.file("extdata", "H1N1.fa", package="vivaldi"), annotated = 'yes')
VCF_DF =#> Number of chroms in fasta: 8
#> Length of input files: 24
#> Input vcf files include: c("/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann.vcf",
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_ivar_custom.ann.vcf",
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_varscan_custom.ann.vcf",
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann.vcf",
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_ivar_custom.ann.vcf",
#> "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_varscan_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_ivar_custom.ann.vcf", "/tmp/RtmpqQwdxD/temp_libpath13c4765db17151/vivaldi/extdata/vcfs/H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_varscan_custom.ann.vcf"
#> )
#> Sample name is: H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 107
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 107
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 107
#> row_num: 0
#>
: 107
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 107
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 107
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 107
#> row_num: 0
#>
: 107
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 104
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 104
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 104
#> row_num: 0
#>
: 104
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 104 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 108
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 108
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 108
#> row_num: 0
#>
: 108
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 109
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 109
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 109
#> row_num: 0
#>
: 109
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 105
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 105
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 105
#> row_num: 0
#>
: 105
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 105 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 110
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 110
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 110
#> row_num: 0
#>
: 110
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 110
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 110
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 110
#> row_num: 0
#>
: 110
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 108
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 108
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 108
#> row_num: 0
#>
: 108
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 108 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 105
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 105
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 105
#> row_num: 0
#>
: 105
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 104
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 104
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 104
#> row_num: 0
#>
: 104
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 103
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 103
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 103
#> row_num: 0
#>
: 103
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 103 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 105
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 105
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 105
#> row_num: 0
#>
: 105
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 103
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 103
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 103
#> row_num: 0
#>
: 103
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 102
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 102
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 102
#> row_num: 0
#>
: 102
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 102 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 107
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 107
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 107
#> row_num: 0
#>
: 107
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 105
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 105
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 105
#> row_num: 0
#>
: 105
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 101
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 101
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 101
#> row_num: 0
#>
: 101
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 101 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 99
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 99
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 99
#> row_num: 0
#>
: 99
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 101
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 101
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 101
#> row_num: 0
#>
: 101
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 100
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 100
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 100
#> row_num: 0
#>
: 100
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 100 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> Sample name is: H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 73
#> header_line: 74
#> variant count: 104
#> column count: 10
#>
73 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 104
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 104
#> row_num: 0
#>
: 104
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element GL
#> Extracting gt element DP
#> Extracting gt element AD
#> Extracting gt element RO
#> Extracting gt element QR
#> Extracting gt element AO
#> Extracting gt element QA
#> Extracting gt element MIN_DP
#> Sample name is: H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_ivar_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 12
#> header_line: 13
#> variant count: 106
#> column count: 10
#>
12 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 106
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 106
#> row_num: 0
#>
: 106
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element AD
#> Extracting gt element DP
#> Extracting gt element GQ
#> Sample name is: H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_varscan_custom.ann
#> Scanning file to determine attributes.
#> File attributes:
#> meta lines: 28
#> header_line: 29
#> variant count: 103
#> column count: 10
#>
28 read in.
Meta line #> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#> Character matrix gt rows: 103
#> Character matrix gt cols: 10
#> skip: 0
#> nrows: 103
#> row_num: 0
#>
: 103
Processed variant#> All variants processed
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element SDP
#> Extracting gt element DP
#> Extracting gt element RD
#> Extracting gt element AD
#> Extracting gt element FREQ
#> Extracting gt element PVAL
#> Extracting gt element RBQ
#> Extracting gt element ABQ
#> Extracting gt element RDF
#> Extracting gt element RDR
#> Extracting gt element ADF
#> Extracting gt element ADR
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 103 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#todo: merge three arrange_* functions into a single function
kable(head(VCF_DF))
ChromKey | POS | CHROM | ID | REF | ALT | ANN | REF_COUNT | ALT_COUNT | gt_DP | sample | REF_FREQ | ALT_FREQ | ALT_TYPE | majorfreq | minorfreq | major | minor |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1007 | H1N1_HA | NA | G | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| | 796 | 38 | 834 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9544365 | 0.0455635 | minor | 0.9544365 | 0.0455635 | G | A |
1 | 1145 | H1N1_HA | NA | T | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| | 740 | 30 | 770 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9610390 | 0.0389610 | minor | 0.9610390 | 0.0389610 | T | A |
1 | 1293 | H1N1_HA | NA | T | C | C|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1293T>C|p.Gly431Gly|1293/1701|1293/1701|431/566|| | 554 | 17 | 571 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9702277 | 0.0297723 | minor | 0.9702277 | 0.0297723 | T | C |
1 | 1319 | H1N1_HA | NA | C | T | T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1319C>T|p.Ala440Val|1319/1701|1319/1701|440/566|| | 580 | 7 | 587 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9880750 | 0.0119250 | minor | 0.9880750 | 0.0119250 | C | T |
1 | 1386 | H1N1_HA | NA | A | G | G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| | 545 | 7 | 552 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9873188 | 0.0126812 | minor | 0.9873188 | 0.0126812 | A | G |
1 | 1505 | H1N1_HA | NA | A | G | G|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1505A>G|p.Asp502Gly|1505/1701|1505/1701|502/566|| | 380 | 5 | 386 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9844560 | 0.0129534 | minor | 0.9844560 | 0.0129534 | A | G |
dim(VCF_DF)
#> [1] 2493 18
This function provides the user with the opportunity to filter the variants by frequency or coverage of the ALT allele. Based on (our paper), we recommend 1% allele frequency and no coverage cutoff for sequencing data with replicates. For sequencing data without replicates, a much more stringent filtering step is recommended to ensure confidence in the variants called. We suggest a 3% allele frequency and 200X coverage for data without replicates.
# Use default coverage (300) and frequency (0.02) cutoffs
# or set custom values
400
cov_cutoff = 0.02
freq_cutoff = filter_variants(VCF_DF)
DF_filt =#> Total number of SNP filtered out: 1263
#DF_filt = filter_variants(VCF_DF, coverage_cutoff=cov_cutoff, frequency_cutoff=freq_cutoff)
kable(head(DF_filt))
ChromKey | POS | CHROM | ID | REF | ALT | ANN | REF_COUNT | ALT_COUNT | gt_DP | sample | REF_FREQ | ALT_FREQ | ALT_TYPE | majorfreq | minorfreq | major | minor |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1007 | H1N1_HA | NA | G | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| | 796 | 38 | 834 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9544365 | 0.0455635 | minor | 0.9544365 | 0.0455635 | G | A |
1 | 1145 | H1N1_HA | NA | T | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| | 740 | 30 | 770 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9610390 | 0.0389610 | minor | 0.9610390 | 0.0389610 | T | A |
1 | 1293 | H1N1_HA | NA | T | C | C|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1293T>C|p.Gly431Gly|1293/1701|1293/1701|431/566|| | 554 | 17 | 571 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9702277 | 0.0297723 | minor | 0.9702277 | 0.0297723 | T | C |
1 | 435 | H1N1_HA | NA | G | A | A|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.435G>A|p.Ser145Ser|435/1701|435/1701|145/566|| | 726 | 23 | 749 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9692924 | 0.0307076 | minor | 0.9692924 | 0.0307076 | G | A |
1 | 539 | H1N1_HA | NA | A | G | G|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.539A>G|p.Lys180Arg|539/1701|539/1701|180/566|| | 745 | 19 | 764 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9751309 | 0.0248691 | minor | 0.9751309 | 0.0248691 | A | G |
2 | 500 | H1N1_MP | NA | C | T | T|missense_variant|MODERATE|CDS_H1N1_MP_1_759|H1N1_M1|transcript|H1N1_M1.1|protein_coding|1/1|c.500C>T|p.Thr167Ile|500/759|500/759|167/252||,T|intron_variant|MODIFIER|CDS_H1N1_MP_1_759|H1N1_M1|transcript|H1N1_M1.2|protein_coding|1/1|c.27-215C>T|||||| | 1163 | 28 | 1192 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9756711 | 0.0234899 | minor | 0.9756711 | 0.0234899 | C | T |
dim(DF_filt)
#> [1] 1230 18
This function takes the one large annotation column containing annotation information from SNPeff and separates them out into individual columns with one attribute each.
prepare_annotations(DF_filt)
DF_filt =#> >2 annotations: character(0)
kable(head(DF_filt))
ChromKey | POS | CHROM | ID | REF | ALT | allele | annotation | putative_impact | gene_name | gene_id | feature_type | feature_id | transcript_biotype | rank_total | HGVS.c | HGVS.p | cDNA_position | CDS_position | protein_position | distance_to_feature | errors | REF_COUNT | ALT_COUNT | gt_DP | sample | REF_FREQ | ALT_FREQ | ALT_TYPE | majorfreq | minorfreq | major | minor |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1007 | H1N1_HA | NA | G | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1007G>A | p.Arg336Lys | 1007/1701 | 1007/1701 | 336/566 | 796 | 38 | 834 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9544365 | 0.0455635 | minor | 0.9544365 | 0.0455635 | G | A | ||
1 | 1145 | H1N1_HA | NA | T | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1145T>A | p.Leu382Gln | 1145/1701 | 1145/1701 | 382/566 | 740 | 30 | 770 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9610390 | 0.0389610 | minor | 0.9610390 | 0.0389610 | T | A | ||
1 | 1293 | H1N1_HA | NA | T | C | C | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1293T>C | p.Gly431Gly | 1293/1701 | 1293/1701 | 431/566 | 554 | 17 | 571 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9702277 | 0.0297723 | minor | 0.9702277 | 0.0297723 | T | C | ||
1 | 435 | H1N1_HA | NA | G | A | A | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.435G>A | p.Ser145Ser | 435/1701 | 435/1701 | 145/566 | 726 | 23 | 749 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9692924 | 0.0307076 | minor | 0.9692924 | 0.0307076 | G | A | ||
1 | 539 | H1N1_HA | NA | A | G | G | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.539A>G | p.Lys180Arg | 539/1701 | 539/1701 | 180/566 | 745 | 19 | 764 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9751309 | 0.0248691 | minor | 0.9751309 | 0.0248691 | A | G | ||
3 | 1042 | H1N1_NA | NA | G | C | C | missense_variant | MODERATE | CDS_H1N1_NA_1_1410 | H1N1_NA | transcript | H1N1_NA.1 | protein_coding | 1/1 | c.1042G>C | p.Gly348Arg | 1042/1410 | 1042/1410 | 348/469 | 691 | 30 | 721 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9583911 | 0.0416089 | minor | 0.9583911 | 0.0416089 | G | C |
dim(DF_filt)
#> [1] 1390 33
Here, we add in information about the sizes of the segments, which will allow us to do downstream calculations like tstv, dNdS, and others. The user must provide both the variant dataframe and the loaded metadata dataaframe, as well as the names of the columns to merge by in both.
add_metadata(DF_filt, sizes, c('CHROM'), c('segment'))
DF_filt =
kable(head(DF_filt))
CHROM | ChromKey | POS | ID | REF | ALT | allele | annotation | putative_impact | gene_name | gene_id | feature_type | feature_id | transcript_biotype | rank_total | HGVS.c | HGVS.p | cDNA_position | CDS_position | protein_position | distance_to_feature | errors | REF_COUNT | ALT_COUNT | gt_DP | sample | REF_FREQ | ALT_FREQ | ALT_TYPE | majorfreq | minorfreq | major | minor | SegmentSize | STRAIN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H1N1_HA | 1 | 1007 | NA | G | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1007G>A | p.Arg336Lys | 1007/1701 | 1007/1701 | 336/566 | 796 | 38 | 834 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9544365 | 0.0455635 | minor | 0.9544365 | 0.0455635 | G | A | 1701 | H1N1 | ||
H1N1_HA | 1 | 1145 | NA | T | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1145T>A | p.Leu382Gln | 1145/1701 | 1145/1701 | 382/566 | 740 | 30 | 770 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9610390 | 0.0389610 | minor | 0.9610390 | 0.0389610 | T | A | 1701 | H1N1 | ||
H1N1_HA | 1 | 1293 | NA | T | C | C | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1293T>C | p.Gly431Gly | 1293/1701 | 1293/1701 | 431/566 | 554 | 17 | 571 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9702277 | 0.0297723 | minor | 0.9702277 | 0.0297723 | T | C | 1701 | H1N1 | ||
H1N1_HA | 1 | 435 | NA | G | A | A | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.435G>A | p.Ser145Ser | 435/1701 | 435/1701 | 145/566 | 726 | 23 | 749 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9692924 | 0.0307076 | minor | 0.9692924 | 0.0307076 | G | A | 1701 | H1N1 | ||
H1N1_HA | 1 | 539 | NA | A | G | G | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.539A>G | p.Lys180Arg | 539/1701 | 539/1701 | 180/566 | 745 | 19 | 764 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9751309 | 0.0248691 | minor | 0.9751309 | 0.0248691 | A | G | 1701 | H1N1 | ||
H1N1_HA | 1 | 585 | NA | C | T | T | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.585C>T | p.Gly195Gly | 585/1701 | 585/1701 | 195/566 | 778 | 27 | 806 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 0.9652605 | 0.0334988 | minor | 0.9652605 | 0.0334988 | C | T | 1701 | H1N1 |
dim(DF_filt)
#> [1] 1390 35
Sequencing in replicate allow us to keep only those variants found in both runs, as any sequencing or PCR errors are extremely unlikely to be found in independent experiments. For this function, the user must provide the variant dataframe and the loaded replicates dataframe. In addition, whatever name the user provides for the replicates (ex. “Rep1”, “r1”, “rep1”, etc) must be provided in quotes, as well as a list of columns that should be used to merge the two dataframes. This should contain any columns that are expected to be identical between the two replicates, such as the segment and nucleotide position of the variant as well as any SNPeff annotation information. It should not contain replicate-specific information such as the allele frequency of the variant. We calculate the average allele frequency for both the ALT and REF allele, as well as the weighted average, which considers the number of reads for each replicate as well. We expect these to be generally similar, unless one of the replicates has significantly higher coverage per sample than the other.
c("sample","CHROM","ChromKey","POS","REF","ALT","allele",
cols ="annotation","putative_impact","gene_name","gene_id","feature_type","feature_id","transcript_biotype",
"rank_total", "HGVS.c","HGVS.p","cDNA_position","CDS_position","protein_position","distance_to_feature",
"errors","ALT_TYPE","major","minor","SegmentSize","STRAIN")
merge_replicates(DF_filt,replicates,"rep1","rep2",cols)
DF_filt_reps =
# if you want to have your chrom/segments in certain order must reorder before plotting:
$CHROM = factor(DF_filt_reps$CHROM, levels = SEGMENTS) DF_filt_reps
Shannon entropy is a commonly used metric to describe the amount of genetic diversity in sequencing data. It is calculated by considering the frequency of the ALT and REF allele at every position and then summing those values over either a segment or over the entire genome. These values can then be normalized by kb in order to compare across different segments or samples. As a result, this function outputs a dataframe with five new columns reflecting these different caluclations. The user must provide the variant dataframe and the length of the genome they are working with.
shannon_entropy(DF_filt_reps,genome_size)
DF_filt_reps =
kable(head(DF_filt_reps))
sample | CHROM | ChromKey | POS | REF | ALT | allele | annotation | putative_impact | gene_name | gene_id | feature_type | feature_id | transcript_biotype | rank_total | HGVS.c | HGVS.p | cDNA_position | CDS_position | protein_position | distance_to_feature | errors | ALT_TYPE | major | minor | SegmentSize | STRAIN | filename.x | replicate.x | ID.x | REF_COUNT.x | ALT_COUNT.x | gt_DP.x | REF_FREQ.x | ALT_FREQ.x | majorfreq.x | minorfreq.x | filename.y | replicate.y | ID.y | REF_COUNT.y | ALT_COUNT.y | gt_DP.y | REF_FREQ.y | ALT_FREQ.y | majorfreq.y | minorfreq.y | minorfreq | majorfreq | weighted_minorfreq | weighted_majorfreq | shannon_ntpos | chrom_shannon | genome_shannon | shannon_chrom_perkb | genome_shannon_perkb |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a_1_fb | H1N1_HA | 1 | 1007 | G | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1007G>A | p.Arg336Lys | 1007/1701 | 1007/1701 | 336/566 | minor | G | A | 1701 | H1N1 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | NA | 796 | 38 | 834 | 0.9544365 | 0.0455635 | 0.9544365 | 0.0455635 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | NA | 780 | 35 | 816 | 0.9558824 | 0.0428922 | 0.9558824 | 0.0428922 | 0.0442279 | 0.9551594 | 0.0442424 | 0.9551515 | 0.2621955 | 0.8846285 | 169.5044 | 0.5200638 | 1.29e-05 | ||
a_1_fb | H1N1_HA | 1 | 1145 | T | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1145T>A | p.Leu382Gln | 1145/1701 | 1145/1701 | 382/566 | minor | T | A | 1701 | H1N1 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | NA | 740 | 30 | 770 | 0.9610390 | 0.0389610 | 0.9610390 | 0.0389610 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | NA | 781 | 29 | 810 | 0.9641975 | 0.0358025 | 0.9641975 | 0.0358025 | 0.0373818 | 0.9626182 | 0.0373418 | 0.9626582 | 0.2301561 | 0.8846285 | 169.5044 | 0.5200638 | 1.29e-05 | ||
a_1_fb | H1N1_HA | 1 | 435 | G | A | A | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.435G>A | p.Ser145Ser | 435/1701 | 435/1701 | 145/566 | minor | G | A | 1701 | H1N1 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | NA | 726 | 23 | 749 | 0.9692924 | 0.0307076 | 0.9692924 | 0.0307076 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | NA | 766 | 22 | 790 | 0.9696203 | 0.0278481 | 0.9696203 | 0.0278481 | 0.0292779 | 0.9694563 | 0.0292398 | 0.9694607 | 0.1925281 | 0.8846285 | 169.5044 | 0.5200638 | 1.29e-05 | ||
a_1_fb | H1N1_HA | 1 | 539 | A | G | G | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.539A>G | p.Lys180Arg | 539/1701 | 539/1701 | 180/566 | minor | A | G | 1701 | H1N1 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | NA | 745 | 19 | 764 | 0.9751309 | 0.0248691 | 0.9751309 | 0.0248691 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | NA | 762 | 29 | 793 | 0.9609079 | 0.0365700 | 0.9609079 | 0.0365700 | 0.0307195 | 0.9680194 | 0.0308285 | 0.9678870 | 0.1997490 | 0.8846285 | 169.5044 | 0.5200638 | 1.29e-05 | ||
a_1_fb | H1N1_MP | 2 | 642 | T | C | ,C | intron_variant | MODIFIER | CDS_H1N1_MP_1_759 | H1N1_M1 | transcript | H1N1_M1.2 | protein_coding | 1/1 | c.27-73T>C | minor | T | C | 982 | H1N1 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | NA | 824 | 29 | 853 | 0.9660023 | 0.0339977 | 0.9660023 | 0.0339977 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | NA | 783 | 35 | 818 | 0.9572127 | 0.0427873 | 0.9572127 | 0.0427873 | 0.0383925 | 0.9616075 | 0.0383004 | 0.9616996 | 0.2348725 | 0.9888624 | 169.5044 | 1.0069882 | 1.29e-05 | ||||||
a_1_fb | H1N1_MP | 2 | 642 | T | C | C | synonymous_variant | LOW | CDS_H1N1_MP_1_759 | H1N1_M1 | transcript | H1N1_M1.1 | protein_coding | 1/1 | c.642T>C | p.His214His | 642/759 | 642/759 | 214/252 | minor | T | C | 982 | H1N1 | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | NA | 824 | 29 | 853 | 0.9660023 | 0.0339977 | 0.9660023 | 0.0339977 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | NA | 783 | 35 | 818 | 0.9572127 | 0.0427873 | 0.9572127 | 0.0427873 | 0.0383925 | 0.9616075 | 0.0383004 | 0.9616996 | 0.2348725 | 0.9888624 | 169.5044 | 1.0069882 | 1.29e-05 |
dim(DF_filt_reps)
#> [1] 586 56
The function tally_it() allows the user to count the number of variants over a given set of variables. These variables should be provided as a list and then passed into the function in addition to the name, in quotes, for the new column containing the number of variants. For example, to get the sum of variants on every segment, the user should construct as list containing the name of the sample column and the name of the segment column and pass that list into the function.
# Across segments:
c('sample','CHROM', "SegmentSize")
group_list_seg = tally_it(DF_filt_reps, group_list_seg, "snv_count")
seg_count =
kable(head(seg_count))
sample | CHROM | SegmentSize | snv_count |
---|---|---|---|
a_1_fb | H1N1_PB2 | 2280 | 13 |
a_1_fb | H1N1_PB1 | 2274 | 13 |
a_1_fb | H1N1_PA | 2151 | 7 |
a_1_fb | H1N1_HA | 1701 | 4 |
a_1_fb | H1N1_NP | 1497 | 8 |
a_1_fb | H1N1_NA | 1410 | 4 |
# Across genome:
c('sample')
group_list_gen = tally_it(DF_filt_reps, group_list_gen, "snv_count")
gen_count =
kable(head(gen_count))
sample | snv_count |
---|---|
a_1_fb | 58 |
a_1_iv | 66 |
a_2_fb | 63 |
a_2_iv | 66 |
a_3_fb | 64 |
a_3_iv | 68 |
The transistion/transversion rate is commonly used to test for a bias in nucleotide conversions. Transitions are expected to be more common. Here, the user must provide the variant dataframe and the length of the genome, as this ratio will be caluclated by segment and over the whole genome.
tstv_ratio(DF_filt_reps,genome_size)
DF_tstv =kable(head(DF_tstv))
sample | CHROM | SegmentSize | chrom_or_genome | transition | transversion | tstv_ratio | tstv_ratio_perkb |
---|---|---|---|---|---|---|---|
a_1_fb | H1N1_PB2 | 2280 | tstv_chrom_count | 11 | 2 | 5.500000 | 2.4122807 |
a_1_fb | H1N1_PB2 | 2280 | tstv_genome_count | 51 | 7 | 7.285714 | 0.5547639 |
a_1_fb | H1N1_PB1 | 2274 | tstv_chrom_count | 11 | 2 | 5.500000 | 2.4186456 |
a_1_fb | H1N1_PB1 | 2274 | tstv_genome_count | 51 | 7 | 7.285714 | 0.5547639 |
a_1_fb | H1N1_PA | 2151 | tstv_chrom_count | 6 | 1 | 6.000000 | 2.7894003 |
a_1_fb | H1N1_PA | 2151 | tstv_genome_count | 51 | 7 | 7.285714 | 0.5547639 |
This function takes the variant dataframe and generates a large, interactive plot of all of the variants. The plot will be faceted by each segment for each individual sample. The user can scroll over each point in the plot to get information about that variant, including the nucleotide change and the position on the segment.
#library(plotly)
snv_location(DF_filt_reps)
The snv_genome() function will take the variant dataframe and generate a plot of the number of variants per genome and colors them by their SNPeff annotation, while the snv_segment() function will generate the same plot but faceted by segment.
snv_genome(DF_filt_reps)
snv_segment(DF_filt_reps)
This function will take the variant dataframe and generates 4 plots: the Ts/Tv ratio across the genome, the ratio across each segment, and both plots normalized by kb.
tstv_plot(DF_tstv)
This function takes the variant dataframe and generates three plots: the shannon entropy value for every variant, the sum of these values over each segment, and the sum over the entire genome. A higher Shannon entropy value indicates more diversity.
# makes 3 plots
plot_shannon(DF_filt_reps)
This function allows users to focus on a single variant of interest and track the allele frequency of the ALT and the REF allele in all samples it is found in. To run, the user must provide the variant dataframe and the segment name and nucleotide position, in quotes, of the variant they are interested in. This can be run for any variant found in the sequencing data.
# provide the df, segment, and variant position
position_allele_freq(DF_filt_reps,"H1N1_HA", "1007")
The dN/dS ratio is commonly used to test for a bias in amino acid conversions. dN counts the number of nonsynonymous mutations, or nucleotide changes that result in an amino acid change in the resulting protein product. dS counts the number of synonymous mutations, which don’t result in an amino acid change and are not expected to be under strong selection. Under neutral evolution, the dN/dS is expected to be appox. equal to 1, while a ratio > 1 indicates an enrichment of nonsynonymous changes, a hallmark of positive selection, and a ratio < 1 indicates a depletion of nonsynonsymous changes, a hallmark of negative or purifying selection. The user must provide the variant dataframe that contains information about the amino acid changes for all nucleotide variants, otherwise this metric will not accurately reflect the true value.
##dNdS_segment ## SAYS NEEDS dataframe - must be for amino-acid specific calculations, cannot be the same as the dataframe used for SNP calculations
This function takes the variant dataframe and generates a plot showing the distribution of the the minor variants (appearing in less than 50% of the reads for any sample). For many viruses, we expect the distribution to be scewed to very low frequency variants (< 10%).
af_distribution(DF_filt_reps)