Related Information
Here's some related infromation of some special points about the software involved.
Last updated
Here's some related infromation of some special points about the software involved.
Last updated
1 SvAnna
(1)How does SvAnna work?
SvAnna introduces the pathogenicity of structural variation score (PSV) to evaluated SV deleteriousness, calculated based on DNA sequence deleteriousness and phenotype similarity score.
For each SV, SvAnna determines the extent of overlap with genomic elements, including promoters and transcripts. For each transcript, it determines which exon or exons are affected and whether the transcriptional start site of the coding sequence is disrupted. For each class of variant, SvAnna defines rules to assess a sequence deleteriousness score δ(G) for a set of genes G affected by the variant(as shown below). At the same time, a phenotypic relevance score Φ(Q,D) is calculated based on the similarity of patient phenotypes Q encoded using Human Phenotype Ontology (HPO) terms and the ~ 8000 computational disease models D of the HPO project. The candidates are ranked based on a PSV score that is calculated as a function of the δ(G) and Φ(Q,D) scores.
(2)How is PSV calculated?
The Pathogenicity of Structural Variation (PSV) score is calculated based on the sequence deleteriousness score δ(g) and phenotype similarity score ɸ(Q, D) for all affected genes G. The sequence score δ(g) for each affected gene is weighted by the phenotypic similarity score ɸ(Q, D).
Here, the PSV score is calculated as a function of the query HPO terms (Q), the set of affected genes G, and the Mendelian diseases D associated with the genes in G. δ(g) is weighted by the exponentiated phenotypic similarity ɸ(Q, D) of the query terms Q to a computational model of a disease D that is associated with variants in g. SvAnna uses the highest ɸ(Q, D) if more than one disease is associated with variants in g.
(3)How to interpret PSV?
PSV was calculated as described above. It is a synthesis of function scoring and phenotypic relevance. The efficiency of the PSV raw scoring has not been fully verified and thus no definite cutoff. However, ranking all SVs in a human sample according to PSV helps identifying the very pathogenic one among the benign at a reported discovery rate of 87%.
2 Straglr
How do Straglr find tandem repeats?
The genotyping module, given a list of coordinates and motifs either from the genome-scan module or input by the user in BED format to run Straglr in genotype-only mode, iterates through each locus and extracts read sequences sandwiched between the coordinates of the neighboring nucleotides. A similar strategy is used to extract insertion sequence from single or split alignments obtained from Straglr’s genome-scan module. Straglr attempts to rescue missing split alignments in which there are potential TR sub-sequences by aligning a short stretch (80 bp) of reference genomic sequence immediately up- or downstream of the position where the possible sub-sequence lies to the clipped read sequence using BLASTN. A successful unambiguous alignment provides the missing boundary of the repeat sub-sequence with the read and enables Straglr to extract the sequence for TRF inspection. TRF is again used to screen all candidate repeat sequences extracted, as motifs detected are matched against the target motif. When simple string matching using Python’s regular expression fails, such as the cases for long and complicated VNTRs, BLASTN is performed with mismatch allowance to discern matches between target and detected motifs. If none of the detected motifs matches the target motif, no genotyping result will be produced for the locus in question. The final size and motif sequence for each supporting read are extrapolated from TRF results. This process captures potential TRs smaller than the alleles identified from genome-scan mode (if genome-scan mode was run), hence compiling a complete list of alleles for genotype ascertainment.
Straglr uses GMM (Python scikit-learn package) to estimate the genotype given all the repeat sizes identified at each target locus. It attempts different numbers of clusters up to a user-specified maximum (default = 2) and assigns the one with the smallest Akaike information criterion (AIC) value as the number of alleles. The median of all repeat sizes within each cluster is the size reported for that allele. The final output of Straglr details the supporting read names together with the copy numbers, sizes, and start location of the TR detected in each read.
3 Pandepth
How do pandepth work?
It utilizes HTSlib for alignment file parsing and accepts binary compressed alignment files in BAM or CRAM format via the ‘-i’ parameter. Reads with alignment quality below a specified threshold can be excluded using the ‘-q’ parameter, and reads with specific flags can be excluded using the ‘-f’ parameter. By default, reads with flags indicating unmapped, secondary alignment, quality control failures and optical duplicates are filtered. Sequencing coverage statistics for specific regions can be obtained by providing GFF/GTF or BED files through the ‘-g’ or ‘-b’ parameters, or a specific window size through the ‘-w’ parameters, respectively. The result is reported in an output file with metrics including Covered site, Total depth, Coverage (%) and Mean depth for each chromosome, specific gene, region or window size. Additional inclusion of GC content analysis in the output can be enabled by specifying the reference genome sequence with the ‘-r’ parameter and using the ‘-c’ parameter. When the ‘-t’ parameter specifies two or more threads and index files are present, PanDepth employs parallel computing to perform sequencing coverage calculations. The program distributes the workload across multiple threads, resulting in improved efficiency.
: Danis, D., Jacobsen, J. O. B., Balachandran, P., Zhu, Q., Yilmaz, F., Reese, J., Haimel, M., Lyon, G. J., Helbig, I., Mungall, C. J., Beck, C. R., Lee, C., Smedley, D., & Robinson, P. N. (2022). SvAnna: efficient and accurate pathogenicity prediction of coding and regulatory structural variants in long-read genome sequencing. Genome medicine, 14(1), 44. https://doi.org/10.1186/s13073-022-01046-6
: Chiu, R., Rajan-Babu, IS., Friedman, J.M. et al. Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences. Genome Biol 22, 224 (2021). https://doi.org/10.1186/s13059-021-02447-3
: Huiyang Yu, Chunmei Shi, Weiming He, Feng Li, Bo Ouyang, PanDepth, an ultrafast and efficient genomic tool for coverage calculation, Briefings in Bioinformatics, Volume 25, Issue 3, May 2024, bbae197,