Day Night

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Nature

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype"

Select a language for the TTS:
UK English Female
UK English Male
US English Female
US English Male
Australian Female
Australian Male
Language selected: (auto detect) - EN

Play all audios:

The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We present a method named HISAT2 (hierarchical indexing for spliced

alignment of transcripts 2) that can align both DNA and RNA sequences using a graph Ferragina Manzini index. We use HISAT2 to represent and search an expanded model of the human reference

genome in which over 14.5 million genomic variants in combination with haplotypes are incorporated into the data structure used for searching and alignment. We benchmark HISAT2 using

simulated and real datasets to demonstrate that our strategy of representing a population of genomes, together with a fast, memory-efficient search algorithm, provides more detailed and

accurate variant analyses than other methods. We apply HISAT2 for HLA typing and DNA fingerprinting; both applications form part of the HISAT-genotype software that enables analysis of

haplotype-resolved genes or genomic regions. HISAT-genotype outperforms other computational methods and matches or exceeds the performance of laboratory-based assays.

The CAAPA genome data are available from dbGaP (accession phs001123.v1.p1).

HISAT2 and HISAT-genotype are open-source software freely available at https://github.com/DaehwanKimLab/hisat2. The HISAT2 package includes programs and application programming interfaces

for C++, Python and JAVA that rapidly retrieve genomic locations from repeat alignments for use in downstream analyses such as variant calling, peak calling and differential gene expression

analysis.

We would like to express our thanks to K. Barnes and M. Daya for sharing Omixon’s HLA results with us. We would like to thank B. Langmead and J. Pritt for their invaluable contributions to

our discussions on HISAT2. We also greatly appreciate the generosity of G. Danuser and D. Reed in providing wet-lab bench space and equipment for us. This work was supported in part by the

National Human Genome Research Institute under grants R01-HG006102 and R01-HG006677 to S.L.S. and by the Cancer Prevention Research Institute of Texas under grant RR170068 to D.K. All

authors read and approved the final manuscript.

Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, USA

Department of Computer Science, Stanford University, Stanford, CA, USA

Center for Computational Biology, McKusick-Nathans Institute of Genetic Medicine, School of Medicine, Johns Hopkins University, Baltimore, MD, USA

Departments of Biomedical Engineering, Computer Science, and Biostatistics, Johns Hopkins University, Baltimore, MD, USA

D.K. and S.L.S. performed the analysis and discussed the results of HISAT2 and HISAT-genotype. D.K. designed and implemented HISAT2 and HISAT-genotype. J.M.P. optimized the index-building

algorithm of HISAT2. D.K. and C.P. implemented the repeat-indexing algorithm of HISAT2. D.K., C.P. and C.B. performed the evaluations of the various programs. D.K. performed the wet-lab

experiments. D.K., C.B. and S.L.S. wrote the manuscript.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

In the two ‘Node rank’ columns on the left, since node ranks are given in consecutive and increasing order, one bit (0 or 1) can be used to represent a node rank instead of 4 bytes (any

number between 0 and 4,294,967,295) to manage offsets for a human genome. 1 and 0 are used to indicate a new node rank and to indicate an additional outgoing or incoming edge that a node

has, respectively. To retrieve a node rank, simply summing up the 1s gives rise to that node’s rank. Since the labels in the ‘First’ column are already sorted, five numbers are enough to

represent the column, two for As, three for Cs, four for Gs, three for Ts, and two for Zs. In the ‘Last’ column, two bits are used to represent each label: 00 for A, 01 for C, 10 for G, and

11 for T. 00 is also used to represent Z. HISAT2 internally resolves whether 00 represents A or Z. The right table is the space efficient representation of the left table after these

transformations.

In a and b, a reference consisting of only alleles of genes of interest can introduce significant mapping bias by mapping reads from regions not included in the restricted reference, as

illustrated in more detail in Supplementary Fig. 3. In c, an aligner using the current human reference may not be able to map many reads if they originated from alleles that are

substantially different from the human reference allele. In d, a reference consisting of the human reference plus numerous alleles of HLA genes enables mapping of reads from even

substantially different alleles. Most HLA-typing methods, such as HLA-VBSeq, HLA*PRG, Kourami, and Graphtyper, are based on c, d, or a combination thereof to initially identify HLA reads,

after which HLA-VBSeq uses approach a, and HLA*PRG, Kourami, and Graphtyper use a small-scale graph representation as described in b to perform typing. Kourami assembles only exons of HLA

genes, while HISAT-genotype is able to assemble full-length sequences of HLA genes including exons and introns.

An illustration of the benefits of using the right reference/index when working with sequencing reads. The figure shows the alignment of reads to the whole genome (upper right) and to one

particular genomic region denoted as Region 3 (lower right). When using the whole genome for aligning the six example reads, reads are perfectly aligned to the correct regions (regions 1, 2,

3, 4, 5, and 6). However, if the example reads are aligned using only one particular region (e.g. Region 3), five out of the six reads are incorrectly aligned to that region because

alignment programs allow for a few mismatches. For example, in order to identify and extract reads that belong to the HLA-DRB1 gene from whole-genome sequencing reads, one may attempt to

align them only to the HLA-DRB1 gene region. In one experiment, we found that this strategy produced 1100 times more reads mapped to HLA-DRB1 than a whole-genome alignment produced, because

of the numerous pseudogene copies of DRB1 in the genome.

The size of each block is 128 bytes, consisting of 32 4-byte cells. Each block stores: (1) four 4-byte numbers for the accumulated numbers of occurrences of A, C, G, and T up to that block,

(2) one 4-byte number for the accumulated number of 1s up to the block in the Node rank (Outgoing edges) of the right table in Fig. 1a, (3) one 4-byte number for the row number of the Node

rank (incoming edge) corresponding to the accumulated number of 1s indicated in (2), (4) 208 labels (or nucleotides) corresponding to the Last column of the right table in Fig. 1a, and (5)

208 ‘OUT’ bits and 208 ‘IN’ bits corresponding to the Node rank columns of the table.

How to align a 3-bp query, TAG, whose TG corresponds to the last two nucleotides of the original reference sequence, GAGCTG, and where A is a 1-bp insertion in the query. Searching from the

right end of the query to the left, the nodes labeled ‘G’ are first selected (node IDs ‘4’, ‘5’, ‘6’, and ‘7’). Then the incoming edges of those nodes are examined to identify which has a

preceding base ‘A’. Nodes ‘5’ and ‘7’ qualify, with preceding nodes ‘1’ and ‘2’. These in turn are examined to determine which of these nodes is preceded by a base ‘T’. Only one of the two

nodes, node ‘2’, has a preceding node, ‘8’, whose label corresponds to ‘T’. Node ‘8’ is chosen as a mapped location for the query. This is the final alignment of the query shown in the

prefix-sorted graph, and additional algorithms convert it to the corresponding alignment in the original graph.

Given a de Bruijn graph in which k = 3 and each k-mer is present at least C times (e.g. five times) in the genome, a k-mer is chosen (e.g. the most frequently occurring k-mer) and extended

in the left and right directions. Note that a de Bruijn graph can be easily constructed from a k-mer table. The extended sequence consisting of k-mers is called a repeat sequence. If there

is a branch during extension, one of the k-mers is chosen (e.g. the most frequently occurring k-mer at the branch). In a, for example, a k-mer, TTT, is chosen and shown in yellow, and then

extended in both directions until there is no extension possible, resulting in a repeat sequence, CCGTTTAC. In order to find the next repeat sequence, the k-mers belonging to the previously

identified repeat sequence are removed as shown in b. A k-mer, CTT, is chosen and not extended in b as it has no k-mers to extend. In c, TAT is initially chosen and extended into TATTGT in

orange. Finally, in d, TGC is chosen and not extended. Each repeat sequence consists of sub-sequences that exist in the reference genome, and a sub-sequence consists of one or more

consecutive k-mers. We only store one location per sub-sequence, instead of per k-mer.

Supplementary Figs. 1–7, Supplementary Tables 1–9 and Supplementary Note 1

HISAT-genotype’s HLA typing results for 17 PG genomes on HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1 and HLA-DRB1

HISAT-genotype’s HLA typing results for 917 CAAPA genomes on HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1 and HLA-DRB1

Comparisons of HISAT-genotype and Kourami for HLA typing using simulated reads (see Supplementary Note 1 for description)

HISAT-genotype initial DNA fingerprinting results for 17 PG genomes

PowerPlex Fusion results for 17 PG genomes (raw signal image data)

List of alleles for 13 DNA fingerprinting loci and the amelogenin locus from the NIST short tandem repeat database

List of eight additionally incorporated alleles for four DNA fingerprinting loci D8S1179, D13S317, VWA and D21S11

Anyone you share the following link with will be able to read this content:

Its round 2 of the centre's smart city plan and the war of the cities has already begun. Even as the mid - December...

Centrolene to launch new forwarder technology, network platform

Home/American Shipper/Centrolene to launch new forwarder technology, network platformAmerican ShipperCentrolene to launc...

Businesses, ports oppose new calif. Air rules

Air quality regulators in Southern California have proposed a rule that they say would “ensure emission reductions as...

ABX Air to acquire Cargo Holdings International

Home/American Shipper/ABX Air to acquire Cargo Holdings InternationalAmerican ShipperABX Air to acquire Cargo Holdings I...

6268-met-gala-2023-alia-bhatt-makes-her-debut-in-a-pearly-white-princess-bride-gown-styled-by-prabal-guru - NorthEast Now

6268-met-gala-2023-alia-bhatt-makes-her-debut-in-a-pearly-white-princess-bride-gown-styled-by-prabal-guru Ready for a ch...

Latests News

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

The human reference genome represents only a small number of individuals, which limits its usefulness for genotyping. We...

Tata doubles down on iPhones: Plans to expand Hosur Apple factory, to employ 28,000 people at it

In a significant move, Tata Electronics, a part of the Tata Group, is set to amplify its iPhone-casing facility in Hosur...

Tata group commits Rs 1,500 crore to help fight COVID-19

Tata Trusts and Tata's group companies have come forward to help the country fight the pandemic. Tata Sons, the holding ...

9 quick questions for rebecca wisocky | members only access

In the popular CBS series _Ghosts,_ Rebecca Wisocky, 51, plays the deceased character Hetty Woodstone, who, along with a...

Mamata Banerjee's campaign helpline turns into job-hunt portal for youth

Ever since the campaign was launched, there have been a lot of calls on the campaign helpline.Ever since West Bengal CM ...

Menu

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype"

Play all audios:

Play all audios:

Trending News

Latests News