Profiling immunoglobulin repertoires across multiple human tissues using rna sequencing

Nature

Profiling immunoglobulin repertoires across multiple human tissues using rna sequencing"


Play all audios:

Loading...

ABSTRACT Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and


accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types


from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of


immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue


repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an


important resource for studying immunological diseases. SIMILAR CONTENT BEING VIEWED BY OTHERS TRUST4: IMMUNE REPERTOIRE RECONSTRUCTION FROM BULK AND SINGLE-CELL RNA-SEQ DATA Article 13 May


2021 TUTORIAL: INTEGRATIVE COMPUTATIONAL ANALYSIS OF BULK RNA-SEQUENCING DATA TO CHARACTERIZE TUMOR IMMUNITY USING RIMA Article 30 June 2023 IDENTIFICATION, ASSEMBLY AND CHARACTERIZATION OF


TUMOR IMMUNOGLOBULIN TRANSCRIPTS FROM RNA SEQUENCING DATA USING IGSEQR Article 24 April 2025 INTRODUCTION A key function of the adaptive immune system is to mount protective memory responses


to a given antigen. B cells recognize their specific antigens through immunoglobulins (Ig), surface antigen receptors, which are unique to each cell and its progeny. A typical Ig repertoire


is composed of one immunoglobulin heavy chain (IGH) and two light chains, κ (IGK) and λ (IGL). Igs are diversified through somatic recombination, a process that randomly combines variable


(V), diversity (D), and joining (J) gene segments, and inserts or deletes non-templated bases at the recombination junctions1 (Fig. 1a). The resulting DNA sequences are then translated into


antigen receptor proteins. This process enables the Ig repertoire to develop astonishing diversity of antigen receptors from any given individual, with >1013 theoretically possible


distinct Ig receptors1. Ig repertoire diversity is key for an individual’s immune system to confer protection against a wide variety of potential pathogens2. In addition, upon activation of


a B cell, somatic hypermutation further diversifies Ig in their variable region. These changes are mostly single-base substitutions occurring at extremely high rates—somatic hypermutation


can undergo 10−5 to 10−3 mutations per base pair per generation3. Isotype switching is another mechanism that contributes to B-cell functional diversity. Here, antigen specificity remains


unchanged, while the heavy chain VDJ regions join with different constant (C) regions, such as IgG, IgA, or IgE isotypes, and alter the immunological properties of Igs. The pairing of heavy


and light chains that occurs in polyclonally activated B cells chains is another mechanism that increases Ig diversity. High-throughput technologies enable remarkable levels of accuracy when


profiling the Ig repertoires. Commonly used assay-based approaches to RNA sequencing (RNA-Seq) provide a detailed view of the adaptive immune system by leveraging the deep sequencing of


amplified DNA or RNA from the variable region of the Ig locus (BCR-Seq)4,5,6. Those technologies are usually restricted to one chain, with the majority of studies focusing on the heavy chain


of the Ig repertoire. Recent studies2 have successfully applied assay-based approaches to characterize the immune repertoire of peripheral blood. However, little is known about the


immunological repertoires of other human tissues, including barrier tissues such as skin and mucosae. Studies involving assay-based protocols usually have small sample sizes, thus limiting


analysis of intra-individual variation of immunological receptors across diverse human tissues. RNA-Seq traditionally uses the reads mapped onto human genome references to study the


transcriptional landscape of both single cells and entire cellular populations. In contrast to assay-based protocols that produce reads from the amplified variable region of the Ig locus,


RNA-Seq is able to capture the entire cellular population of the sample, including B cells. However, due to the repetitive nature of the Ig locus and the extremely high level of diversity in


Ig transcripts, most mapping tools are ill-equipped to handle Ig sequences. RNA-Seq was successfully used for analysis of highly clonal leukemic repertoires with high relative quantities of


Ig transcripts5. Despite this, Ig transcripts often occur in sufficient numbers within the transcriptome of many tissues to characterize their respective Ig repertoires7. A number of


methods8,9,10 were designed to assemble Ig and T-cell receptor repertoires and have been applied across various public RNA-Seq datasets. Existing methods that are capable of assembling Ig


repertoires from bulk RNA-Seq data typically produce low-accuracy results (F-score < 0.2). In this study, we develop ImReP, an alignment-free computational method for rapid and accurate


profiling of the Ig repertoire from regular RNA-Seq data. ImReP is capable of efficiently extracting receptor-derived reads from RNA-Seq data and accurately assembling Ig clonotypes, defined


as distinct amino acid sequences of complementarity-determining region 3 (CDR3). We demonstrate that bulk RNA-Seq is a suitable technology for measuring the individual adaptive immune


repertoire. In our validation experiments, ImRep is able to capture 69% of the immune repertoire obtained by BCR-Seq. Using ImReP, we create a systematic atlas of Ig sequences across a broad


range of tissue types, most of which were previously unstudied for Ig repertoires. We also examine the compositional similarities of clonal populations between the tissues to track the flow


of Ig clonotypes across immune-related tissues, including secondary lymphoid and organs that encompass mucosal, exocrine, and endocrine sites. Our proposed approach lacks advantages in


comparison with performing targeted BCR-Seq; rather, it provides a useful tool for mining large-scale RNA-Seq datasets for the study of Ig receptor repertoires. RESULTS EXISTING TOOLS FOR


PROFILING THE IG RECEPTOR REPERTOIRE A number of tools have previously been developed to reconstruct the Ig receptor repertoire. Repertoire analysis from RNA-Seq data typically starts with


mapping the reads to the germline V, D, and J genes that can be obtained from the International ImMunoGeneTics (IMGT) database11. There are three possible read mapping scenarios as follows:


(1) the read is entirely mapped to the V gene; (2) the read is entirely mapped to the J gene; (3) the read is partially mapped to the V and J genes simultaneously. Existing methods consider


only reads from category (3). These methods use different underlying algorithms to map reads to germline genes. For example, MiXCR8 relies on an in-house alignment procedure, IgBlast12


utilizes BLAST with an optimized set of parameters, and IMSEQ13 uses in-house pairwise alignment between the read sequence and the germline V and J segment sequences. Following the


alignment, MiXCR performs overlapping of previously aligned reads into contigs. The resulting contigs are re-aligned to the V, D, and J genes to verify that the significant portion of


non-template N insertions is covered. In contrast to MiXCR, which simultaneously aligns reads to both V, D, and J genes, IgBlast separately aligns the query read to databases composed of V,


D, and J genes. IgBlast uses a specific sequence to separately perform alignment; first, the program finds the best V gene hit. Next, IgBlast masks the aligned read region and performs an


alignment to the J gene database. In the event of a heavy chain, IgBlast also queries the D gene database for the best hit. Lastly, the software checks that each component in the obtained


V(D)J rearrangement originates from the same locus and reports CDR3 sequences and corresponding V(D)J recombinations. All methods use the definition of CDR3 to determine the boundaries of


the CDR3 sequence in each of the reads. The final step in repertoire analysis is to correct the assembled clones for PCR and sequencing errors. To correct these errors, which are introduced


during data preparation, MiXCR and IMSEQ cluster the assembled clones and report a consensus sequence for each cluster. IgBlast skips the error correction step and directly outputs inferred


CDR3 sequences. Most methods use alignment or assembly to infer CDR3s and align reads to V and J genes. In contrast, the ImReP procedure provides a match between the read prefix and the read


suffix to the prefix of J genes and suffix of V genes, respectively, without a need for alignment. In avoiding alignment, ImReP is able to significantly decrease running time and minimize


required computational resources. Average CPU (central processing unit) time reported for ImReP is 44 minutes, a runtime substantially shorter than the average 10 hours required for MiXCR.


At peak usage, across all samples, ImReP consumes 3 GB of RAM (random access memory), whereas MiXCR requires 10 GB of RAM. IMREP IS A METHOD FOR PROFILING OF IG REPERTOIRE We apply ImReP to


0.6 trillion RNA-Seq reads (92 Tbp) from 8555 samples to assemble CDR3 sequences of Ig receptors (Supplementary Data 1). The RNA-Seq data were generated by the Genotype-Tissue Expression


Consortium (GTEx v6). First, we map RNA-Seq reads to the human reference genome using a short-read aligner (performed by GTEx consortium14) (Fig. 1). Next, we identify reads spanning the


V(D)J junction of the Ig receptors and assembled clonotypes (a group of clones with identical CDR3 amino acid sequences). We define the CDR3 as the sequence of amino acids, starting with


cysteine, which is located on the left side of the junction, and ending with phenylalanine (for IGK or IGL) or tryptophan (for IGH), which is located on the right side of the junction. In


this case, ImReP uses 0.02 trillion high-quality reads that had been successfully mapped to Ig genes or were unmapped reads that had failed to map to the human reference genome (Fig. 1a and


Supplementary Fig. 1). ImReP is a two-stage alignment-free approach to assembling CDR3 sequences and detecting corresponding V(D)J recombinations (Fig. 1b). In the first stage, we prepare


the candidate receptor reads from mapped and unmapped RNA-Seq reads (Supplementary Fig. 1). We then merge partially mapped reads from Ig loci and unmapped reads into a set of candidate


receptor reads, which serve as an input for ImReP. We scan the amino acid sequences of the read and determine the putative CDR3 as a substring of the read starting from cysteine (C) and


ending with phenylalanine (F) (or tryptophan(W) for IGH). A read is separated into three parts as follows: read prefix, CDR3, and read suffix. The CDR3 sequence is a sequence starting with


cysteine (C) and ending with, for IGK and IGL, phenylalanine (F), and, for IGH, tryptophan (W). Reads with putative CDR3s are further examined to assess the overlap of V and J genes.


Variable Ig receptor genes are imported from IMGT version: 3.1.17. We use C from the beginning of the read and C from the V gene as an anchor to match the read prefix and the V gene.


Similarly, we use F (or W) from the end of the read and F (or W) from the J gene as an anchor to match the read suffix and the J gene. In the second stage, ImReP utilizes reads that contain


a partial CDR3 sequence and overlap a single gene segment (V or J). We use an alignment-free procedure to determine the alignment between the V or J gene and the read prefix or suffix,


respectively. ImReP performs matching with a suffix tree technique; matched reads with an overlap of at least 15 nucleotides are used to assemble full-length CDR3s. We further correct PCR


and sequencing errors in the assembled CDR3s. ImReP clusters assemble CDR3 into a set of clusters using the CAST algorithm10. The clustering procedure is iteratively repeated until the


average inverse edit distance (Levenshtein) inside each cluster is less than the user-defined threshold (ImReP default is .2). The consensus sequence of each cluster is reported as the


correct CDR3 sequence. A detailed description of the methodology implemented with ImReP is provided in the Methods section. ImReP is freely available at


https://github.com/Mangul-Lab-USC/imrep. Currently, ImReP supports human and mouse Ig receptor repertoires. FEASIBILITY OF USING RNA-SEQ TO STUDY THE IG REPERTOIRE To validate the


feasibility of using RNA-Seq to study the Ig receptor repertoire, we simulate RNA-Seq data as a mixture of transcriptomic reads and reads derived from Ig transcripts (ratio between


Ig-derived reads and transcriptomic reads was on average 1 : 3600) (Supplementary Fig. 2). Ig transcripts are simulated based on random recombination of V, D, and J gene segments (obtained


from IMGT database11) with non-template insertion at the recombination junctions (Supplementary Fig. 3). We assess the ability of ImReP to extract CDR3-derived reads from the RNA-Seq mixture


by applying ImReP to a simulated RNA-Seq mixture. While our simulation approach may not completely summarize the various nuances and eccentricities of actual immune repertoires, it allows


us to assess the accuracy of our tool. ImReP is able to identify 99% of CDR3-derived reads from the RNA-Seq mixture, suggesting it is a powerful tool for profiling RNA-Seq samples of


immune-related tissues. Next, we compare ImReP with other methods designed to assemble Ig receptor repertoires. We also investigate the sequencing depth and read length required to reliably


assemble Ig sequences from RNA-Seq data. Our simulations suggest that both read length and sequencing depth have a major impact on precision-recall rates of CDR3 sequence assembly. ImReP is


able to maintain an 80% precision rate for the majority of simulated scenarios. Average CDR3 coverage that is higher than eight allows ImReP to archive a recall rate close to 90% for a read


length above 75 bp (Fig. 2a). Increasing coverage has a positive effect on the number of assembled clonotypes achieved by ImReP. We compare the performance of ImReP with that of MiXCR


(RNA-Seq mode)8, 13, IgBlast-based pipeline15, and IMSEQ13. Except for IMSEQ, these tools were developed to assemble the hypervariable sequences from Ig receptors directly from RNA-Seq data.


Another tool, iSSAKE16, is no longer supported and was not recommended for use. Unfortunately, we obtained empty output after running V’DJer17 and we could not solve the problem by


increasing coverage in the simulated data. We exclude TRUST9 and TraCeR10, as those methods are solely designed for T-cell receptors. We supplied each of those tools with the original


RNA-Seq reads as raw or mapped reads, depending on the software developers’ recommendations. IMSEQ13 cannot be applied directly to RNA-Seq reads because it was originally designed for


targeted sequencing of Ig receptor loci. Thus, to independently assess and compare accuracy with ImReP, we only ran IMSEQ with the simulated reads derived from Ig transcripts (Supplementary


Fig. 2). ImReP consistently outperforms existing methods in both recall and precision rates. The recall is defined as TP/(TP + FN). Precision is defined as TP/(TP + FP). We define TP as the


number of correctly assembled CDR3 sequences (based on the exact match), FN is defined as the number of true CDR3 sequences not assembled by the method and FP is defined as the number of


incorrectly assembled CDR3 sequences. On average, ImReP offers three-time superior accuracy (average F-score of ImRep is .78, for other methods average F-score is < 0.2). F-score is


defined as the harmonic mean of precision and recall. Notably, ImReP is the only method with acceptable performance for 50 bp read length, reconstructing with a higher precision rate


significantly more CDR3 clonotypes than other methods. To further demonstrate the feasibility of applying nonspecific RNA-Seq techniques to profile Ig receptor repertoires, we use 18 tumor


biopsies sequenced by BCR-Seq and RNA-Seq. We acquired biopsies from patients with histologically confirmed Burkitt lymphoma18. We first mapped the reads onto the reference human genome and


transcriptome, then we extracted unmapped reads, which we provided to ImReP for assembly of IGH clonotypes. Based on the recommendation of MiXCR, we provide raw paired-end reads to the tool.


BCR-Seq data were generated by Adaptive Biosystems (https://www.adaptivebiotech.com/) and was analyzed by Adaptive Biosystems’s Immune Analyzer package. One difficulty inherent to using


BCR-Seq as a gold standard for estimating the efficiency of the RNA-Seq method is that BCR-Seq captures DNA clonotypes, whereas RNA-Seq only captures the expressed clonotypes. To account for


the possible discrepancies, we first map RNA-Seq reads onto the major clonotypes with relative frequency at least 90% detected by BCR-Seq. In 5 out of 18 BCR-Seq samples, no RNA-Seq reads


map to BCR-seq-confirmed major clonotypes. We exclude those samples from the analysis. In remaining samples, we consider the set of CDR3s obtained by BCR-Seq to be the total IGH repertoire.


We investigate which portion of the total immune repertoire that RNA-Seq is capable of capturing. Using RNA-Seq, ImReP is able to capture on average 53.3% of the IGH repertoire, estimated as


the sum of detected BCR-seq-confirmed clonotypes; MiXCR is able to capture 40.1% (Fig. 2b). In all cases, ImRep is capable of detecting BCR-seq-confirmed clonotypes with a relative


frequency exceeding 90%. In comparison, MiXCR detects these clonotypes in only 83.3% of cases. When the frequency of the major clonotype drops below 10%, ImReP is able to detect the major


clonotype in 60% of the cases, while MiXCR only detects a clonotype in 20% of the cases. Remarkably, both methods are able to detect major clonotypes with a frequency below 1% in one of the


samples (Supplementary Data 2). We also investigate the ability of each method to detect BCR-seq-confirmed minor clonotypes. The average frequency of the minor clonotypes across all samples


is 0.37%. ImReP is able to detect a minor clonotype in 38% of the samples (Fig. 2e). Despite the ability of ImReP and MiXCR to capture the majority of BCR-seq-confirmed repertoire, both


methods often miss the rare clonotypes due to the limited number of Ig-derived reads in RNA-Seq data. ImReP is able to detect 50% of all BCR-Seq-confirmed clonotypes with the relative


frequency higher than 0.24%. MiXCR is able to detect 50% of all BCR-Seq-confirmed clonotypes with the relative frequency higher than 0.29% (Supplementary Fig. 4). Both methods are able to


accurately estimate the relative frequencies of assembled clonotypes (ImRep: _r_ = 0.97, _p_-value = 4.4 × 10−40; MiXCR _r_ = 0.87, _p_-value = 5.1 × 10−15) (Fig. 2c, d). Scripts and


commands utilized to process the data and run all the tools used in this study are available online at https://github.com/Mangul-Lab-USC/ImReP_publication. We also investigate the


possibility of fusing a V and J read based on the partial CDR3 overlap that was not actually derived from the same read. We obtained 3129 BCR-Seq-based IGH transcripts from a healthy, naive


B-cell repertoire sequenced using error-corrected B-cell receptor (BCR) sequencing19. We use those transcripts as the reference to simulate reads covering the BCR-Seq-based IGH transcripts


with 16× coverage. The second stage of ImReP increases sensitivity by 16% for 50 bp reads and 4% for 75 bp reads. No improvement is observed for 100 bp reads (Supplementary Table 1). The


decrease of precision in case of 75 bp reads results in an overall decrease of the F-score. Based on the simulated reads, we recommend applying the second stage of ImRep for 50 bp reads


(implemented as default settings in ImReP). We further validate the ability of ImReP to accurately infer the proportion of immune cells in sampled tissue. We hypothesize that the fraction of


B cells in a sample will be proportional with the fraction of receptor-derived reads in our RNA-Seq data. We use a transcriptome-based computational method, SaVant20, which uses


cell-specific gene signatures (independent of Ig transcripts) to infer the relative abundance of B cells within each tissue sample (Supplementary Table 2). The B-cell signatures used by


SaVant are derived from CD19+ cells and might not represent every B-cell subset21. However, CD19+ cells likely represent the largest populations of B-cell subsets and many of the


CD19-negative B-cell subsets may carry a gene signature similar to the CD19 signatures. We find that B-cell signatures inferred by SaVant show a positive correlation with the size of IGH


repertoire (_r_ = 0.77, _P_ = 1.8 × 10−10) (Fig. 2f). An exception to this correlation is found for tissues that contain the highest density of B cells: spleen, whole blood, small intestine


(terminal ileum), lung, and Epstein–Barr virus (EBV)-transformed lymphocytes (LCLs). CHARACTERIZING THE IG REPERTOIRE ACROSS 53 GTEX TISSUES ImReP identifies over 8826 million reads


overlapping 3.6 million distinct CDR3 sequences that originated from diverse human tissues. The majority of assembled CDR3 sequences were derived from IGH chain (1.7 million), 0.9 million


were derived from the IGK chain, and 1.0 million were derived from the IGL chain. Ninety-eight percent of CDR3 sequences have a count of less than ten reads and the median CDR3 sequence


count is 1.4. CDR3 sequences derived from IGK are the most abundant across all tissues, accounting on average for 54% of the entire B-cell population (Supplementary Fig. 5). We compare the


length and amino acid composition22 of the assembled CDR3 sequences of Ig receptor chains (Supplementary Fig. 6). Consistent with previous studies, we observe that Ig light chains have


notably shorter and less variable CDR3 lengths when compared with heavy chains23. The tissue type appears to have no effect on the length distribution of CDR3 sequences (Supplementary Fig. 


7). In line with other studies23, 24, both light chains exhibit a reduced amount of sequencing diversity (Supplementary Fig. 6). We observe an average of 1331 distinct Ig clonotypes per


sample. To account for various sequencing depths, we further normalized the detected number of clonotypes by the total number of RNA-Seq reads. We refer to this measure as clonotypes per one


million raw RNA-Seq reads (CPM). As the number of distinct clonotypes does not increase linearly with the sequencing depth, a CPM metric should not be used in studies comparing clonotype


diversity across various phenotypes. Instead, CPM is intended to be an informative measure of clonal diversity that is adjusted for sequencing depth. We use per sample α-diversity (Shannon


entropy) to incorporate into a single diversity metric the total number of distinct clonotypes and their relative frequencies. Among all tissues, spleen has the largest B-cell population,


with a median of 1301 Ig-derived reads per one million RNA-Seq reads. Spleen also has the most diverse population of B cells with median per sample α-diversity rate of 7.6, corresponding to


1025 CPM (Fig. 3 and Supplementary Data 1). Organs that possess mucosal, exocrine, and endocrine sites (_n_ = 24) harbor a rich clonotype population with a median of 87 CPM per sample. Minor


salivary glands have the highest Ig diversity rates in the group (_α_ = 7.1) and surpass the diversity rates of the terminal Ileum containing Peyer’s Patches, which are secondary lymphoid


organs (Supplementary Data 1). Tissues not related to the immune system, including adipose, muscle, and organs from the central nervous system, contain a median of six CPM per sample, which


are most likely due to the blood content of the tissues25. The highest number of distinct CDR3 sequences among non-lymphoid organs is present in the omentum, a membranous double layer of


adipose tissue containing fat-associated lymphoid clusters. As expected26, EBV-transformed lymphocytes (LCLs) harbor a large homogeneous population of Ig clonotypes (Supplementary Data 1 and


Supplementary Fig. 8). The number of reported clonotypes is normalized by the proportion of B cells within each tissue sample (Supplementary Data 3). We have used SaVant to infer the


relative abundance of B cells within each tissue sample based on cell-specific gene signatures (independent of Ig transcripts). IG CLONOTYPES SPECIFIC TO AN INDIVIDUAL OR A TISSUE TYPE Amino


acid sequences of clonotypes exhibit extreme inter-individual dissimilarity, with 88% of clonotypes unique to a single individual (private) (Fig. 4a). The remaining ~400,000 clonotypes are


shared by at least two individuals (public). The small fraction of B cells present in many tissues limits our ability to capture the entire Ig repertoire in those tissues and leads to


mis-classification of some public clonotypes as private. The number of individuals sharing clonotypes varies across Ig chains, with Ig light chains having the highest number of public


clonotypes. Twenty-five percent of all IGK clonotypes are public, and the number of individuals sharing the IGK clonotype sequences can be as high as 471 (Fig. 4b). The limited capacity of


RNA-Seq to cover low-abundant clonotypes may misclassify public clonotypes as private. Consistent with previous studies9, 27, we observe that public clonotypes are significantly shorter in


length than private clonotypes (two-sided two-sample _t_-test: _p_-value < 2 × 10−16). For example, IGH chain public clonotypes have an average length of 13 amino acids and private


clonotypes have an average length of 16 amino acids. We also examine whether public clonotypes are more often shared across tissues than across individuals. We observe a strong correlation


between the number of times receptor sequences are shared across individuals and across tissues of the same individual for both IGK (_r_ = 0.78, _p_-value < 2 × 10−16) and IGL chains (_r_


 = 0.77, _p_-value < 2 × 10−16) (Supplementary Fig. 9). In contrast, public receptors of IGH chain are unlikely to be shared across tissues (_r_ = 0.15, _p_-value < 2 × 10−16)


(Supplementary Fig. 9). Overall, 14% of the ~240,000 clonotypes from both light and heavy chains shared across tissues are public (Fig. 4c). The full list of public clonotypes is distributed


with the Atlas of Immunoglobulin Repertoires, which is publically available at https://github.com/Mangul-Lab-USC/TAIR. THE FLOW OF IG CLONOTYPES ACROSS HUMAN GTEX TISSUES A large number of


samples available through this study allow us to establish a pairwise relationship between tissues and to track the flow of Ig clonotypes across various human tissues. We observe a


significant increase in the number of CDR3 sequences shared across pairs of tissues obtained from the same individual. Further, we consistently observe this pattern for all chains of Ig


receptors (two-sided Mann–Whitney _U_-test: _p_-value < 2 × 10−16 for each chain) (Fig. 5a). We observe a different amount of shared CDR3 sequences across different types of Ig chains,


with an increase in Ig light chains when compared to Ig heavy chains. The largest difference occurs between Ig heavy chains shared between samples taken from the same individuals (median


Sørensen–Dice similarity index of 0.0158), and samples taken from different individuals (median Sørensen–Dice similarity index of 0.0025) (Fig. 5a). To establish the flow of Ig clonotypes


across various tissues, we compare clonotype populations between and within the same individuals. We limit this analysis to pairs of tissues for which we had at least 10 individuals (870


pairs of tissues out of 1378 possible pairs). We use β-diversity (Sørensen–Dice similarity index) to measure compositional similarities between the tissues in terms of gain or loss of CDR3


sequences (Fig. 5b). For the majority of the 870 available tissue pairs, we observe no commonality between IGH sequences, which corresponds to a β-diversity score of 0.0. We examine the flow


of IGH clonotypes across tissues and present it as a network (Fig. 5b). Among 870 available tissue pairs, we identify 56 tissue pairs with a β-diversity score above 0.001. The spleen has


the most highly connected tissue (17 connections), followed by lung (16 connections). Clonotypes represent one connected component, meaning that every two nodes are connected either directly


or via other nodes. Clonotype populations of spleen and lung are the most similar (0.02 β-diversity score). Other highly similar pairs are minor salivary gland and esophagus mucosa, as well


as terminal ileum (small intestine) and transverse colon. We observe more than 200 pairs of tissues with a β-diversity score for Ig light chains above 0.001 (Supplementary Figs. 10 and 11).


The most similar tissue pairs for the IGK chain are spleen and transverse colon (0.15 β-diversity score). IMREP IDENTIFIES TISSUE SAMPLES WITH LYMPHOCYTE INFILTRATION Histological images of


tissue cross-sections and pathologists’ notes are used to validate ImReP’s ability to identify samples with a high lymphocyte content, which often correlates with a disease state. We


examine the IGH clonotype populations from thyroid tissue across individuals. The median number of inferred, distinct CDR3 sequences per sample is 20, although 14.5% of the samples had more


than 500 distinct CDR3 sequences. We observe the highest number of CDR3 sequences among all thyroid samples in an individual with Hashimoto’s thyroiditis, an autoimmune disease characterized


by lymphocyte infiltration and T-cell-mediated cytotoxicity. A text-based analysis of pathologists’ notes corresponding with biological samples indicates that Hashimoto’s disease was


present, with varying degrees of severity, in 12.6% of examined thyroid samples. First, we use pathologists’ notes to annotate samples as derived from individuals who lack a Hashimoto’s


disease diagnosis (_n_ = 180) or were assigned a Hashimoto’s disease diagnosis (_n_ = 26), then we compare the adaptive repertoire diversity between the two groups. We observe a significant


increase in the number of distinct IGH clonotypes in samples from individuals with Hashimoto’s thyroiditis (Mann–Whitney _U_-test: _U_ = 83, _p_-value = 2.1 × 10−14) (Supplementary Fig. 12).


We also observe a significant increase in the number of distinct IGH clonotypes in positive correlation with the noted severity of Hashimoto’s thyroiditis (Fig. 6a). In addition, a larger


number of clonotypes in kidney samples correlates with the presence of glomerulosclerosis, and, in lung samples, a larger number of clonotypes correlates with the presence of inflammatory


diseases such as sarcoidosis and bronchopneumonia. We observe no difference in clonal diversity in males and females across the tissue types, except in breast tissues (two-sided Mann–Whitney


_U_-test: _U_ = 376, _p_-value = 4.14 × 10−15). Higher clonotype diversity scores of breast tissue in male individuals corresponds to gynecomastia, a common disorder marked by non-cancerous


enlargement of male breast tissue (Fig. 6b). DISCUSSION We develop ImReP, a computational approach capable of accurately reconstructing Ig immune repertoires using RNA-Seq data. Our initial


study demonstrates the ability of ImReP to efficiently extract Ig-derived reads from RNA-Seq data and accurately assemble the corresponding hypervariable region sequences. The proposed


algorithm can accurately assemble CDR3 sequences of Ig receptors, despite the presence of sequencing errors and short read length. Simulations generated using various read lengths and


coverage depth show that ImReP consistently outperforms existing methods in terms of precision and recall rates. We demonstrate the feasibility of applying RNA-Seq to study the adaptive


immune repertoire. Although RNA-Seq lacks the sequencing depth of targeted sequencing (i.e., BCR-Seq), the approach compensates for these analytical restraints by examining a larger sample


size. Using ImReP, we create the first systematic atlas of immune sequences for Ig receptor repertories across diverse human tissues. This atlas provides a rich resource for comparative


analysis of a range of tissue types, most of which are currently unstudied. The atlas of immune repertoires, available with the paper, is one of the largest collections of CDR3 sequences and


tissue types. We anticipate that this database will enhance future studies in areas such as immunology and will contribute to the development of diagnostic tools and therapies for human


disease. Using RNA-Seq to study immune repertoires is advantageous when compared to Rep-Seq; RNA-Seq has the ability to simultaneously capture clonotype populations from all chains during a


single run. RNA-Seq also allows simultaneous detection of overall transcriptional responses of the adaptive immune system, which it produces by comparing changes in the number of Ig


transcripts to the much larger transcriptome. Given the increasing number of large-scale RNA-Seq datasets available, we look forward to scaling up the atlas of immune receptors in order to


provide valuable insights into immune responses across various autoimmune diseases, allergies, and cancers. METHODS RNA-SEQ DATA We used RNA-Seq data from the Genotype-Tissue Expression


study (GTEx Consortium v.6) that corresponds to 8,555 samples collected from 53 tissues that were obtained from 544 individuals. RNA-Seq data is generated using Illumina HiSeq sequencing of


75 bp paired-end reads. The data were derived from 38 solid organ tissues, 11 brain subregions, whole blood, and 3 cell lines of postmortem donors. The samples were collected from adults


matched for age across male and female individuals. Metadata for the GTEx samples used in this paper can be found in Supplementary Data 4. RNA-SEQ DATA PREPROCESSING We downloaded the mapped


and unmapped reads in BAM format from dbGap (http://www.ncbi.nlm.nih.gov/gap). For each sample, we prepared the candidate receptor-derived reads as the input for the ImReP tool. First, we


extracted reads mapped to the Ig genes. Some high-throughput aligners allow partial mapping (i.e., soft clipping), which trims one or two ends of the reads and maps the remaining read. Reads


containing CDR3 sequences may be found among these reads and can be extracted using ImReP. Second, we filter out low quality, low complexity reads and reads that match rRNA repeats. We then


merged the reads mapped to the Ig loci and the prepared unmapped reads; ImReP use this data to assemble CDR3 sequences and corresponding V(D)J recombinations. IMREP ALGORITHM ImReP is a


computational approach to assembling CDR3 sequences and detecting corresponding V(D)J recombinations from B- and T-cell receptors. ImReP consists of two stages. In the first stage, ImReP


infers the CDR3 sequences from reads that simultaneously overlap V and J gene segments. We defined the CDR3 as the sequence of amino acids between the cysteine on the right of the junction


and phenylalanine (for IGK or IGL) or tryptophan (for IGH) on the left of the junction. We first converted the read sequences from nucleotides to amino acids. We scanned the amino acid


sequences of the read and determined the putative CDR3 as a sub-sequence of the read, starting from cysteine (C) and ending with phenylalanine (F) (and tryptophan [W] for IGH). The reads


containing the described substring were considered candidate CDR3 reads. We denoted _n_ to be the length of the read. We denoted the coordinates of the putative CDR3 string to be _x_ and


_y_, corresponding with the start and the end of the CDR3 sequence, respectively, in the read coordinates. This way each candidate CDR3 read is composed of three parts. The first part of the


CDR3 read contains a prefix of the read, which potentially overlaps with the suffix of V gene. The prefix contains the amino acids from the read, from position 0 to _x_ − 1. The second part


of the CDR3 read is a substring of the read containing the putative CDR3 sequence. It contains the amino acids from the read, from position _x_ to _y_. The third part of the CDR3 is a


suffix of the read that potentially overlaps with the prefix of J gene. The suffix contains the amino acids of the read, from position _y_ + 1 to _n_. The amino acid sequences of the V and J


genes of BCRs were imported from IMGT information system (http://www.imgt.org/vquest/refseqh.html#V-D-J-C-sets). For each V gene, we identified the last conserved cysteine (C) and recorded


the position _p__C_ of the read. For each J gene, we identified the first conserved phenylalanine (for IGK or IGL) or tryptophan (for IGH) and recorded its position _p_F. (We identified the


position of phenylalanine or tryptophan in the J gene sequence. Such positions are referred to as _p_F.) For each V gene, we extracted two substrings: $$V_x = V[0,p_{\mathrm{{C}}} -


1]\,{\mathrm{and}}\,V_y = V[p_{\mathrm{{C}}} + 1,n_{\mathrm{{V}}}]$$ (1) For each J gene, we recorded two substrings: $$J_x = J[0,p_{\mathrm{{F}}} - 1]\,{\mathrm{and}}\,J_y =


J[p_{\mathrm{{F}}} + 1,n_{\mathrm{{J}}}]$$ (2) Here, _n_V and _n_J are the lengths of V and J genes, respectively. Given a set of candidate CDR3 reads, we attempted to find the corresponding


V and J genes. We matched a substring of the read _r_[0, _x_ − 1] with the corresponding suffix of _V__x_ for V genes. We also matched the read _r_[_y_ + 1, _n_] with the corresponding


prefix of _J__x_ for J genes. We considered a read to match the V gene if the length of _r_[0, _x_ − 1] is greater than four and the edit distance between _r_[0, _x_ − 1] and _V__x_ is


<2. We considered a read to match the J gene if the length of _r_[_y_ + 1, _n_] is >4, and the edit distance between _r_[_y_ + 1, _n_] and _J__x_ is <2. In cases where a read


overlaps equally (in terms of edit distance) among multiple V genes and J genes, all matching V genes are reported. In the second stage, ImReP utilizes the reads overlapping only with the V


or J gene. Such reads contain a partial CDR3 sequence. ImReP builds a suffix tree _S_ on the reads overlapping any of the V genes. Then, for each read _j_ overlapping a J gene a V-gene


overlapping read, _v_ from _S_ is determined (in cases where any exists). Reads _v_ and _j_ are concatenated (based on the overlap) and the CDR3 region is extracted. Further, ImReP uses a


CAST clustering technique to correctly assemble CDR3s for PCR and sequencing errors. The output of the algorithm is the set of CDR3 partitions, and each of the partitions corresponds to a


clonotype. Specifically, ImReP builds a complete graph _G_ = (_V_, _E_, _w_), where the set of vertices _V_ is represented by the set of assembled CDR3 sequences. The weight of the edge is


determined by the inverse of the edit distance, computed between the two CDR3 sequences _x_ and _y_. The CAST algorithm is executed with the following procedure. A new partition _P_ is


initialized with the max-degree node. Then, the set of “close” vertices is iteratively added to the partition, and the set of “distant” vertices are removed from the partition. A vertex _v_


is deemed to be “close” (“distant”), if the average distance from _v_ to the vertices from _P_ is greater (smaller) than a user-defined threshold. The procedure is repeated until either the


set of “close” or the set of “distant” vertices is empty. In such a way, the partition _P_ is based on a max-degree node and extended with the “close” vertices. Vertices belonging to _P_ are


then removed from the graph _G_ and the clustering procedure is repeated until all of the vertices are assigned to a partition. Let {_v_1, _v_2,…,_v__i_,…,_v_n} be a partition output by the


CAST algorithm. Each _v__i_ has an associated weight equal to the count of CDR3’s _v__i_, which was assembled during the first two stages of ImReP. We computed the weighted consensus


sequence of _P_ and output the sequence as a final clonotype. Finally, we mapped D genes (for IGH) onto assembled CDR3 sequences and infer corresponding V(D)J recombination. Starting with


release v0.8, ImReP reports the out-of-frame CDR3 sequences. VALIDATION BASED ON SIMULATED RNA-SEQ DATA We performed in-silico simulations to investigate the feasibility of using RNA-Seq to


study the adaptive immune repertoire. We first checked the ability of ImReP to extract the receptor-derived reads from raw RNA-Seq reads. First, we simulated the Ig transcripts, which are


composed of recombined VDJ segments containing non-template insertion at the V(D)J junction (Supplementary Fig. 2). We used the IMGT database (http://www.imgt.org/vquest/refseqh.html) of V


and J gene segments. We randomly selected V, D, and J segments, and we inserted a sequence of random nucleotides between V and D, and between D and J. The length of the inserted sequence was


sampled from the Gaussian-like distribution with a mean value of 15. We also excluded the simulated transcripts that contain random insertions leading to out-of-frame proteins. We used


LymAnalizer (version 1.2.2) (https://sourceforge.net/projects/lymanalyzer/) to validate CDR3 sequences of the transcript. We used SimNGS (version 1.6)


(https://www.ebi.ac.uk/goldman-srv/simNGS/) to simulate paired-end reads, referred as receptor-derived reads, from Ig transcripts. Next, we simulated 50 million transcriptomic reads from a


human transcriptome reference (GRCh37). We mixed receptor-derived reads with transcriptomic reads into an RNA-Seq mixture (Supplementary Fig. 3). We then applied ImReP to a simulated RNA-Seq


mixture in order to check the ability of ImReP to extract CDR3-derived reads from the RNA-Seq mixture. Next, we studied the effects of the coverage and read length on the ability to


reconstruct CDR3 sequences. In total, we simulated 1,000 Ig transcripts. We simulated paired-end reads of various read lengths (_l_ = 50,75,100). We have also simulated different numbers of


reads that correspond to different coverage rates of Ig transcripts (_c_ = 1, 2, 4, 8, 16, 32, 64, 128). We used the power law distribution to assign frequencies to simulated Ig


transcripts28. The CDR3 amino acid sequences assembled by ImReP were compared to simulated transcripts in order to evaluate the recall and precision for various read lengths and coverage


rates. We define recall and precision in the following way: $${\mathrm{Recall}} = {\mathrm{TP}}/\left( {{\mathrm{TP}} + {\mathrm{FN}}} \right)$$ (3) $${\mathrm{Precision}} =


{\mathrm{TP}}/\left( {{\mathrm{TP}} + {\mathrm{FP}}} \right)$$ (4) Where TP is the number of correctly assembled CDR3 sequence features (i.e., an exact match to the simulated CDR3), FN is


the number of simulated CDR3 sequence features not assembled by the method, and FP is the number of incorrectly assembled CDR3 sequences. Scripts that simulate the reads and Ig transcripts


are available online at: https://github.com/Mangul-Lab-USC/ImReP_publication. VALIDATION BASED ON BCR-SEQ-BASED IGH TRANSCRIPTS Additionally, we used Ig sequences assembled from targeted


BCR-Seq data, which was derived from the IGH locus spanning the region between the FR1 to the IGHJ gene. Error-corrected BCR sequencing was used to generate sequences from peripheral blood


mononuclear cells that were sampled from a healthy individual using19. Targeted BCR amplification was performed using a two-step RT-PCR protocol with multiplex IGHV gene primers and a


barcoded IGHJ primer. Amplicons were sequenced using MiSeq 300 bp paired-end libraries and following Illumina protocols. Raw sequence reads are available under the EGAN00001419382 accession


number in the European Genome-Phenome Archive. BCR-Seq was approved by the Wellcome Sanger Institute review boards and ethics committees (07/MRE05/44). As the Ig sequences constitute human


data, they are stored under managed data access according to the Wellcome Trust data release policy. Access to these samples must be requested from the Data Access Committee (DAC), whose


contact details can be found on the EGA study page. Data hosted on DAC have an accession number; more information can be obtained by sending an email to [email protected]. The


requester will be required to sign a data access agreement, which is in place to protect the identity of the sample donor via a managed access system. We ran the online version of IgBLAST


using the default alignment options (https://www.ncbi.nlm.nih.gov/igblast/igblast.cgi) to extract the CDR3 sequence from each Ig transcript. The resulting 3129 sequences fall into 435


distinct CDR3 sequences. Next, we used simLibrary (version 1.3) and simNGS (version 1.6) software packages to simulate three Illumina single-end read datasets at 16x coverage with read


lengths of 50 bp, 75 bp, and 100 bp. The commands for this process used can be found on Github: https://github.com/Mangul-Lab-USC/ImReP_publication. On each of the datasets, we ran ImReP


with the default options and the “–noOverlap” option. DETERMINE PARAMETERS FOR CLUSTERING USING THE CAST ALGORITHM To cluster the assembled clones with a good balance between sensitivity and


precision, we determined the threshold for inverse edit distance used by CAST. The inverse edit distance is iteratively used by CAST inside each cluster until the average inverse edit


distance (Levenshtein) inside each cluster is less than the user-defined threshold. In general, the decrease of sensitivity shows that many true positive clones are collapsed by CAST. he


increase in precision shows the advantage of CAST application, where false positive clones are collapsed. The CAST’s threshold for edit distance was set to maximize F-score across different


coverages and read lengths. Using the simulated data, we set up the defaults of CAST’s threshold for an edit distance to 0.25 (Supplementary Fig. 13). We simulated paired-end reads of length


2 × 75 bp, covering the Ig transcript with an average coverage rate of 8×. After the CAST threshold was determined, we applied ImRep with and without CAST on simulated datasets with various


read lengths and coverages (Supplementary Fig. 14). Across the vast majority of read lengths and coverages, CAST provides an improved clonotype reconstruction accuracy (measured by


F-score). Lower edit distance threshold (<0.1) results in drop of sensitivity, suggesting that closely related but bona fide distinct mutants are collapsed by the algorithm. The default


edit distance threshold chosen by ImRep balances both precision and sensitivity; the default setting avoids collapsing closely related-yet-bona-fide distinct clonotypes. COMPARISON WITH


OTHER METHODS We used simulated and real datasets to compare ImReP to existing methods. We note that IMSEQ cannot be applied to RNA-Seq reads, because it was originally designed for BCR-Seq.


Scripts and commands utilized to repertoire assembly tools are available online at: https://github.com/Mangul-Lab-USC/ImReP_publication. CELL-TYPE COMPOSITION B-cell signature values per


sample were derived using SaVant20. Cell-specific signature genes are first defined from a set of cells/tissues obtained from the Human Body Atlas29 by using the proportional median values.


We calculate these values by dividing the intensity of a probe in a particular cell type by its median value across all cells/tissues. The top 25 genes with the highest proportional median


value for CD19+ B cells were defined as the specific signature for that cell type (Supplementary Table 2). All Ig genes were removed from the signature. The signature score is then generated


from the average of the log2-transformed values of the signature genes within each sample. DEFINITION OF CLONOTYPE Clonotypes are defined as clones with identical CDR3 amino acid sequences.


HISTOLOGICAL IMAGES AND PATHOLOGIST NOTES We used histological images and pathologists’ notes (available from the GTEx portal, http://www.gtexportal.org/home/histologyPage#data) to validate


the adaptive immune profile of the samples. Although samples were derived from primary tissues, they often have a mixed cell type composition. For example, samples from stomach tissues have


various proportions of lymphocytes as, according to pathologists’ notes, they were derived from mucosal or muscularis areas of the tissue. GTEx samples with inflammation and/or subject to


various diseases are separately investigated. Pathologists’ notes report the percentage of mucosa, and the disease or inflammation status, of the biopsied tissue. STATISTICS AND


REPRODUCIBILITY All statistics have been derived where appropriate sample size dictated that a statistical test could be performed. DATA REPRESENTATION We used WebLogo3


(http://weblogo.threeplusone.com/manual.html) to visualize the amino acid composition of assembled CDR3 sequences, and Gephi (https://gephi.org/users/) to visualize the flow of clonotypes


across diverse human tissues. REPORTING SUMMARY Further information on research design is available in the Nature Research Reporting Summary linked to this article. DATA AVAILABILITY All


RNA-Seq data discussed in this paper is available as part of the Genotype-Tissue Expression (GTEx) Project under the phs000424.v8.p2 accession number in the database of Genotypes and


Phenotypes (dbGaP). The targeted BCR-Seq assemblies used for validation of the ImReP method are available at the adaptive biosystems webpage


(https://clients.adaptivebiotech.com/pub/lombardo-2017-bloodadvances). RNA-Seq samples used for validation of ImReP are available under the SRP099346 accession number in the Sequence Read


Archive. Raw sequence reads of targeted BCR-Seq data are available under the EGAN00001419382 accession number in the European Genome-Phenome Archive. All data required to produce the figures


and analysis performed in this paper are freely available at https://github.com/Mangul-Lab-USC/ImReP_publication and are available in the Source Data zip file, including the data used to


produce Figs. 2a–f, 3, 4a, c, 5a, b, and 6a, b, and Supplementary Figs. 4, 5, 6d, 7, 8, 9a–c, 10–12, 13a–c, and 14a–c.  CODE AVAILABILITY ImReP is freely available at


https://github.com/Mangul-Lab-USC/imrep. ImReP is distributed under the terms of the General Public License version 3.0 (GPLv3). All code required to produce the figures and analysis


performed in this paper are freely available at https://github.com/Mangul-Lab-USC/ImReP_publication. Source data are provided with this paper. CHANGE HISTORY * _ 04 SEPTEMBER 2020 An


amendment to this paper has been published and can be accessed via a link at the top of the paper. _ REFERENCES * Georgiou, G. et al. The promise and challenge of high-throughput sequencing


of the antibody repertoire. _Nat. Biotechnol._ 32, 158–168 (2014). Article  CAS  Google Scholar  * Freeman, J. D., Warren, R. L., Webb, J. R., Nelson, B. H. & Holt, R. A. Profiling the


T-cell receptor beta-chain repertoire by massively parallel sequencing. _Genome Res._ 19, 1817–1824 (2009). Article  CAS  Google Scholar  * Rajewsky, K., Förster, I. & Cumano, A.


Evolutionary and somatic selection of the antibody repertoire in the mouse. _Science_ 238, 1088–1094 (1987). Article  ADS  CAS  Google Scholar  * Benichou, J., Ben-Hamo, R., Louzoun, Y.


& Efroni, S. Rep-Seq: uncovering the immunological repertoire through next-generation sequencing. _Immunology_ 135, 183–191 (2012). Article  CAS  Google Scholar  * DeWitt, W. S. et al. A


public database of memory and naive B-cell receptor sequences. _PLoS ONE_ 11, e0160853 (2016). Article  Google Scholar  * Putintseva, E. V. et al. Mother and child T cell receptor


repertoires: deep profiling study. _Front. Immunol._ 4, 463 (2013). Article  Google Scholar  * Blachly, J. S. et al. Immunoglobulin transcript sequence and somatic hypermutation computation


from unselected RNA-seq reads in chronic lymphocytic leukemia. _Proc. Natl Acad. Sci. USA_ 112, 4322–4327 (2015). Article  ADS  CAS  Google Scholar  * Bolotin, D. A. et al. MiXCR: software


for comprehensive adaptive immunity profiling. _Nat. Methods_ 12, 380–381 (2015). Article  CAS  Google Scholar  * Li, B. et al. Landscape of tumor-infiltrating T cell repertoire of human


cancers. _Nat. Genet._ 48, 725–732 (2016). Article  CAS  Google Scholar  * Stubbington, M. J. T. et al. T cell fate and clonality inference from single-cell transcriptomes. _Nat. Methods_


13, 329–332 (2016). Article  Google Scholar  * Lefranc, M.-P. et al. IMGT®, the International ImMunoGeneTics information system® 25 years on. _Nucleic Acids Res._ 43, D413–D422 (2015).


Article  CAS  Google Scholar  * Ye, J., Ma, N., Madden, T. L. & Ostell, J. M. IgBLAST: an immunoglobulin variable domain sequence analysis tool. _Nucleic Acids Res._ 41, W34–W40 (2013).


Article  Google Scholar  * Kuchenbecker, L. et al. IMSEQ–a fast and error aware approach to immunogenetic sequence analysis. _Bioinformatics_ 31, 2963–2971 (2015). Article  CAS  Google


Scholar  * GTEx Consortium et al. Genetic effects on gene expression across human tissues. _Nature_ 550, 204–213 (2017). * Strauli, N. B. & Hernandez, R. D. Statistical inference of a


convergent antibody repertoire response to influenza vaccine. _Genome Med._ 8, 60 (2016). Article  Google Scholar  * Warren, R. L., Nelson, B. H. & Holt, R. A. Profiling model T-cell


metagenomes with short reads. _Bioinformatics_ 25, 458–464 (2009). Article  CAS  Google Scholar  * Mose, L. E. et al. Assembly-based inference of B-cell receptor repertoires from short read


RNA sequencing data with V’DJer. _Bioinformatics_ 32, 3729–3734 (2016). Article  CAS  Google Scholar  * Lombardo, K. A. et al. High-throughput sequencing of the B-cell receptor in African


Burkitt lymphoma reveals clues to pathogenesis. _Blood Adv._ 1, 535–544 (2017). Article  CAS  Google Scholar  * Petrova, V. N. et al. Combined Influence of B-Cell Receptor Rearrangement and


Somatic Hypermutation on B-Cell Class-Switch Fate in Health and in Chronic Lymphocytic Leukemia. _Front. Immunol._ 9, 1784 (2018). Article  Google Scholar  * Lopez, D. et al. SaVanT: a


web-based tool for the sample-level visualization of molecular signatures in gene expression profiles. _BMC Genomics_ 18, 824 (2017). * Landsverk, O. J. B. et al. Antibody-secreting plasma


cells persist for decades in human intestine. _J. Exp. Med._ 214, 309–317 (2017). Article  CAS  Google Scholar  * Crooks, G. E., Hon, G., Chandonia, J.-M. & Brenner, S. E. WebLogo: a


sequence logo generator. _Genome Res._ 14, 1188–1190 (2004). Article  CAS  Google Scholar  * Philibert, P. et al. A focused antibody library for selecting scFvs expressed at high levels in


the cytoplasm. _BMC Biotechnol._ 7, 81 (2007). Article  Google Scholar  * Hoi, K. H. & Ippolito, G. C. Intrinsic bias and public rearrangements in the human immunoglobulin Vλ light chain


repertoire. _Genes Immun._ 14, 271–276 (2013). Article  CAS  Google Scholar  * Yu, H.-P., Chiu, Y.-W., Lin, H.-H., Chang, T.-C. & Shen, Y.-Z. Blood content in guinea-pig tissues:


Correction for the study of drug tissue distribution. _Pharmacol. Res._ 23, 337–347 (1991). Article  CAS  Google Scholar  * De Rossi, A. et al. Infection of Epstein-Barr virus-transformed


lymphoblastoid B cells by the human immunodeficiency virus: evidence for a persistent and productive infection leading to B cell phenotypic changes. _Eur. J. Immunol._ 20, 2041–2049 (1990).


Article  Google Scholar  * Warren, R. L. et al. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured


repertoire size of at least 1 million clonotypes. _Genome Res._ 21, 790–797 (2011). Article  CAS  Google Scholar  * Weinstein, J. A., Jiang, N., White, R. A. III, Fisher, D. S. & Quake,


S. R. High-throughput sequencing of the zebrafish antibody repertoire. _Science_ 324, 807–810 (2009). Article  ADS  CAS  Google Scholar  * Su, A. I. et al. A gene atlas of the mouse and


human protein-encoding transcriptomes. _Proc. Natl Acad. Sci. USA_ 101, 6062–6067 (2004). Article  ADS  CAS  Google Scholar  Download references ACKNOWLEDGEMENTS We thank Dr. Lana Martin for


the helpful discussions and comments on the manuscript, and for assistance with producing the figures. AUTHOR INFORMATION Author notes * These authors jointly supervised this work: Sagiv


Shifman, Noah Zaitlen, Maura Rossetti, K. Mark Ansel, Eleazar Eskin, Serghei Mangul. AUTHORS AND AFFILIATIONS * Department of Computer Science, University of California, Los Angeles, 404


Westwood Plaza, Los Angeles, CA, 90095, USA Igor Mandric, Jeremy Rotman, Harry Taegyun Yang, William Van Der Wey, Benjamin Statz, Eleazar Eskin & Serghei Mangul * Department of Clinical


Pharmacy, School of Pharmacy, University of Southern California, 1540 Alcazar Street, Los Angeles, CA, 90033, USA Jeremy Rotman & Serghei Mangul * Bioinformatics Interdepartmental Ph.D.


Program, University of California, Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA, 90095-1570, USA Harry Taegyun Yang * Biomedical Sciences Graduate Program, University of


California, San Francisco, 1675 Owens Street, Suite 310, San Francisco, CA, 94143-0523, USA Nicolas Strauli * Department of Molecular, Cell, and Developmental Biology, University of


California, Los Angeles, 610 Charles E. Young Drive South, Los Angeles, CA, 90095, USA Dennis J. Montoya & Douglas Yao * Department of Microbiology, Immunology, and Molecular Genetics,


University of California, Los Angeles, 609 Charles E. Young Drive East, Los Angeles, CA, 90095, USA Jiem R. Ronas * Program in Bioinformatics and Integrative Genomics, Harvard Medical


School, 10 Shattuck Street, Suite 514, Boston, MA, 02115, USA Douglas Yao * Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Velislava Petrova *


Department of Computer Science, Georgia State University, 33 Gilmer Street SE, Atlanta, GA, 30303, USA Alex Zelikovsky * The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State


Medical University, Moscow, 119991, Russia Alex Zelikovsky * Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, 611 Charles E. Young Drive East,


Los Angeles, CA, 90095, USA Roberto Spreafico & Serghei Mangul * Department of Genetics, The Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, 9190401, Israel


Sagiv Shifman * Department of Medicine, University of California, San Francisco, 533 Parnassus Avenue, San Francisco, CA, 94143, USA Noah Zaitlen * Immunogenetics Center, Department of


Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, 1000 Veteran Avenue, Los Angeles, CA, 90095-1652, USA Maura Rossetti * Sandler


Asthma Basic Research Center, Department of Microbiology and Immunology, University of California, San Francisco, 513 Parnassus Avenue, San Francisco, CA, 94143-0414, USA K. Mark Ansel *


Department of Human Genetics, David Geffen School of Medicine at UCLA, 695 Charles E. Young Drive South, Box 708822, Los Angeles, CA, 90095, USA Eleazar Eskin * Department of Computational


Medicine, David Geffen School of Medicine at UCLA, 73-235 CHS, Los Angeles, CA, 90095, USA Eleazar Eskin Authors * Igor Mandric View author publications You can also search for this author


inPubMed Google Scholar * Jeremy Rotman View author publications You can also search for this author inPubMed Google Scholar * Harry Taegyun Yang View author publications You can also search


for this author inPubMed Google Scholar * Nicolas Strauli View author publications You can also search for this author inPubMed Google Scholar * Dennis J. Montoya View author publications


You can also search for this author inPubMed Google Scholar * William Van Der Wey View author publications You can also search for this author inPubMed Google Scholar * Jiem R. Ronas View


author publications You can also search for this author inPubMed Google Scholar * Benjamin Statz View author publications You can also search for this author inPubMed Google Scholar *


Douglas Yao View author publications You can also search for this author inPubMed Google Scholar * Velislava Petrova View author publications You can also search for this author inPubMed 


Google Scholar * Alex Zelikovsky View author publications You can also search for this author inPubMed Google Scholar * Roberto Spreafico View author publications You can also search for


this author inPubMed Google Scholar * Sagiv Shifman View author publications You can also search for this author inPubMed Google Scholar * Noah Zaitlen View author publications You can also


search for this author inPubMed Google Scholar * Maura Rossetti View author publications You can also search for this author inPubMed Google Scholar * K. Mark Ansel View author publications


You can also search for this author inPubMed Google Scholar * Eleazar Eskin View author publications You can also search for this author inPubMed Google Scholar * Serghei Mangul View author


publications You can also search for this author inPubMed Google Scholar CONTRIBUTIONS S.M. designed the study. I.M. developed the methods and the simulated datasets used in this study.


B.S., D.Y., J.R., J.R.R., H.T.Y., V.P., and W.V.D.W. performed analysis for this paper. B.S., D.Y., H.T.Y., I.M., J.R., V.P., and W.V.D.W. generated figures. A.Z., D.M., E.E., I.M., K.M.A.,


M.R., N.S., N.Z., R.S., S.M., S.S., and V.P. wrote the paper. A.Z., D.M., E.E., K.M.A., M.R., N.S., N.Z., R.S., S.M., S.S., and V.P. reviewed and edited the manuscript. CORRESPONDING AUTHOR


Correspondence to Serghei Mangul. ETHICS DECLARATIONS COMPETING INTERESTS The authors declare no competing interests. ADDITIONAL INFORMATION PEER REVIEW INFORMATION _Nature Communications_


thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. PUBLISHER’S NOTE Springer Nature remains neutral with regard to


jurisdictional claims in published maps and institutional affiliations. SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION PEER REVIEW FILE REPORTING SUMMARY DESCRIPTION OF ADDITIONAL


SUPPLEMENTARY FILES SUPPLEMENTARY DATA 1 SUPPLEMENTARY DATA 2 SUPPLEMENTARY DATA 3 SUPPLEMENTARY DATA 4 SOURCE DATA SOURCE DATA RIGHTS AND PERMISSIONS OPEN ACCESS This article is licensed


under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate


credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article


are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and


your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this


license, visit http://creativecommons.org/licenses/by/4.0/. Reprints and permissions ABOUT THIS ARTICLE CITE THIS ARTICLE Mandric, I., Rotman, J., Yang, H.T. _et al._ Profiling


immunoglobulin repertoires across multiple human tissues using RNA sequencing. _Nat Commun_ 11, 3126 (2020). https://doi.org/10.1038/s41467-020-16857-7 Download citation * Received: 27


October 2018 * Accepted: 24 May 2020 * Published: 19 June 2020 * DOI: https://doi.org/10.1038/s41467-020-16857-7 SHARE THIS ARTICLE Anyone you share the following link with will be able to


read this content: Get shareable link Sorry, a shareable link is not currently available for this article. Copy to clipboard Provided by the Springer Nature SharedIt content-sharing


initiative


Trending News

Access to this page has been denied

Your browser appears to have Javascript disabled.For instructions on how to enable Javascript please click here.If you h...

Origins of hyperphenylalaninemia in israel

ABSTRACT Mutations and polymorphisms at the phenylalanine hydroxylase (PAH) gene were used to study the genetic diversit...

Javascript support required...

Mechanisms of electrochemical hydrogenation of aromatic compound mixtures over a bimetallic ptru catalyst

ABSTRACT Efficient electrochemical hydrogenation (ECH) of organic compounds is essential for sustainability, promoting c...

Feel the squeeze | Nature Physics

Access through your institution Buy or subscribe _Proc. Natl Acad. Sci. USA_ 113, 9210–9215 (2016) A glass surface feels...

Latests News

Profiling immunoglobulin repertoires across multiple human tissues using rna sequencing

ABSTRACT Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-con...

Human pangenome supports analysis of complex genomic regions

* Lander, E. S. _et al._ _Nature_ 409, 860–921 (2001). Article  PubMed  Google Scholar  * Nurk, S. _et al._ _Science_ 37...

Jane the virgin season 5, episode 7 promo: what will happen next?

Jane the Virgin season five, episode seven, titled Chapter Eighty-Eight, will air on Wednesday, May 8 at 8pm EST on The ...

Javascript support required...

The risks of pots after covid-19 vaccination and sars-cov-2 infection: more studies are needed

Postural orthostatic tachycardia syndrome (POTS) can follow COVID-19 as part of the post-acute sequelae of SARS-CoV-2 in...

Top