• Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics
  • The Human Genome Project
  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

Genome-Wide Association Studies Fact Sheet

Genome-wide association studies involve scanning markers across the genomes of many people to find genetic variations associated with a particular disease.

What is a genome-wide association study?

A genome-wide association study is an approach that involves rapidly scanning markers across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease and mental illnesses.

Why are such studies possible now?

With the completion of the Human Genome Project in 2003 and the International HapMap Project in 2005, researchers now have a set of research tools that make it possible to find the genetic contributions to common diseases. The tools include computerized databases that contain the reference human genome sequence, a map of human genetic variation and a set of new technologies that can quickly and accurately analyze whole-genome samples for genetic variations that contribute to the onset of a disease.

GWAS

How will genome-wide association studies benefit human health?

The impact on medical care from genome-wide association studies could potentially be substantial. Such research is laying the groundwork for the era of personalized medicine, in which the current one size-fits-all approach to medical care will give way to more customized strategies.In the future, after improvements are made in the cost and efficiency of genome-wide scans and other innovative technologies, health professionals will be able to use such tools to provide patients with individualized information about their risks of developing certain diseases. The information will enable health professionals to tailor prevention programs to each person's unique genetic makeup. In addition, if a patient does become ill, the information can be used to select the treatments most likely to be effective and least likely to cause adverse reactions in that particular patient.

What have genome-wide association studies found?

Researchers already have reported considerable success using this new strategy. For example, in 2005, three independent studies found that a common form of blindness is associated with variation in the gene for complement factor H, which produces a protein involved in regulating inflammation. Few previously thought that inflammation might contribute so significantly to this type of blindness, which is called age-related macular degeneration.

Similar successes have been reported using genome-wide association studies to identify genetic variations that contribute to risk of type 2 diabetes, Parkinson's disease, heart disorders, obesity, Crohn's disease and prostate cancer, as well as genetic variations that influence response to anti-depressant medications.

How are genome-wide association studies conducted?

To carry out a genome-wide association study, researchers use two groups of participants: people with the disease being studied and similar people without the disease. Researchers obtain DNA from each participant, usually by drawing a blood sample or by rubbing a cotton swab along the inside of the mouth to harvest cells.

Each person's complete set of DNA, or genome, is then purified from the blood or cells, placed on tiny chips and scanned on automated laboratory machines. The machines quickly survey each participant's genome for strategically selected markers of genetic variation, which are called single nucleotide polymorphisms, or SNPs.

If certain genetic variations are found to be significantly more frequent in people with the disease compared to people without disease, the variations are said to be "associated" with the disease. The associated genetic variations can serve as powerful pointers to the region of the human genome where the disease-causing problem resides.

However, the associated variants themselves may not directly cause the disease. They may just be "tagging along" with the actual causal variants. For this reason, researchers often need to take additional steps, such as sequencing DNA base pairs in that particular region of the genome, to identify the exact genetic change involved in the disease.

How can researchers access data from genome-wide association studies?

The National Center for Biotechnology Information (NCBI), a part of NIH's National Library of Medicine, is developing databases for use by the research community. An archive of data from genome-wide association studies on a variety of diseases and conditions already can be accessed through an NCBI Web site, called the Database of Genotype and Phenotype (dbGaP) located at: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gap .

What is NIH doing to support genome-wide association studies?

NIH, the Foundation for the National Institutes of Health, Pfizer Global Research & Development and others have formed a public-private partnership, the Genetic Association Information Network (GAIN), to fund genome-wide association studies. After peer-review of applications, GAIN announced its first round of studies in October 2006. The initial studies include bipolar disorder, major depression, kidney disease in type 1 diabetes, attention deficit hyperactivity disorder, schizophrenia and psoriasis. More information about GAIN can be found at: http://www.fnih.org/work/past-programs/genetic-association-information-network-gain .

In addition, individual NIH institutes have started genome-wide association studies. For example, the National Heart Lung and Blood Institute (NLBI) has launched the Framingham Genetic Research Study in collaboration with the Boston University School of Medicine. In that study, 9,000 participants in the long-running Framingham Heart Study will undergo genome-wide association studies to identify the genes underlying cardiovascular and other chronic diseases, such as osteoporosis and diabetes. More information on that study can be found at: http://www.nhlbi.nih.gov/news/press-releases/2006/nhlbi-to-launch-framingham-genetic-research-study.html .

Other NHLBI efforts in this area include genome-wide association studies involving the Women's Health Study, the Women's Health Initiative and the Candidate Gene Association Resource, which pools DNA samples collected from multiple NHLBI cohort studies. NHLBI, along with the National Institute of General Medical Sciences, also are major contributors to the PharmacoGenetics Research Network. Along with many other tools and technologies, this network is using genome-wide association studies to explore the effects of genes on individuals' varying responses to medications.

Some NIH institutes already have completed genome-wide association studies and deposited their data in the NCBI dbGaP database. These studies include research by the National Eye Institute on age-related eye diseases and the National Institute of Neurological Disorders and Stroke on Parkinson's disease.

Last updated: August 17, 2020

Loading metrics

Open Access

Primers provide a concise introduction into an important aspect of biology highlighted by a current PLOS Biology research article.

See all article types »

Genome-wide association studies have problems due to confounding: Are family-based designs the answer?

* E-mail: [email protected]

Affiliations UCLA Anderson School of Management, Los Angeles, California, United States of America, Human Genetics Department, UCLA David Geffen School of Medicine, Los Angeles, California, United States of America

ORCID logo

  • Alexander Strudwick Young

PLOS

Published: April 12, 2024

  • https://doi.org/10.1371/journal.pbio.3002568
  • Reader Comments

Fig 1

Genome-wide association studies (GWASs) can be affected by confounding. Family-based GWAS uses random, within-family genetic variation to avoid this. A study in PLOS Biology details how different sources of confounding affect GWAS and whether family-based designs offer a solution.

Citation: Young AS (2024) Genome-wide association studies have problems due to confounding: Are family-based designs the answer? PLoS Biol 22(4): e3002568. https://doi.org/10.1371/journal.pbio.3002568

Copyright: © 2024 Alexander Strudwick Young. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The study was supported by Open Philanthropy (010623-00001 and 2019-198171) and the National Institute on Aging/National Institutes of Health through R01 AG083379-01 (to the University of California, Los Angeles). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The author has declared that no competing interests exist.

Since the advent of large-scale, genome-wide genotype data, one study design has dominated human genetics: the genome-wide association study (GWAS). A GWAS scans across the genome for loci that are associated with a phenotype. However, GWASs are susceptible to confounding due to gene–environment correlation (environmental confounding) and correlations with other genetic variants across the genome (genetic confounding). Although GWASs use techniques including principal component analysis (PCA) to adjust for population structure, this often fails to eliminate all confounding [ 1 – 3 ]. The confounding present in GWAS has caused issues in downstream applications, including: estimation of heritability and genetic correlation [ 4 , 5 ], estimation of disease causes using mendelian randomization [ 6 ], and inferences of natural selection [ 2 ]. Family-based GWAS (FGWAS) has been proposed as a solution to confounding because genotypes of offspring are randomly assigned given the genotypes of the parents, generating a natural experiment. Recent FGWASs have demonstrated confounding in GWASs of several phenotypes [ 7 , 8 ], including educational attainment, cognitive ability, height, smoking, and age when first giving birth. In this issue of PLOS Biology , Veller and Coop [ 9 ] perform a comprehensive evaluation of how different phenomena can lead to bias in GWAS and FGWAS, finding that FGWAS is free from all environmental confounding and almost all genetic confounding.

Typically, GWASs analyze one focal variant at a time, so GWAS associations include the direct genetic effects (DGEs, causal effects of alleles in an individual on that individual) of the focal variant and variants that are correlated with the focal variant (i.e., in linkage disequilibrium or LD). Variants that are physically close (on the same chromosome) to the focal variant tend to be in strong LD as they tend to be inherited from the same parental haplotype without recombination. This makes it hard to pinpoint the causal variant, leading to the problem of fine-mapping. Since this occurs even under random-mating, it’s not typically thought of as genetic confounding even though it leads to genotype–phenotype association for non-causal variants. Furthermore, many methods are designed to work with this type of local LD that always affects GWAS.

However, phenomena other than DGEs of the focal and nearby variants can contribute to the genotype–phenotype associations picked up by GWAS [ 3 ] ( Fig 1 ): indirect genetic effects (IGEs, effects of alleles in an individual on another individual mediated through the environment) from relatives (e.g., parents or siblings) contribute to genotype–phenotype associations and cannot—in general—be removed without data on first-degree relatives of the GWAS sample. Although IGEs are a form of gene–environment correlation, they do not require the population to be structured. When there is population structure, this can lead to correlations between alleles and other environment factors (called population stratification, an example of environmental confounding). For example, two reproductively isolated populations could have different rates of skin cancer due to living at different latitudes: this could lead to a spurious associations (in the overall population) between alleles that are at different frequencies in these two populations and skin cancer risk. While some forms of population structure can be corrected for using PCA and other methods, subtle forms of stratification are difficult to detect and remove from GWAS [ 1 ].

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

The family on the left has a higher frequency of the A allele and is taller than the family on the right. This could lead to a GWAS finding that the A allele is associated with increased height. However, several distinct phenomena could contribute to the GWAS association. If the A allele exerts a different effect on the bodies of those carrying it compared to the T allele, leading to increased height, this would contribute to genotype–phenotype association and be an example of a DGE. However, if the A allele was inherited from a parent, the A allele could have affected the offspring’s height through another pathway: by affecting the parent’s phenotype (e.g., nurturing behavior) and thereby affecting the offspring’s height through the environment, an example of an IGE. IGEs can be considered as confounding factors when the goal is estimation of DGEs—implicitly the goal of most GWASs. However, genotype–phenotype associations can occur at loci without any causal influence on the phenotype due to population stratification, which occurs when there is a correlation between allele frequencies and environments across genetically distinct subpopulations. For example, the family on the left could be from a subpopulation with abundant calories and a high frequency of the A allele, whereas the family on the right could be from a subpopulation with insufficient calories and a low frequency of the A allele; this could generate a spurious association between the A allele and increased height. Another source of confounding is assortative mating, which results in correlation between mates’ phenotypes, and therefore between the causal alleles in the mother and the father, irrespective of their physical positions, inducing genetic confounding between the focal A allele and all other causal alleles in the genome. Family-based GWAS (FGWAS) exploits the randomization of genetic material during meiosis to remove confounding from estimates of DGEs.

https://doi.org/10.1371/journal.pbio.3002568.g001

Nonlocal genetic confounding—including across chromosomes—can result from nonrandom mating, including population structure, natural selection, and assortative mating. Assortative mating leads to correlations between mates’ phenotypes and therefore between the trait increasing alleles in the mother and the father, irrespective of their physical positions [ 10 ]. Over multiple generations of assortative mating, alleles that have the same direction of effect become positively correlated both within haplotypes from the same parent and across haplotypes from different parents. GWAS will therefore overestimate the DGE of a causal variant for a trait affected by (positive) assortative mating because it picks up part of the effect of all other causal variants, including rare variants and variants on other chromosomes [ 10 ]. The picture is further complicated by the fact that humans do not assort on a single phenotype, but on multiple dimensions involving multiple phenotypes simultaneously, inducing correlations across genetic variants affecting different traits [ 5 ].

A potential solution to the confounding that can affect GWAS is to instead perform FGWAS, which uses within-family genotype variation to estimate DGEs. Within-family variation is generated by random segregations of chromosomes during meiosis, which are independent of environment; thus, FGWAS eliminates confounding due to gene–environment correlation (including IGEs from parents and population stratification).

However, exactly how genetic confounding affects FGWAS had not been thoroughly investigated until the work of Veller and Coop [ 9 ]. Because chromosomes segregate independently during meiosis, genetic confounding from variants on different chromosomes does not affect FGWAS. Within-chromosome, FGWAS still has to contend with the fine-mapping problem due to a lack of recombination between nearby variants. What was not clear until Veller and Coop’s study was how FGWAS is affected by genetic confounding due to variants on the same chromosome but not in the same local region (LD block) as the focal variant. They examine theoretically and in simulations how both GWAS and FGWAS are affected by assortative mating, population structure (including due to phenotype-based migration), admixture, and natural selection (in the form of stabilizing selection). They find that all the sources of genetic confounding can lead to substantial bias in GWAS, but that the confounding in FGWAS is generally minimal. This is because the human genome is split over 23 chromosomes, implying that most pairs of loci are on different chromosomes. If traits are affected by many causal variants spread across the genome, then there is much less potential for genetic confounding due to correlations between variants on the same chromosome than due to all genome-wide variants. Since FGWAS eliminates the influence of cross-chromosome correlations, the vast majority of genetic confounding is eliminated. However, Veller and Coop show that a small amount of (nonlocal, but within chromosome) genetic confounding can remain under certain scenarios.

Veller and Coop also give warnings about interpreting coefficients on parental genotypes as estimates of IGEs—as they are affected by both genetic and environmental confounding—and on interpreting the results of within-family polygenic prediction analyses. Polygenic predictors (called polygenic scores or PGSs) are weighted sums of genotypes, with weights typically derived from GWAS. Within-family association between a PGS and phenotype can only be due to DGEs (and IGEs between siblings if using a sibling design [ 7 ]) but can be misinterpreted when there is assortative mating on multiple traits. This could lead to a PGS for one trait predicting another trait within-family, despite there being no shared causal variants. This can occur because the PGS for one trait could give nonzero weight to variants that are causal for the other trait due to correlations between variants for both traits induced by assortative mating. This concern would be practically eliminated for within-family associations with PGSs derived from FGWAS, but these are currently far less powerful than PGS derived from GWAS [ 8 ].

GWAS has successfully discovered thousands of trait–variant associations, given biological insights, guided drug target discovery, and enabled creation of powerful polygenic predictors. However, GWAS is susceptible to confounding that can lead to biases and erroneous conclusions in downstream analyses. FGWAS presents a principled solution by using the natural experiment of mendelian segregation during meiosis to remove confounding. Veller and Coop examine FGWAS and find that it removes all environmental confounding and almost all genetic confounding. Future studies should examine the consequences of the phenomena investigated by Veller and Coop on downstream applications of both GWAS and FGWAS (e.g., estimating genetic correlations and natural selection). While FGWAS has favorable properties compared to GWAS, it requires samples with genotyped first-degree relatives, which limits the sample size compared to GWAS: existing FGWASs have effective sample sizes in the tens of thousands [ 7 , 8 ], compared to millions for many GWASs. Even with comparable sample sizes, FGWAS is less powerful than GWAS because it only uses within-family genetic variation. Furthermore, issues not examined by Veller and Coop may lead to bias in FGWAS, such as nonrandom sampling of offspring with respect to heritable phenotypes. We should therefore consider building family-based sampling (ideally representative of the population) into the design of future biobanks to enable powerful and robust FGWAS and other family-based methodologies.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • Open access
  • Published: 22 June 2021

Genome-wide association study and its applications in the non-model crop Sesamum indicum

  • Muez Berhe 1 , 2   na1 ,
  • Komivi Dossa 1 , 3 , 4   na1 ,
  • Jun You 1 ,
  • Pape Adama Mboup 5 ,
  • Idrissa Navel Diallo 3 , 5 ,
  • Diaga Diouf 3 ,
  • Xiurong Zhang 1 &
  • Linhai Wang 1  

BMC Plant Biology volume  21 , Article number:  283 ( 2021 ) Cite this article

6756 Accesses

21 Citations

3 Altmetric

Metrics details

Sesame is a rare example of non-model and minor crop for which numerous genetic loci and candidate genes underlying features of interest have been disclosed at relatively high resolution. These progresses have been achieved thanks to the applications of the genome-wide association study (GWAS) approach. GWAS has benefited from the availability of high-quality genomes, re-sequencing data from thousands of genotypes, extensive transcriptome sequencing, development of haplotype map and web-based functional databases in sesame.

In this paper, we reviewed the GWAS methods, the underlying statistical models and the applications for genetic discovery of important traits in sesame. A novel online database SiGeDiD ( http://sigedid.ucad.sn/ ) has been developed to provide access to all genetic and genomic discoveries through GWAS in sesame. We also tested for the first time, applications of various new GWAS multi-locus models in sesame.

Conclusions

Collectively, this work portrays steps and provides guidelines for efficient GWAS implementation in sesame, a non-model crop.

Sesame ( Sesamum indicum L, 2n = 2x = 26) which belongs to the Pedaliaceae family is one of the most ancient oilseed crops domesticated from the wild progenitor S . malabaricum in Near East, Asia and Africa over 5,000 years ago [ 1 , 2 ]. Sesame is reputed for its climate-resilience, high oil content, and unique antioxidant properties [ 3 ]. It is an important source of high-quality edible oil and protein food. The oil content of sesame seed ranges from 50-60% with a high proportion of natural antioxidants such as sesamolin, sesamin, and sesamol, conferring a long shelf life and stability to the oil [ 4 , 5 ]. Ashakumary et al. [ 6 ] reported that sesame seed contains 19-25% protein and is a good source of iron, magnesium, copper, calcium, vitamins B1, E and phytosterols that help to lower the levels of blood cholesterol. Besides, all essential amino acids and fatty acids are present in the sesame seed [ 7 ]. The sesame sector is a billion-dollar industry that supports the livelihoods of millions of farmers throughout the world [ 8 ]. The total production has significantly increased over the last ten years, reaching 6 million tons in 2017 (Food and Agriculture Organization Statistical Database [ 9 ]. Sesame production and productivity, however, face different constraints, including limited numbers of improved varieties, shattering of capsules at maturity, non-synchronous maturity, poor stand establishment, profuse branching, low harvest index, drought stress, waterlogging and diseases [ 10 , 11 , 12 ]. To accelerate sesame improvement, genomics assisted breeding has been adopted as an efficient approach for developing superior varieties in a short time [ 13 ]. Hence, the reference genome sequence of sesame together with numerous essential genomic resources was delivered to the scientific community [ 14 ]. The haplotype map of the sesame genome was constructed from a re-sequencing project of 705 worldwide diverse cultivars and two representative genomes were further de novo assembled [ 15 ]. These resources are vital to the quick advancement of sesame research, as they expedite the detection of genetic loci that control important agronomic traits using the genome-wide association study (GWAS) approach. Today, hundreds of causative genetic variants associated with important traits such as oil quality, abiotic stress resistance, seed yield have been discovered. These findings facilitate the use of marker-assisted selection and genomic selection to advance genetic improvement and overall productivity of sesame. This makes sesame a rare case of non-model and minor crop for which genomic studies, particularly GWAS, have been very successful.

In this review paper, we first present the GWAS approach and underlying statistical models. Then, the ongoing efforts of genetic discovery through applications of GWAS in sesame are presented in detail. We conclude this paper with important guidelines for better applications of GWAS in sesame.

GWAS approach, underlying statistical models and applications in plants

Gwas approach.

Genome-wide association study (GWAS) also known as association mapping or linkage disequilibrium (LD) mapping takes the full advantage of high phenotypic variation within a species and the high number of historical recombination events in the natural population. It has become an alternative approach over the conventional quantitative trait locus (QTL) mapping to identify the genetic loci underlying traits at a relatively high resolution [ 15 ]. GWAS in general is applicable to study the association between single-nucleotide polymorphisms (SNPs) and target phenotypic traits. Nowadays, SNP identification is becoming much easier using advanced high throughput genotyping techniques. GWAS, quantitatively is evaluated based on LD by genotyping and phenotyping various individuals in a natural population panel. Unlike the traditional QTL mapping approach, which makes the use of bi-parental segregating populations, identification of causal genes for traits of interest in GWAS is performed in natural populations. A key advantage of GWAS is that the same genotyping data and the same population can be used over and over for different traits.

GWAS has been successfully applied to identify associations at a high resolution, detect candidate genes and dissect the quantitative traits in human, animals, and plants [ 16 , 17 ]. GWAS in various economically valuable crops has been used to gain insight into the genetic architecture of important traits, including days to heading, days to flowering panicle architecture, resistance to rice yellow mottle virus, fertility restoration, and agronomic traits in rice [ 18 , 19 , 20 , 21 ]; pattern of genetic change and evolution [ 22 , 23 ], compositional and pasting properties [ 24 ], stalk biomass [ 25 ] and leaf cuticular conductance [ 26 ] in maize; plant height components and inflorescence architecture [ 27 ], grain size [ 28 ] and grain quality [ 29 ] in sorghum; harvest index in maize [ 30 ], flowering time in canola [ 31 ], stress tolerance, oil content and seed quality [ 32 ] in brassica; oil yield and quality [ 15 ], yield related traits [ 33 , 34 ], drought tolerance [ 35 ], vitamin E [ 36 ] in sesame.

Statistical models underlying GWAS approach

Single-locus models.

Marker-trait association using GWAS has been widely detected using one-dimensional genome scans of the population [ 19 , 37 , 38 , 39 ]. In this method, one SNP is evaluated at a time. Following the use of general linear model (GLM) which is described as Y = β 0 + β 1 X [ 40 ] (where Y = dependent/predicted/ explanatory/response variable, β 0 = the intercept ; β 1 = a weight or slope (coefficient); X = a variable), a popular model referred as a Mixed Linear Model (MLM) (Q+K method) which is described as Y = Xβ + Zu + e [ 41 ], (where Y = vector of observed phenotypes; β = unknown vector containing fixed effects, including the genetic marker, population structure (Q), and the intercept; u = unknown vector of random additive genetic effects from multiple background QTL for individuals/lines; X and Z = known design matrices; and e = unobserved vector of residuals) was developed to control the multiple testing effects and bias of population stratification in GWAS. Then, the accuracy of association mapping has been reported partially improved [ 17 , 42 , 43 ]. Subsequently, numerous advanced statistical methods based on the MLM have also been suggested to resolve certain limitations such as false-positive rates, large computational consequences, and inaccurate predictions [ 44 ]. Efficient mixed model association (EMMA) [ 45 ], compressed mixed linear model (CMLM) and population parameters previously determined (P3D) [ 46 ], and random-SNP-effect mixed linear model (MRMLM) [ 47 ] are some of the latest improved single-locus genome scans MLM-based approaches proposed so far. Such advanced statistical models are powerful, flexible, and computationally efficient. EMMA was proposed to minimize the computational load exhibited in the MLM probability functions by considering the quantitative trait nucleotide (QTN) effect as a fixed effect [ 17 , 44 , 45 ]; while CMLM was proposed to control the size of huge genotype data by grouping individuals into groups and, thus, the group kinship matrix is derived from the clustered individuals [ 46 ]. Generally, despite its limitation for efficient estimation of marker effects in complex traits, the single-locus model approach has a good ability to handle several markers [ 47 ], and this is one of its worthy reported features.

Although the single-locus model analysis was a common approach for association analysis between each SNP and phenotype in GWAS, some earlier reports suggested that the use of a single-locus model analysis has limitations to resolve potential effects caused by multiple tests, historical genotype effects and pleiotropic effects [ 17 , 48 ]. They reported that the interaction between the available genetic variants throughout the genome is not profoundly explored when only on SNP is tested at a time. Similarly, the Bonferroni correction employed to control the false-positive error (FDR) due to multiple testing is also very stringent in this approach, hence significant numbers of important loci may not be identified by the single-locus models particularly for large errors due to phenotypic data and multi-locus effects [ 49 , 50 ]. Thus, it has been suggested that these single-locus genome scan methods are not convenient to test quantitative traits regulated by a few and/or many genes with large and minor effects, respectively [ 17 , 49 ]. Besides, the genetic epistatic effects generated within close genes could not be explored in single-locus methods [ 51 ].

Haplotype-based models

To address some of the limitations in the single-locus model analysis, haplotype-based models, which is conducted based on a random SNP effect mixed linear model (MRMLM) described as: Y =Xβ + Z k y k + u + e (where Y = a vector of estimated genotypic value for all lines is an incident matrix for fixed effects as population structure, β is a vector of the fixed effect, Z k = a vector of genotype indicators for k th SNP, Y k = random effect of marker k with ~N (0, Kσ 2 k ), u = vector of polygenic effects described by the kinship matrix (K) with ~N (0, σ 2 a ) and e = vector of residuals errors with ~N (0, Iσ 2 e )), was developed and implemented for some major crops such as wheat, rice, and soybean [ 52 , 53 ]. Several neighboring markers in high LD are clustered into a single multi-locus haplotype in this multivariate method, thus the haplotypes are evaluated in a multiple GLM system rather than individual SNPs, and the associations between the haplotypes and the traits under selection have been observed [ 48 , 52 , 54 ]. The haplotype-based model is relatively more efficient and reliable than the traditional single-locus models in GWAS as it helps to accurately capture the allelic diversity, optimize the use of high-density marker data, enhance the power of epistatic interactions discovery and minimize multiple testing [ 51 , 52 ].

Multi-locus models

Multi-locus models are newly developed alternative methods in GWAS involving two-stage algorithms [ 55 , 56 , 57 ] consisting of a single locus scan of the entire genome to detect all possible associated SNPs (QTNs) and then testing all associated SNPs using a multi-locus GWAS model to detect true QTNs. These newly developed multi-locus GWAS models are ideal for testing complex quantitative traits regulated by multiple genes/loci and less influenced by population structure. Some advantages of multi-locus models over single-locus models are for example, the detection of multiple genes governing a given trait with high power and efficiency, low false-positive rate and no need of Bonferroni correction for multiple testing known to potentially exclude important loci [ 17 , 47 , 58 , 59 ]. Multi-locus models have also resulted in substantial improvements of the quality and depth of the association results in GWAS [ 17 , 42 , 53 , 57 , 60 , 61 ]. The models currently largely implemented in GWAS include a multi-locus mixed model (MLMM) [ 57 ], multi-locus random SNP-effect mixed linear model (mrMLM) [ 47 ], integrative sure independence screening expectation-maximization Bayesian least absolute shrinkage and selection operator model (ISIS EM-BLASSO) [ 50 ], fast multi-locus random-SNP-effect efficient mixed model association (FASTmrEMMA) [ 17 ], polygene-background-control-based least angle regression plus Empirical Bayes (pLARmEB) [ 62 ], Kruskal-Wallis test with empirical Bayes under polygenic background control (pKWmEB) [ 58 ] and fast multi-locus random-SNP-effect mixed linear model (FASTmrMLM) [ 59 , 63 ]. Among the numerous multi-locus models recorded to date, Segura et al. [ 57 ] proposed a MLMM method which has an advantage over other existing multi-locus methods, including penalized logistical regression [ 64 ], Stepwise regression [ 65 ], Bayesian-inspired penalized maximum likelihood, computational efficiency, false discovery rate detection and addressing the problems of population structure in GWAS. Similarly, Korte et al. [ 66 ] also proposed a mixed model method referred to as a multi-trait mixed model (MTMM) that detects the causal loci for precisely correlated multiple phenotype traits and simultaneously deals with both intra-trait and inter-trait variance components. Likewise, Klasen et al. [ 61 ] suggested a Quantitative Trait Cluster Association Test (QTCAT) analysis of multi-locus associations without employing population correction techniques and this model showed better results in limiting the false positive/negative associations due to correction strategies to mitigate confounding impacts. Multi-Trait Analysis of GWAS (MTAG) was also another specific approach developed by Turley et al. [ 67 ] to analyze summary statistics (meta-analysis) in GWAS. Zhan et al. [ 68 ] also proposed another method, named Dual Kernel Association Test (DKAT) that includes two individual kernel matrices to explain phenotype and genotype similarities. Some of DKAT's advantages over existing methods include being able to test the relationship between multiple traits and multiple SNPs without making parametric assumptions, correcting Type I error rates, being statistically highly efficient and computationally scalable [ 60 , 68 ].

Recently, different comparative studies have been conducted to assess the capacity of these different GWAS models in detecting marker-trait associations in different plant species. Globally, it has been found that the multi-locus models were more efficient and powerful than the single-locus models to detect highly significant association results for the traits of interest (Table 1 ). However, integrating both single-locus and multi-locus models have been proved to enhance the power and validity of the association analysis of complex traits in GWAS because single-locus models could detect some loci that multi-locus models fail to identify [ 54 , 70 ].

Use of pan-genome vs single reference genome for GWAS

The common approach to study a given population’s genetic variation relies on the interpretation of genes and variants annotated from the sequences of the existing reference genome [ 74 ]. Currently, reference genome sequences of many crops, including rice [ 75 , 76 , 77 ], sorghum [ 78 ], maize [ 79 ], Brassica rapa [ 80 ], barely [ 81 , 82 ], millet [ 83 ], potato [ 84 ], tomato [ 85 ], and sesame [ 14 ] have been reported. Following the generation of high-quality reference genome sequences, several GWAS have been carried out to discover the natural variation among diverse populations. However, the reference-genome-based GWAS approach may not be sufficient to distinguish any difference between or within the population in which certain relevant genes may be inactive in the reference genome but may be expressed in the studied populations [ 86 ].

Since the discovery of pan-genome in Streptococcus agalactiae [ 87 ], different pan-genomes have been constructed through comparison of multiple genomes derived from de novo sequences assembly of various individuals of the same species including, rice [ 88 , 89 ], maize [ 90 ]), soybean [ 91 ], B. napus [ 92 ], wheat [ 93 ] and recently in sesame [ 94 ] (Table 2 ). Unlike the reference genome sequencing-based GWAS approach which depends on SNPs among the entire panel under investigation, the pan-genome approach is more inclusive and could detect copious variation including structural variation (SV), copy number variation (CNV), present/absent variation, inversion and translation variations [ 30 , 86 ]. In this regard, Song et al. [ 96 ] reported a direct detection of causal structural variation for the target traits (silique length, seed weight and flowering time) in Brassica napus based on the PAV-based genome-wide association study (PAV-GWAS) using the pan-genome assembled from eight high-quality genomes. They also reported that the SNP-GWAS approach that involves the single reference genome indicated no detection of causal structural variation for the same population. The result of their study indicates that the pan-genome based association study is a powerful approach that can complement the single-reference genome approach in detecting new SNP-trait associations. Likewise, the physical position of the sugarcane mosaic virus resistance gene ( ZmTrxh ) in maize was discovered using a pan-genome assembled from three different genotypes, but not with the use of the single reference genome [ 90 ]. Other pan-genomes based GWAS have been conducted in important crops such as rice and pigeon pea [ 89 , 97 ].

Diversity and development of GWAS populations in sesame

Morphological and genetic diversity.

Sesame is a diploid species and belongs to the division Spermatophyta , subdivision Angiospermae , class Dicotyledoneae , order Tubiflorae , family Pedaliaceae , and genus Sesamum . Pedaliaceae is a small family of 16 genera and 60 species of which 37 species belong to Sesamum genus and only Sesamum indicum L. is the most commonly cultivated species [ 10 , 39 , 98 , 99 , 100 ]. A high number of varieties and ecotypes are reported with high adaptation to various ecological conditions in the world. There are three cytogenetic groups in Sesamum of which 2n = 26 consists of the cultivated S. indicum along with S. alatum, S. capense, S. schenckii, S. malabaricum ; 2n = 32 consists of S. prostratum, S. laciniatum, S. angolense, S. angustifolium; while S. radiatum, S. occidentale and S. schinzianum belong to 2n = 64 [ 101 , 102 , 103 ]. So far, extensive morphological variations including plant height, height to the first capsule, height to first branch, number of branches, flowering period, flower color, number of flowers per axil, number of capsule per axil, capsule edge number days to maturity, number of seeds per capsule, number of capsule per plant, seed coat color, seed size, seed oil content, seed yield, and branching habit have been reported in the cultivated sesame [ 11 , 14 , 104 , 105 , 106 , 107 ]. Besides the huge phenotypic variation harbored in sesame germplasm, various molecular marker-based high levels of genetic diversity were also documented within many landraces and cultivars collected from different areas around the world (Table 3 ) [ 1 , 14 , 15 , 104 , 106 , 109 , 110 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 , 134 ]. Recently, advances in next-generation sequencing technologies have facilitated SNP-based genetic diversity analysis in sesame. Globally, high levels of genetic diversity in diverse sesame germplasm from Asia, Europe, America, and Africa were reported (Table 4 ) [ 14 , 15 , 36 , 135 , 136 ].

Development of GWAS populations

In China, there are over 8,000 accessions of sesame deposited in the National Mid-term Gene Bank of China located in the Oil Crops Research Institute of Chinese Academy of Agricultural Sciences (OCRI-CAAS) [ 14 ]. Similarly, about 4,500 sesame accessions conserved in the National Long-term Gene bank in Beijing [ 107 ] (Fig. 1 ). Based on these large collections, strategies to build a sesame core collection have started early in the year 2000 using morphological descriptors and later, molecular tools [ 14 , 15 , 106 , 107 , 137 ]. Ultimately, a sesame core collection encompassing 705 diverse accessions including 405 landraces, 95 cultivars from China, and 205 accessions from 28 other countries was established at OCRI [ 15 ]. The entire panel was re-sequenced on Illumina HiSeq 2000 (http:/ www.ncgr.ac.cn/ SesameHapMap), in which a total of 5,407,981 SNPs were detected in the genome with an average of 2 SNP per 50 bp (Fig. 2 ). This panel shows ideal characteristics for the implementation of GWAS, including high phenotypic variability, low population structure and genetic differentiation among groups, and a moderate decline in LD (~88 kb) [ 15 ]. However, most of the accessions (70.1%) included in this panel represent only one country while the other 28 countries are represented only by 29.9% of the accessions. Furthermore, a limited number of African sesame (~3%) was included in this study, although Africa is the main source of diverse sesame landraces [ 108 ]. Therefore, for exploiting the genetic bases of important agronomic traits and detection of potential causative genes, there is a need to update this GWAS population panel by including more materials representing diverse agro-ecological origins across the world. Another association-mapping panel population was developed by the sesame research group in Henan Sesame Research Center, Henan Academy of Agricultural Sciences (HSRC-HAAS) [ 122 , 136 ] consisting of 366 germplasm accessions representing about 89.9% from China and the rest 10.1% from 11 countries. This population also showed high phenotypic and genetic diversity, relatively good SNP density (1 SNP per 2.6 kb with 42,781 SNPs in total) and moderate decay in LD (~99 kb) [ 122 ]. However, this panel also has limited geographical representation. Further GWAS panel populations have been recently built from Korean core collections. However, the population size and SNP density were very low: 96 accessions and 5,962 SNPs [ 36 ]; 87 accessions and 8,883 SNPs [ 135 ]. Overall, to explore the genetic bases of economically important agronomic traits and identify possible causative genes, these developed GWAS panels need to be updated by providing more materials reflecting diverse agro-ecological backgrounds worldwide.

figure 1

Flow chart showing key steps in GWAS implementation in sesame (prepared based on works at OCRI-CAAS)

figure 2

Single-nucleotide polymorphism distributions on the 16 linkage groups (LGs) of the sesame genome assembly v1. The horizontal axis shows the LG length; the 0 ∼ 27841 legend insert shows the SNP density

Advantages and limitations for GWAS implementation in sesame

Implementation of GWAS based on high-quality genome sequences results generally in a more accurate prediction and mining of potential causative genes. The high-resolution positioning of SNPs in the genome along the entire chromosomes can unravel the genetic architecture of target traits; hence, GWAS can detect more significant associations, candidate genes, and genomic locations with high power and efficiency. Since 2014, the development of a high-quality draft genome of the sesame genotype ‘Zhongzhi13’ [ 14 ] has opened the door for genomic research in sesame. Sesame has a small diploid genome estimated at 350 Mb, of which 274 Mb draft genome was assembled, and 27,148 protein-coding genes were predicted. Another genome sequence was also published during the same period from the modern cultivar ‘Yuzhi1’ [ 138 ]. Progresses in genome sequencing technologies associated with the reduction of sequencing costs have created opportunities for additional genome sequencing projects in sesame. The reference genome was updated to have a higher resolution [ 39 ] and the genome sequences of different sesame landraces including ‘Baizhima’ and ‘Mishuozhima’ [ 15 ] and a modern cultivar ‘Swetha’ [ 139 ] were also published. Furthermore, the assembly of a sesame pan-genome from five different genomes identified 15,890 dispensable genes, providing a rich resource for comprehensive gene discovery and superior allele mining through GWAS [ 94 ]. Similarly, the availability of tremendous transcriptome data from diverse sesame tissues, various growth conditions and from wild Sesamum species such as S. radiatum and S. mulayanum (Table 5 ) ( https://www.ncbi.nlm.nih.gov/bioproject/?term=((sesame)%20AND%20%22Sesamum%20indicum%22[orgn:__txid4182])%20AND%20bioproject_sra[filter]%20NOT%20bioproject_gap[filter ]) facilitates post-GWAS works particularly for pinpointing candidate genes and their functional analysis. The availability of several mapping populations [ 11 ] is also very useful for validating or polishing GWAS findings. Besides, the availability of functional genomic databases such as Sinbase ( http://ocri-genomics.org/Sinbase/index.html ), SesameFG ( http://sesame-bioinfo.org/SesameFG/ ) and Sesame HapMap that have been deployed to facilitate genome excavation, comparative genomics, gene expression analysis, are highly useful for post-GWAS investigations [ 15 , 105 , 140 ].

To further facilitate the exploitation of GWAS results as well as all genetic discoveries available in sesame, we have developed a novel database named Sesamum indicum Genetic Discovery Database (SiGeDiD) ( http://sigedid.ucad.sn/ ). SiGeDiD is a flexible online catalog of all genetic and genomic discoveries including, candidate genes, QTLs and functional molecular markers in sesame (Fig. 3 ). It is an essential platform for comparative analysis of GWAS projects in sesame and facilitates gene discovery, particularly the identification of pleiotropic genomic regions/genes that have been identified from different GWAS and other genetic/genomic studies. The website is user-friendly and we integrated a module allowing researchers to upload directly their findings in SiGeDiD. Currently, the BLAST functionality is unavailable but SiGeDiD will be updated to make it more interactive and fully functional.

figure 3

SiGeDiD: an online catalogue of functional genomic discoveries in sesame ( http://sigedid.ucad.sn/ )

Collectively, the availability of enormous genomic resources, the small genome size of sesame, comprehensive GWAS panels, diverse mapping populations, high genetic diversity, low population structure, and relatively low LD are advantageous for GWAS implementation in sesame.

Limitations

While GWAS provides an opportunity to investigate a range of novel genes associated with important agronomic traits, this method does not necessarily identify causal variants and genes [ 141 ]. When GWAS is completed, it is often necessary to take additional steps to investigate the functional and causal variants and their target genes in which transgenic experiments may ultimately be implemented. Sesame, however, is a recalcitrant plant for genetic transformation, so there are limited validations of GWAS-identified SNPs using a transgenic approach. Besides, although the LD decay rate in sesame is relatively lower than that of other self-pollinating crops, including rice (~100-350 kb) [ 142 , 143 ], soybean (~574 kb) [ 144 , 145 ] and brassica (~405 kb) [ 146 ], it showed a higher LD decay rate than other cross-pollinating species, including maize (~5.39-15.53 kb) [ 147 ]. Consequently, the modest level of LD decay rate (88 kb) reported in sesame suggests that GWAS resolution may not easily resolve to the causative gene unless a high marker density is used. GWAS, therefore, could have a limited efficiency on trait-based QTL regions or causative genes detection in the absence of high marker density. Another limitation of GWAS in sesame is that many sesame cultivars are highly photosensitive, so field phenotyping and collecting reliable data in various regions of the world is difficult.

GWAS applications in sesame

From 2015, several GWAS projects have been successfully implemented in sesame to uncover the genetic bases of key agronomic traits such as oil content, oil nutrient composition, seed yield, and yield-related components, seed coat color, morphological characteristics, disease resistance salt tolerance, waterlogging resistance, drought tolerance, root traits and nutritional values [ 15 , 33 , 34 , 35 , 36 , 135 , 136 , 148 ]. As to our knowledge, all GWAS projects conducted so far in sesame were based on a single-locus method (EMMA) and the majority was implemented on the GWAS panel developed at OCRI-CAAS. In this work, we summarize all of the results of GWAS reported by different groups of sesame researchers (Table 6 and Fig. 4 ). A large scale GWAS was conducted by investigating the natural variation of 705 sesame accessions based on 169 sets of phenotypic data including, oil content, nutrient composition, yield components, morphological characteristics, growth cycle, coloration and disease resistance. In total, 1,805,413 SNPs were used. This has led to the identification of 446 significantly associated SNPs with the phenotypic variation. Following in-depth analyses of the major loci, a total of 46 causative genes including genes related to flower lip color ( SiGL3 ), petiole color ( SiMYB113 and SiMYB23 ), oil content ( SiPPO ), fatty acid biosynthesis ( CXE17 and GDSL -like lipase) and yield ( SiACS ) were identified [ 15 ]. Similarly, GWAS of 39 yield-related traits was also conducted [ 34 ] using the same population as the previous study [ 15 ]. In total, 646 loci associated with traits of interest and 48 potential genes significantly associated with the functional loci were identified. They reported several candidate homologs genes involved in seed formation and some novel candidate genes ( SiLPT3 and SiACS8 ) which may control capsule length and capsule number [ 34 ]. Likewise, variations in PEG-induced drought stress and salt stress tolerance were investigated in 490 diverse sesame accessions (representing 33 countries in Asia, Africa, America and Europe) based on GWAS [ 33 ]. A total of 132 significant SNPs resolved to nine QTLs and 151 total genes of which SiEMF1 , SiGRV2 , SiCYP76C7 , SiGRF5 , SiCCD8 , SiGPAT3 , SiGDH2 , SiRABA1D were detected as potential genes regulating drought stress while for salt tolerance, a total of 120 significant SNPs resolved to 15 QTLs and 241 genes of which of SiLHCB6 , SiMLP31 , SiPOD , SiHSFA1 , SiDUF538 , SiCC-NBS -LRR, SiUDG , SiGPAT3 , SiNAC43 , SiGDH2 , SiCP24 , SiWRKY14 , SiXXT5 , SiXTH15 , and SiG6PD1 were detected as potential genes [ 33 ]. Later on, GWAS was conducted to investigate genetic variants governing drought tolerance in 400 sesame accessions [ 35 ]. A total of 140 reliable and stable QTLs were identified and resolved to 10 QTLs. Similarly, 120 genes, of which SiABI4 , SiTTM3, SiGOLS1 , SiNIMIN1 , and SiSAM having high potentials to modulate drought tolerance in sesame, were identified [ 35 ]. Their study was the first to validate the function of a candidate gene from GWAS using transgenic approach. They demonstrated that sesame accessions originated from drought-prone agro-ecological regions have fixed several drought-tolerant alleles, though alleles contributing to high yielding under drought conditions are far from being fixed. Hence, sesame is mostly considered as a resilient crop because of the long-term adaptation to drought-prone agro-ecological regions. Additional new GWAS results were also reported recently [ 36 , 135 , 136 ] (Table 6 ). Based on genotyping by sequencing (GBS) method, [ 36 ] conducted GWAS on vitamin E and identified eight strongly linked SNPs and 12 genes with various regulatory functions, including transcription regulator HTH, zinc ion binding protein, glycosylphosphatidylinositol (GPI)-anchor biosynthesis and ribosome protein. They also identified, two loci, LG_03_13104062 containing seven genes ( SIN_1022039 – SIN_1022045 ) and LG_08_6621957 containing five genes ( SIN_1001936 – SIN_1001940 ), detected simultaneously on LGs 3 and 8, respectively, by employing two different models (GLM and MLM). Hence, the authors suggested that these two simultaneously detected loci have high potentials to control vitamin E in sesame. However, due to the limited numbers of SNPs (5,962) and small panel size used in this GWAS, potential loci for this important trait may have been missed [ 136 ]. used genotype data from 42,781 SNPs and seed coat color trait from an association-mapping panel consisting of 366 sesame germplasms to identify 224 significantly associated SNPs. Based on the four most stable peaks/SNPs significantly associated with sesame seed coat color, they retained 92 candidate genes. Of these genes, SIN_1016759 (encoding predicted PPO) was also reported in previous GWAS by [ 15 ] and QTL mapping study by [ 39 ]. Using a mapping association of 87 sesame accessions and 8,883 SNPs, a GWAS on phytophthora blight resistance was conducted [ 135 ]. The result of this study suggested that SIN_1019016 was one of the candidate genes identified closely associated with phytophthora blight resistance in sesame. The limited SNP numbers called from the GBS approach and relatively small size of sesame accessions used in this study could have affected the GWAS output associated with trait under investigation. More recently, a comprehensive GWAS conducted by Dossa et al. [ 148 ] unraveled the genetic basis of seven root related traits. They reported 409 significant signals, 19 QTLs containing 32 candidate genes associated with sesame root traits. More importantly, they discovered an orphan gene named ‘ Big Root Biomass ’ ( SIN_1025576 ) which modulates sesame root biomass through the auxin pathway [ 148 ]. In addition to the published GWAS findings, the OCRI-CAAS sesame research group has also several unpublished GWAS outputs on various agronomic traits including, waterlogging, chlorophyll, salt stress at the seedling stage and interestingly a metabolite based GWAS has been completed. These results will illuminate the genetic basis of important metabolites such as sesamin/sesamolin variation in sesame. All candidate genes, QTLs and SNPs will be regularly loaded into SiGeDiD (http:/sigedid.ucad.sn/) for further uses in sesame breeding projects.

figure 4

GWAS applications in sesame. a Circos plot summarizing genetic findings of important agronomic traits in sesame. (A) Pseudomolecules (LG), (B) gene density, (C) QTL position, (D) -log(p) of the peak SNPs, (E) pleiotropic QTLs; b Schematic diagram showing potential candidate genes discovered so far related to important agronomic traits in sesame. The image of the sesame plant has been specifically designed in this study

Potential of new statistical models to improve the accuracy and power of GWAS in sesame

To our knowledge, multi-locus models have not yet been employed in sesame GWAS research and no previous study has compared different GWAS models (single locus and multi-locus models) in sesame. Herein, we tested the applications of new GWAS models in sesame based on quantitative (root length) and qualitative (seed coat color) traits. Natural variation in root length of 350 sesame accessions was collected from a field experiment following the methodology developed by Su et al. [ 149 ], and the genotypic data were obtained from 1,000,000 common SNPs. For the seed coat color GWAS, the 600 sesame accessions, and 1,000,000 common SNPs were used [ 15 ]. To investigate the phenotypic natural variation for the seed coat color, matured seeds from five capsules per genotype were collected and photographed with a high-resolution digital camera and the seed –coat color data, which was based on the red, green, and blue (RGB) values, were recorded following the methodological approach adopted by Zhang et al. [ 150 ]. Subsequently, three separate GWAS models, including two multi-locus models (mrMLM FASTmrEMMA and mrMLM ) and one single locus model (EMMAX) were selected (mainly because they do not require extensive phenotypic and genotypic data formatting) and were implemented using the phenotypic and genotypic data. We further compared the results of these three models to evaluate their potentials to reveal higher number of marker-trait associations and discover more candidate genes.

Our GWAS results for the two traits showed that a total of 190, 181 and 162 significant SNPs (-log10(p) > 6) associated with root length were detected by FASTmrEMMA, mrMLM and EMMAX, respectively. Similarly, 67, 492 and 143 significant SNPs associated with seed coat color were detected by FASTmrEMMA, mrMLM and EMMAX, respectively (Fig. 5 a-f; Table 7 ; Table S1 ). Of the significant SNPs associated with root length, 163 SNPs were identified simultaneously by all three models; all the SNPs identified by EMMAX were also identified simultaneously by both multi-locus models, while 18 SNPs were simultaneously and only detected by FASTmrEMMA and mrMLM (Fig. 5 g). For the seed coat color associated SNPs, 67 and 27 SNPs were detected by all the three models and by two models (mrMLM and EMMAX), respectively (Fig. 5 h). By considering all SNPs co-clustered with peak SNPs within a window of 200 kb as QTLs [ 35 ], a total of 19 and 34 QTLs were detected for root length and seed coat color, respectively, by all the three models (Table S1 ). Within these QTLs, we retrieved 26 and 47 genes for root length and seed coat color, respectively. Based on the robust QTLs co-detected by different models identified for root length, nine potential candidate genes, including SIN_1017810 , SIN_101781, SIN_1017812 , SIN_1017815 , SIN_1017843 , SIN_1007064 , SIN_1007065, SIN_1020072 and SIN_1017818 are proposed for further functional studies to pinpoint the causative gene (s). Regarding the seed coat color, the potential candidate genes identified in our study include SIN_1007188 , SIN_1007221 , SIN_1023226, SIN_1023227 and SIN_1023228 . Interestingly, three genes detected in this study were previously reported by Mei et al. [ 136 ].

figure 5

Application of new statistical multi-locus models in sesame. a and b Negative log10 P -values for association of root length (Y-axis) are plotted against SNP positions (X-axis) using the multi-locus models, mrMLM and FASTmrEMMA, respectively; c Negative log10 P -values for association of root length (Y-axis) are plotted against SNP positions (X-axis) using the single-locus model, EMMAX; d and e Negative log10 P -values for association of seed coat color (Y-axis) are plotted against SNP positions (X-axis) using the multi-locus models, mrMLM and FASTmrEMMA, respectively; f Negative log10 P -values for association of seed coat color (Y-axis) are plotted against SNP positions (X-axis) using the single-locus model, EMMAX. For both traits, a horizontal dash–dot line indicates the significant P -value threshold (10 -6 ) and the significant SNPs are highlighted by red color, vertical line indicates overlapped most significant peaks at least in two models; g Venn diagrams showing the shared and uniquely detected significant SNPs by each model for root length GWAS respectively; h , Venn diagrams depicting the shared and uniquely detected significant SNPs by each model for seed coat color GWAS. The phenotypic and genotypic data for this analysis were obtained from 350 sesame accessions and 1,000,000 common SNPs for root length and data from 705 sesame accessions and 1,805,413 common SNPs for seed coat color GWAS study

Collectively, the analysis of different GWAS models indicates the potential of using an integrated approach (single and multi-locus models) to improve the capacity and power of GWAS in sesame. This will help to detect more and novel marker-trait associations and candidate genes, particularly when investigating quantitative traits . It is also important to note that significantly associated regions simultaneously detected by more models in GWAS are more likely to be highly associated with the traits under investigation as compared with regions detected only by a single model. Hence, developing diagnostic markers for the co-detected associated regions could speed up sesame molecular breeding programmes.

Over the last five years, GWAS have been successfully implemented in sesame and is illuminating the genetic basis of many important agronomic traits. Even though a list of QTLs (~300) and candidate genes (~250) have been identified for qualitative and quantitative traits, more traits, including chlorophyll-yield, metabolite-GWAS, waterlogging, heat tolerance are under investigation. We envision that all these results will lead to the development of allele-specific diagnostic markers to be used as daily molecular tools in sesame breeding programmes. Though a high-quality sesame reference genome sequence has been developed, more often, there are limitations to find any candidate gene around the peak SNPs from GWAS. To overcome these limitations, we need to use the recently developed sesame pan-genome [ 94 ] for future GWAS implementations. The diversity of recently available sesame GWAS panels should be improved by integrating more accessions and wild species from different agro-ecological origins mainly from Africa. For this, an international collaboration between sesame researchers is highly required. Furthermore, collaboration between researchers for generating comprehensive germplasm characterization data using precise phenotyping platforms and in contrasting environments will permit more accurate dissection of the genetic architecture of complex traits in sesame. Efforts towards sharing genetic materials between research institutes are crucial for accelerating gene discovery. For example, the re-sequencing data of the 705 fully sequenced GWAS panel generated by OCRI is publicly available and if the germplasm, at least partly, could be shared with partners, more GWAS projects could be implemented on sesame, particularly on traits highly affected by environments. Similarly, working to develop an SNP chip can be an alternative for quick, low-cost, and easy genotyping of novel sesame collections to be used for future GWAS projects.

The application of new multi-locus GWAS models and integration of single- and multi-locus models will provide more efficiency and power in future GWAS implementation in sesame. Up to date, very few studies have validated the numerous GWAS findings in sesame. Therefore, follow-up studies are needed for further validating the favorable alleles identified from GWAS in independent populations and using other approaches (classical bi-parental QTL mapping, QTLseq, etc.). Validation of GWAS findings using transgenic approach is also instrumental in several plant species. In sesame, genetic transformation protocols using tissue culture techniques have been reported [ 151 ]. More studies on this topic are needed in order to develop a more effective genetic transformation protocol in sesame, for example using the flower dip technique [ 152 ]. Hairy root genetic transformation is also a flexible and rapid technique widely adopted in several recalcitrant plants to study gene functions [ 153 ]. We propose to develop a hairy root genetic transformation protocol in sesame combined with new genome editing technologies to confirm some important GWAS findings. Finally, projects aiming at developing diagnostic molecular markers based on GWAS peak SNPs and their favorable alleles should be instigated. This will considerably accelerate sesame molecular breeding.

Availability of data and materials

The data used in this review article are available in the supplementary files and within the manuscript.

Abbreviations

Genome wide association study

Linkage disequilibrium

Quantitative trait locus

Quantitative trait nucleotides

Sesamum indicum genetic discovery database

Single nucleotide polymorphism

Bedigian D. History and lore of sesame in Southwest Asia. Econ Bot. 2004;58(3):329–53.

Article   Google Scholar  

Bedigian D. Systematics and evolution in Sesamum L.(Pedaliaceae), part 1: evidence regarding the origin of sesame and its closest relatives. Webbia. 2015;70(1):1–42.

Ashri A. Sesame breeding. Plant Breed Rev. 1989;16:179–228.

Google Scholar  

Bedigian D. Sesame: the genus Sesamum. Boca Raton: CRC Press; 2010.

Book   Google Scholar  

Lee J, Lee Y, Choe E. Effects of sesamol, sesamin, and sesamolin extracted from roasted sesame oil on the thermal oxidation of methyl linoleate. LWT-Food Sci Technol. 2008;41(10):1871–5.

Article   CAS   Google Scholar  

Ashakumary L, Rouyer I, Takahashi Y, Ide T, Fukuda N, Aoyama T, et al. Sesamin, a sesame lignan, is a potent inducer of hepatic fatty acid oxidation in the rat. Metabolism. 1999;48(10):1303–13.

Article   CAS   PubMed   Google Scholar  

Balasubramaniyan P, Palaniappan S. Field crops: an overview. Principles and practices of agronomy. Agrobios, India, 47; 2001.

Alegbejo M, Iwo G, Abo M, Idowu A. Sesame: a potential industrial and export oilseed crop in Nigeria. J Sustain Agric. 2003;23(1):59–76.

FAOSTAT, F. Statistical databases, fisheries data (2001). Rome: Food and Agriculture Organization of the United Nations; 2018. Available from internet: http://www.fao.org url http://www.fao.org

Ashri A. Sesame breeding. Plant Breed Rev. 1998;16:179–228.

Dossa K, Diouf D, Wang L, Wei X, Zhang Y, Niang M, et al. The emerging oilseed crop Sesamum indicum enters the “Omics” era. Front Plant Sci. 2017;8:1154.

Article   PubMed   PubMed Central   Google Scholar  

Weiss E. Castor, sesame and safflower; 1971.

Varshney RK, Ribaut J-M, Buckler ES, Tuberosa R, Rafalski JA, Langridge P. Can genomics boost productivity of orphan crops? Nat Biotechnol. 2012;30(12):1172–6.

Wang L, Yu S, Tong C, Zhao Y, Liu Y, Song C, et al. Genome sequencing of the high oil crop sesame provides insight into oil biosynthesis. Genome Biol. 2014;15(2):1–13.

Wei X, Liu K, Zhang Y, Feng Q, Wang L, Zhao Y, et al. Genetic discovery for oil production and quality in sesame. Nat Commun. 2015;6:8609.

Huang X, Han B. Natural variations and genome-wide association studies in crop plants. Annu Rev Plant Biol. 2014;65:531–51.

Wen Y-J, Zhang H, Ni Y-L, Huang B, Zhang J, Feng J-Y, et al. Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief Bioinform. 2018;19(4):700–12.

Article   PubMed   Google Scholar  

Cubry P, Pidon H, Ta KN, Tranchant-Dubreuil C, Thuillet A-C, Holzinger M, et al. Genome wide association study pinpoints key agronomic QTLs in African rice Oryza glaberrima. bioRxiv. 2020.

Huang X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, et al. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat Genet. 2010;42(11):961.

Li P, Zhou H, Yang H, Xia D, Liu R, Sun P, et al. Genome-wide association studies reveal the genetic basis of fertility restoration of CMS-WA and CMS-HL in xian/indica and aus accessions of rice (Oryza sativa L.). Rice. 2020;13(1):11.

Yano K, Yamamoto E, Aya K, Takeuchi H, Lo P-c, Hu L, et al. Genome-wide association study using whole-genome sequencing rapidly identifies new genes influencing agronomic traits in rice. Nat Genet. 2016;48(8):927.

Hufford MB, Xu X, Van Heerwaarden J, Pyhäjärvi T, Chia J-M, Cartwright RA, et al. Comparative population genomics of maize domestication and improvement. Nat Genet. 2012;44(7):808–11.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Jiao Y, Zhao H, Ren L, Song W, Zeng B, Guo J, et al. Genome-wide genetic changes during modern breeding of maize. Nat Genet. 2012;44(7):812–5.

Alves ML, Carbas B, Gaspar D, Paulo M, Brites C, Mendes-Moreira P, et al. Genome-wide association study for kernel composition and flour pasting behavior in wholemeal maize flour. BMC Plant Biol. 2019;19(1):123.

Mazaheri M, Heckwolf M, Vaillancourt B, Gage JL, Burdo B, Heckwolf S, et al. Genome-wide association analysis of stalk biomass and anatomical traits in maize. BMC Plant Biol. 2019;19(1):1–17.

Lin M, Matschi S, Vasquez M, Chamness J, Kaczmar N, Baseggio M, et al. Genome-wide association study for maize leaf cuticular conductance identifies candidate genes involved in the regulation of cuticle development. G3 Genes Genomes Genetics. 2020;10(5):1671–83.

CAS   PubMed   PubMed Central   Google Scholar  

Morris GP, Ramu P, Deshpande SP, Hash CT, Shah T, Upadhyaya HD, et al. Population genomic and genome-wide association studies of agroclimatic traits in sorghum. Proc Natl Acad Sci. 2013;110(2):453–8.

Tao Y, Zhao X, Wang X, Hathorn A, Hunt C, Cruickshank AW, et al. Large-scale GWAS in sorghum reveals common genetic control of grain size among cereals. Plant Biotechnol J. 2020;18(4):1093–105.

Kimani W, Zhang L-M, Wu X-Y, Hao H-Q, Jing H-C. Genome-wide association study reveals that different pathways contribute to grain quality variation in sorghum (Sorghum bicolor). BMC Genomics. 2020;21(1):112.

Lu F, Romay MC, Glaubitz JC, Bradbury PJ, Elshire RJ, Wang T, et al. High-resolution genetic mapping of maize pan-genome sequence anchors. Nat Commun. 2015;6(1):1–8.

Raman H, Raman R, Qiu Y, Yadav AS, Sureshkumar S, Borg L, et al. GWAS hints at pleiotropic roles for FLOWERING LOCUS T in flowering time and yield-related traits in canola. BMC Genomics. 2019;20(1):636.

Article   PubMed   PubMed Central   CAS   Google Scholar  

Lu K, Wei L, Li X, Wang Y, Wu J, Liu M, et al. Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement. Nat Commun. 2019;10(1):1–12.

CAS   Google Scholar  

Li D, Dossa K, Zhang Y, Wei X, Wang L, Zhang Y, et al. GWAS uncovers differential genetic bases for drought and salt tolerances in sesame at the germination stage. Genes. 2018;9(2):87.

Article   CAS   PubMed Central   Google Scholar  

Zhou R, Dossa K, Li D, Yu J, You J, Wei X, et al. Genome-wide association studies of 39 seed yield-related traits in sesame (Sesamum indicum L.). Int J Mol Sci. 2018;19(9):2794.

Article   PubMed Central   CAS   Google Scholar  

Dossa K, Li D, Zhou R, Yu J, Wang L, Zhang Y, et al. The genetic basis of drought tolerance in the high oil crop Sesamum indicum. Plant Biotechnol J. 2019;17(9):1788–803.

He Q, Xu F, Min M-H, Chu S-H, Kim K-W, Park Y-J. Genome-wide association study of vitamin E using genotyping by sequencing in sesame (Sesamum indicum). Genes Genomics. 2019;41(9):1085–93.

Challa S, Neelapu NR. Genome-wide association studies (GWAS) for abiotic stress tolerance in plants. In: Biochemical, physiological and molecular avenues for combating abiotic stress tolerance in plants. Amsterdam: Elsevier; 2018. p. 135–50.

Chapter   Google Scholar  

Rahaman M, Mamidi S, Rahman M. Genome-wide association study of heat stress-tolerance traits in spring-type Brassica napus L. under controlled conditions. Crop J. 2018;6(2):115–25.

Wang L, Xia Q, Zhang Y, Zhu X, Zhu X, Li D, et al. Updated sesame genome assembly and fine mapping of plant height and seed coat color QTLs using a new high-density genetic map. BMC Genomics. 2016;17(1):31.

Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155(2):945–59.

Yu J, Pressoir G, Briggs WH, Bi IV, Yamasaki M, Doebley JF, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38(2):203–8.

Gupta PK, Kulwal PL, Jaiswal V. Association mapping in crop plants: opportunities and challenges. In: Advances in genetics. Amsterdam: Elsevier; 2014. p. 109–47.

Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, et al. Further improvements to linear mixed models for genome-wide association studies. Sci Rep. 2014;4(1):1–13.

Lipka AE, Tian F, Wang Q, Peiffer J, Li M, Bradbury PJ, et al. GAPIT: genome association and prediction integrated tool. Bioinformatics. 2012;28(18):2397–9.

Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, et al. Efficient control of population structure in model organism association mapping. Genetics. 2008;178(3):1709–23.

Zhang Z, Ersoz E, Lai C-Q, Todhunter RJ, Tiwari HK, Gore MA, et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet. 2010;42(4):355–60.

Wang S-B, Feng J-Y, Ren W-L, Huang B, Zhou L, Wen Y-J, et al. Improving power and accuracy of genome-wide association studies via a multi-locus mixed linear model methodology. Sci Rep. 2016;6:19444.

Buzdugan L, Kalisch M, Navarro A, Schunk D, Fehr E, Bühlmann P. Assessing statistical significance in multivariable genome wide association analysis. Bioinformatics. 2016;32(13):1990–2000.

Bush WS, Moore JH. Genome-wide association studies. PLoS Comput Biol. 2012;8(12):e1002822.

Tamba CL, Ni Y-L, Zhang Y-M. Iterative sure independence screening EM-Bayesian LASSO algorithm for multi-locus genome-wide association studies. PLoS Comput Biol. 2017;13(1):e1005357.

Gawenda I, Thorwarth P, Günther T, Ordon F, Schmid KJ. Genome-wide association studies in elite varieties of German winter barley using single-marker and haplotype-based methods. Plant Breed. 2015;134(1):28–39.

Abed A, Belzile F. Comparing single-SNP, multi-SNP, and haplotype-based approaches in association studies for major traits in Barley. Plant Genome. 2019;12(3):1–14.

Article   PubMed   CAS   Google Scholar  

Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet. 2010;11(11):773–85.

Li C, Fu Y, Sun R, Wang Y, Wang Q. Single-locus and multi-locus genome-wide association studies in the genetic dissection of fiber quality traits in upland cotton (Gossypium hirsutum L.). Front Plant Sci. 2018;9:1083.

Cui Y, Zhang F, Zhou Y. The application of multi-locus GWAS for the detection of salt-tolerance loci in rice. Front Plant Sci. 2018;9:1464.

Li J, Tang W, Zhang Y-W, Chen K-N, Wang C, Liu Y, et al. Genome-wide association studies for five forage quality-related traits in Sorghum (Sorghum bicolor L.). Front Plant Sci. 2018;9:1146.

Segura V, Vilhjálmsson BJ, Platt A, Korte A, Seren Ü, Long Q, et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat Genet. 2012;44(7):825.

Ren W-L, Wen Y-J, Dunwell JM, Zhang Y-M. pKWmEB: integration of Kruskal–Wallis test with empirical Bayes under polygenic background control for multi-locus genome-wide association study. Heredity. 2018;120(3):208–18.

Zhang Y, Liu P, Zhang X, Zheng Q, Chen M, Ge F, et al. Multi-locus genome-wide association study reveals the genetic architecture of stalk lodging resistance-related traits in maize. Front Plant Sci. 2018;9:611.

Gupta PK, Kulwal PL, Jaiswal V. Association mapping in plants in the post-GWAS genomics era. In: Advances in genetics. Amsterdam: Elsevier; 2019. p. 75–154.

Klasen JR, Barbez E, Meier L, Meinshausen N, Bühlmann P, Koornneef M, et al. A multi-marker association method for genome-wide association studies without the need for population structure correction. Nat Commun. 2016;7(1):1–8.

Zhang J, Feng J, Ni Y, Wen Y, Niu Y, Tamba C, et al. pLARmEB: integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity. 2017;118(6):517–24.

Tamba CL, Zhang Y-M. A fast mrMLM algorithm for multi-locus genome-wide association studies. biorxiv. 2018:341784.

Ayers KL, Cordell HJ. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet Epidemiol. 2010;34(8):879–91.

Cordell HJ, Clayton DG. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am J Hum Genet. 2002;70(1):124–41.

Korte A, Vilhjálmsson BJ, Segura V, Platt A, Long Q, Nordborg M. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat Genet. 2012;44(9):1066–71.

Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet. 2018;50(2):229–37.

Zhan X, Zhao N, Plantinga A, Thornton TA, Conneely KN, Epstein MP, et al. Powerful genetic association analysis for common or rare variants with high-dimensional structured traits. Genetics. 2017;206(4):1779–90.

Ma L, Liu M, Yan Y, Qing C, Zhang X, Zhang Y, et al. Genetic dissection of maize embryonic callus regenerative capacity using multi-locus genome-wide association studies. Front Plant Sci. 2018;9:561.

Xu Y, Yang T, Zhou Y, Yin S, Li P, Liu J, et al. Genome-wide association mapping of starch pasting properties in maize using single-locus and multi-locus models. Front Plant Sci. 2018;9:1311.

Su J, Ma Q, Li M, Hao F, Wang C. Multi-locus genome-wide association studies of fiber-quality related traits in Chinese early-maturity upland cotton. Front Plant Sci. 2018;9:1169.

Chang F, Guo C, Sun F, Zhang J, Wang Z, Kong J, et al. Genome-wide association studies for dynamic plant height and number of nodes on the main stem in summer sowing soybeans. Front Plant Sci. 2018;9:1184.

Peng Y, Liu H, Chen J, Shi T, Zhang C, Sun D, et al. Genome-wide association studies of free amino acid levels by six multi-locus models in bread wheat. Front Plant Sci. 2018;9:1196.

Gan X, Stegle O, Behr J, Steffen JG, Drewe P, Hildebrand KL, et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature. 2011;477(7365):419–23.

Goff SA, Ricke D, Lan T-H, Presting G, Wang R, Dunn M, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science. 2002;296(5565):92–100.

International, R.G.S.P. The map-based sequence of the rice genome. Nature. 2005;436(7052):793.

Yu J, Hu S, Wang J, Wong GK-S, Li S, Liu B, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science. 2002;296(5565):79–92.

Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, et al. The Sorghum bicolor genome and the diversification of grasses. Nature. 2009;457(7229):551–6.

Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326(5956):1112–5.

Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, et al. The genome of the mesopolyploid crop species Brassica rapa. Nat Genet. 2011;43(10):1035–9.

Consortium, I.B.G.S. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491(7426):711–6.

Mayer KF, Martis M, Hedley PE, Šimková H, Liu H, Morris JA, et al. Unlocking the barley genome by chromosomal and comparative genomics. Plant Cell. 2011;23(4):1249–63.

Zhang G, Liu X, Quan Z, Cheng S, Xu X, Pan S, et al. Genome sequence of foxtail millet (Setaria italica) provides insights into grass evolution and biofuel potential. Nat Biotechnol. 2012;30(6):549–54.

Consortium PGS. Genome sequence and analysis of the tuber crop potato. Nature. 2011;475(7355):189.

Consortium TG. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012;485(7400):635.

Tao Y, Zhao X, Mace E, Henry R, Jordan D. Exploring and exploiting pan-genomics for crop improvement. Mol Plant. 2019;12(2):156–69.

Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc Natl Acad Sci. 2005;102(39):13950–5.

Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature. 2018;557(7703):43–9.

Zhao Q, Feng Q, Lu H, Li Y, Wang A, Tian Q, et al. Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat Genet. 2018;50(2):278–84.

Gage JL, Vaillancourt B, Hamilton JP, Manrique-Carpintero NC, Gustafson TJ, Barry K, et al. Multiple maize reference genomes impact the identification of variants by genome-wide association study in a diverse inbred panel. Plant Genome. 2019;12(2):1–12.

Li Y-H, Zhou G, Ma J, Jiang W, Jin L-G, Zhang Z, et al. De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat Biotechnol. 2014;32(10):1045–52.

Hurgobin B, Golicz AA, Bayer PE, Chan CKK, Tirnaz S, Dolatabadian A, et al. Homoeologous exchange is a major cause of gene presence/absence variation in the amphidiploid Brassica napus. Plant Biotechnol J. 2018;16(7):1265–74.

Montenegro JD, Golicz AA, Bayer PE, Hurgobin B, Lee H, Chan CKK, et al. The pangenome of hexaploid bread wheat. Plant J. 2017;90(5):1007–13.

Yu J, Golicz AA, Lu K, Dossa K, Zhang Y, Chen J, et al. Insight into the evolution and functional characteristics of the pan-genome assembly from sesame landraces and modern cultivars. Plant Biotechnol J. 2019;17(5):881–92.

Contreras-Moreira B, Cantalapiedra CP, García-Pereira MJ, Gordon SP, Vogel JP, Igartua E, et al. Analysis of plant pan-genomes and transcriptomes with GET_HOMOLOGUES-EST, a clustering solution for sequences of the same species. Front Plant Sci. 2017;8:184.

Song J-M, Guan Z, Hu J, Guo C, Yang Z, Wang S, et al. Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus. Nat Plants. 2020;6(1):34–45.

Zhao J, Bayer PE, Ruperao P, Saxena RK, Khan AW, Golicz AA, et al. Trait associations in the pangenome of pigeon pea (Cajanus cajan). Plant Biotechnol J. 2020;18(9):1946–54.

Asghar A, Majeed MN. Chemical characterization and fatty acid profile of different sesame verities in Pakistan. Am J Sci Ind Res. 2013;4:540–5.

Baydar H. Breeding for the improvement of the ideal plant type of sesame. Plant Breed. 2005;124(3):263–7.

Kobayashi T, Kinoshita M, Hattori S, Ogawa T, Tsuboi Y, Ishida M, et al. Development of the sesame metallic fuel performance code. Nucl Technol. 1990;89(2):183–93.

Kobayashi T. Cytogenetics of sesame (Sesamum indicum). In: Developments in plant genetics and breeding. Amsterdam: Elsevier; 1991. p. 581–92.

Nayar NM, Mehra K. Sesame: its uses, botany, cytogenetics, and origin. Econ Bot. 1970:20–31.

Pham TD, Thi Nguyen T-D, Carlsson AS, Bui TM. Morphological evaluation of sesame (‘Sesamum indicum’L.) varieties from different origins. Aust J Crop Sci. 2010;4(7):498.

Wei W, Zhang Y, Wang L, Li D, Gao Y, Zhang X. Genetic diversity, population structure, and association mapping of 10 agronomic traits in sesame. Crop Sci. 2016;56(1):331–43.

Wei X, Gong H, Yu J, Liu P, Wang L, Zhang Y, et al. SesameFG: an integrated database for the functional genomics of sesame. Sci Rep. 2017;7(1):1–10.

Zhang Y, Zhang X, Che Z, Wang L, Wei W, Li D. Genetic diversity assessment of sesame core collection in China by phenotype and molecular markers and extraction of a mini-core collection. BMC Genet. 2012;13(1):102.

Zhang Y-X, Zhang X-R, Hua W, Wang L-H, Che Z. Analysis of genetic diversity among indigenous landraces from sesame (Sesamum indicum L.) core collection in China as revealed by SRAP and SSR markers. Genes Genomics. 2010;32(3):207–15.

Dossa K, Wei X, Zhang Y, Fonceka D, Yang W, Diouf D, et al. Analysis of genetic diversity and population structure of sesame accessions from Africa and Asia as major centers of its cultivation. Genes. 2016;7(4):14.

Cho Y-I, Park J-H, Lee C-W, Ra W-H, Chung J-W, Lee J-R, et al. Evaluation of the genetic diversity and population structure of sesame (Sesamum indicum L.) using microsatellite markers. Genes Genomics. 2011;33(2):187–95.

Yepuri V, Surapaneni M, Kola VSR, Vemireddy L, Jyothi B, Dineshkumar V, et al. Assessment of genetic diversity in sesame (Sesamum indicum L.) genotypes, using EST-derived SSR markers. J Crop Sci Biotechnol. 2013;16(2):93–103.

Park J-H, Suresh S, Cho G-T, Choi N-G, Baek H-J, Lee C-W, et al. Assessment of molecular genetic diversity and population structure of sesame (Sesamum indicum L.) core collection accessions using simple sequence repeat markers. Plant Genet Resour. 2014;12(1):112–9.

Yue W, Wei L, Zhang T, Li C, Miao H, Zhang H. Genetic diversity and population structure of germplasm resources in sesame (Sesamum indicum L.) by SSR markers. Acta Agron Sin. 2012;38(12):2286–96.

Wei W, Zhang Y, Lv H, Wang L, Li D, Zhang X. Population structure and association analysis of oil content in a diverse set of Chinese sesame (Sesamum indicum L.) germplasm. Sci Agric Sin. 2012;45(10):1895–903.

Wei W, Zhang Y, Lü H, Li D, Wang L, Zhang X. Association analysis for quality traits in a diverse panel of chinese sesame (Sesamum indicum L.) Germplasm. J Integr Plant Biol. 2013;55(8):745–58.

Wu K, Yang M, Liu H, Tao Y, Mei J, Zhao Y. Genetic analysis and molecular characterization of Chinese sesame (Sesamum indicum L.) cultivars using Insertion-Deletion (InDel) and Simple Sequence Repeat (SSR) markers. BMC Genet. 2014;15(1):35.

Akbar F, Rabbani MA, Masood MS, Shinwari ZK. Genetic diversity of sesame (Sesamum indicum L.) germplasm from Pakistan using RAPD markers. Pak J Bot. 2011;43(4):2153–60.

Al-Somain BHA, Migdadi HM, Al-Faifi SA, Alghamdi SS, Muharram AA, Mohammed NA, et al. Assessment of genetic diversity of sesame accessions collected from different ecological regions using sequence-related amplified polymorphism markers. 3 Biotech. 2017;7(1):82.

Arriel NHC, Di Mauro AO, Arriel EF, Unêda-Trevisoli SH, Costa MM, Bárbaro IM, et al. Genetic divergence in sesame based on morphological and agronomic traits. Crop Breed Appl Biotechnol. 2007:253–61.

Basak M, Uzun B, Yol E. Genetic diversity and population structure of the Mediterranean sesame core collection with use of genome-wide SNPs developed by double digest RAD-Seq. PLoS One. 2019;14(10):e0223757.

Bedigian D. Evolution of sesame revisited: domestication, diversity and prospects. Genet Resour Crop Evol. 2003;50(7):779–87.

Bedigian D, Smyth C, Harlan JR. Patterns of morphological variation inSesamum indicum. Econ Bot. 1986;40(3):353–65.

Cui C, Mei H, Liu Y, Zhang H, Zheng Y. Genetic diversity, population structure, and linkage disequilibrium of an association-mapping panel revealed by genome-wide SNP markers in sesame. Front Plant Sci. 2017;8:1189.

Dar AA, Mudigunda S, Mittal PK, Arumugam N. Comparative assessment of genetic diversity in Sesamum indicum L. using RAPD and SSR markers. 3 Biotech. 2017;7(1):10.

de Sousa Araújo E, Arriel NHC, dos Santos RC, de Lima LM. Assessment of genetic variability in sesame accessions using SSR markers and morpho-agronomic traits. Aust J Crop Sci. 2019;13(1):45.

Dossa K, Wei X, Li D, Fonceka D, Zhang Y, Wang L, et al. Insight into the AP2/ERF transcription factor superfamily in sesame and expression profiling of DREB subfamily under drought stress. BMC Plant Biol. 2016;16(1):171.

Ercan AG, Taskin M, Turgut K. Analysis of genetic diversity in Turkish sesame (Sesamum indicum L.) populations using RAPD markers ⋆ . Genet Resour Crop Evol. 2004;51(6):599–607.

Gebremichael DE, Parzies HK. Genetic variability among landraces of sesame in Ethiopia. Afr Crop Sci J. 2011;19(1).

Hika G, Geleta N, Jaleta Z. Genetic variability, heritability and genetic advance for the phenotypic traits in sesame (Sesamum indicum L.) populations from Ethiopia. Sci Technol Arts Res J. 2015;4(1):20–6.

Pandey SK, Das A, Rai P, Dasgupta T. Morphological and genetic diversity assessment of sesame (Sesamum indicum L.) accessions differing in origin. Physiol Mol Biol Plants. 2015;21(4):519–29.

Parsaeian M, Mirlohi A, Saeidi G. Study of genetic variation in sesame (Sesamum indicum L.) using agro-morphological traits and ISSR markers. Russ J Genet. 2011;47(3):314.

Pham TD, Geleta M, Bui TM, Bui TC, Merker A, Carlsson AS. Comparative analysis of genetic diversity of sesame (Sesamum indicum L.) from Vietnam and Cambodia using agro-morphological and molecular markers. Hereditas. 2011;148(1):28–35.

Wei X, Wang L, Zhang Y, Qi X, Wang X, Ding X, et al. Development of simple sequence repeat (SSR) markers of sesame (Sesamum indicum) from a genome survey. Molecules. 2014;19(4):5150–62.

Wei X, Zhu X, Yu J, Wang L, Zhang Y, Li D, et al. Identification of sesame genomic variations from genome comparison of landrace and variety. Front Plant Sci. 2016;7:1169.

Woldesenbet DT, Tesfaye K, Bekele E. Genetic diversity of sesame germplasm collection (Sesamum indicum L.): implication for conservation, improvement and use. Int J Biotechnol Mol Biol Res. 2015;6(2):7–18.

Asekova S, Oh E, Kulkarni KP, Lee MH, Kim JI, Pae S-B, et al. A combinatorial approach of biparental QTL mapping and genome-wide association analysis identifies candidate genes for phytophthora blight resistance in sesame. bioRxiv. 2020; https://doi.org/10.1101/2020.03.18.996637 .

Mei H, Cui C, Liu Y, Liu Y, Cui X, Du Z, et al. Genome-wide association study of seed coat color in sesame (Sesamum indicum L.). PLoS One. 2020. https://doi.org/10.21203/rs.2.18296/v2 .

Xiurong Z, Yingzhong Z, Yong C, Xiangyun F, Qingyuan G, Mingde Z, et al. Establishment of sesame germplasm core collection in China. Genet Resour Crop Evol. 2000;47(3):273–9.

Zhang H, Miao H, Wang L, Qu L, Liu H, Wang Q, et al. Genome sequencing of the important oilseed crop Sesamum indicumL. Genome Biol. 2013;14(1):401.

Kitts PA, Church DM, Thibaud-Nissen F, Choi J, Hem V, Sapojnikov V, et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 2016;44(D1):D73–80.

Wang L, Yu J, Li D, Zhang X. Sinbase: an integrated database to study genomics, genetics and comparative genomics in Sesamum indicum. Plant Cell Physiol. 2015;56(1):e2.

Tam V, Patel N, Turcotte M, Bossé Y, Paré G, Meyre D. Benefits and limitations of genome-wide association studies. Nat Rev Genet. 2019;20(8):467–84.

Li N, Zheng H, Cui J, Wang J, Liu H, Sun J, et al. Genome-wide association study and candidate gene analysis of alkalinity tolerance in japonica rice germplasm at the seedling stage. Rice. 2019;12(1):24.

Zhang P, Zhong K, Zhong Z, Tong H. Genome-wide association study of important agronomic traits within a core collection of rice (Oryza sativa L.). BMC Plant Biol. 2019;19(1):259.

Hyten DL, Choi I-Y, Song Q, Shoemaker RC, Nelson RL, Costa JM, et al. Highly variable patterns of linkage disequilibrium in multiple soybean populations. Genetics. 2007;175(4):1937–44.

Li M, Liu Y, Tao Y, Xu C, Li X, Zhang X, et al. Identification of genetic loci and candidate genes related to soybean flowering through genome wide association study. BMC Genomics. 2019;20(1):987.

Wu Z, Wang B, Chen X, Wu J, King GJ, Xiao Y, et al. Evaluation of linkage disequilibrium pattern and association study on seed oil content in Brassica napus using ddRAD sequencing. PLoS One. 2016;11(1):e0146383.

Rashid Z, Singh PK, Vemuri H, Zaidi PH, Prasanna BM, Nair SK. Genome-wide association study in Asia-adapted tropical maize reveals novel and explored genomic regions for sorghum downy mildew resistance. Sci Rep. 2018;8(1):1–12.

Dossa K, Zhou R, Li D, Liu A, Qin L, Mmadi MA, et al. A novel motif in the 5’-UTR of an orphan gene ‘ Big Root Biomass’ modulates root biomass in sesame. Plant Biotechnol J. 2020. https://doi.org/10.1111/pbi.13531 .

Su R, Zhou R, Mmadi MA, Li D, Qin L, Liu A, et al. Root diversity in sesame (Sesamum indicum L.): insights into the morphological, anatomical and gene expression profiles. Planta. 2019;250(5):1461–74.

Zhang H, Miao H, Wei L, Li C, Zhao R, Wang C. Genetic analysis and QTL mapping of seed coat color in sesame (Sesamum indicum L.). PLoS One. 2013;8(5):e63898.

Chowdhury S, Basu A, Kundu S. Overexpression of a new osmotin-like protein gene (SindOLP) confers tolerance against biotic and abiotic stresses in sesame. Front Plant Sci. 2017;8:410.

Martins PK, Nakayama TJ, Ribeiro AP, da Cunha BADB, Nepomuceno AL, Harmon FG, et al. Setaria viridis floral-dip: a simple and rapid Agrobacterium-mediated transformation method. Biotechnol Rep. 2015;6:61–3.

Gomes C, Dupas A, Pagano A, Grima-Pettenati J, Paiva JAP. Hairy root transformation: a useful tool to explore gene function and expression in Salix spp. recalcitrant to transformation. Front Plant Sci. 2019;10:1427.

Download references

Acknowledgements

Data summarized in this paper have been generated through works of several authors which we would like to thank for their continuous efforts for the emergence of sesame crop. We are also thankful to Dr Muhammad Amjad Nawaz for his assistance in drawing the sesame plant.

The study was supported by Wuhan cutting-edge application technology fund (2018020401011303), the Science and Technology Innovation Project of Hubei province (201620000001048), the Natural Science Foundation of Hubei Province, China (2019CFB574), the Fundamental Research Funds for Central Non-profit Scientific Institution (1610172019004, Y2019XK15-02), the Agricultural Science and Technology Innovation Project of the Chinese Academy of Agricultural Sciences (CAAS-ASTIP-2013-OCRI) and the China Agriculture Research System (CARS-14). The funders have no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Muez Berhe and Komivi Dossa contributed equally to this work.

Authors and Affiliations

Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture, and Rural Affairs, No.2 Xudong 2nd Road, Wuhan, 430062, China

Muez Berhe, Komivi Dossa, Jun You, Xiurong Zhang & Linhai Wang

Humera Agricultural Research Center of Tigray Agricultural Research Institute, Humera, Tigray, Ethiopia

Laboratoire Campus de Biotechnologies Végétales, Département de Biologie Végétale, Faculté des Sciences et Techniques, Université Cheikh Anta Diop, BP 5005 Dakar-Fann, 10700, Dakar, Senegal

Komivi Dossa, Idrissa Navel Diallo & Diaga Diouf

Laboratory of Genetics, Horticulture and Seed Sciences, Faculty of Agronomic Sciences, University of Abomey-Calavi, 01 BP 526, Cotonou, Republic of Benin

Komivi Dossa

Département de Mathématiques et Informatique, Faculté des Sciences et Techniques, Université Cheikh Anta Diop, BP 5005 Dakar-Fann, 10700, Dakar, Senegal

Pape Adama Mboup & Idrissa Navel Diallo

You can also search for this author in PubMed   Google Scholar

Contributions

M B, K D and L W conceived and designed the paper; M B, K D, L W, J Y, D D, X Z collected and analyzed the literature; K D and M B conducted multi-locus GWAS analyses; P A M, I N D, K D and D D designed and developed SiGeDiD; M B and K D drafted the paper and prepared the figures; L W, J Y, D D, X Z have revised the manuscript. All authors have read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Komivi Dossa or Linhai Wang .

Ethics declarations

Ethics approval and consent to participate.

Not applicable

Consent for publication

Competing interests.

The authors declare no conflict of interest

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

: Table S1 . Summary list of total QTLs and candidate genes identified in GWAS for root length and seed coat color along the linkage groups in sesame by multi-locus and single-locus models. Table S2 . Summary of QTL and candidate genes detected by each GWAS model. Table S3 . Candidate genes detected in each LG for each model.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Berhe, M., Dossa, K., You, J. et al. Genome-wide association study and its applications in the non-model crop Sesamum indicum . BMC Plant Biol 21 , 283 (2021). https://doi.org/10.1186/s12870-021-03046-x

Download citation

Received : 25 October 2020

Accepted : 17 May 2021

Published : 22 June 2021

DOI : https://doi.org/10.1186/s12870-021-03046-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Statistical models
  • Genomics assisted breeding

BMC Plant Biology

ISSN: 1471-2229

research paper on genome wide association studies

  • Frontiers in Genetics
  • Applied Genetic Epidemiology
  • Research Topics

Advancements and Prospects of Genome-wide Association Studies

Total Downloads

Total Views and Downloads

About this Research Topic

Genome-wide association studies (GWAS) aim to identify the genetic variants associated with a dichotomous (e.g., type 2 diabetes case/control status) or quantitative (e.g., serum fasting glucose levels) traits. The study involves a high-density scan of the genome, genotyping single nucleotide polymorphisms ...

Keywords : Genome-wide association studies, drug development, trait analysis, Polygenic Risk Score

Important Note : All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

Topic Editors

Topic coordinators, recent articles, submission deadlines, participating journals.

Manuscripts can be submitted to this Research Topic via the following journals:

total views

  • Demographics

No records found

total views article views downloads topic views

Top countries

Top referring sites, about frontiers research topics.

With their unique mixes of varied contributions from Original Research to Review Articles, Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author.

  • Search Menu
  • Advance articles
  • Digital Collections
  • Author Guidelines
  • Submission Site
  • Open Access Policy
  • Self-Archiving Policy
  • About the ASBMR
  • Editorial Board
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Acknowledgments, conflicts of interest.

  • < Previous

Integrating both common and rare variants to predict bone mineral density and fracture

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Sirui Gai, Yu Qian, Zhenlin Zhang, Hou-Feng Zheng, Integrating both common and rare variants to predict bone mineral density and fracture, Journal of Bone and Mineral Research , Volume 39, Issue 3, March 2024, Pages 193–194, https://doi.org/10.1093/jbmr/zjad022

  • Permissions Icon Permissions

For more than 10 yr now, genome-wide association studies (GWASs) have identified around ~1000 genetic loci associated with BMD, osteoporosis, and fractures in human populations. 1 However, even in large-scale biobank–based GWAS, 2 most of the identified genetic variants are common, and few low-frequency and rare variants were identified to be associated with these bone traits by previous whole-genome sequencing efforts. 3 , 4 In this issue, Lu et al. 5 performed a genome-wide burden test using more than 450 000 whole-exome sequencing (WES) samples from the UK Biobank to identify potential influential rare variants (with a minor allele frequency [MAF] ≤1‰) that likely confer a strong effect on BMD-related traits.

As we know, BMD can be estimated by the speed of sound (SOS) and broadband ultrasound attenuation measured by quantitative ultrasound at the heel. In 2021, Lu et al. 6 developed a genetically predicted SOS (gSOS) for individuals in the UK Biobank by common genetic variants through the genome-wide genetic risk score (GRS). They demonstrated that this score provided moderately better fracture risk prediction than some of the clinical risk factors such as smoking and the use of corticosteroids.

Rare variants, particularly those in regions of low linkage disequilibrium, could contribute to additional variance or heritability for some of the complex traits (such as height). 7 Moreover, the inclusion of rare variants in GRS prediction requires a large sample size to ensure sufficient statistical power to detect the effect, due to the low MAF. In view of this, Lu et al. 5 recently tested the association of rare variants with bone-related traits in UK Biobank large-scale WES data, and identified dozens of genes harboring the influential rare variants which were involved in key regulatory pathways of bone homeostasis. They further developed a generalized GRS for heel SOS (ggSOS) that incorporated both common and rare genetic variants as predictors, and examined whether ggSOS had added value to gSOS in predicting bone mineral density, osteoporosis, and fracture risk. During the process of constructing the ggSOS, the authors considered both the Bonferroni (stringent strategy) and false discovery rate (lenient strategy) correction to select the representative significant burden mask for each gene. Intriguingly, they suggested that adding rare variants did not demonstrate substantially improved predictive performance in either European ancestry or other populations. As the influential variants were rare, it should be noted that only a small proportion (5.4%) of the population would carry the variants. When restricting the assessment to carriers of potentially influential rare variants, ggSOS demonstrated improved predictive performance; however, the improvement was marginal. Similar predictive performance was also observed in obesity risk, 8 while rare variants show a convincing association with BMI, obesity, and extreme obesity. The GRS constructed from rare variants provided limited improvement over the common GRS in the prediction of obesity risk. 8 Therefore, in terms of fracture prediction, is it necessary to call for next generation sequencing efforts at the clinic-level, considering the high cost of sequencing and low prevalence of influential rare variants? Even for common GRS, although the gSOS might outperform some of the clinical risk factors in fracture risk prediction, it does not perform as well as BMD measurement itself. 6 Although ggSOS might not be substantially better than gSOS, the use of GRS in fracture prediction could decrease the number of individuals requiring BMD-based FRAX screening. 9

The limitations were also listed in the discussions of the study. 5 Besides, the SOS measurement was used in the training and testing dataset in this study 5 ; however, SOS measurement was not correlated very well with BMD. 10 Furthermore, it would be more conducive to provide an analysis pipeline to generalize the method of integrating common and rare variants to other traits prediction.

We would like to extend our sincere gratitude to the High-performance Computing Center at Westlake University for their invaluable support and resources. And we thank Gareth Wetherill who helped us proofread the manuscript.

This work was supported by the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (#2023C03164), and the Chinese National Key Technology R&D Program, Ministry of Science and Technology (#2021YFC2501702).

None declared.

Zhu X , Bai W , Zheng H . Twelve years of GWAS discoveries for osteoporosis and related traits: advances, challenges and applications . Bone Res . 2021 ; 9 ( 1 ): 23 . https://doi.org/10.1038/s41413-021-00143-3

  • Google Scholar

Morris JA , Kemp JP , Youlten SE , et al.    An atlas of genetic influences on osteoporosis in humans and mice . Nat Genet . 2019 ; 51 ( 2 ): 258 – 266 . https://doi.org/10.1038/s41588-018-0302-x

Styrkarsdottir U , Thorleifsson G , Sulem P , et al.    Nonsense mutation in the LGR4 gene is associated with several human diseases and other traits . Nature . 2013 ; 497 ( 7450 ): 517 – 520 . https://doi.org/10.1038/nature12124

Zheng HF , Forgetta V , Hsu YH , et al.    Whole-genome sequencing identifies EN1 as a determinant of bone density and fracture . Nature . 2015 ; 526 ( 7571 ): 112 – 117 . https://doi.org/10.1038/nature14878

Lu T , Forgetta V , Zhou S , Richards JB , Greenwood CM . Identifying rare genetic determinants for improved polygenic risk prediction of bone mineral density and fracture risk . J Bone Miner Res . 2023 ; 38 ( 12 ): 1771 – 1781 . https://doi.org/10.1002/jbmr.4920

Lu T , Forgetta V , Keller-Baruch J , et al.    Improved prediction of fracture risk leveraging a genome-wide polygenic risk score . Genome Med . 2021 ; 13 ( 1 ): 16 . https://doi.org/10.1186/s13073-021-00838-6

Wainschtein P , Jain D , Zheng Z , et al.    Assessing the contribution of rare variants to complex trait heritability from whole-genome sequence data . Nat Genet . 2022 ; 54 ( 3 ): 263 – 273 . https://doi.org/10.1038/s41588-021-00997-7

Wang Z , Choi SW , Chami N , et al.    The value of rare genetic variation in the prediction of common obesity in European ancestry populations . Front Endocrinol . 2022 ; 13 : 863893 . https://doi.org/10.3389/fendo.2022.863893

Forgetta V , Keller-Baruch J , Forest M , et al.    Development of a polygenic risk score to improve screening for fracture risk: a genetic risk prediction study . PLoS Med . 2020 ; 17 ( 7 ): e1003152 . https://doi.org/10.1371/journal.pmed.1003152

Tuna H , Birtane M , Ekuklu G , Cermik F , Tuna F , Kokino S . Does quantitative tibial ultrasound predict low bone mineral density defined by dual energy X-ray absorptiometry?   Yonsei Med J . 2008 ; 49 ( 3 ): 436 – 442 . https://doi.org/10.3349/ymj.2008.49.3.436

Email alerts

Related articles in, citing articles via.

  • Recommend to your Library
  • Advertising & Corporate Services
  • Journals Career Network

Affiliations

  • Online ISSN 1523-4681
  • Print ISSN 0884-0431
  • Copyright © 2024 American Society for Bone and Mineral Research
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Knowledge UChicago

The authors confirm that all data underlying the findings are fully available without restriction. Simulation code is available at  https://doi.org/10.5281/zenodo.10520811  and  https://github.com/cveller/confoundedGWAS .

© 2024 Veller, Coop.

This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Letter to the Editor
  • Open access
  • Published: 13 August 2018

The impact of genome-wide association studies on biomedical research publications

  • Travis J. Struck 1 ,
  • Brian K. Mannakee 2 &
  • Ryan N. Gutenkunst   ORCID: orcid.org/0000-0002-8659-0579 1  

Human Genomics volume  12 , Article number:  38 ( 2018 ) Cite this article

3511 Accesses

9 Citations

2 Altmetric

Metrics details

The past decade has seen major investment in genome-wide association studies (GWAS). Among the many goals of GWAS, a major one is to identify and motivate research on novel genes involved in complex human disease. To assess whether this goal is being met, we quantified the effect of GWAS on the overall distribution of biomedical research publications and on the subsequent publication history of genes newly associated with complex disease. We found that the historical skew of publications toward genes involved in Mendelian disease has not changed since the advent of GWAS. Genes newly implicated by GWAS in complex disease do experience additional publications compared to control genes, and they are more likely to become exceptionally studied. But the magnitude of both effects has declined over the past decade. Our results suggest that reforms to encourage follow-up studies may be needed for GWAS to most successfully guide biomedical research toward the molecular mechanisms underlying complex human disease.

Since the first successful genome-wide association studies (GWAS) were published over a decade ago [ 1 – 4 ], thousands have been performed [ 5 ]. These studies have identified tens of thousands of statistical associations between genetic variants and human diseases [ 5 ]. The large investment in GWAS has been criticized [ 6 ], perhaps because initial hopes for quick clinical impact were overenthusiastic [ 7 ]. The average time from basic science discovery to clinical practice is 17 years [ 8 ], so it is unsurprising that few GWAS results directly affect patients yet. But direct clinical impact is not the only goal of GWAS.

One major goal of GWAS has been to broadly characterize the genetic basis of human traits and complex disease. GWAS have shown that most traits are highly polygenic and that most common variants exhibit small effect size on phenotype [ 9 , 10 ]. They have also shown that genetic variants associated with disease are strongly enriched in regulatory regions [ 11 ] and that pleiotropy is pervasive [ 12 , 13 ]. They have also enabled polygenic prediction of traits by aggregating the weak effects of many variants [ 14 , 15 ], although not yet with clinical precision [ 16 ]. These insights have motivated a number of large public genomics projects, such as the ENCODE project to identify functional genomic elements [ 17 ], the Epigenome Roadmap project to identify tissue-specific epigenomic regulation [ 18 ], the GTEx project to connect genetic variation with tissue-specific gene expression [ 19 ], and the Human Cell Atlas project to identify and characterize all cell types in the body [ 20 ].

Another major goal of GWAS has been to specifically identify novel genes involved in complex disease and steer research toward them [ 16 , 21 , 22 ]. Identifying the causal genetic variant and the affected gene(s) that drive an association can be challenging [ 23 ], but integrating data from large genomics projects can provide important clues [ 24 ]. Novel connections between genes and diseases can lead to new treatments. For example, an early GWAS unexpectedly found variation in complement factor H to be strongly associated with macular degeneration [ 2 ], spurring the development of complement-based therapeutics [ 25 ]. Similarly, associations between variation in the interleukin-23 receptor and Crohn’s disease [ 26 ] and psoriasis [ 27 ] motivated the development of several treatments that are now in clinical trials [ 28 ]. In both of these classic examples, going from association to therapy demanded substantial follow-up research.

Beyond anecdotal examples, how much follow-up research typically occurs when a gene is newly associated with complex disease via GWAS? To answer this question, we assessed the impact of GWAS on subsequent biomedical research publications. Our motivation was that if there is little follow-up research on associated genes, then important medical innovations are possibly being missed, and reforms may be necessary to encourage follow-up research.

Published GWAS are themselves often highly cited, for example [ 4 , 26 , 29 ]. A systematic comparison also found that GWAS are more highly cited than comparable candidate gene studies [ 30 ]. But a paper that cites a GWAS does not necessarily follow-up on the associations reported by that GWAS. To quantify how much follow-up research is motivated by GWAS, we focused on the subsequent publication record of newly associated genes.

The distribution of biomedical research publications is highly unequal among human genes (Fig.  1 a ; [ 31 ]). Much of this inequality stems from historical momentum, driven by the availability of prior functional information [ 32 ] or research tools [ 33 ]. Consequently, many potentially medically important genes may be understudied [ 34 ]. Because GWAS are largely unbiased by previous knowledge about genes [ 35 ], they provide an opportunity for understudied genes to be brought to the scientific forefront.

figure 1

Biomedical scientific publications are highly unequally distributed and strongly skewed toward genes involved in Mendelian disease, even after the advent of GWAS. a The distribution of publications among all human genes is highly uneven. Plotted is the number of publications per gene, with genes sorted by number of publications. (The gene with the fewest publications is plotted as rank 1, and the gene with the most publications as rank 20,422.) A few genes are the subject of thousands of publications each, whereas thousands of genes are the subject of fewer than ten publications each. b The distribution of publications among all human genes is more uneven in the post-GWAS era (2005 and later) than in the pre-GWAS era (before 2005). Shown in this Gini plot are the cumulative proportions of publications in each category versus gene rank. The further the curve is from the diagonal, the more uneven the distribution. For comparison, the distribution of publications among yeast genes is shown, with the yeast x -axis stretched to match the number of human genes. c Highly studied genes tend to be involved in Mendelian disease. Plotted are the distributions of genes among publication rank for genes of each possible type of disease association and for both the pre- and post-GWAS eras. (Distributions are not normalized across types of disease association.) In both eras, genes involved in Mendelian diseases are strongly enriched toward high publication ranks. By contrast, many genes involved only in complex disease rank low in terms of publications

We evaluated the effect of GWAS on the biomedical research literature in three ways. At a broad scale, we tested whether the distribution of publications among human genes has changed since the advent of GWAS. At a narrower scale, we quantified the effect of being newly associated with complex disease on the subsequent publication histories of human genes. Lastly, we identified outlier genes with exceptional publication activity and tested whether GWAS might play a role in motivating such activity. Overall, we find that genes newly associated with complex disease do experience increases in publication activity, but this effect has declined over the past decade.

We measured research output on genes using scientific publications, as collected in the NCBI Gene database [ 36 ]. We prefer this manually curated database to automatic text mining, because text mining may introduce false positives when a gene is mentioned in passing. In total, we considered 553,184 biomedical research publications that appeared in the annotations for one or more human genes, most of which were published after 1995 (Additional file  1 : Figure S1).

Broad patterns of publications on human genes

We used the Online Mendelian Inheritance in Man (OMIM) database [ 37 ] and the EBI-NCBI GWAS catalog [ 5 ] to classify genes into those associated with Mendelian disease ( N =1126), complex disease ( N =3648), both ( N =595), or no disease ( N =15,043). As expected [ 31 ], we found that the distribution of publications among human genes was highly uneven. A small number of genes were the subject of many thousands of publications, while a large number of genes were the subject of only a few (Fig.  1 a ).

To quantify the unevenness of publications among genes, we used the Gini coefficient, which ranges from 0 (perfectly even distribution) to 1 (perfectly uneven). The Gini coefficient is calculated from the cumulative distribution of publications versus the gene rank (Fig.  1 b ). To quantify the effect of GWAS on the distribution of publications among human genes, we compared that distribution before and after 2005. We chose 2005 as the cutoff between pre- and post-GWAS eras, because that is the year of the first entry in the GWAS catalog [ 5 ]. Other appropriate cutoff years might be 2007, when the first large GWAS were published, or 2009, to give time for publication patterns to change. Using either of these cutoff years does not qualitatively change our results (Additional file  1 : Figure S2). The inequality of publications among human genes is larger in the post-GWAS era than in the pre-GWAS era (Gini coefficient 0.73 vs 0.65; Fig.  1 b ). It is not inevitable that the distribution of publications should be so unequal; the Gini coefficient of publications among yeast genes is much lower at 0.43 (Fig.  1 b ).

The ultimate goal of most biomedical research is to improve human health, so the distribution of publications is expected to be skewed toward genes involved in human disease. In the pre-GWAS era, genes associated with Mendelian disease were, almost without exception, among the most highly studied human genes (Fig.  1 c and Additional file  1 : Figure S2). By contrast, many genes that would later be associated with complex disease were among the least studied human genes (Fig.  1 c ). The advent of GWAS led to the discovery of many genes associated with complex human disease. The focus of biomedical publications on Mendelian disease genes, however, remains strong in the post-GWAS era (Fig.  1 c ). In particular, many genes associated with complex disease remain among the least studied genes in the human genome (Fig.  1 c ). The distribution of publication ranks for genes associated only with complex disease has shifted slightly toward higher ranks in the post-GWAS era compared to the pre-GWAS era (Mann-Whitney U test, p ∼ 10 −9 , N =3648), but the distribution has not changed qualitatively. Examining the distributions of publication ranks at higher temporal resolution also does not reveal any qualitative changes (Additional file  1 : Figure S3).

Subsequent publications on individual genes

To quantify the immediate effect of GWAS on research into individual newly associated genes, we considered all genes that were first associated with complex disease via GWAS before 2015 ( N =2442), and we focused on the calendar year of the first association and the following 2 years. For each new GWAS gene, we compared the publications over this period with a control non-GWAS gene chosen to have as similar a prior publication history as possible (see the “ Materials and methods ” Section). The variance in an associated gene’s publications is strongly correlated with the number of publications on that gene in the prior 3 years (Fig.  2 a ). Normalizing the excess in publications relative to the control gene by the square root of the number of recent publications normalizes the variance (Fig.  2 b ), consistent with a Poisson model for publication output [ 38 ]. The normalized excess in publications for a GWAS gene is slightly but significantly shifted (Fig.  2 c ; one-sample t test, p ∼ 5×10 −34 , N =2442). The mean normalized excess is 1.24 units, corresponding to a mean excess of 2.95 publications over the 3 years following association.

figure 2

Effect on subsequent publications for genes newly associated with complex disease via GWAS. To quantify the short-term effects of GWAS association, we considered the publication excess of each newly associated gene compared with its control gene. a The variance of the publication excess is strongly correlated with the associated gene’s number of recent publications. b Normalizing the publication excess by the square root of the number of recent publications equalizes the variance. It also reveals a trend for the normalized effect of GWAS association to be smaller for more heavily studied genes. c The distribution of normalized publication excess is shifted toward positive values, indicating a positive effect of GWAS association on subsequent publications. d The normalized publication excess for a newly associated gene is weakly correlated with the p value of the association. e It is not statistically significantly correlated with the estimated effect size of the association, as quantified by the reported odds ratio. f The normalized publication excess is negatively correlated with the publication date of the association. More recently associated genes experience a smaller increase in subsequent publications. Reported correlations ρ are Spearman rank correlations, and thick black lines in panels d – f are linear regressions

We next sought to identify the factors that determine how large an effect a GWAS will have on an associated gene’s subsequent publications. For example, the more heavily studied a gene was previously, the smaller the effect of GWAS association (Fig.  2 b , Spearman rank correlation, p ∼ 6×10 −8 , N = 2442).

The strength of a GWAS association is quantified by its statistical p value and its estimated biological effect size, which is most commonly an odds ratio. The normalized publication excess for a newly associated gene is weakly positively correlated with the p value of its association (Fig.  2 d ; p ∼ 1×10 −4 , N =2442). By contrast, the normalized publication excess is not significantly correlated with the estimated effect size of the reported association (Fig.  2 e ; p ∼ 0.14, N =1327).

The strongest predictor of the effect of a GWAS on future publications for associated genes is the year in which the GWAS was published. The typical normalized publication excess has declined dramatically since the early years of GWAS (Fig.  2 f ; p ∼ 9×10 −23 , N =2442).

The predictors for the effect of GWAS on subsequent publications that we have studied may themselves be correlated; to disentangle their effects, we built a linear regression model. In that model, the effects of the number of recent publications and GWAS publication date are strong and statistically significant (Table  1 ). By contrast, the quantitative properties of the association itself, the p value and the estimated effect size, have weak effects that are not statistically significant.

The GWAS catalog uses a relatively liberal p value threshold of 10 −5 for inclusion of associations into the catalog, and large p value associations may be statistical noise that subsequent researchers properly ignore. To account for this effect, we repeated our analyses using only genes for which the first reported association had p <10 −8 , the suggested threshold for testing low-frequency variants [ 39 ]. When we restricted our analysis to these high-confidence associations (Additional file  1 : Figure S4), we found that normalized publication excess was no longer significantly correlated with p value ( ρ =0.044, p ∼ 0.23; N =724), but it was positively correlated with estimated effect size ( ρ =0.094; p ∼ 0.025; N =570). The negative correlation between normalized publication excess and GWAS publication date was stronger than in the full data ( ρ =−0.33; p ∼ 7×10 −20 ). The linear regression model (Additional file  1 : Table S1) was similar to the full data, with the effects that were statistically not significant for p value and estimated effect size and significant for number of recent publications and GWAS publication date. Further restricting our analysis to associations for which the lower bound of the 95% confidence interval on the estimated odds ratio was larger than 1.1 (Additional file  1 : Figure S5) yielded qualitatively similar results (Additional file  1 : Figure S6 and Table S2).

Association with particular diseases might lead to particularly intense study. To test this possibility, we considered the class of disease that each gene was associated with as an additional predictor in the linear regression model. Of the 20 disease classes tested, only metabolic disease had a significant effect on the normalized publication excess (Additional file  1 : Table S3). Further stratifying among metabolic diseases, we found that this trend is driven by studies on type II diabetes and obesity (Additional file  1 : Table S4).

Genes with exceptional publication records

The typical new GWAS gene experiences a modest increase in subsequent publications, but some exceptional genes may experience large increases, so-called hot genes. To identify such genes, we used the model of Pfeiffer and Hoffmann [ 38 ] to predict the number of publications for each gene in each year, based on that gene’s prior publication history. We trained the model on all genes never implicated in complex disease through GWAS. By comparing the model predictions and publication data, we then identified particular years in which particular genes had unexpectedly large numbers of publications (Additional file  2 ). For example, complement factor H had a significant excess of publications in all 3 years following its association with macular degeneration (Fig.  3 a ).

figure 3

The effect of GWAS in generating exceptionally studied genes. a A significantly elevated number of studies were published on complement factor H following its association with macular degeneration via GWAS in 2005 [ 2 ]. Solid line is the predicted publication history from the model of Pfeiffer and Hoffmann [ 38 ], points indicate actual publication counts, and starred points indicate years with a statistically significant excess (one-sided Bonferroni-corrected p <0.05). b The total number of genes exhibiting an unusual excess in publications peaked in 2009, as did the number of those genes that were recently newly associated with complex disease via GWAS. c The number of genes newly associated with complex disease through GWAS has grown since the inception of GWAS. d The proportion of genes exhibiting an unusual excess in publications that were recently identified in GWAS peaked at roughly 20% in 2009 and has since declined

The total number of hot genes per year has recently fluctuated (Fig.  3 b ). Between 2009 and 2016, on average, 0.3% of genes were hot in any given year. Of the genes that were newly associated with complex disease via GWAS within the past 3 years, the probability of being hot was 1.3%. So, being newly associated with complex disease does increase the probability that a gene will become hot. The total number of hot genes that were recently associated with complex disease via GWAS peaked, however, in 2009 (Fig.  3 b ), even as the number of new GWAS genes each year has grown (Fig.  3 c ). Thus, the proportion of hot genes that were recent GWAS hits has declined (Fig.  3 d ).

To further quantify the role of GWAS in creating hot genes, we used a logistic regression model (Table  2 ). Consistent with the overall probabilities (Fig.  3 ), this model showed that being a recent new GWAS hit was an important factor in determining whether a gene would be hot. The effect of being a GWAS hit, however, had a negative interaction with the year. In other words, the effect of GWAS on creating hot genes with exceptional publication records decreased with time.

We analyzed the biomedical research publications to quantify the effect of genome-wide association studies on published scientific research. We found that even after the advent of GWAS, publications remain highly skewed toward Mendelian disease genes, with many complex disease genes receiving little attention (Fig.  1 c ). New complex disease genes identified by GWAS do receive additional study and subsequent publications (Fig.  2 c ), but that effect has declined (Fig.  2 f , Table  1 ). Being newly associated with complex disease does increase a gene’s chance of becoming a “hot” gene, but this effect has also declined (Fig.  3 d , Table  2 ). Together, our results suggest that GWAS have been successful in bringing research attention to novel genes involved in complex human disease, but this influence is waning.

Considering the overall distribution of biomedical publications, we found that GWAS have not reduced the inequality among human genes. The distribution of publications among human genes is characterized by a Gini coefficient of 0.73 in the post-GWAS era (Fig.  1 a ). By comparison, the Gini coefficient of money income among American households was 0.48 in 2016 [ 40 ] and among global households was 0.625 in 2013 [ 41 ]. The inequality of publications among genes is thus substantially greater than the inequality of income among households.

Focusing on individual genes, we found that association with complex disease via GWAS is correlated with an increase in subsequent publications (Fig.  2 ). Interestingly, the p value and estimated effect size of the association play a statistically insignificant role in determining the magnitude of that increase (Table  1 and Additional file  1 : Table S1). We found a stronger effect on the subsequent publications for genes newly associated with metabolic disease (Additional file  1 : Tables S3 and S4), perhaps reflecting its recent emphasis in public health [ 42 ]. We also found that association with complex disease via GWAS does raise the chances of a gene becoming an exceptionally studied “hot” gene (Fig.  3 ). But most dramatically, we found that the effects of new association via GWAS have declined over the past decade (Figs.  2 f and 3 d ).

The direct results of a GWAS are associations of a disease with genetic variants, not with genes. For simplicity, we associated each variant with the closest gene, as long as that gene was within 500 kb. But many variants are regulatory, and gene regulation is complex, so some variants may actually most strongly affect other more distant genes [ 23 ]. Thus, some of the gene associations we study may be spurious. But this issue has existed since the advent of GWAS and has not changed markedly since. So, it cannot explain why the effect of GWAS on subsequent publications has declined over time. When studying the effects of genetic evidence on drug development, Nelson et al. [ 43 ] used a more complex approach for assigning variants to genes. They incorporated linkage disequilibrium and attempted to infer regulatory relationships using expression quantitative trait loci (eQTLs) and DNAse hypersensitivity sites. When we analyzed their collection of association data, we found similar results to our original analysis, although the effects were somewhat weaker (Additional file  1 : Table S5 and Figure S7). In particular, we still found a negative relationship between the publication date of an association and its effect on the subsequent publications.

Our measures of scientific publications do not necessarily capture the full effects of GWAS on biomedical research. We considered studies of specific associated genes, but the broad insights GWAS has given into the genetic basis of human disease have substantially affected the biomedical research [ 10 – 12 , 16 ]. Motivated by the example of complement factor H (Fig.  3 a ), we focused on the publications in a 3-year window following the GWAS. Some follow-up studies may take longer, but using a 5-year window does not change our qualitative conclusions (Additional file  1 : Figure S8 and Tables S6 and S7). GWAS may also promote biomedical research in ways that do not involve new publications. For example, drugs with associated genetic evidence are more likely to progress along the development pipeline [ 43 ], suggesting that GWAS promote efficient drug development. More broadly, we focused on the associations with complex disease, the most common biomedical application of GWAS. But GWAS for drug response have already provided important guidance for personalized treatment [ 44 ]. Lastly, human GWAS have applications beyond health. For an evolutionary example, GWAS data have been used to detect adaptation in the human genome [ 45 ].

What explains the declining effect of GWAS on subsequent publications regarding newly associated genes? Perhaps early GWAS captured most genetic variants of large effect, so more recent studies find less compelling associations. But estimated effect size is not a strong predictor of subsequent publications (Table  1 ). Moreover, the typical estimated effect size of new associations has declined only modestly, and the absolute number of large-effect associations has grown (Additional file  1 : Figure S9). Or perhaps journal publication criteria have changed over time, making GWAS less visible or follow-up studies more challenging to publish. The typical impact factor of journals GWAS are published in has declined slightly since the advent of GWAS (Additional file  1 : Figure S10A). But the impact factor of the GWAS publication has only a weak effect on the publication excess of newly associated genes (Additional file  1 : Figure S10B). When we included GWAS publication impact factor in our linear regression model, its effect was statistically significant but insufficient to explain the effect of publication date (Additional file  1 : Table S8). Or perhaps researchers are spreading their effort among newly associated genes, so effects on individual genes have declined. But the summed publication excess over all genes newly associated with complex disease in a given time period has also declined over the past decade (Fig.  4 ). Or perhaps the availability of funding for follow-up studies has declined, as overall biomedical research funding has declined in both North America and Europe [ 46 ]. Or perhaps the capacity and interest to perform follow-up analyses has not kept pace with the “fire hose” of GWAS results [ 47 ]. Our data do not point toward a definitive explanation, and further investigation is needed to understand why recent GWAS promote less follow-up study on associated genes than early GWAS.

figure 4

Total publication excess of new GWAS genes. For 6-month periods, plotted is the total publication excess (compared to control genes) of genes newly associated with complex disease via GWAS during each period

Over the past decade, GWAS have undeniably contributed greatly to biomedical knowledge [ 16 ]. The development of large-scale accessible databases of phenotypic and genotypic data, such as the UK Biobank [ 48 ], will fuel further contributions. But few GWAS results are directly medically actionable, so follow-up research is essential to translate novel associations into medical innovations. Our results suggest that the ability of GWAS to motivate published follow-up research on associated genes is declining. To maximize the positive impact of GWAS on human health, this trend must be understood and reversed.

Materials and methods

Publication data.

We obtained Entrez GeneIDs for all 20,422 human protein-coding genes from NCBI Gene [ 36 ] on December 12, 2017. For all those genes, we collected PubMed identifiers of associated publications from NCBI Gene’s gene2pubmed file, downloaded December 12, 2017. This file contains both associations created manually during the curation of Gene References Into Function (GeneRIFs) and associations collected from organism-specific databases, Gene Ontology, and other curated data sources. We then obtained date information for each publication from PubMed, taking the earliest year between the reported year or EYear, using BioPython [ 49 ]. We followed a similar procedure for yeast genes. We obtained impact factor data from the 2016 InCites Journal Citation Reports [ 50 ].

Disease data

To identify genes associated with Mendelian disease, we downloaded the Online Mendelian Inheritance in Man (OMIM) Gene Map of connections from genes to traits [ 37 ] on January 17, 2018. We filtered to keep only entries with a confidence code of “confirmed” and to ignore entries indicating a potentially spurious mapping or association with a non-disease trait. We further considered only entries with Entrez GeneIDs, to avoid ambiguity among gene names and aliases. This procedure yielded 1878 genes associated with disease traits. Of these, 1543 genes were associated with Mendelian but not complex multifactorial disease, 157 were associated with complex multifactorial but not Mendelian disease, and 178 were associated with both Mendelian and complex multifactorial disease.

To further identify genes associated with complex disease and to gather GWAS data, we used the January 1, 2017, release of NHGRI-EBI’s GWAS Catalog [ 5 ]. We filtered the catalog to remove non-disease traits, by keeping only entries that were children of the term “disease” (EFO0000408) in the Experimental Factor Ontology [ 51 ]. To connect associated variants with genes, we began with the Mapped Genes column in the catalog. We then connected each variant with its closest mapped gene, if that gene was within 500 kb. If a variant was within two overlapping genes, we connected with both genes. This procedure yielded 4069 genes associated with complex disease. To analyze the classes of disease, we used the children of the term “disease” in the Experimental Factor Ontology.

Our analysis of OMIM and the GWAS catalog yielded 5369 total disease-associated genes. Considering genes associated with only Mendelian disease in OMIM and not associated with disease through GWAS yielded 1126 Mendelian disease genes. Considering genes associated with only complex multifactorial disease in OMIM or associated with disease through GWAS yielded 3648 complex disease genes. The remaining 595 genes were associated with both Mendelian and complex disease.

Of the disease genes in the GWAS catalog, 2442 were first associated prior to 2015, so we could analyze three full years of publication data. For those genes, we identified odds ratios as reported effect sizes without units for variants that had a reported frequency of the risk allele. For our odds ratio analysis, we analyzed the 1327 genes for which an odds ratio was reported in the first year of GWAS association.

We also analyzed the association data of Nelson et al. [ 43 ]. They connected variants to genes using linkage disequilibrium, expression QTLs, and DNAse hypersensitivity. We filtered their Supplementary Data Set 1 to remove associations from OMIM, which may be Mendelian diseases. We also manually classified traits as disease or non-disease (Additional file  3 ), filtering out the non-disease traits.

Control genes

For each of our 2442 GWAS genes, we identified its control gene as the non-GWAS gene with the closest number of total publications prior to the year the gene was first associated with complex disease. If multiple genes were tied for closest, we compared the previous year as well, continuing either until there was no ambiguity or until we reached 1950. For the 233 GWAS genes with ambiguous control genes, we compared subsequent publications between the GWAS gene and the average of the control genes.

Publication rate model

We used the model of Pfeiffer and Hoffmann [ 38 ] to predict expected per-gene publication rates:

Here, Δ P i,t +1 is the predicted number of publications for gene i in year t +1, and P i,t and \(P^{*}_{t}\) are the cumulative number of publications in previous years for the gene and the average cumulative number of publications for all genes in the organism, respectively. The term in the denominator models saturation of publication rates. The three rate parameters, k 1 , k 2 , and k 3 , and the saturation parameters, P S and α , were assumed to be identical for all genes. To fit the parameters to our data, we constructed a likelihood function by assuming that the number of publications each year for each gene was independently Poisson distributed with mean Δ P i,t +1 given by Eq. 1 . We then maximized that likelihood with respect to the five model parameters, using publication data from 1950 to 2015 for all non-GWAS genes. The maximum-likelihood parameter values were k 1 =0.0214, k 2 =0.225, k 3 =0.00288, P S =24.1, and α =1.67. Five genes each had one publication prior to 1950 that was not included in the data fit.

To identify the years in which genes had significantly elevated publication rates, our null model was that publications were Poisson distributed with mean given by Eq. 1 . Significant gene years were defined as those in which the probability of generating at least the observed number of publications was less than the Bonferroni-corrected significance cutoff 0.05/( N g N y ). Here, N g =20,442 was the total number of genes considered, and N y =67 was the total number of years.

Abbreviations

Genome-wide association study

Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, et al.Functional SNPs in the lymphotoxin- α gene that are associated with susceptibility to myocardial infarction. Nat Genet. 2002; 32(4):650–4.

Article   PubMed   CAS   Google Scholar  

Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, et al.Complement factor H polymorphism in age-related macular degeneration. Science. 2005; 308(5720):385–9.

Article   PubMed   PubMed Central   CAS   Google Scholar  

DeWan A, Liu M, Hartman S, Zhang SSM, Liu DTL, Zhao C, et al. HTRA1 promoter polymorphism in wet age-related macular degeneration. Science. 2006; 314:989–92.

Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3000 shared controls. Nature. 2007; 447(7145):661–78.

Article   CAS   Google Scholar  

MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2016; 45:D896–D901.

Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012; 90(1):7–24.

Manolio TA. Bringing genome-wide association findings into clinical use. Nat Rev Genet. 2013; 14(8):549–58.

Balas EA, Boren SA. Managing clinical knowledge for health care improvement In: Bemmel J, McCray AT, editors. Yearbook of Medical Informatics 2000: Patient-Centered Systems. Stuttgart: Schattauer Verlagsgesellschaft mbH: 2000. p. 65–70.

Google Scholar  

Boyle EA, Li YI, Pritchard JK. An expanded view of complex traits: from polygenic to omnigenic. Cell. 2017; 169(7):1177–86.

Timpson NJ, Greenwood CMT, Soranzo N, Lawson DJ, Richards JB. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. 2018; 19(2):110–24.

Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012; 337(6099):1190–5.

Sivakumaran S, Agakov F, Theodoratou E, Prendergast JG, Zgaga L, Manolio T, et al. Abundant pleiotropy in human complex diseases and traits. Am J Hum Genet. 2011; 89(5):607–18.

Pickrell JK, Berisa T, Liu JZ, Ségurel L, Tung JY, Hinds DA. Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet. 2016; 48(7):709–17.

Wray N, Goddard M, Visscher P. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007; 17:1520–1528.

Chatterjee N, Shi J, García-Closas M. Developing and evaluating polygenic risk prediction models for stratified disease prevention. Nat Rev Genet. 2016; 14210(2014):14205–10.

Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet. 2017; 101(1):5–22.

Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74.

Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518(7539):317–29.

Article   PubMed Central   CAS   Google Scholar  

Ardlie KG, DeLuca DS, Segrè AV, Sullivan TJ, Young TR, Gelfand ET, et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015; 348(6235):648–60.

Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al. The human cell atlas. Elife. 2017; 6:1–30.

Article   Google Scholar  

Hirschhorn JN. Genomewide association studies–illuminating biologic pathways. N Engl J Med. 2009; 360(17):1699–701.

Ricigliano VAG, Umeton R, Germinario L, Alma E, Briani M, Di Segni N, et al. Contribution of genome-wide association studies to scientific research: a pragmatic approach to evaluate their impact. PLoS One. 2013; 8(8):e71198.

Edwards SL, Beesley J, French JD, Dunning M. Beyond GWASs: Illuminating the dark road from association to function. Am J Hum Genet. 2013; 93(5):779–97.

Gallagher MD, Chen-Plotkin AS. The post-GWAS Era: from association to function. Am J Hum Genet. 2018; 102(5):717–30.

Black JRM, Clark SJ. Age-related macular degeneration: genome-wide association studies to translation. Genet Med. 2016; 18(4):283–9.

Duerr RH, Taylor KD, Brant SR, Rioux JD, Silverberg MS, Daly MJ, et al. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science. 2006; 314(5804):1461–63.

Cargill M, Schrodi SJ, Chang M, Garcia VE, Brandon R, Callis KP, et al. A large-scale genetic association study confirms IL12B and leads to the identification of IL23R as psoriasis-risk genes. Am J Hum Genet. 2007; 80(2):273–90.

Teng MWL, Bowman EP, McElwee JJ, Smyth MJ, Casanova JL, Cooper AM, et al. IL-12 and IL-23 cytokines: from discovery to targeted therapies for immune-mediated inflammatory diseases. Nat Med. 2015; 21(7):719–29.

Harold D, Abraham R, Hollingworth P, Sims R, Gerrish A, Hamshere ML, et al. Genome-wide association study identifies variants at CLU and PICALM associated with Alzheimer’s disease. Nat Genet. 2009; 41(10):1088–93.

Mansiaux Y, Carrat F. Contribution of genome-wide association studies to scientific research: a bibliometric survey of the citation impacts of GWAS and candidate gene studies published during the same period and in the same journals. PLoS ONE. 2012; 7(12):e51408.

Dolgin E. The greatest hits of the human genome. Nature. 2017; 551:427–31.

Article   PubMed   Google Scholar  

Haynes WA, Tomczak A, Khatri P. Gene annotation bias impedes biomedical research. Sci Rep. 2018; 8(1):1–7.

Isserlin R, Bader GD, Edwards A, Frye S, Willson T, Yu FH, Vol. 14. The human genome and drug discovery after a decade. Roads (still) not taken; 2011. http://arxiv.org/abs/1102.0448.

Edwards AM, Isserlin R, Bader GD, Frye SV, Willson TM, Yu FH. Too many roads not taken. Nature. 2011; 470(7333):163–5.

Wilkening S, Chen B, Bermejo JL, Canzian F. Is there still a need for candidate gene approaches in the era of genome-wide association studies?. Genomics. 2009; 93(5):415–9.

Brown GR, Hem V, Katz KS, Ovetsky M, Wallin C, Ermolaeva O, et al. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 2015; 43(D1):D36–D42.

Amberger JS, Bocchini CA, Schiettecatte F, Scott AF, Hamosh A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®;), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015; 43(D1):D789–D798.

Pfeiffer T, Hoffmann R. Temporal patterns of genes in scientific publications. Proc Natl Acad Sci U S A. 2007; 104(29):12052–56.

Fadista J, Manning AK, Florez JC, Groop L. The (in)famous GWAS P -value threshold revisited and updated for low-frequency variants. Eur J Hum Genet. 2016; 24(8):1202–5.

Article   PubMed   PubMed Central   Google Scholar  

Semega JL, Fontenot KR, Kollar MA. Income and poverty in the United States: 2016. U.S. Census Bureau, Current Population Reports, P60-259. Washington, DC: U.S. Government Printing Office; 2017.

World Bank. Poverty and shared prosperity 2016: taking on inequality. Washington, DC: World Bank; 2016.

Book   Google Scholar  

Caballero B. The global epidemic of obesity: an overview. Epidemiol Rev. 2007; 29:1–5.

Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, et al. The support of human genetic evidence for approved drug indications. Nat Genet. 2015; 47(8):856–60.

Giacomini KM, Yee SW, Mushiroda T, Weinshilboum RM, Ratain MJ, Kubo M. Genome-wide association studies of drug response and toxicity: an opportunity for genome medicine. Nat Rev Drug Discov. 2017; 16:70.

Berg JJ, Coop G. A population genetic signal of polygenic adaptation. PLoS Genet. 2014; 10(8):004412.

Chakma J, Sun GH, Steinberg JD, Sammut SM, Jagsi R. Asia’s ascent: global trends in biomedical R&D expenditures. N Engl J Med. 2014; 370(1):1–3.

Hunter DJ, Kraft P. Drinking from the fire hose-statistical issues in genomewide association studies. N Engl J Med. 2007; 357(5):436–9.

Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015; 12(3):1–10.

Cock PJA, Antao T, Chang JT, Chapman Ba, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009; 25(11):1422–23.

Clarivate Analytics. 2016 Journal Citation Reports Ⓡ ; 2017. http://ipscience-help.thomsonreuters.com/incitesLiveJCR/JCRGroup/howtoCiteJCR/version/10 .

Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J, Kolesnikov N, et al. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics. 2010; 26(8):1112–8.

Download references

Acknowledgments

We thank Yann Klimentidis and Tricia Serio for the helpful comments.

This work was supported by the National Science Foundation [DGE-1143953 to BM].

Availability of data and materials

The data that support the primary findings of this study are available from the NCBI-EBI GWAS Catalog [ 5 ] and NCBI Gene [ 36 ]. All data generated during this study are included in this published article and its supplementary information files.

Author information

Authors and affiliations.

Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ, USA

Travis J. Struck & Ryan N. Gutenkunst

Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, AZ, USA

Brian K. Mannakee

You can also search for this author in PubMed   Google Scholar

Contributions

RG designed the study, performed the data analysis, and prepared the manuscript. TS collected, processed, and analyzed the data. BK contributed ideas to the data analysis. All authors approved the final manuscript.

Corresponding author

Correspondence to Ryan N. Gutenkunst .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1.

Supplemental tables and figures. Supplemental Tables S1–S8 , Figure S1–S10 . (PDF 586 KB)

Additional file 2

Gene-years with exceptional publication activity. Gene-years with a statistically significant excess of publications relative to the prediction of the Pfeiffer and Hoffmann model. For GWAS disease genes, the date of the first GWAS to identify that gene is also recorded. (TSV 45 KB)

Additional file 3

Categorization of Nelson et al. traits. Traits from the association data of Nelson et al. [ 43 ], categorized as disease or non-disease. (TSV 16 KB)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Struck, T.J., Mannakee, B.K. & Gutenkunst, R.N. The impact of genome-wide association studies on biomedical research publications. Hum Genomics 12 , 38 (2018). https://doi.org/10.1186/s40246-018-0172-4

Download citation

Received : 30 March 2018

Accepted : 01 August 2018

Published : 13 August 2018

DOI : https://doi.org/10.1186/s40246-018-0172-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Genome-wide association studies
  • Bibliometrics
  • Follow-up research

Human Genomics

ISSN: 1479-7364

research paper on genome wide association studies

Genome-wide association analysis for drought tolerance and component traits in groundnut gene pool

  • Open access
  • Published: 18 April 2024
  • Volume 220 , article number  76 , ( 2024 )

Cite this article

You have full access to this open access article

  • Seltene Abady   ORCID: orcid.org/0000-0002-9618-0741 1 ,
  • Hussein Shimelis 1 ,
  • Pasupuleti Janila 2 ,
  • Ankush Wankhade 2 &
  • Vivek P. Chimote 3  

38 Accesses

Explore all metrics

The potential production and productivity of groundnuts are limited due to severe drought stress associated with climate change. The current study aimed to identify genomic regions and candidate genes associated with drought tolerance and component traits for gene introgression and to guide marker-assisted breeding of groundnut varieties. Ninety-nine genetically diverse groundnut genotypes were phenotyped under drought-stressed and non-stressed field conditions in 2018/19 and 2019/20, and using the LeasyScan platform under non-stressed conditions in 2019/20 at the International Crops Research Institute for the Semi-Arid Tropics (ICRISAT)/India. The samples were genotyped using 48 K single nucleotide polymorphisms (SNPs) markers at the University of Georgia/USA. Phenotypic data was collected on 17 agronomic traits and subjected to statistical analyses. The SNP data were computed, and population structure was inferred using a Bayesian clustering method in Structure version 2.3.4, while linkage disequilibrium was calculated using the GAPIT program in R software. Marker-trait associations were deduced using Tassel 5.2.86. Significant phenotypic variations were recorded for drought tolerance and the assessed agronomic traits. GWAS analysis using PCA + K and Q + K models identified significant SNPs associated with leaf area (1 SNP), leaf area Index (1 SNP), specific leaf area (1 SNP), leaf relative water content (43 SNPs), number of primary branches (1 SNP) and hundred seed weight (1 SNP). Forty-seven and one marker-trait associations were detected under drought-stressed and non-stressed conditions, respectively. The candidate genes and markers identified in the current study are useful for accelerated groundnut breeding targeting drought tolerance and market-preferred traits.

Similar content being viewed by others

research paper on genome wide association studies

Pilot-scale genome-wide association mapping in diverse sorghum germplasms identified novel genetic loci linked to major agronomic, root and stomatal traits

Ajay Prasanth Ramalingam, Williams Mohanavel, … Ramasamy Perumal

research paper on genome wide association studies

Association mapping in bambara groundnut [Vigna subterranea (L.) Verdc.] reveals loci associated with agro-morphological traits

Charles U. Uba, Happiness O. Oselebe, … Wosene G. Abtew

research paper on genome wide association studies

Genome-wide association mapping and genomic prediction of agronomical traits and breeding values in Iranian wheat under rain-fed and well-watered conditions

Ehsan Rabieyan, Mohammad Reza Bihamta, … Hadi Alipour

Avoid common mistakes on your manuscript.

Introduction

Groundnut is cultivated on 32.72 million ha, with an annual production of 53.93 million tons worldwide (FAOSTAT 2023 ). It is cultivated primarly as a rain-fed crop in the semi-arid tropics and sub-tropical regions where recurrent drought is widespread. The potential production and productivity of groundnut ( Arachis hypogaea L.; 2n = 4x = 40) is limited due to recurrent and severe drought stress associated with climate change. Developing and deploying groundnut varieties with drought tolerance and desirable product profiles is vital for food security and global trade. Groundnuts significantly improve the nutritional status of humankind. Groundnut seeds are rich sources of carbohydrates, protein, lipids, vitamins, minerals and fiber. The seed contains all the essential amino acids making them a critical component of the human diet, especially in communities where animal-derived protein sources are not readily available (Mupunga et al. 2017 ).

Groundnut is relatively tolerant to drought stress than other traditional leguminous species (Wan et al. 2014 ). The quantity and quality of grain, seed oil and fodder values of groudnut are affected by drought stress (Abady et al. 2021 ). Severe drought stress occurring during the reproductive growth stage can lead to a yield loss of up to 33% (Pereira et al. 2016 ; Carvalho et al. 2017 ). Drought during grain filling and maturity reduced groundnut's total oil and linoleic acid contents (Dwived et al. 1996 ). Additionally, drought affects the inherent symbiotic nitrogen fixation capacity of crops, limiting grain yield and quality and affecting the feed quality of the haulm such as its nitrogen content, digestibility and metabolizable energy (Blümmel et al. 2012 ). Therefore, there is a need to develop and deploy drought-tolerant and locally adapted groundnut varieties to mitigate the effects of drought on groundnut yield and product profiles.

Reportedly, groundnut has marked genetic variability for drought tolerance and component traits for genetic improvement programs (Azevedo et al. 2010 ; Abady et al. 2021 ). Several candidate groundnut varieties with improved drought tolerance and agronomic traits have been developed through conventional breeding by the International Crop Research Institute for the Semi-Arid Tropics (ICRISAT) and through national breeding programs (Janila et al. 2016 ). However, the globally pace of new variety design, release and adoption of drought-tolerant groundnut is slow for several reasons. Drought tolerance is a polygenic trait conditioned by multiple genes with minor genetic effects and is subjected to genotype by the environmental interaction dragging selection gains (Ravi et al. 2011 ). Furthermore, the genetic base of the cultivated groundnut is narrow, and the introgression of genes from the wild species has limited success due to the ploidy differences providing unpredicted progeny segregations and selection response (Foncéka et al. 2009; Janila et al. 2016 ). Drought-tolerant and high-yielding varieties have yet to be developed and deployed globally. This is dependent on the identification of agronomic traits associated with drought tolerance and transferring of the genes underlying the target traits to locally adapted genotypes (Edae et al. 2014 ).

Advanced breeding and genetic innovations such as marker-assisted selection (MAS), genomic selection (GS), and targeted gene editing would accelerate groundnut variety design, and commercialization (Hasan et al. 2021 ; Pandey et al. 2014 ). These tools have been valuable in breeding programs to facilitate the identification of drought-resistance genes from germplamincluding landraces and wild relatives. The candidate genes can be transferred, pyramided, and fast-tracked in advanced breeding lines (Salgotra and Stewart 2020 ). Genetic and genomic tools are valuable resources for precision and speed breeding to release drought-resilient and market-preferred varieties.

Identification of genetic markers associated with drought tolerance and economic traits including high kernel yield, oil and oleic acid contents, and haulm yields and quality attributes, disease and insect resistanceis crucial for the development new variety with essential traits (Shaibu et al. 2020 ; Devate et al. 2022 ). To date, limited genes have been reported based on groundnut genome-wide association studies for drought tolerance (Bertioli et al. 2016 ). Zhou et al. ( 2021 ) identified SNP markers significantly associated with pod and hundred seed weights in groundnut. Shaibu et al. ( 2020 ) identified SNP markers for drought surrogate traits using the soil plant analysis development (SPAD) chlorophyll meter reading and leaf area index in groundnut. Zou et al. ( 2022 ) identified five SNP markers significantly associated with chlorophyll content in the groundnut. Pandey et al. ( 2014 ) identified one SSR marker associated with rust resistance and higher yield in groundnut. There is a limited knowledge on the number of genetic markers and marker-traits association for drought tolerance and economic traits based on diverse genetic pool of groundnut. Knowledge on marker-traits associations is crucial for marker-assisted selection, trait integration and precision breeding in groundnut. Thus, the objective of the current study was to identify genomic regions and candidate genes associated with drought tolerance and component traits for gene introgression, and to guide marker-assisted breeding of drought-tolerant groundnut varieties.

Material and methods

Plant material.

Ninety-nine genetically diverse groundnut genotypes acquired from ICRISAT in Patancheru, India were used for the study. The genotypes were selected based on desirable traits, including drought tolerance, resistance to foliar diseases such as late leaf spot and rust, high oil and oleic acid contents, and early-to-medium maturation. This study used a high-yielding groundnut cultivar ICGV98412 released in India, Ghana and Ethiopia as a comparative control. The details of the genotypes are described in Supplementary Table 1. The genotypes were evaluated under drought-stressed (DS) and non-stressed (NS) conditions at ICRISAT (latitude, 17.51°N, longitude, 78.27°E, and altitude 545 m) during the 2018/2019 and 2019/2020 post-rainy cropping seasons using a 10 × 10 alpha lattice design with two replications. The plants were phenotyped in five environments, including four experiments [drought-stressed and non-stressed conditions in two seasons (2018/19 and 2019/20)] under field conditions and using the leasyScan platform under non-stressed conditions with four replications.

Phenotypic evaluation and data analysis

Phenotypic data were collected on days to 50% flowering (DF), chlorophyll meter reading (SCMR), specific leaf area (cm 2 g −1 ), leaf relative water content (LRWC), plant height (PH, expressed in cm), number of primary branches (PB), pod yield per plant (PY, expressed in g plant −1 ), shelling percentage (SHP, expressed in %), seed yield per plant (SY, expressed in g plant −1 ), total biomass per plant (TBM, expressed in g plant −1 ) and harvest index (HI) (%). From the LeasyScan experiment, leaf area (LA), projected leaf area (PLA), leaf area index (LAI), and light penetration depth (LPD), digital plant height (DPH) and digital biomass (DBM) data were collected. The phenotypic data were subjected to analysis of variance. The homogeneity of error variances was tested using the Bartlett test before pooled analysis of variance. The means of the treatments were separated using the least significant difference (LSD) procedure at the 5% significant level.

The 99 groundnut genotypes were grown under field conditions at ICRISAT, Hyderabad, India. Genomic DNA was extracted from the leaves of three weeks old seedlings at the Center of Excellence in Genomics and Systems Biology at ICRISAT. The DNA was extracted using the modified cetyl trimethyl ammonium bromide (CTAB) method (Mace et al. 2003 ). DNA was mixed with a loading dye and quantified by loading 1 μl DNA on the 0.8% (w/v) agarose gel containing 10 μl ethidium bromide (10 mg/ml) and run at 80 V for 30–45 min. Subsequently, the DNA was visualized under a UV transilluminator (Bio-Rad Universal Hood II Gel Doc System). DNA quality and concentration were estimated using NanoDrop Spectrometry (UV 160 A, Japan). A DNA sample of 47 ng/µl per genotype was submitted for genotyping. The DNA samples were genotyped with a 48 K Afymetrix SNP array (‘Axiom_ Arachis’) (Wankhade et al. 2023 ).

SNP data were analyzed using the Axiom analysis suite (Thermo 2018 ). SNP markers with more than 20% of missing data and minor allele frequencies lower than 0.05 were eliminated. This resulted in 15,575 SNP markers, which were used for further analysis. Ninety-nine genotypes were used after the data imputation. The genotype data filtering was performed using TASSEL version 5.2.86 software.

Population structure and principal component analysis

The population structure pattern and admixture detection were inferred using a Bayesian model-based clustering algorithm implemented in STRUCTURE version 2.3.4 (Pritchard et al. 2000 ). The length of the burn-in period and Markov Chain Monte Carlo (MCMC) were set at 10,000 iterations (Evanno et al. 2005 ). The K value was set between 1 and 10 to generate the number of subpopulations in the genotypes. Twenty runs were performed for each K-value to accurately estimate the number of populations. Delta K values were calculated, and the appropriate K value was determined by the Evanno et al. ( 2005 ) method using the STRUCTURE Harvester program (Earl et al. 2012 ). SNP marker-based PCA and kinship analysis were subsequently conducted with GAPIT (Lipka et al. 2012 ).

Genome-wide association analysis (GWAS)

GWAS was performed with Tassel 5.2.86. Six models were evaluated in the marker-trait association analysis, including the naïve, Q, K, PCA, PCA + K, and Q + K. Association signals were observed on PCA + K and Q + K models using a mixed linear model (MLM). Quantile–quantile (QQ) plots were presented with –log 10 (P) of each SNP and expected P value, and the Manhattan plots were generated using TASSEL 5.2.86. Marker-trait association with or above 20% phenotypic variance explained (PVE) was considered to be a major association. Candidate genes covering major SNPs within a 50 kb region upstream or downstream of peak SNPs were selected from the PeanutBase website tool ( https://www.peanutbase.org ).

Linkage disequilibrium (LD) and decay

The LD between polymorphic SNPs retained after filtering at a cutoff of MAF 0.05, 0.1, and 0.2 was calculated in the form of r 2 using TASSEL 5.2.86. LD decay plots were generated using the R script written by Remington et al. ( 2001 ) using R Studio (2021.09.0 Build 351© 2009–2021 R Studio, PBC).

Phenotypic variation

Significant genetic variation were recorded for yield and yield components among the tested groundnut genotypes evaluated under drought-stressed and non-stressed conditions (Abady et al. 2021 ). Analysis of variance for canopy-related traits phenotyped using LeasyScan planform showed highly significant ( P  < 0.001) genotype differences (Table  1 ). Mean performance of the groundnut genotypes for 13 phenotypic traits under drought-stressed and non-stressed conditions, and six canopy-related traits under non-stressed conditions are presented in Supplementary Tables 2, 3, and 4 in that order. Wide phenotypic variations existed for all the assessed traits.

Population structure, principal component analysis (PCA) and linkage disequilibrium (LD)

Population structure analysis of the 99 groundnut genotypes resolved three sub-populations with 32% admixture genotypes (Abady et al. 2021 ). Allocation into clusters was done at 70% ancestry. Twenty-four, 22 and 21% of the genotypes were assigned to sub-populations 1, 2, and 3, respectively. The PCA based on SNP marker data also confirmed the presence of three subgroups, corresponding with the population structure results (Fig.  1 ). The first three principal components accounted for 32% of the total variation (Fig.  1 a) and revealed three distinct clusters in the population (Fig.  1 b).

figure 1

Principal component analysis of the 99 groundnut genotypes based on 15,575 high-quality SNPs with MAF > 0.05 using the first three principal components. The first three principal components indicated 32% of the variation as indicated on the scree plot ( a ). The genotypes were grouped into three distinct clusters ( b )

Three different threshold cutoff levels of MAF, i.e., 0.05, 0.1 and 0.2 were used to explore the effect of minor alleles on the nature and decay of genome-wide LD and resented in Fig.  2 a, b and c in that order. LD was found to be decreasing with increasing bin distance. LD declined to half of its original value in three different threshold cutoff levels of MAF, i.e., 0.05, 0.1 and 0.2 was 3.98, 6.33 and 14.48 Mb, respectively.

figure 2

Effect of three MAFs, 0.05 ( a ), 0.1 ( b ) and 0.2 ( c ), on the nature of LDs and their decay in advanced breeding lines

Marker trait association

Forty-seven and 13 SNP markers were significantly associated with DLA, LAI, SLA, LRWC, PB and HSW, were identified using PCA + K and Q + K models, respectively (Table  2 ). Among the significantly associated SNP markers, nine were identified by both models. Forty-five SNP markers were significantly associated with LRWC and seven SNP markers were associated with one or two traits. Thus, in this study, 50 SNP markers were identified (Table  2 and Fig.  3 ) Graphical representation of significant SNPs identified for the assessed traits were depicted with a Manhattan map along with QQ plots (Fig.  3 ). The QQ plots showed that the deviation between observed and expected P values was very small, suggesting a true positive association between the SNPs and the traits.

figure 3

Manhattan map and QQ plots showing SNP markers associated with different agronomic traits among 99 groundnut genotypes based on the PCA + K and Q + K models. Note: a ) and b denote digital leaf area under non-stressed (NS) conditions; c and d digital leaf area index under NS conditions; e and f specific leaf area under NS conditions; g and h leaf relative water content under drought-stressed (DS) conditions; i and j number of primary branches under DS conditions; k and l number of primary branches under NS conditions; m and n hundred seed weight under DS conditions

The GWAS output identified one SNP marker with significant association with both DLA and LAI under non-stressed (NS) conditions using the Q + K model (Fig.  3 a and c). The phenotypic variance of these traits explained by the marker was 21 and 20%, respectively. PCA + K and Q + K models detected one SNP marker significantly associated with SLA under drought-stressed (DS) conditions (Fig.  3 e and f). The phenotypic variance of SLA explained by the significant marker ranged from 22 to 23%. Similarly, PCA + K and Q + K models detected two SNPs markers with significant association with PB under DS (Fig.  3 i and j) and NS (Fig.  3 k and l) conditions. The phenotypic variance of the trait explained by the markers ranged from 20 to 23%. The study identified 43 SNPs with significant association with LRWC under DS conditions through either the PCA + K or Q + K or, both models (Table  2 , Fig.  3 g and h). The phenotypic variance of the trait explained by the significant SNPs ranged from 20 to 31%. Further, the GWAS analysis detected one SNP significant association with HSW under DS conditions. Both PCA + K and Q + K models were identified for this marker (Fig.  3 m and n). The phenotypic variance of the trait explained by the significant SNP marker ranged from 28 to 31%.

Phenotypic variability

Drought is the leading abiotic stress, which limits groundnut production and productivity globally. Significant progress were reported on groundnut pre-breeding for drought tolerance through conventional breeding methods (Janila et al., 2016 ). However, the pace of drought tolerance breeding and variety release has been slow due to the complex nature of gene action and the genotype by environment by management interaction effect (Ravi et al. 2011 ). Deploying drought surrogate traits is critical for effective drought tolerance breeding in crop genetic resources, including groundnut. Furthermore, understanding the genetic base of physiological and yield-related traits in groundnut could provide an opportunity to develop drought-tolerant cultivars (Pereira et al. 2016 ). Wankhade et al. ( 2022 ) proposed an integrated phenotyping approach for screening groundnut genotypes for drought tolerance. The authors reported early generation selection gains using the LeasyScan method with complementary drought stress indices under managed stress environment.

The present study revealed wide genetic variability for the assessed physiological and yield-related traits among the tested groundnut genotypes which were evaluated under drought-stressed and non-stressed conditions (Abady et al. 2021 ). The analysis of variance revealed highly significant genotypic differences for canopy-related traits, including digital leaf area (LA), digital leaf area index (LAI), specific leaf area (SLA), leaf relative water content (LRWC), digital plant height (PH), digital biomass (DBM) (Table  1 ). Traits related to canopy development are tightly associated with plant water use (Vadez et al. 2015 ). The leaf area influences the rate of transpiration as the wider the leaf area, the greater the rate of transpiration because broad leaves tend to have more stomata (Maylani et al. 2020 ). Thus, selecting genotypes with small leaf area could enhance groundnut productivity under DS conditions. Diffuse light penetrates deeper into a plant canopy, and increases photosynthesis and crop production (Zhange et al. 2022 ). Reportedly, there is a strong positive association between biomass production and transpiration efficiency under drought-stressed (DS) conditions due to the genotypes’ root system to mobilize water from the soil for stem elongation and biomass accumulation (Vadez et al. 2016 ). Previous findings indicated a positive association between reduced SLA and increased leaf thickness under DS conditions. This correlation results in thicker cell wall, which helps to prevent water loss by evaporation and achieve higher water use efficiency (Zhou et al. 2020 ). LRWC is the most useful parameter to measure plant water status in terms of the physiological consequence of cellular water deficit (Barr and Weatherley 1962 ). This parameter represents the balance between the water supply to the leaf tissue and the transpiration rate (Lugojan and Ciulca 2011 ). Thus, maintenance of higher LRWC under -stressed conditions could be a good indication of drought tolerance. The observed genetic variability in this study could be utilized in groundnut breeding programs to develop drought-tolerant and high-yielding varieties.

Population structure and PCA

In the present study, the population structure of the 99 groundnut genotypes revealed the presence of three sub-populations (Abady et al. 2021 ). Similarly, the PCA results displayed the presence of three sub-groups (Fig.  1 b). The low number of sub-groups indicates low genetic differentiation, given that most genotypes were India collections. Combining information generated from the genetic population structure and PCA is useful for the selection of various parents in breeding programs and the mapping of marker-trait associations.

Linkage disequilibrium in groundnut

LD is the non-random co-occurrence of two or more alleles (Lewontin and Kojima 1960 ) Determination of LD and its decay with the genetic distance helps to assess the resolution of association mapping and desirable numbers of SNPs on arrays (Vos et al. 2017 ). LD decay depends on cultivation patterns, breeding methods, breeding history, and evolutionary history (Devate et al. 2022 ). Higher LD decay was observed in the present study than in the previous findings (Pandey et al. 2014 ; Otyama et al. 2019 ). This could be attributed to possible intercrosses among the advanced breeding lines. This suggests that the utilization of more SNP markers and population size could enhance the power and efficiency of MTAs in the groundnut breeding programs.

Candidate genes associated with SNPs

Identifying genomic regions using genome-wide SNP markers is a vital approach for developing climate-resilient varieties. In this study, a pleiotropic gene effect was detected between digital leaf area and leaf area index using the marker AX-177643135, indicating collinearity between leaf canopy traits (Table  2 , Fig.  3 a and c).

Significant SNPs for LRWC were identified on chromosomes A02, A03, A05, A10, BO3, B08 and B09. In addition, the following five SNP markers with major effect were identified: AX-176804539, AX-176794990, AX-177641299, AX-176795390 and AX-176822255 located on chromosome A03 at 49.5 kb (PVE = 20%), BO3 at 45 kb (PVE = 31%) and BO9 at 45 kb (PVE = 21%), A03 at 36.5 kb (PVE = 20%) and B03 at 36.5 kb (PVE = 20%), in that order (Table  2 , Fig.  3 g and h). These markers showed strong association signals for LRWC under drought-stressed conditions.

For leaf relative water content (LRWC) under drought-stressed conditions, the SNP SNP AX-176794990 [chromosome B03; -log10(P value of 5.68 to 5.79)] was located within 21.68 kb of the Araip.9NG64 gene, which encodes for an RNA-binding protein (RBP) 24-like protein family, involved in RNA processing, export and stability. Muthuswamy et al. ( 2021 ) and Yan et al. ( 2022 ) reviewed the role of RBPs in abiotic stress response and proposed that the proteins regulate stress response through RNA metabolism. Based on BLAST analysis of the gene sequence of Araip.9NG64 , a match was found with UBP1-associated protein 2C (UBP2c) with two RNA recognition motifs of 85 amino acids each, reportedly playing a crucial role in leaf senescence (Na et al. 2015 ). Li et al. ( 2002 ) reported that ABA Activated Protein Kinase (AAPK), which is present in guard cells, interacts with the AAPK Interacting protein (AAPKIP 1), which is a RBP that interacts with mRNA of dehyhdrin, a protein implicated in drought stress.

The formation of cuticular layers with increased wax and cutin content on leaf surfaces is closely related to drought tolerance. Identification of drought tolerance-associated wax components and cutin monomers and the genes responsible for their biosynthesis is essential for understanding the physiological and genetic mechanisms underlying drought tolerance and improving crop drought resistance (Yang et al. 2022 ). SNP, AX-147235264 located on chromosome B10 (− log10 ( P value of 3.88) accounted for 20% of the variance in LRWC under drought-stressed conditions (ST_LRWC). It was present near the Aradu.PKW10 gene, which encodes for a CD2 antigen cytoplasmic tail-binding-like protein. CD2 cytoplasmic tail binding protein 2 is a component of the U5 snRNP complex involved in RNA splicing. PSTPIP1, encodes CD2 antigen-binding protein 1 (CD2BP1), also known as proline/serine/threonine phosphatase-interacting protein 1 (PSTPIP1). SNP, AX-147235264 was reported by Otyama et al. ( 2022 ) to be responsible for linolenic acid accumulation and wax formation under drought-stress conditions (Yang et al., 2022 ).

The current association analyses identified a SNP, AX- AX-176816874 (chromosome B09; − log10 (P value of 4.29) affecting ST_LRWC. It was present in the Araip57P4D gene, which encodes for a Chitinase (Class V). Some chitinases are expressed in response to abiotic stress (Hamid 2013 ; Zhou et al. 2020 ). Lv et al. ( 2022 ) found that drought stress treatment induces significant upregulation of Class V chitinases.

SNP, AX-147244306 (chromosome B03; − log10 ( P value of 4.30) was identified as associated with ST_LRWC. It was present in the Araip.4J8RL gene, which encodes for a polynucleotide phosphatase/kinase type with Intracellular protein interaction domains with a role in abiotic stress tolerance (Dasuni and Nailwal 2020 ). It could assist in the selection for drought tolerance in groundnut.

SNP, AX-147244415 specific to ST_LRWC (chromosome B03; − log10 ( P value) of 4.29) was present within the Araip.SVH5H gene encodes for a G family of Abscisic acid (ABC) transporter protein of the half-size transporters. The SNPs are expressed in the vascular tissues and are mainly involved in the translocation of ABA across the plasma membrane and tonoplast (Jarzyniak and Jasiński 2014 ). Kuromori et al. ( 2011 ) reported that a mutant version of this gene results in increased transpiration losses and drought susceptibility.

SNP AX-176798839 (chromosome A05; − log10 ( P value) of 3.94) associated with ST_LRWC was present in the Aradu.VIU0I gene encoding for a zinc finger MYM-type 1 like protein. It is responsible for signalling and regulation under abiotic stress. Zinc finger proteins enhance plant drought resistance by increasing the levels of osmotic adjustment substances (Han et al. 2020 ).

SNP AX-176817979 [(chromosome A02;+log10 (P value of 4.85 and 3.98)] was identified as associated with the number of primary branches under non-stressed conditions (Table  2 , Fig.  3 i and j). It was present within 4.0 kb of the Aradu.ML3P3 gene encodes for a P-type ATPase of the Arabidopsis 2 protein. P-type ATPase type 2 belongs to the haloacid dehalogenase (HAD) superfamily and is split into four groups, i.e., Na + , K + , H + Ca 2+ , Mg 2+ and phospholipids (Thever and Saier 2009 ). Animals and fungi have Na + /K + -ATPases (P2C ATPases) and Na + -ATPases (P2D ATPases), respectively, that carry Na + exclusion (Axelsen and Palmgren, 2001 ). The Na + /k + -ATPase helps to maintain low Na + and high K + concentrations within the cells.

In addition, for SN AX-177639302, which is on chromosome B09 at 4.05 kb (PVE = 28%) shows a strong association signal for seed weight under drought-stressed conditions (Table  2 , Fig.  3 m and n). Similarly, Gangurde et al. ( 2020 and 2022 ) reported a seed weight-associated genomic region on chromosome B09 in groundnut. This marker could provide an opportunity for seed size improvement in groundnut.

Conclusions

The study identified SNP-traits associations through association mapping in Arachis hypogaea . Forty-eight significant associated regions were detected for important physiological and yield-related traits using the PCA + K and Q + K models. Forty-seven SNPs significantly associated with leaf area, leaf area index, specific leaf area, leaf relative water content, number of primary branches and hundred seed weight under drought-stressed conditions were identified. The identified MTAs and candidate genes in this study could be used to understand the genetic basis of genomic regions of important physiological and yield-related traits and to accelerate the development of drought-tolerant and high yielding groundnut cultivars. Furthermore, the markers could be validated and deployed in groundnut breeding programs for gene pyramiding and trait integration.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Abady S, Shimelis H, Janila P, Yaduru S, Shayanowako AIT, Deshmukh D, Chaudhari S, Manhor SS (2021) Assessment of the genetic diversity and population structure of groundnut germplasm collections using phenotypic traits and SNP markers: implications for drought tolerance breeding. PLoS ONE 16(11):e0259883. https://doi.org/10.1371/journal.pone.0259883

Article   CAS   PubMed   PubMed Central   Google Scholar  

Axelsen KB, Palmgren MG (2001) Inventory of the superfamily of P-type ion pumps in Arabidopsis. Plant physiol 126:696–706. https://www.jstor.org/stable/4279931

Azevedo NAD, Nogueira RJMC, Melo FPA, Santos R (2010) Physiological and biochemical responses of peanut genotypes to water deficit. J Plant Interact 5:1–10

Article   Google Scholar  

Barr HD, Weatherley PE (1962) A re-examination of the relative turgidity technique for estimating water deficit in leaves. Aust J Biol Sci 15:413–428

Bertioli DJ, Cannon SB, Froenicke L, Huang G, Farmer AD, Cannon EKS et al (2016) The genome sequences of Arachis duranensis and Arachis ipaensis , the diploid ancestors of cultivated peanut. Nat Genet 48:438–446. https://doi.org/10.1038/ng.3517

Article   CAS   PubMed   Google Scholar  

Blummel M, Ratnakumar P, Vadez V (2012) Opportunities for exploiting variations in haulm fodder traits of intermittent drought tolerant lines in a reference collection of groundnut ( Arachis hypogaea L.). Field Crops Res 126:200–206. https://doi.org/10.1016/j.fcr.2011.10.004

Carvalho MJ, Vorasoot N, Puppala N, Muitia A, Jogloy S (2017) Effects of terminal drought on growth, yield and yield components in Valencia peanut genotypes. SABRAO J Breed Genet 49:270–279

Google Scholar  

Dasauni K, Nailwal TK (2020) Zinc finger proteins: Novel sources of genes for abiotic stress tolerance in plants. Transcription Factors for Abiotic Stress Tolerance in Plants. Academic Press, Cambridge, pp 29–45. https://doi.org/10.1016/B978-0-12-819334-1.00003-4

Chapter   Google Scholar  

Devate NB, Krishna H, Parmeshwarappa SKV, Manjunath KK, Chauhan D, Singh S, Singh JB, Kumar M, Patil R, Khan H, Jain N, Singh GP, Singh PK (2022) Genome-wide association mapping for component traits of drought and heat tolerance in wheat. Front Plant Sci 13:943033. https://doi.org/10.3389/fpls.2022.943033

Article   PubMed   PubMed Central   Google Scholar  

Dwivedi SL, Nigam SN, Nageswara Rao RC, Singh U, Rao KVS (1996) Effect of drought on oil, fatty acids and protein contents of groundnut ( Arachis hypogaea L.) seeds. Field Crops Res 48:125–133

Earl DA, von Holdt BM (2012) STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conserv Genet Resour 4:359–361. https://doi.org/10.1007/s12686-011-9548-7

Edae EA, Byrne PF, Haley SD, Lopes MS, Reynolds MP (2014) Genome-wide association mapping of yield and yield components of spring wheat under contrasting moisture regimes. Theor Appl Genet 127:791–807. https://doi.org/10.1007/s00122-013-2257-8

Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611–2620. https://doi.org/10.1111/j.1365-294X,2005.02553.x

FAOSTAT (2023) “Food and Agriculture Organization of the United Nations Database of Agricultural Production.” FAO Statistical Databases. http://www.fao.org/faostat/ . Accessed 14 January 2023

Gangurde SS, Wang H, Yaduru S, Pandey MK, Fountain JC, Chu Y, Isleib T, Holbrook CC, Xavier A, Culbreath AK, Ozias-Akins P, Varshney RK, Guo B (2020) Nested-association mapping (NAM) based genetic dissection uncovers candidate genes for seed and pod weights in peanut ( Arachis hypogaea ). Plant Biotechnol J 18:1457–1471

Gangurde SS, Khan AW, Janila P, Variath MT, Manohar SS, Singam P, Chitikineni A, Varshney RK, Pandey MK (2022) Whole-genome sequencing based discovery of candidate genesand diagnostic markers for seed weight in groundnut. Plant Genome 16:e20265. https://doi.org/10.1002/tpg2.20265

Hamid R, Khan MA, Ahmad M, Ahmad MM, Abdin MZ, Musarrat J, Javed S (2013) Chitinases: an update. J Pharm and Bioallied Sci 5:21. https://doi.org/10.4103/0975-7406.106559

Article   CAS   Google Scholar  

Han G, Lu C, Guo J, Qiao Z, Sui N, Qiu N, Wang B (2020) C 2 H 2 zinc finger proteins: master regulators of abiotic stress responses in plants. Front Plant Sci 11:115. https://doi.org/10.3389/fpls.2020.00115

Hasan N, Choudhary S, Naaz N, Sharma N, Laskar RA (2021) Recent advancements in molecular marker-assisted selection and applications in plant breeding programmes. J Genet Eng and Biotechnol 19:128. https://doi.org/10.1186/s43141-021-00231-1

Janila P, Nigam SN, Pandey MK, Nagesh P, Varshney RK (2016) Groundnut improvement: use of genetic and genomic tools. Front Plant Sci 4:1–16. https://doi.org/10.3389/fpls.2013.00023

Jarzyniak KM, Jasiński M (2014) Membrane transporters and drought resistance–a complex issue. Front Plant Sci 5:687. https://doi.org/10.3389/fpls.2014.00687

Kuromori T, Sugimoto E, Shinozaki K (2011) Arabidopsis mutants of AtABCG22, an ABC transporter gene, increase water transpiration and drought susceptibility. Plant J 67:885–894. https://doi.org/10.1111/j.1365-313X.2011.04641.x

Lewontin RC, Kojima K (1960) The evolutionary dynamics of complex polymorphisms. Evolution 14:458–472

Li J, Kinoshita T, Pandey S, Ng CKY, Gygi SP, Shimazaki KI, Assmann SM (2002) Modulation of an RNA-binding protein by abscisic-acid-activated protein kinase. Nature 418:793–797

Lipka AE, Tian F, Wang Q, Peiffer J, Li M, Bradbury PJ et al (2012) GAPIT: genome association and prediction integrated tool. Bioinformatics 28(18):2397–2399. https://doi.org/10.1093/bioinformatics/bts444

Lugojan C, Ciulca S (2011) Evaluation of relative water content in winter wheat. Jhortic, ForestBiotechnol 15:173–177

Lv P, Zhang C, Xie P, Yang X, El-Sheikh MA, Hefft DI, Ahmad P, Zhao T, Bhat JA (2022) Genome-wide identification and expression analyses of the chitinase gene family in response to white mold and drought stress in soybean ( Glycine max ). Life 12:1340. https://doi.org/10.3390/life12091340

Mace ES, Buhariwalla KK, Buhariwalla HK, Crouch JH (2003) A high-throughput DNA extraction protocol for tropical molecular breeding programs. Plant MolBiol Rep 21:459–460. https://doi.org/10.1007/BF02772596

Mathew I, Shimelis H, Shayanowako AIT, Laing M, Chaplot V (2019) Genome-wide association study of drought tolerance and biomass allocation in wheat. PLoS ONE 14(12):e0225383. https://doi.org/10.1371/journal.pone.0225383

Maylani ED, Yuniati R, Wardhana W (2020) The Effect of leaf surface character on the ability of water hyacinth, Eichhornia crassipes (Mart.) Solms. to transpire water. IOP Conf Ser Mater Sci Eng 902(1):012070. https://doi.org/10.1088/1757-899X/902/1/012070

Mupunga I, Mngqawa P, Katerere DR (2017) Peanuts, aflatoxins and undernutrition in children in Sub-Saharan Africa. Nutrients 9:1287. https://doi.org/10.3390/nu9121287

Muthusamy M, Kim JH, Kim JA, Lee SI (2021) Plant RNA binding proteins as critical modulators in drought, high salinity, heat, and cold stress responses: an updated overview. Int J Mol Sci 22:6731

Na JK, Kim JK, Kim DY, Assmann SM (2015) Expression of potato RNA-binding proteins StUBA2a/b and StUBA2c induces hypersensitive-like cell death and early leaf senescence in Arabidopsis. J Exp Bot 66:4023–4033

Otyama PI, Wilkey A, Kulkarni R, Assefa T, Chu Y, Clevenger J et al (2019) Evaluation of linkage disequilibrium, population structure, and genetic diversity in the U.S. peanut mini core collection. BMC Genom 20:481. https://doi.org/10.1186/s12864-019-5824-9

Otyama PI, Chamberlin K, Ozias-Akins P, Graham MA, Cannon EK, Cannon SB, MacDonald GE, Anglin NL (2022) Genome-wide approaches delineate the additive, epistatic, and pleiotropic nature of variants controlling fatty acid composition in peanut (Arachis hypogaea L.). Theor Appl Genet 12:1–21. https://doi.org/10.1093/g3journal/jkab382

Pandey MK, Upadhyaya HD, Rathore A, Vadez V, Sheshshaye MS, Sriswathi M, Govil M, Kumar A, Gowda MVC, Shivali S et al (2014) Genome-wide association studies for 50 agronomic traits in peanut using the reference set comprising 300 genotypes from 48 countries of semi-arid tropics of the world. PLoS ONE 9(8):e105228. https://doi.org/10.1371/journal.pone.0105228

Pereira JWL, Albuquerque MB, Melo Filho PA, Nogueira RJMC, Lima LM, Santos RC (2016) Assessment of drought tolerance of peanut cultivars based on physiological and yield traits in a semiarid environment. Agric Water Manag 166:70–76. https://doi.org/10.1016/j.agwat.2015.12.010

Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multi locus genotype data. Genetics 155:945–959. https://doi.org/10.1093/genetics/155.2.945

Ravi K, Vadez V, Isobe S, Mir RR, Guo Y, Nigam SN, Gowda MVC, Radhakrishnan T, Bertioli DJ, Knapp SJ, Varshney RK (2011) Identification of several small main-effect QTLs and a large number of epistatic QTLs for drought tolerance related traits in groundnut ( Arachis hypogaea L.). Theor Appl Genet 122:1119–1132

Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, Doebley J et al (2001) Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc Natl Acad Sci USA 98:11479–11484. https://doi.org/10.1073/pnas.201394398

Salgotra RK, Stewart CN Jr (2020) Functional markers for precision plant breeding. Int J Mol Sci 21(13):4792

Shaibu AS, Sneller C, Motagi BN, Chepkoech J, Chepngetich M, Miko ZL, Isa AM, Ajeigbe HA, Mohammed SG (2020) Genome-wide detection of snp markers associated with four physiological traits in groundnut ( Arachis hypogaea L.) mini core collection. Agronomy 10:192. https://doi.org/10.3390/agronomy10020192

Singh N, Agarwal N, Yadav HK (2019) Genome-wide SNP-based diversity analysis and association mapping in linseed ( Linum usitatissimum L.). Euphytica 215:139. https://doi.org/10.1007/s10681-019-2462-x

Thermo Fisher Scientific Inc (2018) Axiom TM Analysis Suite (AxAS) v4.0 USER GUIDE. Available at: https://downloads.thermofisher.com/Affymetrix_Softwares/ Axiom_Analysis_Suite_AxAS_v4.0_User_Guide.pdf

Thever MD, Saier MH (2009) Bioinformatic characterization of P-Type ATPases encoded within the fully sequenced genomes of 26 Eukaryotes. J Membr Biol 229:115–130. https://doi.org/10.1007/s00232-009-9176-2

Vadez V, Ratnakumar P (2016) High transpiration efficiency increases pod yield under intermittent drought in dry and hot atmospheric conditions but less so under wetter and cooler conditions in groundnut ( Arachis hypogaea (L.). Field Crops Res 193:16–23

Vadez V, Kholová J, Hummel G, Zhokhavets U, Gupta SK, Hash CT (2015) LeasyScan: a novel concept combining 3D imaging and lysimetry for high-throughput phenotyping of traits controlling plant water budget. J Exp Bot 66:5581–5593. https://doi.org/10.1093/jxb/erv251

Vos PG, Paulo MJ, Voorrips RE, Visser RGF, van Eck HJ, van Eeuwijk FA (2017) Evaluation of LD decay and various LD-decay estimators in simulated and SNP-array data of tetraploid potato. Theor Appl Genet 130:123–135. https://doi.org/10.1007/s00122-016-2798-8

Article   PubMed   Google Scholar  

Wan L, Wu Y, Huang J, Dai X, Lei Y, Yan L, Jiang H, Zhang J, Varshney RK, Liao B (2014) Identification of ERF genes in peanuts and functional analysis of AhERF008 and AhERF019 in abiotic stress response. FunctIntegr Genomics 14:467–477. https://doi.org/10.1007/s10142-014-0381-4

Wankhade AP, Chimote VP, Viswanatha KP, Yadaru S, Deshmukh DB, Gattu S, Sudini HK, Deshmukh MP, Shinde VS, Vemula AK, Pasupuleti J (2023) Genome-wide association mapping for LLS resistance in a MAGIC population of groundnut ( Arachis hypogaea L.). Theor Appl Genet 136:43. https://doi.org/10.1007/s00122-023-04256-7

Wankhade A, Purohit A, Janila P (2022) Step-wise selection for early canopy traits followed by stress tolerance indices as an approach for improving drought tolerance in groundnut ( Arachis hypogaea L.). In: The 7 th Congress on Plant Production In Water - Limited Environment, 28 Nov - 02 Dec 2022, King Fahd Hotel, Dakar, Senegal

Yan Y, Gan J, Tao Y, Okita TW, Tian L (2022) RNA-Binding proteins: the key modulator in stress granule formation and abiotic stress response. Front Plant Sci 13:882596. https://doi.org/10.3389/fpls.2022.882596

Yang F, Han Y, Zhu QH, Zhang X, Xue F, Li Y, Luo H, Qin J, Sun J, Liu F (2022) Impact of water deficiency on leaf cuticle lipids and gene expression networks in cotton (Gossypium hirsutum L.). BMC Plant Biol 22:404. https://doi.org/10.1186/s12870-022-03788-2

Zhang Y, Yang J, Van Haaften M, Li L, Lu S, Wen W, Zheng X, Pan J, Qian T (2022) Interactions between diffuse light and cucumber (Cucumis sativus L.) canopy structure, simulations of light interception in virtual canopies. Agronomy 12:602. https://doi.org/10.3390/agronomy12030602

Zhou N, An Y, Gui Z, Xu S, He X, Gao J, Zeng D, Gan D, Xu W (2020) Identification and expression analysis of chitinase genes in Zizania latifolia in response to abiotic stress. Sci Hortic 261:108952. https://doi.org/10.1016/j.scienta.2019.108952

Zhou X, Guo J, Pandey MK, Varshney RK, Huang L, Luo H, Liu N, Chen W, Lei Y, Liao B, Jiang H (2021) Dissection of the genetic basis of yield-related traits in the chinese peanut mini-core collection through genome-wide association studies. Front Plant Sci 12:637284. https://doi.org/10.3389/fpls.2021.637284

Zou K, Kim KS, Kang D, Kim MC, Ha J, Moon JK, Jun TH (2022) Genome-Wide association study of leaf chlorophyll content using high-density SNP array in peanuts ( Arachis hypogaea L.). Agronomy 12:152. https://doi.org/10.3390/agronomy12010152

Download references

Acknowledgements

The authors are very thankful to the Groundnut Breeding Program and Center of Excellence in Genomics and Systems Biology (CEG) at ICRISAT, India for the technical assistance during DNA extraction and field experimentation. The authors also acknowledge the University of Georgia, Tifton, United States for providing technical assistance during genotyping. ICRISAT is thanked for providing the groundnut germplasm used in the study.

Open access funding provided by University of KwaZulu-Natal. This work was financially supported by the Organization of the Petroleum Exporting Countries (OPEC) Fund for International Development (OFID), the International Foundation for Science (IFS), and the University of KwaZulu-Natal and conducted under CGIAR Research Program on Grain Legume and Dry Land Cereals (CRP-GLDC).

Author information

Authors and affiliations.

African Centre for Crop Improvement (ACCI), School of Agricultural, Earth and Environmental Sciences, University of KwaZulu-Natal, Private Bag X01, Scottsville, Pietermaritzburg, 3209, South Africa

Seltene Abady & Hussein Shimelis

International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Patancheru, Telangana, India

Pasupuleti Janila & Ankush Wankhade

State Level Biotechnology Centre, Mahatma Phule Krishi Vidyapeeth, Rahuri, 413722, India

Vivek P. Chimote

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: SA HS, PJ, AW, VC, Data curation: SA, Formal analysis: SA, AW, Funding acquisition: HS, Methodology: SA, HS, Project administration: PJ, Resources: HS, PJ, AW, VC, Supervision: HS, PJ, Validation: HS, Writing – original draft: SA, HS. All the authors read and approved the manuscript.

Corresponding author

Correspondence to Seltene Abady .

Ethics declarations

Conflict of interest.

The authors have declared no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 40 KB)

Supplementary file2 (docx 37 kb), supplementary file3 (docx 35 kb), supplementary file4 (docx 30 kb), rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Abady, S., Shimelis, H., Janila, P. et al. Genome-wide association analysis for drought tolerance and component traits in groundnut gene pool. Euphytica 220 , 76 (2024). https://doi.org/10.1007/s10681-024-03324-3

Download citation

Received : 10 November 2023

Accepted : 03 March 2024

Published : 18 April 2024

DOI : https://doi.org/10.1007/s10681-024-03324-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Arachis hypogaea
  • Candidate genes
  • Drought-stress
  • Single nucleotide polymorphism
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Genome-wide association studies articles from across Nature Portfolio

Genome-wide association studies (GWASs) are unbiased genome screens of unrelated individuals and appropriately matched controls or parent-affected child trios to establish whether any genetic variant is associated with a trait. These studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and major diseases.

Latest Research and Reviews

research paper on genome wide association studies

Genome-wide association analyses identify 95 risk loci and provide insights into the neurobiology of post-traumatic stress disorder

Multi-ancestry genome-wide analyses identify 95 loci associated with post-traumatic stress disorder and implicate candidate genes, pathways and neurobiological systems underlying its pathophysiology.

  • Caroline M. Nievergelt
  • Adam X. Maihofer
  • Karestan C. Koenen

research paper on genome wide association studies

Refining the impact of genetic evidence on clinical success

Human genetic evidence increases the success rate of drugs from clinical development to approval but we are still far from reaching peak genetic insights to aid the discovery of targets for more effective drugs.

  • Eric Vallabh Minikel
  • Jeffery L. Painter
  • Matthew R. Nelson

research paper on genome wide association studies

Multi-ancestry meta-analysis of tobacco use disorder identifies 461 potential risk genes and reveals associations with multiple health outcomes

In 653,790 individuals, this multi-ancestral meta-analysis of tobacco use disorder finds 461 potential risk genes and hundreds of associations with health outcomes, showcasing the utility of electronic health records for genetic research.

  • Sylvanus Toikumo
  • Mariela V. Jennings
  • Sandra Sanchez-Roige

research paper on genome wide association studies

Integrative common and rare variant analyses provide insights into the genetic architecture of liver cirrhosis

A multi-ancestry genome-wide association study of liver cirrhosis and its associated endophenotypes identifies and validates 14 risk variants. Integrative common and rare variant analyses provide insights into the genetic architecture of liver cirrhosis.

  • Jonas Ghouse
  • Gardar Sveinbjörnsson
  • Stefan Stender

research paper on genome wide association studies

A genome-wide association study provides insights into the genetic etiology of 57 essential and non-essential trace elements in humans

A genome-wide association study of 57 trace elements measured in up to 6564 Scandinavians, identifies genetic loci associated with blood levels of essential and non-essential trace elements and explores their effects on health outcomes.

  • Marta R. Moksnes
  • Ailin F. Hansen
  • Ben M. Brumpton

research paper on genome wide association studies

Protein-truncating variants in BSN are associated with severe adult-onset obesity, type 2 diabetes and fatty liver disease

Analyses of whole-exome sequencing data identify rare loss-of-function variants in BSN associated with adult-onset obesity, type 2 diabetes and fatty liver disease, with stronger effect sizes than those observed for variants in known obesity risk genes such as MC4R.

  • Maria Chukanova
  • John R. B. Perry

Advertisement

News and Comment

Genetic associations of human metabolic traits, genetic contribution to heterogeneity in type 2 diabetes.

research paper on genome wide association studies

New insights into the genetics of diabetes in pregnancy

Gestational diabetes is a complex metabolic condition thought to have a strong genetic predisposition. A large genome-wide association study of participants from Finland sheds light on the genetic contributors, opening avenues for research into mechanisms that underlie glucose regulation in pregnancy to improve the health of mothers and babies.

  • Aminata Hallimat Cissé
  • Rachel M. Freathy

Rare CTR9 variants and myeloid malignancies

Depression genetics goes global.

  • Shari Wiseman

research paper on genome wide association studies

Connecting clinical and genetic heterogeneity in ADHD

Understanding clinical heterogeneity in attention deficit hyperactivity disorder (ADHD) is important for improving personalized care and long-term outcomes. A study exploits the large scale and breadth of phenotyping of the iPSYCH cohort to link clinical heterogeneity to genetic heterogeneity in ADHD.

  • Chloe X. Yap
  • Jacob Gratten

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper on genome wide association studies

  • Research article
  • Open access
  • Published: 15 April 2024

Shared genetic architecture between autoimmune disorders and B-cell acute lymphoblastic leukemia: insights from large-scale genome-wide cross-trait analysis

  • Xinghao Yu 1 , 2   na1 ,
  • Yiyin Chen 1 , 2   na1 ,
  • Jia Chen 1 ,
  • Huimin Lu 3 ,
  • Depei Wu 1 , 2 &
  • Yang Xu 1 , 2  

BMC Medicine volume  22 , Article number:  161 ( 2024 ) Cite this article

542 Accesses

Metrics details

To study the shared genetic structure between autoimmune diseases and B-cell acute lymphoblastic leukemia (B-ALL) and identify the shared risk loci and genes and genetic mechanisms involved.

Based on large-scale genome-wide association study (GWAS) summary-level data sets, we observed genetic overlaps between autoimmune diseases and B-ALL, and cross-trait pleiotropic analysis was performed to detect shared pleiotropic loci and genes. A series of functional annotation and tissue-specific analysis were performed to determine the influence of pleiotropic genes. The heritability enrichment analysis was used to detect crucial immune cells and tissues. Finally, bidirectional Mendelian randomization (MR) methods were utilized to investigate the casual associations.

Our research highlighted shared genetic mechanisms between seven autoimmune disorders and B-ALL. A total of 73 pleiotropic loci were identified at the genome-wide significance level ( P  < 5 × 10 –8 ), 16 of which had strong evidence of colocalization. We demonstrated that several loci have been previously reported (e.g., 17q21) and discovered some novel loci (e.g., 10p12, 5p13). Further gene-level identified 194 unique pleiotropic genes, for example IKZF1 , GATA3 , IKZF3 , GSDMB , and ORMDL3 . Pathway analysis determined the key role of cellular response to cytokine stimulus, B cell activation, and JAK-STAT signaling pathways. SNP-level and gene-level tissue enrichment suggested that crucial role pleiotropic mechanisms involved in the spleen, whole blood, and EBV-transformed lymphocytes. Also, hyprcoloc and stratified LD score regression analyses revealed that B cells at different developmental stages may be involved in mechanisms shared between two different diseases. Finally, two-sample MR analysis determined causal effects of asthma and rheumatoid arthritis on B-ALL.

Conclusions

Our research proved shared genetic architecture between autoimmune disorders and B-ALL and shed light on the potential mechanism that might involve in.

Peer Review reports

B-cell acute lymphoblastic leukemia (B-ALL) is a prevalent subtype of leukemia characterized by its highly malignant nature, primarily originating from the clonal expansion and abnormal proliferation of B lymphocytes within the hematopoietic system [ 1 ]. Autoimmune disorders are characterized by a disruption in self-tolerance, resulting in pathological alterations and clinical symptoms arising from immune responses targeting self-components [ 2 ]. Concurrently, the pathogenesis of several autoimmune disorders is intricately interwoven with the malfunctioning of B cells within the humoral immune system. The excessive activation of self-reactive B cells precipitates an overproduction of autoantibodies and immune complexes, which, in turn inflict damage upon a multitude of tissues and organs, culminating in the emergence of various autoimmune disorders [ 3 ]. To summarize, B cells assume a pivotal role in the orchestration of humoral immune responses, and their deregulation markedly contributes to the onset of autoimmune diseases and B-cell malignancies [ 4 ].

Epidemiological investigations have discovered associations between autoimmune disorders and B-cell malignancies. For example, rheumatoid arthritis (RA) patients exhibit a twofold increased risk of concomitant B-cell lymphomas when compared to their healthy counterparts [ 5 ]. In the case of systemic lupus erythematosus (SLE) and Sjögren’s syndrome patients, the risk amplifies significantly to 2.7–7.5 times [ 6 ] and 9–18 times [ 6 ], respectively. Previous studies observed that the standardized incidence ratio of ALL was estimated to be 2.77 after RA onset [ 7 ]. Studies also showed that at the time of diagnosis of malignancy, 15–30% of patients present with many of the typical features of rheumatic diseases [ 8 ]. However, current research focused primarily on the onset of autoimmune diseases on hematological malignancies risk, particularly diffuse large B-cell lymphoma and follicular lymphoma. This leaves a clear gap in understanding the pleiotropic mechanisms and bidirectional causations between B-ALL (a disease also derived from B lymphocytes) and autoimmune diseases. Only Li et al. have reported the shared mechanism between autoimmunity and B-ALL, specifically demonstrating the essential role of DYRK1a in mediating the noncanonical NF-κB activation induced by BAFF [ 9 ]. This underscores the existence of substantial knowledge gaps in this field, highlighting the urgent need to ascertain shared risk loci between these two disorders. It is worth noting that traditional clinical or epidemiological research may encounter challenges in ensuring the statistical effectiveness of such investigations.

Recently, the linkage disequilibrium (LD) score regression (LDSC) approach has been developed to indicate whether there exists a genetic correlation between the two types of disease [ 10 ]. It is unclear whether the overall genetic correlation is attributable to a few loci or the entire genome. Few studies to date have systematically evaluated genetic overlap, shared susceptibility genes, and causality between autoimmune diseases and B-ALL. Cross-trait analyses that utilize the correlation of GWAS signals to study polyvalent genetic variants or loci between multiple traits have been shown to accurately identify shared loci between diseases or traits [ 11 , 12 , 13 ]. These pleiotropic loci can be targeted for intervention to potentially prevent or treat these diseases simultaneously. Recently, a novel method called “PLACO” was developed to identify pleiotropy at the SNP-level based on a level-α intersection–union test (IUT) [ 14 ]. Therefore, it is important to determine specific genetic variants or loci that lead to genome-wide genetic correlations or to delve into the shared genetic etiology of these two types of diseases. Our research flowchart is shown in Fig.  1 .

figure 1

Study workflow

GWAS summary data source

GWAS summary statistics for 16 autoimmune diseases were all publicly available from large-scale GWAS or GWAS meta-analyses: adult-onset asthma (AOA) [ 15 , 16 ], childhood-onset asthma (COA) [ 15 , 16 ], Graves’ disease (GD) [ 17 , 18 ], Hashimoto’s disease (HD) [ 17 , 18 ], hypothyroidism (HT) [ 17 , 18 ], primary biliary cirrhosis (PBC) [ 19 , 20 ], primary sclerosing cholangitis (PSC) [ 21 , 22 ], inflammatory bowel disease (IBD) [ 23 , 24 ], Crohn’s disease (CD) [ 23 , 24 ], ulcerative colitis (UC) [ 23 , 24 ], RA [ 25 , 26 ], SLE [ 27 , 28 ], multiple sclerosis (MS) [ 29 , 30 ], systemic sclerosis (SS) [ 31 , 32 ], type 1 diabetes (T1D) [ 17 , 18 ], and vitiligo [ 33 , 34 ]. GWAS summary statistics for B-ALL were generated in a meta-analysis of four GWAS including a total of 5321 cases and 16,666 controls of European ancestry [ 35 , 36 ]. The same quality control procedure was followed for each study, the association between ALL status and SNP genotypes in each study was assessed using logistic regression, and genetic principal components were used as covariates in the association analysis. Risk estimates were finally combined by fixed-effects inverse variance weighted (IVW) meta-analysis. The sources and details of these datasets are summarized in Additional file 2 : Table S1.

Genetic overlap at the genome-wide level

We used LDSC to evaluate the genetic structure shared between autoimmune disorders and B-ALL [ 10 ]. The LD scores used in LDSC were calculated based on genotypes of common SNPs from European ancestry samples in the 1000 genomes project [ 37 ]. Standard errors (SE) were estimated by the jackknife method in LDSC which was further used to correct for attenuation bias. Intercept of LDSC was used to evaluate potential population overlaps between studies from different consortiums [ 10 ]. It is worth noting that no actual population overlap between autoimmune disorders and B-ALL studies existed in our analysis. A likelihood-based method, called high-definition likelihood (HDL), can utilize GWAS summary statistics to estimate genetic associations, which could reduce the variance of genetic association estimates by about 60% compared with the LDSC method [ 38 ].

We further investigated whether SNP heritability of autoimmune diseases and B-ALL was enriched in specific cells and tissues using hierarchical LDSC regression. Stratified-LDSC (S-LDSC) was applied to different immune cell data to assess whether specific cell types had significant genetic enrichment in these tissues. We downloaded 54 human tissues datasets from GTEx [ 39 ] and 292 immune cell types from the ImmGen consortium [ 40 ] (including B cells, γ δ T cells, α β T cells, innate lymphocytes, myeloid cells, stromal cells, and stem cells). After adjusting for the baseline model and all gene sets, we assessed the significance of the SNP heritability enrichment estimated in each tissue and cell by using the regression coefficient Z-scores and corresponding P values.

Identification of pleiotropic loci and genes by using PLACO

A pleiotropic analysis under composite null hypothesis (PLACO) was used to identify pleiotropy among multiple autoimmune diseases and B-ALL at the SNP-level. SNPs reach genome-wide significant level ( P  < 5 × 10 –8 ) and were viewed as pleiotropic variants. The functional mapping and annotation (FUMA) of GWAS was used to determine the genomic regions of these risk variants (i.e., pleiotropic loci) [ 41 ]. Also, a Bayesian colocalization analysis was conducted to determine the pleiotropic loci shared by autoimmune diseases and B-ALL [ 42 ]. To explore the shared mechanisms of the identified loci, nearby genes were mapped based on lead SNPs within each locus. Also, a generalized gene-set analysis of GWAS data (multi-marker analysis of genomic annotation, MAGMA) approach was used to determine the biological function of these pleiotropic loci. Specifically, we performed MAGMA gene analysis to identify pleiotropic genes by properly incorporating LD between markers and to detect multi-marker effects ( P  < 0.05/18,345 = 2.73 × 10 –6 ) [ 43 ]. MAGMA gene-set analysis was performed to investigate the biofunction of lead SNPs [ 43 ], and a total of 10,678 gene sets including curated gene sets (c2.all) and go terms (c5.bp, c5.cc, and c5.mf) from Molecular Signatures Database (MSigDB) were finally tested [ 44 ]. Bonferroni correction was performed for all tested gene sets to avoid false positives ( P  < 0.05/10,678 = 4.68 × 10 –6 ). Metascape webtools (metascape.org) performed a pathway enrichment analysis to determine the function of mapped genes based on MSigDB [ 44 ]. Genome-wide tissue-specific enrichment analysis was conducted based on 54 GTEx tissues [ 45 ] for the genome-wide pleiotropic results generated by PLACO. We also calculated the average expression (log 2 transformed) of all identified pleiotropic genes in all 54 GTEx tissues [ 45 ] and tested tissue specificity by differentially expressed genes (DEGs) in each tissue (up- and down-regulated DEGs were precomputed by the signs of the t-statistics).

Summary-based Mendelian randomization

Summary-based Mendelian randomization (SMR) [ 46 ] method combined summary-level data from GWAS with data from expression quantitative trait loci (eQTL) studies to identify genes whose expression levels are associated with complex traits due to pleiotropy. It employs SMR and HEIDI methods to test pleiotropic associations between gene expression levels and complex traits of interest using summary-level data from GWAS and eQTL studies. This approach could be interpreted as an analysis to test whether the magnitude of SNP effects on phenotype is mediated by gene expression.

Multi-trait colocalization analysis

We utilized hypothesis prioritization for multi-trait colocalization (HyPrColoc) [ 47 ] method to perform multi-trait colocalization analysis to pinpoint the crucial roles that immune traits played in the onset of autoimmune disorders and B-ALL. Immune-wide GWAS data contains a total of 731 immune cells [ 48 ], which could be publicly available from the GWAS catalog (GCST0001391 ~ GCST0002121). Detailed information on the GWAS summary datasets for immune cells was added to Additional file 1 : Supplementary Methods.

Causal association analysis

We performed a one-directional two-sample Mendelian randomization (MR) analysis to assess possible causal effects of autoimmune disorders on B-ALL risk. The “clumping” procedure in PLINK 1.9 software was used to extract independent significance SNPs for all autoimmune diseases ( P  < 5 × 10 –8 ), where r 2 was set to 0.001 and window size was set to a physical distance of 10,000 KB [ 49 ]. Notably, r 2 was calculated based on the 1000 genomes project phase 3 as a reference panel. Proportion of variance explained (PVE) and F statistic ( F  > 10) was used to measure the strength of instrumental variables (IVs) (see Additional File 1 : Supplementary Methods) [ 50 ]. To verify causality among these trait pairs, six MR approaches were performed with each set of IVs, i.e., IVW, Debiased-IVW (DIVW) [ 51 ], weighted median approach [ 52 ], MR pleiotropy residual sum and outlier (MR-PRESSO) [ 53 ], MR-Egger [ 54 ], MR robust adjusted profile score (MR-RAPS) [ 55 ], and mode-based estimate [ 56 ] method. Cochran’s Q statistics was used to examine the effect size heterogeneity across the IVs (see Additional File 1 : Supplementary Methods) [ 57 , 58 ]. Additionally, the intercept of MR-Egger regression and global test of MR-PRSSO were utilized to detect horizontal pleiotropy [ 53 , 54 ]. Detailed information on used MR methods was described in Additional file 1 : Supplementary Method.

Software and packages

The main statistical analysis was performed in R (version 3.5.3). LDSC and S-LDSC analysis were implemented with “LDSC” software (v1.0.1) [ 10 ]. PLACO was performed with “PLACO” package [ 14 ]. Bayesian colocalization analysis was performed with the “coloc” package (version 5.2.1) [ 42 ] and HyPrColoc was performed with the “hyprcoloc” package (version 1.0) [ 47 ]. Function analysis was performed by FUMA web tool [ 41 ]. MAGMA gene and gene-set analysis were performed by MAGMA software [ 43 ]. Two-sample MR analysis was conducted with “MendelianRandomization” (version 0.9.0) [ 59 ], mr.raps (version 0.4.1) [ 55 ], and MRPRESSO (version 1.0) [ 53 ] packages. A copy of the main code used in this research is available at: https://github.com/biostatYu/MRcode-/tree/main/AD_BALL .

Shared genetic architecture between autoimmune disorders and B-ALL

We first evaluated the genetic correlation between autoimmune diseases and B-ALL and results from the LDSC and HDL methods were highly consistent (Table  1 and Additional file 2 : Table S2). Specifically, by using the LDSC method, six traits were identified to be genetically correlated with B-ALL, including AOA, HT, IBD, CD, RA, and MS. While implementing the HDL method, significant genetic correlations were observed among AOA, HT, PBC, RA, MS, and B-ALL, leading to a final union set of seven pairwise traits for further analysis. However, we did not find significant genetic correlation between IBD and CD and HDL results. It was noting that only RA remained significantly genetical correlated with B-ALL risk after applying the Bonferroni correction ( P  = 0.003 < 0.05/16).

Pleiotropic loci and genes identified for multiple autoimmune disorders and B-ALL

Given the shared genetic mechanisms between autoimmune diseases and B-ALL identified by LDSC and HDL, we used novel pleiotropy analyses (PLACO) to identify potential pleiotropic loci for both diseases (Additional file 1 : Fig. S1). The QQ plots demonstrated no premature divergence between observed and expected values, ruling out the possibility of group stratification (Additional file 1 : Fig. S2). Based on PLACO results, we identified a total of 73 pleiotropic genomic risk loci associated with both autoimmune disorders and B-ALL using FUMA ( P  < 5 × 10 –8 ) (Fig.  2 , Additional file 1 : Fig. S1, and Additional file 2 : Table S3). Colocalization analysis finally identified 16 of 73 (21.9%) potential pleiotropic loci with PP.H4 greater than 0.7 (e.g., 5p13) (Table  2 ). The regional plots for each trait pair are presented in Additional file 1 : Fig. S3 ~ S8. Notably, some pleiotropic regions were shared between different pairs, for example, genome regions 7p12.2, 10p14, 6q27, and 10p12.31 were identified in four pairs (Additional file 2 : Table S4). The MAGMA analysis of gene sets suggested that the identified pleiotropic loci may participate in the control of the immune system, hematopoiesis, and various other processes (Fig.  3 A and Additional file 2 : Table S5). Notably, significant monocyte differentiation pathway was found for all trait pairs, and significant leukocyte differentiation was found for all five trait pairs. Further tissue-specific analysis found these risk loci were enriched in several immune-related tissues (e.g., spleen, whole blood, Epstein–Barr virus (EBV)-transformed lymphocytes) (Fig.  3 B and Additional file 2 : Table S6). ANNOVAR category annotation showed that 28 of 73 lead SNPs (38.4%) were intronic variants and 30 of 73 (41.1%) were intergenic variants. Only 2 of 73 (3%) lead SNPs were exonic variants (Additional file 2 : Table S3).

figure 2

The circular diagram presents pleiotropic loci and genes identified by PLACO among seven trait pairs. Note: Shared loci identified by colocalization analysis are highlighted in orange; shared genes identified by MAGMA analysis are highlighted in blue. B-ALL B-cell acute lymphoblastic leukemia, AOA adult-onset asthma, HT hypothyroidism, PBC primary biliary cirrhosis, IBD inflammatory bowel disease, CD Crohn’s disease, RA rheumatoid arthritis, MS multiple sclerosis

figure 3

Bar plot of MAGMA gene-set ( A ) and tissue-specific ( B ) analysis for genome-wide pleiotropic results. Note: The red dotted line represents the significance of 0.05 after multiple corrections, and the blue represents the significance of 0.05. B-ALL B-cell acute lymphoblastic leukemia, AOA adult-onset asthma, HT hypothyroidism, PBC primary biliary cirrhosis, IBD inflammatory bowel disease, CD Crohn’s disease, RA rheumatoid arthritis, MS multiple sclerosis

Pleiotropic genes identified and enrichment analysis

We used different methods to map the identified SNP-level signals into the gene-level signals. By using MAGMA gene analysis, a total of 341 significant pleiotropic genes were determined as pleiotropic genes between multiple autoimmune diseases and B-ALL (194 unique) (Additional file 2 : Table S7 and Additional file 1 : Fig. S9). Additional file 2 : Table S8 lists the details of these genes. MAGMA gene analysis detected 92 repeated pleiotropic genes across different trait pairs, with IKZF1 identified as a pleiotropic gene for six pairs, followed by MLLT10 , FIGNL1 , RNASET2 , CCR6 , GATA3 , CLN3 , PIP4K2A , DDC , RP11-514O12.4 , FGFR1OP , and GRB10 in four trait pairs. eQTL analysis identified multiple hits of IKZF1 in blood- and immune-related tissues (e.g., naïve B cell, CD19 B-cell, EBV-transformed lymphocytes cells, cis-eQTLs, trans-eQTLs, spleen, whole blood). Five genes (i.e., TUFM , ZC2HC1A , RNASET2 , GSDMB, and ORMDL3 ) were observed to be significant in five different tissues. We summarized the landscape of pleiotropic genes identified in different methods and tissues in Fig.  4 . We observed several genes ( RNASET2 and FIGNL1 ) were significantly mapped in different tissues with different methods. The IKZF1 gene was also highlighted in whole blood tissues (Fig.  4 ).

figure 4

Overview of pleiotropic genes (highlighted in all three signals) for the autoimmune disorders and B-ALL. Note: The signals represent hits of genes across different trait pairs. eQTL expression quantitative trait loci, SMR summary-based Mendelian randomization, AD autoimmune disorders, B-ALL B-cell acute lymphoblastic leukemia, AOA adult-onset asthma, HT hypothyroidism, PBC primary biliary cirrhosis, IBD inflammatory bowel disease, CD Crohn’s disease, RA rheumatoid arthritis, MS multiple sclerosis

The shared mechanism between autoimmune diseases and B-ALL may involve specific organs or tissues involvement. Numerous genes (e.g., TOP2A , IKZF3 , MYB, and CD80 ) showed significant differential expression in EBV-transformed lymphocytes, and APOBR , IKZF1, and IL7R showed significant differential expression in spleen and whole blood tissues (Additional file 1 : Fig. S10 and Additional file 2 : Table S9). Tissue enrichment analysis showed that these genes were also enriched into the spleen and EBV-transformed lymphocytes (Additional file 1 : Fig. S11 and Additional file 2 : Table S10). Additional S-LDSC based on multiple tissues identified significant SNP heritability enrichment for all autoimmune diseases (except AOA) in each of the monocytes, blood cells, and spleen, after adjusting for the baseline model (Additional file 1 : Fig. S12 and Additional file 2 : Table S11). Further enrichment analysis of the GO biological processes associated with these genes indicated higher enrichment in the cellular response to cytokine stimulation, B cell activation, response to tumor necrosis factor, inflammatory response, and receptor signaling pathway via JAK-STAT (Fig.  5 A). These pathways play important roles in immune regulation and leukemogenesis. Cell type enrichment analysis showed the highest significance for bone marrow naïve T cells (Fig.  5 B). Furthermore, we found that these genes were numerically enriched in several immunologic signatures (e.g., MEMORY VS CD21HIGH TRANSITIONAL BCELL DN) (Fig.  5 C). The PPI analysis showed that five PPI networks were constructed, including the JAK-STAT signaling pathway and multiple pathways related to DNA damage were involved. And 22 proteins (e.g., STAT, NFKB1, and GATA3) could participate in these pathways (Fig.  5 D). Also, the results suggest that heritability is enriched in the blood, EBV-transformed lymphocytes, whole blood, and palatine tonsil tissues among five or more autoimmune diseases and B-ALL.

figure 5

A Pathway enrichments for identified pleiotropic genes (KEGG, GO, Wiki pathways). B Cell-type enrichments for identified pleiotropic genes. C Immune signatures enrichments for identified pleiotropic genes. D Protein–protein interaction analysis based on identified pleiotropic genes

Immune-related mechanisms shared between autoimmune disorders and B-ALL

The shared mechanism involving affected tissues such as the spleen, lymphocytes, and whole blood, suggested an important involvement of immune mechanisms in the inter-disease. We used the S-LDSC method to determine the heritability enrichment of pleiotropy in immune cells and the HyPrColoc method to identify immune cells with co-localization signals with pleiotropic motifs. S-LDSC observed heritability enrichment of B cells in both autoimmune diseases and B-ALL. When analyzing the enrichment of immune traits from ImmGen, we also observed that two cell traits in the B cell panel were enriched: B.FrE.BM (CD19 + IgM + AA4.1 + HSA + ) and preB.FrD.BM (CD19 + IgM − CD45R + CD43 − ). Additionally, numerous cell traits in the T cell panel were also identified, implying the potential immune mechanisms shared (Additional file 1 : Fig. S12 and Additional file 2 : Table S11). Then multi-trait colocalization analysis by using HyPrColoc was performed to pinpoint key immune cells (Additional file 2 : Table S12). Results highlight 59 pleiotropic loci, of which 19 were unique, and these loci support the important role of 42 unique immune cells in autoimmune diseases and B-ALL by sharing causal variants. Our results support the critical influence of BAFF-R, CD4, CD45, and CD28 on different cells. Notably, a total of six BAFF-R-related immune traits were observed, including BAFF-R on B cell, BAFF-R on CD20 − , BAFF-R on CD24 + CD27 + , BAFF-R on IgD + CD24 − , BAFF-R on IgD + CD24 + , and BAFF-R on IgD + CD38 − . Interestingly, BAFF-R on B cell and BAFF-R on CD24 + CD27 + were both shared among three trait pairs (i.e., B-ALL&IBD, B-ALL&PBC, B-ALL&RA).

The causal relationship between autoimmune diseases and B-ALL estimated by MR

MR analyses using the IVW method showed significant positive associations between two autoimmune diseases (AOA and RA) and B-ALL risk (Fig.  6 A and Additional file 2 : Table S13). The risk of B-ALL was found to be able to be increased as the risk of AOA increases, the effect size was estimated by using the IVW method (OR = 1.223, 95%CI = 1.048 ~ 1.426, P  = 0.010). Another four methods (DIVW, MR-RAPS, MR-PRESSO, and slope of MR-Egger) are consistent with the results of the IVW method. Although a significant intercept of MR-Egger might indicate the existence of potential horizontal pleiotropy, the global test of MR-PRESSO ruled out this possibility ( P  = 0.632). We also observed significant causal effects of RA onset on B-ALL risk by using the IVW method (OR = 1.117, 95%CI = 1.033 ~ 1.208, P  = 0.005). DIVW, MR-RAPS, and MR-PRESSO support this association (Fig.  6 B), where the intercept of MR-Egger and the global test of MR-PRESSO ruled out the possibility of horizontal pleiotropy (Additional file 2 : Table S14). Additional scatter and funnel plots eliminate the possibility of potential outliers (Fig.  6 C–D). However, after the Bonferroni adjustment, no causal associations between autoimmune disorders and B-ALL remained statistically significant ( P  = 0.003 < 0.05/16). Finally, reverse MR analysis ruled out the possibility of reverse-directional causality.

figure 6

A The forest plot shows causal associations between autoimmune disorders and B-ALL by using one-directional MR analysis. Note: Causal effects were estimated by using IVW method. B Forest plot shows causal effects of AOA and RA on B-ALL risk estimated by using different methods. C Scatter plot shows significant causal association between AOA and B-ALL risk. D Funnel plot shows significant causal association between AOA and B-ALL risk. E Scatter plot shows significant causal association between RA and B-ALL risk. F Funnel plot shows significant causal association between RA and B-ALL risk. Associations highlighted with red represent that associations were significant in more than three main MR methods

Given the critical contribution of B cells to autoimmune disorders and B-ALL, there may be a complex relationship between them [ 60 ]. The study employed comprehensive genetic methodologies to investigate the genetic correlation between autoimmune disorders and B-ALL. The study determined pleiotropic loci using cross-trait PLACO analysis and identified pleiotropic genes through the MAGMA method. Then the key pathways and immunological mechanisms involved were identified. Finally, comprehensive MR analysis and sensitive analysis established the causal relationships between autoimmune diseases and B-ALL.

Through genetic correlation analysis, we observed significant genetic overlap between B-ALL and seven autoimmune disorders, including AOA, HT, IBD, CD, PBC, RA, and MS. We provide strong evidence for a shared genetic mechanism between RA and B-ALL, as well as MR evidence suggesting that patients with RA symptoms should be alerted to the risk of progression to ALL, which is consistent with previous studies [ 7 , 8 ]. Additionally, study have shown that 34 of 699 ALL patients diagnosed and followed had previously received varying doses of steroids for aplastic events or arthritis-based rheumatic diseases [ 61 ]. By using genetic variables, MR methods could well avoid the influence of possible confounding factors. Therefore, we believe that in addition to the effect of immunosuppressants, RA itself will also play an important role in the risk of B-ALL. We also observed significantly causal effects of AOA on B-ALL risk, which was ambiguous in previous studies: a systematic review supported the protective effect of asthma on ALL [ 62 ], two types of research included showed significant high risks of ALL in patients with a history of asthma [ 63 , 64 ].

We identified a series of genetically risk loci associated with both autoimmune diseases and B-ALL, and some of which were observed in multiple phenotype pairs (e.g., 7p12.2, 10p14, 6q27, 10p12.31). Previous studies gave the evidence of key role these loci played in the development of autoimmune disorders and B-ALL. For example, loci on 7p12.2 ( IKZF1 ) had been proven to be associated with risk of childhood B-ALL [ 65 ], which was also identified as susceptibility genes for SLE [ 66 ]. After searching for the GWAS catalog, 7p12.2 had been reported to be associated with multiple autoimmune disorders, including CD [ 23 ], IBD [ 23 ], RA [ 25 ], and MS [ 67 ]. GATA3 (10p14) is a key regulator in the immune system, especially in the differentiation and function of type 2 helper (Th2) cells [ 68 ]. Th2 cells have been demonstrated to play a role in various autoimmune diseases, including SLE and IBD [ 69 , 70 ]. Recent research also highlighted the role of noncoding genetic variation (rs3824462) in GATA3 , linking it to an increased risk of Ph-like ALL, a common subtype of B-ALL. The study revealed that rs3824462 induced local and global changes in chromatin conformation, activating JAK-STAT pathway and promoting disease development [ 71 ].

We searched for the identified risk loci in the GWAS catalog (last update in 2023 December 20) [ 72 ] and found that some of the risk loci have been reported to be associated with both B-ALL and autoimmune disorders (Additional file 1 : Fig. S13 and Additional file 2 : Table S15). For instance, the 17q21 locus is implicated in various autoimmune diseases, including asthma [ 73 , 74 ], IBD [ 75 , 76 ], T1D [ 77 ], and SLE [ 78 ]. This locus, housing IKZF3 , GSDMB , and ORMDL3 , involved in lymphocyte development [ 79 ], pyroptosis [ 80 ], and inflammatory response [ 81 ], has been challenging to dissect. GSDMB and ORMDL3 represent the target genes of rs2290400, and its minor allele is associated with a protective effect against ALL [ 82 ]. IKZF3 polymorphism contributes to B-ALL with a 1.5-fold to twofold increase in relative risk [ 83 ]. Genes previously reported to be associated with leukemia have also been observed in our results to be correlated with autoimmune diseases: MLLT10 (10p12) participates in various chromosomal rearrangements associated with ALL and acute myeloid leukemia (AML) [ 84 ]. It is implicated in chromatin structure regulation and DNA damage response, deemed crucial for early development, maintenance, and differentiation of hematopoietic stem cells. While direct evidence for the impact of MLLT10 on autoimmune diseases has not been established, studies indicated a close association with C-reactive protein levels [ 85 ], widely recognized as a valuable indicator of disease activity in various autoimmune rheumatic diseases [ 86 ]. Simultaneously, certain genes previously reported to be associated with autoimmune disorders have also been found in our results to be associated with B-ALL. IRGM (5q33) encodes a member of the immunity-related GTPase family, crucial in innate immunity and inflammatory responses [ 87 ]. Previous studies have linked IRGM to CD [ 88 , 89 , 90 ], UC [ 91 ], and IBD [ 23 ]. CAPSL (5p13) has been reported to be associated with PBC [ 92 ], T1D [ 93 ], asthma [ 94 ], and SLE [ 95 ]. Although direct evidence of its association with ALL is lacking, increased mRNA levels have been observed in AML patients [ 96 ]. Additionally, the long non-coding RNA C5orf56 (5q31) has been identified for its protective role in IBD [ 97 ]. SCHIP1 (3q25) has been associated with SLE [ 98 ], while RNASET2 (6q27) has been identified as a risk gene for both vitiligo [ 99 ] and GD [ 100 ].

Shared genetic structures observed in our research revealed common mechanisms between autoimmune disease and B-ALL. Identified genes were observed to participate in several pathways, like B cell activation, cellular response to cytokine stimulus, and inflammatory response. For each disease pair, we observed a significant enrichment of pleiotropy to the spleen, a critical site for B cell development. Notably, a substantial presence of BAFF-R-associated immune signature, a key regulator of B cell function and survival, was discerned in a multi-trait colocalization analysis. These findings collectively underscored the pivotal role played by B cells in both autoimmune disorders and B-ALL. In autoimmune conditions, B cells are exposed to antigens, undergo activation, and subsequently proliferate and expand clonally, thereby increasing the risk of accumulating genetic mutations, and finally leads to the emergence and progression of B-ALL [ 60 ]. We can think that ORMDL3 and IKZF3 , mentioned earlier, play crucial roles in this context, as evidenced by prior literature reporting ORMDL3’s vital role in B cell survival [ 101 ], and IKZF3 ’s predominant regulation of B cell differentiation, activation response, and proliferation [ 102 ]. Furthermore, malignancies arising from B cells consistently exhibit concurrent autoimmune disorders at any stage, whereas those derived from T cells are less commonly linked to autoimmune phenomena [ 103 ]. Nevertheless, our findings also identified numerous cell traits in the T cell panel, and we speculate that this may be attributed to interactions between B and T cells. The JAK-STAT pathway may represent a crucial mechanism in this context, as it has been targeted in autoimmune diseases [ 104 ] and its role in B-ALL involves the disruption of preleukemic cells differentiation [ 105 ]. Our results highlighted the critical role of EBV infection as a trigger for both autoimmune disorders and B-ALL: tissue-specific analysis revealed enriched risk loci in EBV-transformed lymphocytes, and the central role of IKZF1 in this cell was also identified by gene-level analyses. EBV remains latent in memory B cells after infection, and reactivation can induce B cell clonal immortalization, promoting lymphomagenesis [ 106 ]. Additionally, EBV-induced autoimmunity has been reported to increase the risk of autoimmune diseases [ 107 ].

Limitations

Our study is not without limitations. Firstly, as with other similar studies, the data used in this study was summary-level, and individual-level datasets were not available. Further stratification of the population (e.g., gender, age, etc.) was therefore not possible. Secondly, the sample size of immune cell GWAS used in this study was limited. Therefore, caution should be exercised in interpreting the role of immune cells and drawing conclusions in our studies. Thirdly, it should be noted that our study was limited to European ancestry and may not be generalizable to other ancestries. It is important to be equally cautious in concluding our findings since the relatively small sample size of B-ALL may result in limited statistical power.

Our research has uncovered the intricate connections between autoimmune disorders, especially AOA, HT, IBD, CD, RA, and MS and B-ALL. Identification of pleiotropic risk loci (7p12, 10p14, 6q27, and 10p12) and genes ( IKZF1 , GATA3 , IKZF3 , GSDMB , and ORMDL3 ) shared between diseases suggested shared mechanisms, such as B cell activation and JAK-STAT pathway, common triggers like EBV infection. Additionally, our findings have shed light on and the causal links between autoimmune disorders (AOA and RA) and B-ALL.

Availability of data and materials

Data are available in public, open access repositories corresponding to the original studies (e.g., GWAS catalog). Main codes used in our research are available at https://github.com/biostatYu/MRcode/tree/main/AD_BALL .

Abbreviations

Acute myeloid leukemia

Adult-onset asthma

  • B-cell acute lymphoblastic leukemia

Crohn’s disease

Childhood-onset asthma

Differentially expressed genes

Debiased-inverse variance weighted

Epstein–Barr virus

Expression quantitative trait loci

Functional mapping and annotation

Graves’ disease

Genome-wide association study

Hashimoto's disease

High-definition likelihood

Hypothyroidism

Hypothesis prioritization for multi-trait colocalization

Inflammatory bowel disease

Intersection–union test

Instrumental variables

Inverse variance weighted

Linkage disequilibrium

Linkage disequilibrium score regression

Multi-marker analysis of genomic annotation

  • Mendelian randomization

Mendelian randomization pleiotropy residual sum and outlier

Mendelian randomization robust adjusted profile score

Multiple sclerosis

Molecular signatures database

Primary biliary cirrhosis

Pleiotropic analysis under composite null hypothesis

Primary sclerosing cholangitis

Proportion of variance explained

Rheumatoid arthritis

Standard errors

Stratified-linkage disequilibrium score regression

Systemic lupus erythematosus

Systemic sclerosis

Type 1 diabetes

Ulcerative colitis

Hunger SP, Mullighan CG. Acute lymphoblastic leukemia in children. N Engl J Med. 2015;373(16):1541–52.

Article   CAS   PubMed   Google Scholar  

Chi X, Huang M, Tu H, Zhang B, Lin X, Xu H, et al. Innate and adaptive immune abnormalities underlying autoimmune diseases: the genetic connections. Sci China Life Sci. 2023;66(7):1482–517.

Article   PubMed   Google Scholar  

Lin X, Lu L. B Cell-mediated autoimmune diseases. Adv Exp Med Biol. 2020;1254:145–60.

Shaffer AL, Rosenwald A, Staudt LM. Lymphoid malignancies: the dark side of B-cell differentiation. Nat Rev Immunol. 2002;2(12):920–32.

Smedby KE, Hjalgrim H, Askling J, Chang ET, Gregersen H, Porwit-MacDonald A, et al. Autoimmune and chronic inflammatory disorders and risk of non-Hodgkin lymphoma by subtype. J Natl Cancer Inst. 2006;98(1):51–60.

Ekström Smedby K, Vajdic CM, Falster M, Engels EA, Martínez-Maza O, Turner J, et al. Autoimmune disorders and risk of non-Hodgkin lymphoma subtypes: a pooled analysis within the InterLymph Consortium. Blood. 2008;111(8):4029–38.

Article   PubMed   PubMed Central   Google Scholar  

Hemminki K, Huang W, Sundquist J, Sundquist K, Ji J. Autoimmune diseases and hematological malignancies: exploring the underlying mechanisms from epidemiological evidence. Semin Cancer Biol. 2020;64:114–21.

Cabral DA, Tucker LB. Malignancies in children who initially present with rheumatic complaints. J Pediatr. 1999;134(1):53–7.

Li Y, Xie X, Jie Z, Zhu L, Yang JY, Ko CJ, et al. DYRK1a mediates BAFF-induced noncanonical NF-κB activation to promote autoimmunity and B-cell leukemogenesis. Blood. 2021;138(23):2360–71.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh P-R, et al. An atlas of genetic correlations across human diseases and traits. Nat Genet. 2015;47(11):1236.

Gong W, Guo P, Li Y, Liu L, Yan R, Liu S, et al. Role of the gut-brain axis in the shared genetic etiology between gastrointestinal tract diseases and psychiatric disorders: a genome-wide pleiotropic analysis. JAMA Psychiat. 2023;80(4):360–70.

Article   Google Scholar  

Yu XH, Yang YQ, Cao RR, Cai MK, Zhang L, Deng FY, et al. Rheumatoid arthritis and osteoporosis: shared genetic effect, pleiotropy and causality. Hum Mol Genet. 2021;30(21):1932–40.

Lu H, Qiao J, Shao Z, Wang T, Huang S, Zeng P. A comprehensive gene-centric pleiotropic association analysis for 14 psychiatric disorders with GWAS summary statistics. BMC Med. 2021;19(1):314.

Ray D, Chatterjee N. A powerful method for pleiotropic analysis under composite null hypothesis identifies novel shared loci between type 2 diabetes and prostate cancer. Plos Genet. 2020;16(12):e1009218.

Ferreira MAR, Mathur R, Vonk JM, Szwajda A, Brumpton B, Granell R, et al. Genetic architectures of childhood- and adult-onset asthma are partly distinct. Am J Hum Genet. 2019;104(4):665–84.

Ferreira MAR, Mathur R, Vonk JM, Szwajda A, Brumpton B, Granell R, et al. Genetic architectures of childhood- and adult-onset asthma are partly distinct. https://www.ebi.ac.uk/gwas/studies/GCST007800 . (2019).

Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat Genet. 2021;53(10):1415–24.

Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A cross-population atlas of genetic associations for 220 human phenotypes. https://pheweb.jp/downloads . (2021).

Cordell HJ, Fryett JJ, Ueno K, Darlay R, Aiba Y, Hitomi Y, et al. An international genome-wide meta-analysis of primary biliary cholangitis: novel risk loci and candidate drugs. J Hepatol. 2021;75(3):572–81.

Cordell HJ, Fryett JJ, Ueno K, Darlay R, Aiba Y, Hitomi Y, et al. An international genome-wide meta-analysis of primary biliary cholangitis: novel risk loci and candidate drugs. https://www.ebi.ac.uk/gwas/studies/GCST90061440 . (2021).

Ji SG, Juran BD, Mucha S, Folseraas T, Jostins L, Melum E, et al. Genome-wide association study of primary sclerosing cholangitis identifies new risk loci and quantifies the genetic relationship with inflammatory bowel disease. Nat Genet. 2017;49(2):269–73.

Ji SG, Juran BD, Mucha S, Folseraas T, Jostins L, Melum E, et al. Genome-wide association study of primary sclerosing cholangitis identifies new risk loci and quantifies the genetic relationship with inflammatory bowel disease. https://www.ebi.ac.uk/gwas/studies/GCST004030 . (2017).

de Lange KM, Moutsianas L, Lee JC, Lamb CA, Luo Y, Kennedy NA, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. Nat Genet. 2017;49(2):256–61.

de Lange KM, Moutsianas L, Lee JC, Lamb CA, Luo Y, Kennedy NA, et al. Genome-wide association study implicates immune activation of multiple integrin genes in inflammatory bowel disease. https://www.ebi.ac.uk/gwas/studies/GCST004131 . (2017).

Ishigaki K, Sakaue S, Terao C, Luo Y, Sonehara K, Yamaguchi K, et al. Multi-ancestry genome-wide association analyses identify novel genetic mechanisms in rheumatoid arthritis. Nat Genet. 2022;54(11):1640–51.

Ishigaki K, Sakaue S, Terao C, Luo Y, Sonehara K, Yamaguchi K, et al. Multi-ancestry genome-wide association analyses identify novel genetic mechanisms in rheumatoid arthritis. https://www.ebi.ac.uk/gwas/studies/GCST90132223 . (2022).

Bentham J, Morris DL, Graham DSC, Pinder CL, Tombleson P, Behrens TW, et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nat Genet. 2015;47(12):1457–64.

Bentham J, Morris DL, Graham DSC, Pinder CL, Tombleson P, Behrens TW, et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. https://www.ebi.ac.uk/gwas/studies/GCST003156 . (2015).

Consortium. IMSG. Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility. Science (New York, NY). 2019;365(6460):7188.

Consortium. IMSG. Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility. https://www.ebi.ac.uk/gwas/studies/GCST009597 . (2019).

López-Isac E, Acosta-Herrera M, Kerick M, Assassi S, Satpathy AT, Granja J, et al. GWAS for systemic sclerosis identifies multiple risk loci and highlights fibrotic and vasculopathy pathways. Nat Commun. 2019;10(1):4955.

López-Isac E, Acosta-Herrera M, Kerick M, Assassi S, Satpathy AT, Granja J, et al. GWAS for systemic sclerosis identifies multiple risk loci and highlights fibrotic and vasculopathy pathways. https://www.ebi.ac.uk/gwas/studies/GCST009131 . (2019).

Jin Y, Andersen G, Yorgov D, Ferrara TM, Ben S, Brownson KM, et al. Genome-wide association studies of autoimmune vitiligo identify 23 new risk loci and highlight key pathways and regulatory variants. Nat Genet. 2016;48(11):1418–24.

Jin Y, Andersen G, Yorgov D, Ferrara TM, Ben S, Brownson KM, et al. Genome-wide association studies of autoimmune vitiligo identify 23 new risk loci and highlight key pathways and regulatory variants. https://www.ebi.ac.uk/gwas/studies/GCST004785 . (2016).

Vijayakrishnan J, Qian M, Studd JB, Yang W, Kinnersley B, Law PJ, et al. Identification of four novel associations for B-cell acute lymphoblastic leukaemia risk. Nat Commun. 2019;10(1):5348.

Vijayakrishnan J, Qian M, Studd JB, Yang W, Kinnersley B, Law PJ, et al. Identification of four novel associations for B-cell acute lymphoblastic leukaemia risk. https://www.ebi.ac.uk/gwas/studies/GCST009638 . (2019).

Consortium GP. A global reference for human genetic variation. Nature. 2015;526(7571):68.

Ning Z, Pawitan Y, Shen X. High-definition likelihood inference of genetic correlations across human complex traits. Nat Genet. 2020;52(8):859–64.

GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45(6):580–5.

ImmGen O-S. mononuclear phagocytes. Nat Immunol. 2016;17(7):741.

Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8(1):1826.

Giambartolomei C, Vukcevic D, Schadt EE, Franke L, Hingorani AD, Wallace C, et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. Plos Genet. 2014;10(5):e1004383.

de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. Plos Comput Biol. 2015;11(4):e1004219.

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005;102(43):15545–50.

Carithers LJ, Ardlie K, Barcus M, Branton PA, Britton A, Buia SA, et al. A novel approach to high-quality postmortem tissue procurement: the GTEx project. Biopreserv Biobanking. 2015;13(5):311–9.

Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet. 2016;48(5):481–7.

Foley CN, Staley JR, Breen PG, Sun BB, Kirk PDW, Burgess S, et al. A fast and efficient colocalization algorithm for identifying shared genetic risk factors across multiple traits. Nat Commun. 2021;12(1):764.

Orrù V, Steri M, Sidore C, Marongiu M, Serra V, Olla S, et al. Complex genetic signatures in immune cells underlie autoimmunity and inform therapy. Nat Genet. 2020;52(10):1036–45.

Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.

Burgess S, Thompson SG. Avoiding bias from weak instruments in Mendelian randomization studies. Int J Epidemiol. 2011;40(3):755–64.

Ye T, Shao J, Kang H. Debiased inverse-variance weighted estimator in two-sample summary-data Mendelian randomization. Ann Stat. 2021;49(4):2079–100.

Bowden J, Smith GD, Haycock PC, Burgess S. Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genet Epidemiol. 2016;40(4):304–14.

Verbanck M, Chen C-Y, Neale B, Do R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat Genet. 2018;50(5):693.

Burgess S, Thompson SG. Interpreting findings from Mendelian randomization using the MR-Egger method. Eur J Epidemiol. 2017;32(5):377–89.

Zhao Q, Wang J, Hemani G, Bowden J, Small DS. Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. Ann Statist. 2020;48(3):1742–69. https://doi.org/10.1214/19-AOS1866 .

Hartwig FP, Davey Smith G, Bowden J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. Int J Epidemiol. 2017;46(6):1985–98.

Thompson SG, Sharp SJ. Explaining heterogeneity in meta-analysis: a comparison of methods. Stat Med. 1999;18(20):2693–708.

Bowden J, Del Greco MF, Minelli C, Davey Smith G, Sheehan NA, Thompson JR. Assessing the suitability of summary data for two-sample Mendelian randomization analyses using MR-Egger regression: the role of the I2 statistic. Int J Epidemiol. 2016;45(6):1961–74.

PubMed   PubMed Central   Google Scholar  

Yavorska OO, Burgess S. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. Int J Epidemiol. 2017;46(6):1734–9.

Baecklund E, Smedby KE, Sutton LA, Askling J, Rosenquist R. Lymphoma development in patients with autoimmune and inflammatory disorders–what are the driving forces? Semin Cancer Biol. 2014;24:61–70.

Révész T, Kardos G, Kajtár P, Schuler D. The adverse effect of prolonged prednisolone pretreatment in children with acute lymphoblastic leukemia. Cancer. 1985;55(8):1637–40.

Zhou MH, Yang QM. Association of asthma with the risk of acute leukemia and non-Hodgkin lymphoma. Mol Clin Oncol. 2015;3(4):859–64.

Chang JS, Tsai YW, Tsai CR, Wiemels JL. Allergy and risk of childhood acute lymphoblastic leukemia: a population-based and record-based study. Am J Epidemiol. 2012;176(11):970–8.

Spector L, Groves F, DeStefano F, Liff J, Klein M, Mullooly J, et al. Medically recorded allergies and the risk of childhood acute lymphoblastic leukaemia. Eur J Cancer (Oxford, England : 1990). 2004;40(4):579–84.

Article   CAS   Google Scholar  

Papaemmanuil E, Hosking FJ, Vijayakrishnan J, Price A, Olver B, Sheridan E, et al. Loci on 7p12.2, 10q21.2 and 14q11.2 are associated with risk of childhood acute lymphoblastic leukemia. Nat Genet. 2009;41(9):1006–10.

Han JW, Zheng HF, Cui Y, Sun LD, Ye DQ, Hu Z, et al. Genome-wide association study in a Chinese Han population identifies nine new susceptibility loci for systemic lupus erythematosus. Nat Genet. 2009;41(11):1234–7.

Beecham AH, Patsopoulos NA, Xifara DK, Davis MF, Kemppinen A, Cotsapas C, et al. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet. 2013;45(11):1353–60.

Zhu J. GATA3 regulates the development and functions of innate lymphoid cell subsets at multiple stages. Front Immunol. 2017;8:1571.

Ramirez GA, Tassi E, Noviello M, Mazzi BA, Moroni L, Citterio L, et al. Histone‐specific CD4+ T cell plasticity in active and quiescent systemic lupus erythematosus. Arthritis Rheumatol. 2024.

Liu A, Liang X, Wang W, Wang C, Song J, Guo J, et al. Human umbilical cord mesenchymal stem cells ameliorate colon inflammation via modulation of gut microbiota-SCFAs-immune axis. Stem Cell Res Ther. 2023;14(1):271.

Yang H, Zhang H, Luan Y, Liu T, Yang W, Roberts KG, et al. Noncoding genetic variation in GATA3 increases acute lymphoblastic leukemia risk through local and global changes in chromatin conformation. Nat Genet. 2022;54(2):170–9.

Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, et al. The NHGRI-EBI GWAS catalog: knowledgebase and deposition resource. Nucleic Acids Res. 2023;51(D1):D977–85.

Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, Heath S, et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature. 2007;448(7152):470–3.

Li X, Christenson SA, Modena B, Li H, Busse WW, Castro M, et al. Genetic analyses identify GSDMB associated with asthma severity, exacerbations, and antiviral pathways. J Allergy Clin Immunol. 2021;147(3):894–909.

Jostins L, Ripke S, Weersma RK, Duerr RH, McGovern DP, Hui KY, et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature. 2012;491(7422):119–24.

Chao KL, Kulakova L, Herzberg O. Gene polymorphism linked to increased asthma and IBD risk alters gasdermin-B structure, a sulfatide and phosphoinositide binding protein. Proc Natl Acad Sci USA. 2017;114(7):E1128–37.

Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, et al. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet. 2009;41(6):703–7.

Perez RK, Gordon MG, Subramaniam M, Kim MC, Hartoularos GC, Targ S, et al. Single-cell RNA-seq reveals cell type-specific molecular and genetic associations to lupus. Science (New York, NY). 2022;376(6589):eabf1970.

Morgan B, Sun L, Avitahl N, Andrikopoulos K, Ikeda T, Gonzales E, et al. Aiolos, a lymphoid restricted transcription factor that interacts with Ikaros to regulate lymphocyte differentiation. EMBO J. 1997;16(8):2004–13.

Li X, Zhang T, Kang L, Xin R, Sun M, Chen Q, et al. Apoptotic caspase-7 activation inhibits non-canonical pyroptosis by GSDMB cleavage. Cell Death Differ. 2023;30(9):2120–34.

Zhang Y, Willis-Owen SAG, Spiegel S, Lloyd CM, Moffatt MF, Cookson W. The ORMDL3 asthma gene regulates ICAM1 and has multiple effects on cellular inflammation. Am J Respir Crit Care Med. 2019;199(4):478–88.

Wiemels JL, Walsh KM, de Smith AJ, Metayer C, Gonseth S, Hansen HM, et al. GWAS in childhood acute lymphoblastic leukemia reveals novel genetic associations at chromosomes 17q12 and 8q24.21. Nat Commun. 2018;9(1):286.

Cobaleda C, Vicente-Dueñas C, Sanchez-Garcia I. Infectious triggers and novel therapeutic opportunities in childhood B cell leukaemia. Nat Rev Immunol. 2021;21(9):570–81.

Forgione MO, McClure BJ, Yeung DT, Eadie LN, White DL. MLLT10 rearranged acute leukemia: incidence, prognosis, and possible therapeutic strategies. Genes Chromosomes Cancer. 2020;59(12):709–21.

Said S, Pazoki R, Karhunen V, Võsa U, Ligthart S, Bodinier B, et al. Genetic analysis of over half a million people characterises C-reactive protein loci. Nat Commun. 2022;13(1):2198.

Zhang SL, Lin H, Huang F. Special diagnostic value of C-reactive protein in systemic autoimmune rheumatic diseases complicated with infections. Zhonghua Nei Ke Za Zhi. 2020;59(7):489–92.

CAS   PubMed   Google Scholar  

Singh SB, Davis AS, Taylor GA, Deretic V. Human IRGM induces autophagy to eliminate intracellular mycobacteria. Science (New York, NY). 2006;313(5792):1438–41.

Franke A, McGovern DP, Barrett JC, Wang K, Radford-Smith GL, Ahmad T, et al. Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nat Genet. 2010;42(12):1118–25.

Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447(7145):661–78.

Barrett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohnʼs disease. Nat Genet. 2008;40(8):955–62.

Liu JZ, van Sommeren S, Huang H, Ng SC, Alberts R, Takahashi A, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet. 2015;47(9):979–86.

Juran BD, Hirschfield GM, Invernizzi P, Atkinson EJ, Li Y, Xie G, et al. Immunochip analyses identify a novel risk locus for primary biliary cirrhosis at 13q14, multiple independent associations at four established risk loci and epistasis between 1p31 and 7q32 risk variants. Hum Mol Genet. 2012;21(23):5209–21.

Baxter AG, Jordan MA. From markers to molecular mechanisms: type 1 diabetes in the post-GWAS era. The review of diabetic studies : RDS. 2012;9(4):201–23.

Johansson Å, Rask-Andersen M, Karlsson T, Ek WE. Genome-wide association analysis of 350 000 Caucasians from the UK Biobank identifies novel loci for asthma, hay fever and eczema. Hum Mol Genet. 2019;28(23):4022–41.

Wang YF, Zhang Y, Lin Z, Zhang H, Wang TY, Cao Y, et al. Identification of 38 novel loci for systemic lupus erythematosus and genetic heterogeneity between ancestral groups. Nat Commun. 2021;12(1):772.

Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, et al. A pathology atlas of the human cancer transcriptome. Science (New York, NY). 2017;357(6352):eaan2507.

Ma H, Hu T, Tao W, Tong J, Han Z, Herndler-Brandstetter D, et al. A lncRNA from an inflammatory bowel disease risk locus maintains intestinal host-commensal homeostasis. Cell Res. 2023;33(5):372–88.

Khunsriraksakul C, Li Q, Markus H, Patrick MT, Sauteraud R, McGuire D, et al. Multi-ancestry and multi-trait genome-wide association meta-analyses inform clinical risk prediction for systemic lupus erythematosus. Nat Commun. 2023;14(1):668.

Quan C, Ren YQ, Xiang LH, Sun LD, Xu AE, Gao XH, et al. Genome-wide association study for vitiligo identifies susceptibility loci at 6q27 and the MHC. Nat Genet. 2010;42(7):614–8.

Chu X, Pan CM, Zhao SX, Liang J, Gao GQ, Zhang XM, et al. A genome-wide association study identifies two new risk loci for Graves’ disease. Nat Genet. 2011;43(9):897–901.

Dang J, Bian X, Ma X, Li J, Long F, Shan S, et al. ORMDL3 facilitates the survival of splenic B cells via an ATF6α-endoplasmic reticulum stress-Beclin1 autophagy regulatory pathway. J Immunol (Baltimore, Md : 1950). 2017;199(5):1647–59.

John LB, Ward AC. The Ikaros gene family: transcriptional regulators of hematopoiesis and immunity. Mol Immunol. 2011;48(9–10):1272–8.

Porpaczy E, Jäger U. How I manage autoimmune cytopenias in patients with lymphoid cancer. Blood. 2022;139(10):1479–88.

Banerjee S, Biehl A, Gadina M, Hasni S, Schwartz DM. JAK-STAT signaling as a target for inflammatory and autoimmune diseases: current and future prospects. Drugs. 2017;77(5):521–46.

Fregona V, Bayet M, Bouttier M, Largeaud L, Hamelle C, Jamrog LA, et al. Stem cell-like reprogramming is required for leukemia-initiating activity in B-ALL. J Exp Med. 2024;221(1):20230279.

Thorley-Lawson DA, Gross A. Persistence of the Epstein-Barr virus and the origins of associated lymphomas. N Engl J Med. 2004;350(13):1328–37.

Vietzen H, Berger SM, Kühner LM, Furlano PL, Bsteh G, Berger T, et al. Ineffective control of Epstein-Barr-virus-induced autoimmunity increases the risk for multiple sclerosis. Cell. 2023;186(26):5705–18.e13.

Download references

Acknowledgements

We thank all the studies for making the summary association statistics data publicly available. We are also very grateful to the editor and two referees for their insightful and constructive comments, which substantially improved our original manuscript.

This work was supported by National Key Research and Development Program (2022YFC2502700) to Y.X. and National Natural Science Foundation of China (82020108003 to D.W., 82070187 to Y.X.). D.W. was supported by Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD) and Jiangsu Provincial Medical Innovation Center (CXZX202201). Y.C. is also supported by Postgraduate Research and Practice Innovation Program of Jiangsu Province (KYCX23_3270). D.W. was supported by Suzhou Science and Technology Program Project (SLT201911). X.Y. was supported by Boxi Cultivation Program of the First Affiliated Hospital of Suzhou University (BXQN2023032).

Author information

Xinghao Yu and Yiyin Chen contributed equally to this work.

Authors and Affiliations

National Clinical Research Center for Hematologic Diseases, Jiangsu Institute of Hematology, The First Affiliated Hospital of Soochow University, Suzhou, China

Xinghao Yu, Yiyin Chen, Jia Chen, Yi Fan, Depei Wu & Yang Xu

Collaborative Innovation Center of Hematology, Institute of Blood and Marrow Transplantation, Soochow University, Suzhou, China

Xinghao Yu, Yiyin Chen, Depei Wu & Yang Xu

Department of Outpatient and Emergency, The First Affiliated Hospital of Soochow University, Suzhou, China

You can also search for this author in PubMed   Google Scholar

Contributions

DW and YX designed the study. XY obtained the data. XY and YC cleared up the datasets. XY, HL, and YC mainly performed the data analyses. YX, XY, YC, JC, YF, and HL drafted the manuscript. YX, XY, YC, and DW revised the manuscript, and all authors read and approved the final manuscript.

Corresponding authors

Correspondence to Depei Wu or Yang Xu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

All authors declared no potential conflicts of interest.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:.

Supplementary Methods and Fig. S1-S3. Supplementary Methods - A supplementary document on GWAS quality control, PLACO method, colocalization analysis, MAGMA analysis, HyPrColoc method, immune cell data description, and Mendelian randomization analysis. Fig. S1. Manhattan plot of the PLACO results. Fig. S2. -QQ plots for pleiotropic results performed by PLACO. Fig. S3. Regional plots of each colocalized locus (PP.H4 > 0.7) identified for corresponding trait pair (B-ALL&AOA) by using the PLACO. Fig. S4. Regional plots of each colocalized locus (PP.H4 > 0.7) identified for corresponding trait pair (B- B-ALL&HT) by using the PLACO. Fig. S5. Regional plots of each colocalized locus (PP.H4 > 0.7) identified for corresponding trait pair (B-ALL&PBC) by using the PLACO. Fig. S6. Regional plots of each colocalized locus (PP.H4 > 0.7) identified for corresponding trait pair (B-ALL&IBD) by using the PLACO. Fig. S7. Regional plots of each colocalized locus (PP.H4 > 0.7) identified for corresponding trait pair (B-ALL&MS) by using the PLACO. Fig. S8. Regional plot of each colocalized locus (PP.H4 > 0.7) identified for corresponding trait pair (B- B-ALL&RA) by using the PLACO. Fig. S9. Manhattan plot of MAGMA gene analysis. Fig. S10. Heatmap for expression values of pleiotropic genes in different tissues identified by MAGMA analysis. Fig. S11. Gene-set enrichment for identified pleiotropic genes. Red panels represent significant tissues after Bonferroni adjustment. Fig. S12. Heatmap of tissues and immune traits shared between autoimmune disorders and B-ALL identified by S-LDSC. Fig. S13. Heatmap shows whether the identified risk loci have been reported to be associated with B-ALL and AD in the previous studies after searching the GWAS catalog.

Additional file 2: Table S1.

Data sources. Table S2. Genetic correlation analysis conducted by LDSC and HDL. Table S3. Shared pleiotropic loci identified by PLACO. Table S4. Shared pleiotropic loci among different trait pairs. Table S5. MAGMA Gene-set analysis. Table S6. MAGMA tissue-specific analysis. Table S7. MAGMA gene analysis. Table S8. Information of pleiotropy genes identified by MAGMA. Table S9. Expression value of pleiotropy genes identified by MAGMA in different tissues from GTEx. Table S10. Tissue-specific enrichment of pleiotropy genes identified by MAGMA in different tissues from GTEx. Table S11. S-LDSC cell-type heritability enrichment analysis. Table S12. Multi-trait colocalization analysis highlighted key role of immune cells (PP>0.7). Table S13. Bi-direction MR analysis and sensitive analysis. Table S14. Bi-direction MR analysis and sensitive analysis. Table S15. Identified loci reported in previous GWAS analysis for ALL and AD.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Yu, X., Chen, Y., Chen, J. et al. Shared genetic architecture between autoimmune disorders and B-cell acute lymphoblastic leukemia: insights from large-scale genome-wide cross-trait analysis. BMC Med 22 , 161 (2024). https://doi.org/10.1186/s12916-024-03385-0

Download citation

Received : 08 January 2024

Accepted : 08 April 2024

Published : 15 April 2024

DOI : https://doi.org/10.1186/s12916-024-03385-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Genetic overlap
  • Autoimmune disease

BMC Medicine

ISSN: 1741-7015

research paper on genome wide association studies

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Genome-Wide Association Studies and Beyond

John s. witte.

Institute for Human Genetics, Departments of Epidemiology and Biostatistics and Urology, University of California, San Francisco, San Francisco, California 94158-9001; ude.fscu@ettiwj

Genome-wide association studies (GWAS) provide an important avenue for undertaking an agnostic evaluation of the association between common genetic variants and risk of disease. Recent advances in our understanding of human genetic variation and the technology to measure such variation have made GWAS feasible. Over the past few years a multitude of GWAS have identified and replicated many associated variants. These findings are enriching our knowledge about the genetic basis of disease and leading some to advocate using GWA study results for genetic testing. For many of the GWA study results, however, the underlying mechanisms remain unclear and the findings explain only a limited amount of heritability. These issues may be clarified by more detailed investigations, including analyses of less common variants, sequence-level data, and environmental exposures. Such studies should help clarify the potential value of genetic testing to the public’s health.

INTRODUCTION

Genome-wide association studies (GWAS) compare common genetic variants in large numbers of affected cases to those in unaffected controls to determine whether an association with disease exists ( 34 , 55 ). GWAS have been made possible by the identification of millions of single nucleotide polymorphisms (SNPs) across the human genome and the realization that a subset of these SNPs can capture (“tag”) common genetic variation via linkage disequilibrium ( 16 ). In parallel, advances in microarray-based technology have allowed investigators to genotype efficiently enormous numbers of SNPs ( 77 ).

Before the recent flood of GWA study projects, linkage and candidate gene studies were used to try to decipher the genetic basis of disease. Linkage analysis evaluates markers widely spaced across the genome to determine whether they are inherited along with disease in families with numerous affected individuals ( 2 ). Linkage analysis, however, can have low power to detect common genetic risk factors for disease and has low resolution owing to the limited number of meioses within families ( 55 ). Candidate gene studies can overcome these issues, focusing directly on the association between disease and variants in particular genes that have a priori biological support for being involved with disease. This focus comes at a cost: Candidate gene studies ignore much of the genome, and thus are likely to miss many causal regions or genes and instead find many false-positive associations ( 21 , 36 ).

GWAS improve on these approaches, leveraging their strengths while overcoming weaknesses. In particular, GWAS have greater power than linkage studies to detect small to modest effects, even with an extremely strict alpha-level for statistical significance ( 55 ). Moreover, by casting a wide net of genetic markers across the entire genome, this approach does not require one to prespecify particular candidate genes for study and examines much of the common variation across the human genome. GWAS have convincingly detected hundreds of variants associated with a large number of diseases ( 19 ). Many of these findings are novel; associated SNPs in genes or chromosomal regions were not previously implicated in disease. These results are especially exciting in light of the previous difficulties replicating genetic findings for many diseases. For example, linkage and candidate gene studies of prostate cancer have had limited successes in replicating findings across studies, whereas GWAS have detected more than a dozen highly replicated genetic variants associated with this disease ( 80 ).

The enthusiasm surrounding GWA study findings is tempered, however, by the observation that genetic variants detected by most of these studies may not be causal for disease, may explain only a little of disease heritability, and may have limited public health impact ( 45 ). These issues have prompted some to question the value of current GWAS and to advocate shifting future research efforts to the study of less common genetic variation ( 12 ). In light of the mixed opinions ofGWAS, we consider here the following important aspects of these studies: design and analysis, findings and implications, limitations, and future prospects.

OVERALL STRATEGY AND METHODS

Laying the foundation.

The ability to undertake GWAS is a direct result of a number of important developments over the past decade. Sequencing of the human genome provided the initial foundation for GWAS ( 35 , 74 ). Substantial efforts detected and validated more than ten million SNPs, which brought to light the common genetic variation across the human genome. It was then determined that much of this variation could be efficiently captured by a subset of “tag” SNPs via the phenomenon of linkage disequilibrium (LD) among neighboring SNPs ( 6 , 11 ). The International Haplotype Map (HapMap) Consortium proceeded to measure the LD structure across multiple ancestral populations ( 10 , 16 , 17 ).

In conjunction with the increasing understanding of the human genome, technological advances in array-based genotyping of SNPs made feasible the simultaneous measurement of hundreds of thousands of SNPs ( 77 ). The number of variants assayed by these SNP arrays has rapidly increased, whereas the array prices have steadily decreased. At present, the arrays directly measure approximately one million SNPs while providing relatively high coverage of the common genetic variation across the human genome ( 26 ).

Multistage Study Designs

Sample sizes in the thousands are generally required for GWAS to have sufficient statistical power to detect the expected modest associations (e.g., odds ratios <1.5) while evaluating hundreds of thousands of SNPs ( 81 ). The large sample sizes and initial high cost of SNP arrays helped motivate the development and use of multistage GWA study designs ( 66 ). First, a subset of the study sample is genotyped in a discovery stage using the genome-wide SNP arrays. Then the most strongly associated SNPs are genotyped with a less-expensive genotyping platform in the remaining samples. This narrowing of the most promising SNPs can continue with additional replication stages, along with fine-mapping of associated regions.

The optimal division of samples across stages depends on a number of factors, but in general, the most efficient approach entails including approximately one-third to one-half of the samples in the initial stage and the remaining in the follow-up stages ( 57 ). How many of the most noteworthy SNPs should be subsequently tested depends on the sample sizes in the respective stages and how many false-negative results one is willing to accept. That said, the initial stages of GWAS may not clearly pinpoint SNPs that will be highly replicated by latter stages; therefore, as many SNPs as feasible should be carried over from one GWA study stage to the next (e.g., >1% of the first stage SNPs should be typed in the second stage).

One must also decide whether the early follow-up stages are considered part of a replication or a joint analysis. It seems most intuitive to use the follow-up data as a replication study, with the goal of confirming the initial findings ( 66 ). However, in light of the modest SNP associations and enormous multiple comparisons issue, to obtain sufficient power one is generally better off combining data from the first couple of stages in a joint analysis ( 57 ). One can view the joint analysis approach as a less expensive, single-stage GWA study: Fewer SNPs are typed in a second stage simply to reduce cost, but data from the first and second stages are combined, analyzed, and penalized for multiple comparisons as though a single-stage GWA study had been undertaken ( 28 ). Of course, even with a joint analysis of the first couple of stages, one must still replicate results with additional samples.

Note that decreasing SNP array costs have made multistage GWA study designs less essential. Genotyping 10,000–20,000 SNPs in a follow-up stage can cost just about as much as genotyping a genome-wide SNP array. Therefore, many of the most recent GWAS simply genotype all samples initially available with a SNP array. This practice also allows for simultaneous consideration of SNPs and other forms of genetic variation in one’s entire study sample (e.g., copy number variants). Nevertheless, new technologies for genotyping millions of SNPs or sequencing entire genomes may be sufficiently expensive, whereby multistage designs are again vital to genome-wide studies (discussed further below).

Subject Selection

Most GWAS select cases with a particular disease and compare them with unaffected controls. Of course, GWAS can also evaluate continuous traits, studying entire groups of subjects or selecting those at the extremes of the trait distribution to increase power for detecting associations ( 22 ). Whatever the phenotype, study subjects should be representative of their source population ( 76 ). In a case-control study, this requirement implies that controls be those individuals who, if diseased, would be cases, and controls are commonly selected to match the cases with respect to ethnicity, age, and sex.

Many GWAS, however, have been successful without overly rigorous control selection. In fact, owing to the high cost of subject recruitment and genotyping, there is a growing movement toward using existing genotype information among controls, who were recruited into previous studies and have been made available to researchers ( 37 ). The potential bias arising from using such convenience controls is tempered by the low measurement error in SNP genotyping, the lack of recall bias when studying inherited variants, the large sample sizes, the stringent criterion for statistical significance, and the rigorous replication of findings. In addition, if such controls result in population stratification bias—confounding of associations due to case-control differences in genetic ancestry ( 67 )—this can be addressed analytically with genomic information ( 8 , 50 , 51 ).

Although most GWAS use unrelated controls, some use family members such as unaffected siblings or parents. Family-based designs directly control for population stratification and for some potential confounding by environmental exposures (i.e., those shared by family members). The increased sharing of genetic information among family members, however, can result in this design having substantially lower power for detecting main effects—although increased power for detecting gene-environment interactions—than studies of unrelated individuals would have ( 83 ).

Statistical Analysis

Once genotyping is complete, SNPs are subject to a number of quality-control checks— such as the proportion of samples successfully genotyped and testing for Hardy-Weinberg equilibrium—and those that fail are removed from further consideration. With SNP genotypes and external linkage disequilibrium information on the underlying structure among neighboring SNPs (e.g., from the HapMap project), one can impute some of the common untyped variants; this option allows for a more thorough and powerful evaluation of potential associations across the genome ( 23 , 41 , 42 ).

The relationship with disease is generally evaluated for each SNP using a trend test across the number of minor alleles. This allelic trend test provides relatively good properties, even if the true mode of inheritance is recessive or dominant. The analysis ofGWA study data can also test multimarker combinations of SNPs, haplotypes, or interactions for their association with disease ( 7 , 47 ). Moreover, the statistical analysis can also leverage additional information about the SNPs, for example, whether they are part of a known pathway or are potentially functional ( 4 ).

To determine the overall statistical significance of GWA study results, one must address the issue of multiple comparisons arising from evaluating up to 1 million SNPs. The simplest approach is to use a Bonferroni correction, in which the conventional alpha level of p <0.05 is divided by the number of tests performed (e.g., 0.05/1,000,000 (5 × 10 −8 ). This approach may be conservative because some assayed SNPs are correlated, owing to their linkage disequilibrium. But this conservatism is offset by the fact that the measured SNPs also represent unmeasured SNPs, so the effective number of independent tests in current GWAS is ~1 million ( 48 ). Some GWAS also calculate the false discovery rate (FDR) to assess the strength of associations ( 59 , 65 , 75 ). Note that while adhering to strict significance cut points is helpful to address issues of multiple comparisons, they are somewhat arbitrary and do not reflect the potential clinical or biological importance of an association ( 82 ).

GWA STUDY RESULTS

General summary.

GWA study findings are collated and updated in the National Human Genome Research Institute’s “Catalog of Published Genome-Wide Association Studies” ( http://www.genome.gov/gwastudies ) ( 19 ). This catalog presents results from GWAS that evaluated at least 100,000 SNPs in the initial stage and gives details on associated SNPs with p -values <10 −5 . As of June 2009, the catalog includes more than 350 GWA study publications on more than 1600 distinct SNPs associated with more than 200 phenotypes. The chromosomal locations for many of these findings are highlighted in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is nihms546762f1.jpg

Chromosomal locations of genome-wide association (GWA) study results through March 2009. Results are given for 398 publications with p -values ≤ 5 × 10 −8 . Reproduced from the National Human Genome Research Institute’s GWAS Catalog: http://www.genome.gov/gwastudies . Credit: D. Leja and T. Manolio

There are some interesting patterns in the first few years’ worth of GWA study results. As expected, common diseases are the most frequently studied byGWAS, including more than half a dozen publications on type 1 and 2 diabetes; prostate, breast, colorectal, and lung cancer; amyotrophic lateral sclerosis; cholesterol and triglyceride levels; Alzheimer’s disease; bipolar disorder and schizophrenia; Crohn’s disease; and rheumatoid arthritis ( Figure 1 ) ( 19 ). Over the past few years, the cumulative number of associated SNPs detected by GWAS has exponentially increased; the number of statistically significant SNPs reported per GWA study and the sample size of the GWAS are also increasing with time ( Figure 2 ). The GWAS have had samples sizes in the hundreds to tens of thousands of subjects; the initial stages had a median sample size of 1,752 individuals (interquartile range: 809 to 4763 people) and the follow-up stages had a median sample size of 3,671 (interquartile range: 1649 to 8968). Although some of the studies have similar numbers of cases and controls, many include a smaller number of cases than controls ( Figure 3 , top panel).

An external file that holds a picture, illustration, etc.
Object name is nihms546762f2.jpg

Top panel : cumulative number of GWA study SNPs reported with p -values <10 −5 over time. Bottom panel : GWA study findings by study sample size and number of SNPs per study over time. Each circle indicates a single publication, and the area of the circle reflects the number of associated SNPs in that study. From these plots we can see the rapid increase in GWA study SNPs and a slight trend toward larger studies and more noteworthy SNPs per study ( 19 ).

An external file that holds a picture, illustration, etc.
Object name is nihms546762f3.jpg

Top panel : number of cases and controls in each published GWA study; many GWAS have fewer cases than controls. Bottom panel : total GWA study sample size (cases and controls) versus effect size for studies of binary traits; larger studies are better powered to detect smaller odds ratios ( 19 ).

The effect sizes for the GWA study associations are generally quite modest: The median odds ratio = 1.28 (interquartile range = 1.17 to 1.55, for binary traits). There is an inverse relationship between sample size and effect sizes for the GWAS: Studies with larger sample sizes can detect smaller associations, which is expected because they have higher power than do smaller studies ( Figure 3 , bottom panel). Interestingly, the larger GWAS are unlikely to detect larger effect sizes; the ability of smaller GWAS to detect larger associations may reflect the winner’s curse or false-positive results ( 32 ). The minor allele frequencies of significant GWA study SNPs are relatively common (median = 0.28, interquartile range = 0.16 to 0.39)—again as expected on the basis of the design of GWAS and the measurement of common genetic variation by SNP arrays. The median p -value for associated SNPs is 1×10 −7 (interquartile range = 3×10 −6 to 9×10 −12 ). The association p -values do not appear correlated with the corresponding SNP’s minor allele frequencies; the p -values do, however, appear correlated with the corresponding odds ratios ( Figure 4 ).

An external file that holds a picture, illustration, etc.
Object name is nihms546762f4.jpg

Top panel : GWA study p -values and associated SNP minor allele frequencies (MAF). The red line is a smoothed curve across these values and highlights that there is little impact of MAF on p -values for the strongest SNPs from GWAS. Bottom panel : GWA study p -values and corresponding odds ratios. There is a slight trend toward smaller odds ratios having smaller p -values ( 19 ).

About 70% of the GWAS-associated SNPs are in genes or genic regions, and many of the findings pertain to loci that have not been previously implicated in disease. To date, GWA study findings appear to be overrepresented in genes involved with cell adhesion, signal transduction, transport activity, and protein phosphorylation ( 25 ). Finally, it is worth noting that the SNP array platforms used in the GWAS to date are closely split between those offered by Illumina and Affymetrix ( 19 ).

Specific Examples

There are far too many GWAS to discuss each in much detail here. Therefore, we highlight results from three large and highly successful projects: The Wellcome Trust Case Control Consortium (WTCCC) ( 79 ), the de-CODE/Icelandic studies ( 15 ), and the Cancer Genetic Markers of Susceptibility (CGEMS) GWAS of prostate cancer ( 68 , 85 ).

The WTCCC encompassed GWAS of seven major diseases: bipolar disorder, coronary artery disease, Crohn’s disease, hypertension, rheumatoid arthritis, type 1 diabetes, and type 2 diabetes ( 79 ). The initial stage included 14,000 affected individuals total from the United Kingdom, 2000 with each disease. For comparison, two sets of control groups, each containing 1500 individuals, were used: one from the 1958 British Birth Cohort and one from the U.K. National Blood Service ( 79 ). All study subjects were genotyped using the Affymetrix GeneChip 500K array. Analyses of the resulting data and additional follow-up work detected and replicated many promising SNP-disease associations, for example, a novel association between FTO and type 2 diabetes and associations for coronary artery disease on chromosome 9 ( 79 ). The importance of this project is highlighted by the fact that more than 700 other papers have cited the original publication as of June 2009.

The deCODE GWAS arise from studies that include more than 40,000 individuals from Iceland ( 15 ). This infrastructure allows for swiftly studying any phenotype for which de-CODE has access to a sufficiently large sample size. The deCODE GWAS have looked at a large number of phenotypes using nested case-control studies in Iceland with overlapping control groups and collaborative follow-up and replication studies on subjects from outside of Iceland. They have detected GWA study associations for many different phenotypes, including cancer ( 13 , 14 , 30 , 58 , 69 ); heart disease ( 18 ); obesity ( 71 ); glaucoma ( 70 ); and traits such as age at menarche ( 61 ), bone mineral density ( 60 ), and pigmentation, hair, and eye color ( 60 ). These successes illustrate the value of establishing large, well-characterized populations for evaluating many different diseases.

CGEMS is a multistage GWAS of prostate and breast cancer ( http://cgems.cancer.gov ). Focusing on the prostate cancer study, 1172 cases and 1157 controls of European-American ancestry were selected from the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial and genotyped using the Illumina 550K array ( 68 , 85 ). Almost 27,000 of the most strongly associated SNPs were followed up in a second stage comprised of 4 other study populations with a total of 3941 cases and 3964 controls. Noteworthy findings from the CGEMS of prostate cancer included detecting associated SNPs in distinct loci on chromosome 8q24 and a risk SNP in the MSMB gene, which encodes beta-microseminoprotein, a primary constituent of semen ( 68 , 85 ). Interestingly, the strongly associated SNP in MSMB (rs10993994) had only the 24,223rd smallest p -value in the initial stage ( 68 ), which illustrates that replicated SNPs may not initially have exceptionally small p -values and that it is important to follow up a large number of SNPs in the latter stages of GWAS.

The WTCCC and deCODE studies exhibited that GWAS can successfully use shared controls that are well matched on ethnicity for comparison with multiple phenotypes, even though they may not be fully representative of the cases’ source population. Nevertheless, recruiting controls in the same manner as cases—including obtaining detailed nongenetic information—allows investigators to evaluate gene-environment interactions appropriately; of course, one might simply decide to evaluate such associations with a case-only study design ( 64 ). The WTCCCGWAS also found that processing arrays at different laboratories can result in false-positive findings ( 5 ).

Researchers can request use of the WTCCC and CGEMS GWA study data. Sharing GWA study data is vital because it allows others to replicate findings, combine data, and examine phenotypic clustering with particular genetic variants, maximizing the use of this valuable information. To this end, National Institutes of Health grantees undertaking GWAS are required to develop a data-sharing plan and to deposit their data for use by other scientists (NIH Notice:NOT-OD-08–013). Many GWAS thus far have deposited data in dbGaP for use by the scientific community ( 40 ).

IMPLICATIONS OF GWA STUDY FINDINGS

New insights.

The highly replicated results from GWAS can help clarify the biological basis of disease, providing information about the mechanisms underlying the disease process. If multiple confirmed SNPs arise from a particular biological pathway, this implies that something unique to that pathway may help drive the etiology of disease. Similarly, if multiple diseases are associated with a particular locus, this suggests a common genetic basis for such diseases ( 56 ). For example, the SNPs in the chromosome 8q24 loci are associated with cancer of the prostate, breast, and colon, indicating that this locus is acutely involved with the carcinogenic process ( 9 , 13 , 72 , 84 , 86 ). In fact, more recent mechanistic work shows that the 8q24 prostate and colorectal cancer locus containing the risk SNP rs6983267 is in a transcriptional enhancer and interacts with the proto-oncogene MYC ( 49 , 73 ).

From an epidemiologic perspective, GWA study SNPs may give relatively large estimates of the population-attributable fraction (e.g., 40% for multiple prostate cancer SNPs). These estimates reflect the fairly high-risk allele frequencies for these SNPs, which are commonly between 0.15 and 0.40 ( Figure 4 ). Such population-attributable fraction estimates imply that the associated SNPs account for a sizeable proportion of disease, although these calculations make a number of assumptions and may overestimate or underestimate the true population-attributable fraction ( 43 ).

Genetic Testing

Another implication of GWA study results is the rapidly escalating availability of genetic testing. Currently available tests range from those prescribed by a physician for high-penetrance disease genes to those marketed direct-to-consumer for measuring variants that span the entire genome ( 29 ).With directto- consumer tests, individuals pay to have variation in their genomes assayed with the same SNP arrays used in GWAS. Results from these assays are returned to the consumer along with varying levels of additional information (e.g.,Web-based interpretation of some results, genetic counseling).

Some of the tests include information about the genetic basis of drug response, which is now being studied using GWAS ( 54 ). Pharmacogenomics has important near-term implications for which drug and dose an individual should receive. For example, GWAS confirmed that variants in the gene CYP2C9 impact metabolism of the anticoagulant drug warfarin ( 63 ); such findings suggest that individuals who carry the CYP2C9 variant that results in slow metabolism could be prescribed lower doses of warfarin, reducing potential side effects of severe bleeding and unnecessary health-care costs. In light of such results, personalized medicine is commonly touted as a major potential benefit of pharmacogenomics and GWAS, albeit with some reservations ( 46 ).

Although most individualGWA study SNPs are not effective for genetic testing owing to their modest associations with disease and low penetrance, one might consider genetic tests based on combinations of associated SNPs. For example, when looking at the distribution of five associated SNPs for prostate cancer, men in the top decile of risk alleles carried have an approximate two- to fourfold increase in risk in comparison to men in the lowest decile ( 31 , 68 , 87 ). Based on this increased risk, some investigators advocate a multiple-SNP screening test for prostate cancer ( 87 ), although there are some serious limitations with such tests (discussed below).

LIMITATIONS OF GWAS

Although many GWA study results have been highly replicated, quite a few of the variants are only associated with, not causal for, disease. Determining the causal factors underlying GWA study results can be extremely challenging, requiring fine mapping and mechanistic studies—which are underway for many findings ( 24 ). This is further complicated by the fact that ~30% of the associations detected to date are not even in gene regions ( 19 ). These issues, of course, limit our ability to understand the biological basis of GWA study results and to implement preventive or therapeutic measures.

Little Heritability

GWA study findings often account for only a limited amount of disease heritability ( 39 ), which in part reflects the small magnitude of effect for most SNPs detected by GWAS. Even if large effects are found (e.g., for combinations of SNPs), these SNPs may not have high penetrance and so do not confer a high risk of disease. This dark matter of unexplained heritability has raised concerns about the ultimate value of GWAS ( 12 ).

The original hypothesis motivating GWAS is that common diseases may be caused, in part, by common genetic variants ( 53 ). Thus, GWAS were designed to detect associations between disease and common SNPs (e.g., minor allele frequency >5%). The current SNP arrays measure variants primarily at or above this frequency and may even miss some common variation ( 26 ). Common diseases, however, are undoubtedly also due to rare variants. These variants can act alone or reflect allelic heterogeneity, whereby many different rare alleles within a particular locus each increase risk. Therefore, the inability of existing GWAS to evaluate rare variants may help explain why they account for little heritability.

Another possible explanation is that most GWAS have evaluated genetic variation due only to SNPs. Although SNPs comprise the most common form of genetic variation, copy number variants (CNVs) also give rise to substantial variability and account for some heritability of disease ( 20 , 44 , 52 ). GWAS are now evaluating CNVs, and CNV probes are being incorporated into the most current GWA study arrays. In addition, it has been difficult for investigators to detect gene-gene and gene-environment interactions by GWAS because this practice requires extremely large sample sizes and well-characterized environmental exposures ( 64 ). As sample sizes increase, GWAS will also be able to detect additional SNP associations that have even smaller effect sizes than those observed to date.

Not Very Predictive

Another important limitation of GWA study results—which is especially pertinent in light of the growing direct-to-consumer tests—is that they may not sufficiently distinguish between individuals with low and high risk of disease. For example, the five-SNP test noted above for prostate cancer provides only a slight increase in the area under the receiver operating characteristic curve (AUC) for classifying cases and controls (0.61 to 0.63) ( 87 ). In general, screening tests based on most of the GWA study SNPs detected to date will likely have low positive (and negative) predictive value for disease and have limited usefulness in a diagnostic setting ( 33 , 78 ). Adding more GWA study SNPs with modest disease associations may not much improve the discriminatory ability of such tests. Moreover, few individuals will carry large numbers of GWA study risk alleles, so screening for these in the general population would not be cost-effective. Note also that justification for genetic testing also depends on the existence of effective interventions.

NEXT-GENERATION GWAS

Although the successes ofGWAS are tempered by their limitations, they do provide an important advance in our efforts toward deciphering the genetic basis of disease ( 1 ).GWAS highlight the value of agnostic approaches to the search for disease genes. Taking a broad genome-wide view is essential for achieving a more complete understanding of the genetic architecture of complex phenotypes. GWAS also emphasize the significance of undertaking complementary replication and validation studies across multiple populations ( 3 ).

Continued scientific and technological advances will allow investigators to study less common and different sources of genetic variation. Results from the 1000 Genomes project ( http://www.1000genomes.org ) can be used to assay less common SNPs with more sizeable genotyping platforms (e.g., 10 million SNPs). Sequencing technologies are rapidly decreasing in cost, and genome-wide sequence studies will eventually become feasible. Before sequencing all study subjects, future work may use a sequence/genotype hybrid design in which the initial phase sequences a subset of subjects, and based on these results, large-scale genotyping will be undertaken on the remaining study subjects.

As data on less common variants become available, and interest in detecting interactions and pathway effects on disease grows, there will be an increasing need for more complex statistical analysis tools. Methods that maximize the strengths of both agnostic genome-wide and knowledge-based biological approaches may help clarify potential associations such as explicitly incorporating into analysis additional existing information about the properties of genetic variants ( 4 , 27 ). New statistical methods for evaluating rare variants will also be crucial as such data are generated on a wider scale ( 38 ).

The next generation of genome-wide studies will further improve our understanding of the disease process, risks, and response to therapy. The impact of these studies on a person’s and the public’s health will vary substantially. In some cases, the information will have little actionable value, and any corresponding genetic tests could ultimately increase health care costs by prompting individuals to obtain unnecessary medical care. In other situations, the knowledge gained will be extremely valuable and provide great benefit; hopefully this will encompass a large majority of genome-wide findings.

ACKNOWLEDGMENTS

I thank Drs. Iona Cheng and Inga Hallgrimsdottir for helpful comments on this manuscript, and Joel Mefford for creating Figures 2 – 4 . This work was supported by grants R01 CA88164 and U01 CA127298 from the National Institutes of Health.

DISCLOSURE STATEMENT

The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.

LITERATURE CITED

  • Open access
  • Published: 19 April 2024

Asparagine reduces the risk of schizophrenia: a bidirectional two-sample mendelian randomization study of aspartate, asparagine and schizophrenia

  • Huang-Hui Liu 1 , 2   na1 ,
  • Yao Gao 1 , 2   na1 ,
  • Dan Xu 1 , 2 ,
  • Xin-Zhe Du 1 , 2 ,
  • Si-Meng Wei 1 , 2 ,
  • Jian-Zhen Hu 1 , 2 ,
  • Yong Xu 1 , 2 &
  • Liu Sha 1 , 2  

BMC Psychiatry volume  24 , Article number:  299 ( 2024 ) Cite this article

Metrics details

Despite ongoing research, the underlying causes of schizophrenia remain unclear. Aspartate and asparagine, essential amino acids, have been linked to schizophrenia in recent studies, but their causal relationship is still unclear. This study used a bidirectional two-sample Mendelian randomization (MR) method to explore the causal relationship between aspartate and asparagine with schizophrenia.

This study employed summary data from genome-wide association studies (GWAS) conducted on European populations to examine the correlation between aspartate and asparagine with schizophrenia. In order to investigate the causal effects of aspartate and asparagine on schizophrenia, this study conducted a two-sample bidirectional MR analysis using genetic factors as instrumental variables.

No causal relationship was found between aspartate and schizophrenia, with an odds ratio (OR) of 1.221 (95%CI: 0.483–3.088, P -value = 0.674). Reverse MR analysis also indicated that no causal effects were found between schizophrenia and aspartate, with an OR of 0.999 (95%CI: 0.987–1.010, P -value = 0.841). There is a negative causal relationship between asparagine and schizophrenia, with an OR of 0.485 (95%CI: 0.262-0.900, P -value = 0.020). Reverse MR analysis indicates that there is no causal effect between schizophrenia and asparagine, with an OR of 1.005(95%CI: 0.999–1.011, P -value = 0.132).

This study suggests that there may be a potential risk reduction for schizophrenia with increased levels of asparagine, while also indicating the absence of a causal link between elevated or diminished levels of asparagine in individuals diagnosed with schizophrenia. There is no potential causal relationship between aspartate and schizophrenia, whether prospective or reverse MR. However, it is important to note that these associations necessitate additional research for further validation.

Peer Review reports

Introduction

Schizophrenia is a serious psychiatric illness that affects 0.5 -1% of the global population [ 1 ]. The burden of mental illness was estimated to account for 7% of all diseases worldwide in 2016, and nearly 20% of all years lived with disability [ 2 ]. The characteristics of schizophrenia are positive symptoms, negative symptoms, and cognitive symptoms, which are often severe functional impairments and significant social maladaptations for patients suffering from schizophrenia [ 3 ]. It is still unclear what causes schizophrenia and what the pathogenesis is. There are a number of hypotheses based on neurochemical mechanisms, including dopaminergic and glutamatergic systems [ 4 ]. Although schizophrenia research has made significant advances in the past, further insight into its mechanisms and causes is still needed.

Association genetics research and genome-wide association studies have successfully identified more than 24 candidate genes that serve as molecular biomarkers for the susceptibility to treatment- refractory schizophrenia (TRS). It is worth noting that some proteins in these genes are related to glutamate transfer, especially the N-methyl-D-aspartate receptor (NMDAR) [ 5 ]. It is thought that NMDARs are important for neural plasticity, which is the ability of the brain itself to adapt to new environments. With age, NMDAR function usually declines, which may lead to decreased plasticity, leading to learning and memory problems. Consequently, the manifestation of cognitive deficits observed in diverse pathologies, including Alzheimer’s disease, amyotrophic lateral sclerosis, Huntington’s disease, Parkinson’s disease, schizophrenia, and major depression, can be attributed to the dysfunction of NMDAR [ 4 ]. There are two enantiomers of aspartate (Asp): L and D [ 6 ]. In the brain, D-aspartate (D-Asp) stimulates glutamate receptors and dopaminergic neurons through its direct NMDAR agonist action [ 7 ]. According to the glutamate theory of Sch, glutamate NMDAR dysfunction is a primary contributor to the development of this psychiatric disorder and TRS [ 8 ]. It has been shown in two autopsy studies that D-Asp of prefrontal cortex neurons in patients with schizophrenia are significantly reduced, which is related to an increased expression of D-Asp oxidase [ 9 ] or an increased activity of D-Asp oxidase [ 10 ]. Several studies in animal models and humans have shown that D-amino acids, particularly D-Ser and D-Asp [ 11 ], are able to modulate several NMDAR-dependent processes, including synaptic plasticity, brain development, cognition and brain ageing [ 12 ]. In addition, D-Asp is synthesized in hippocampal and prefrontal cortex neurons, which play an important role in the development of schizophrenia [ 13 ]. It has been reported that the precursor substance of asparagine (Asn), aspartate, activates the N-methyl-D-aspartate receptor [ 14 ]. Asparagine is essential for the survival of all cells [ 15 ], and it was decreased in schizophrenia patients [ 16 ]. Asparagine can cause metabolic disorders of alanine, aspartate, and glutamic acid, leading to dysfunction of the glutamine-glutamate cycle and further affecting it Gamma-Aminobutyric Acid(GABA) level [ 17 ].It is widely understood that the imbalance of GABA levels and NMDAR plays a crucial role in the pathogenesis of schizophrenia, causing neurotoxic effects, synaptic dysfunction, and cognitive impairments [ 18 ].Schizophrenic patients exhibited significantly higher levels of serum aspartate, glutamate, isoleucine, histidine and tyrosine and significantly lower concentrations of serum asparagine, tryptophan and serine [ 19 ]. Other studies have also shown that schizophrenics have higher levels of asparagine, phenylalanine, and cystine, and lower ratios of tyrosine, tryptophan, and tryptophan to competing amino acids, compared to healthy individuals [ 20 ]. Aspartate and asparagine’s association with schizophrenia is not fully understood, and their causal relationship remains unclear.

The MR method is a method that uses Mendelian independence principle to infer causality, which uses genetic variation to study the impact of exposure on outcomes. By using this approach, confounding factors in general research are overcome, and causal reasoning is provided on a reasonable temporal basis [ 21 ]. The instrumental variables for genetic variation that are chosen must adhere to three primary hypotheses: the correlation hypothesis, which posits a robust correlation between single nucleotide polymorphisms (SNPs) and exposure factors; the independence hypothesis, which asserts that SNPs are not affected by various confounding factors; the exclusivity hypothesis, which maintains that SNPs solely influence outcomes through on exposure factors. In a recent study, Mendelian randomization was used to reveal a causal connection between thyroid function and schizophrenia [ 22 ]. According to another Mendelian randomization study, physical activity is causally related to schizophrenia [ 23 ]. Therefore, this study used Mendelian randomization method to explore the causal effects of aspartate on schizophrenia and asparagine on schizophrenia.

To elucidate the causal effects of aspartate and asparagine on schizophrenia. This study used bidirectional MR analysis. In the prospective analysis of MR, the exposure factors under consideration were aspartate and asparagine, while the outcome of interest was the risk of schizophrenia. On the contrary, in the reverse MR analysis, schizophrenia was utilized as the exposure factor, with aspartate and asparagine being chosen as the outcomes.

Materials and methods

Obtaining data sources, select genetic tools closely related to aspartate or asparagine.

In this research, publicly accessible GWAS summary statistical datasets from the MR basic platform were utilized. These datasets consisted of 7721 individuals of European ancestry [ 24 ] for the exposure phenotype instrumental variable of aspartate, and 7761 individuals of European ancestry [ 24 ] for the exposure phenotype instrumental variable of asparagine.

Select genetic tools closely related to schizophrenia

Data from the MR basic platform was used in this study for GWAS summary statistics, which included 77,096 individuals of European ancestry [ 5 ], as instrumental variables related to schizophrenia exposure phenotype.

Obtaining result data

The publicly accessible GWAS summary statistical dataset for schizophrenia was utilized on the MR basic platform, with a sample size of 77096. Additionally, the summary level data for aspartate and asparagine were obtained from the publicly available GWAS summary dataset on the MR basic platform, with sample sizes of 7721 and 7761, respectively, serving as outcome variables.

Instrumental variable filtering

Eliminating linkage disequilibrium.

The selection criteria for identifying exposure related SNPs from the aggregated data of GWAS include: (1) Reaching a significance level that meets the threshold for whole genome research, expressed as P -value < 5 * 10 − 6 [ 25 ]; (2) Ensure the independence of the selected SNPs and eliminate linkage disequilibrium SNPs ( r 2  < 0.001, window size of 10000KB) [ 26 ]; (3) There are corresponding data related to the research results in the GWAS summary data.

Eliminating weak instruments

To evaluate whether the instrumental variables selected for this MR study have weak values, we calculated the F-statistic. If the F-value is greater than 10, it indicates that there are no weak instruments in this study, indicating the reliability of the study. Using the formula F =[(N-K-1)/K] × [R 2 /(1-R 2 )], where N denotes the sample size pertaining to the exposure factor, K signifies the count of instrumental variables, and R 2 denotes the proportion of variations in the exposure factor that can be elucidated by the instrumental variables.

The final instrumental variable obtained

As a result of removing linkage disequilibrium and weak instrumental variables, finally, 3 SNPs related to aspartate and 24 SNPs related to asparagine were selected for MR analysis.

Bidirectional MR analysis

Research design.

Figure  1 presents a comprehensive depiction of the overarching design employed in the MR analysis undertaken in this study. We ascertained SNPs exhibiting robust correlation with the target exposure through analysis of publicly available published data, subsequently investigating the existence of a causal association between these SNPs and the corresponding outcomes. This study conducted two bidirectional MR analyses, one prospective and reverse MR on the causal relationship between aspartate and schizophrenia, and the other prospective and reverse MR on the causal relationship between asparagine and schizophrenia.

figure 1

A MR analysis of aspartate and schizophrenia (located in the upper left corner). B  MR analysis of schizophrenia and aspartate (located in the upper right corner). C  MR analysis of asparagine and schizophrenia (located in the lower left corner). D  MR analysis of schizophrenia and asparagine (located in the lower right corner)

Statistical analysis

Weighted median, weighted mode, MR Egger, and inverse variance weighting (IVW) were used to conduct a MR study. The primary research findings were derived from the results obtained through IVW, the results of sensitivity analysis using other methods to estimate causal effects are considered. Statistical significance was determined if the P -value was less than 0.05. To enhance the interpretation of the findings, this study converted the beta values obtained in to OR, accompanied by the calculation of a 95% confidence interval (CI).

Test for directional horizontal pleiotropy

This study used MR Egger intercept to test horizontal pleiotropy. If the P -value is greater than 0.05, it indicates that there is no horizontal pleiotropy in this study, meaning that instrumental variables can only regulate outcome variables through exposure factors.

Results of bidirectional MR analysis of aspartate and schizophrenia

Analysis results of aspartate and schizophrenia.

In prospective MR analysis, this study set aspartate as the exposure factor and schizophrenia as the outcome. We used 3 SNPs significantly associated with aspartate screened across the entire genome. The instrumental variables exhibited F-values exceeding 10, signifying the absence of weak instruments and thereby affirming the robustness of our findings. Through MR analysis (Fig.  2 A), we assessed the individual influence of each SNP locus on schizophrenia. The results of the IVW method indicate that no causal effect was found between aspartate and schizophrenia, with an OR of 1.221 (95%CI: 0.483–3.088, P -value = 0.674).

In addition, the analyses conducted using the weighted mode and weighted median methods yielded similar results, indicating the absence of a causal association between aspartate and schizophrenia. Furthermore, the MR Egger analysis demonstrated no statistically significant disparity in effectiveness between aspartate and schizophrenia, as evidenced by a P -value greater than 0.05 (Table  1 ; Fig.  2 B).

In order to test the reliability of the research results, this study used MR Egger intercept analysis to examine horizontal pleiotropy, and the result was P -value = 0.579 > 0.05, indicating the absence of level pleiotropy. Furthermore, a leave-one-out test was conducted to demonstrate that no single SNP had a substantial impact on the stability of the results, indicating that this study has considerable stability (Fig.  2 C). Accordingly, the MR analysis results demonstrate the conclusion that aspartate and schizophrenia do not exhibit a causal relationship.

Analysis results of schizophrenia and aspartate

Different from prospective MR studies, in reverse MR studies, schizophrenia was set as an exposure factor and aspartate was set as the outcome. Through MR analysis (Fig.  2 D), we assessed the individual influence of each SNP locus on aspartate .The results of the IVW method indicate that there is no causal effect between schizophrenia and aspartate, with an OR of 0.999(95%CI: 0.987–1.010, P -value = 0.841). Similarly, the weighted mode, weighted median methods also failed to demonstrate a causal link between schizophrenia and aspartate. Additionally, the MR Egger analysis did not reveal any statistically significant difference in effectiveness between schizophrenia and aspartate ( P -value > 0.05) (Table  1 and Fig . 2 E).

The MR Egger intercept was used to test horizontal pleiotropy, and the result was P -value = 0.226 > 0.05, proving that this study is not affected by horizontal pleiotropy. Furthermore, a leave-one-out test revealed that no individual SNP significantly influenced the robustness of the findings (Fig.  2 F).

figure 2

Depicts the causal association between aspartate and schizophrenia through diverse statistical analyses, as well as the causal association between schizophrenia and aspartate through diverse statistical analyses. A The forest plot of aspartate related SNPs and schizophrenia analysis results, with the red line showing the MR Egger test and IVW method. B  Scatter plot of the analysis results of aspartate and schizophrenia, with the slope indicating the strength of the causal relationship. C  Leave-one-out test of research results on aspartate and schizophrenia. D The forest plot of schizophrenia related SNPs and aspartate analysis results, with the red line showing the MR Egger test and IVW method. E  Scatter plot of the analysis results of schizophrenia and aspartate, with the slope indicating the strength of the causal relationship. F  Leave-one-out test of research results on schizophrenia and aspartate

Results of bidirectional MR analysis of asparagine and schizophrenia

Analysis results of asparagine and schizophrenia.

In prospective MR studies, we used asparagine as an exposure factor and schizophrenia as a result to investigate the potential causal relationship between them. Through a rigorous screening process, we identified 24 genome-wide significant SNPs associated with asparagine. In addition, the instrumental variable F values all exceeded 10, indicating that this study was not affected by weak instruments, thus proving the stability of the results. This study conducted MR analysis to evaluate the impact of all SNP loci on schizophrenia. (Fig.  3 A). According to the results of IVW, a causal relationship was found between asparagine and schizophrenia, and the relationship is negatively correlated, with an OR of 0.485 (95%CI: 0.262-0.900, P -value = 0.020).

The weighted median results also showed a causal relationship between asparagine and schizophrenia, and it was negatively correlated. In the weighted mode method, asparagine and schizophrenia did not have a causal relationship, while in the MR Egger method, there was no statistically significant difference in efficacy between them ( P -value > 0.05) (Table  1 ; Fig.  3 B).

In order to examine the horizontal pleiotropy, the MR Egger intercept was applied, and P -value = 0.768 > 0.05 result proves that this study is not affected by horizontal pleiotropy Furthermore, a leave-one-out test was conducted to demonstrate that no individual SNP had a substantial impact on the stability of the results, indicating that this study has good stability. (Fig.  3 C). Therefore, MR analysis shows that asparagine is inversely proportional to schizophrenia.

Analysis results of schizophrenia and asparagine

In reverse MR analysis, schizophrenia is considered an exposure factor, and asparagine is considered the result, studying the causal effects of schizophrenia and asparagine. Through MR analysis (Fig.  3 D), we assessed the individual influence of each SNP locus on s asparagine. The IVW method results indicated no potential causal relationship between schizophrenia and asparagine, with an OR of 1.005(95%CI: 0.999–1.011, P -value = 0.132). The research results of weighted mode method and weighted median method did not find a causal effects of schizophrenia and asparagine. Additionally, the MR Egger analysis did not reveal any statistically significant difference in effectiveness between schizophrenia and asparagine ( P -value > 0.05) (Table  1 ; Fig.  3 E).

In order to examine the horizontal pleiotropy, the MR Egger intercept was applied, and the result was P -value = 0.474 > 0.05, proving that this study is not affected by horizontal pleiotropy. Furthermore, a leave-one-out test was conducted to demonstrate that no individual SNP had a substantial impact on the stability of the results, indicating that this study has good stability (Fig.  3 F).

figure 3

Depicts the causal association between asparagine and schizophrenia through diverse statistical analyses, as well as the causal association between schizophrenia and asparagine through diverse statistical analyses. A  The forest plot of asparagine related SNPs and schizophrenia analysis results, with the red line showing the MR Egger test and IVW method. B  Scatter plot of the analysis results of asparagine and schizophrenia, with the slope indicating the strength of the causal relationship. C Leave-one-out test of research results on asparagine and schizophrenia. D  The forest plot of schizophrenia related SNPs and asparagine analysis results, with the red line showing the MR Egger test and IVW method. E  Scatter plot of the analysis results of schizophrenia and asparagine, with the slope indicating the strength of the causal relationship. F  Leave-one-out test of research results on schizophrenia and asparagine

In this study, the MR analysis results after sensitivity analysis suggested a causal relationship between asparagine and schizophrenia, which was negatively correlated. However, the reverse MR analysis did not reveal any potential relationship between schizophrenia and asparagine, no potential causal relationship between aspartate and schizophrenia was found in both prospective and reverse MR analyses (Fig.  4 ).

figure 4

Summary of results from bidirectional two-sample MR study

The levels of asparagine in schizophrenia patients decrease, according to studies [ 16 ]. Based on the findings of the Madis Parksepp research team, a continuous five-year administration of antipsychotic drugs (AP) has been observed to induce significant metabolic changes in individuals diagnosed with schizophrenia. Significantly, the concentrations of asparagine, glutamine (Gln), methionine, ornithine, and taurine have experienced a substantial rise, whereas aspartate, glutamate (Glu), and alpha-aminoadipic acid(α-AAA) levels have demonstrated a notable decline. Olanzapine (OLZ) treatment resulted in significantly lower levels of Asn compared to control mice [ 27 ]. Asn and Asp play significant roles in various biological processes within the human body, such as participating in glycoprotein synthesis and contributing to brain functionality. It is worth noting that the ammonia produced in brain tissue needs to have a rapid excretion pathway in the brain. Asn plays a crucial role in regulating cellular function within neural tissues through metabolic control. This amino acid is synthesized by the combination of Asp and ammonia, facilitated by the enzyme asparagine synthase. Additionally, the brain effectively manages ammonia elimination by producing glutamine Gln and Asn. This may be an explanation for the significant increase in Asn and Gln levels (as well as a decrease in Asp and Glu levels) during 5 years of illness and after receiving AP treatment [ 28 ]. The study by Marie Luise Rao’s team compared unmedicated schizophrenic patients, healthy individuals and patients receiving antipsychotic treatment. Unmedicated schizophrenics had higher levels of asparagine, citrulline, phenylalanine, and cysteine, while the ratios of tyrosine, tryptophan, and tryptophan to competing amino acids were significantly lower than in healthy individuals [ 29 ].

The findings of our study demonstrate an inverse association between asparagine levels and the susceptibility to schizophrenia, suggesting that asparagine may serve as a protective factor against the development of this psychiatric disorder. However, we did not find a causal relationship between schizophrenia and asparagine. Consequently, additional investigation and scholarly discourse are warranted to gain a comprehensive understanding of this complex association.

Two different autopsy studies measured D-ASP levels in two different brain samples from patients with schizophrenia and a control group [ 14 ]. The first study, which utilized a limited sample size (7–10 subjects per diagnosis), demonstrated a reduction in D-ASP levels within the prefrontal cortex (PFC) postmortem among individuals diagnosed with schizophrenia, amounting to approximately 101%. This decrease was found to be correlated with a notable elevation in D-aspartate oxidase (DDO) mRNA levels within the same cerebral region [ 30 ]. In addition, the second study was conducted on a large sample size (20 subjects/diagnosis/brain regions). The findings of this study indicated a noteworthy decrease of approximately 30% in D-ASP selectivity within the dorsal lateral PFC (DLPFC) of individuals diagnosed with schizophrenia, when compared to corresponding brain regions of individuals without schizophrenia. However, no significant reduction in D-ASP was observed in the hippocampus of patients with schizophrenia. The decrease in D-Asp content was associated with a significant increase (about 25%) in DDO enzyme activity in the DLPFC of schizophrenia patients. This observation highlights the existence of a dysfunctional metabolic process in DDO activity levels in the brains of schizophrenia patients [ 31 ].

Numerous preclinical investigations have demonstrated the influence of D-Asp on various phenotypes reliant on NMDAR, which are linked to schizophrenia. After administering D-Asp to D-Asp oxidase gene knockout mice, the abnormal neuronal pre-pulse inhibition induced by psychoactive drugs such as MK-801 and amphetamine was significantly reduced by the sustained increase in D-Asp [ 32 ]. According to a review, free amino acids, specifically D-Asp and D-Ser (D-serine), have been identified as highly effective and safe nutrients for promoting mental well-being. These amino acids not only serve as integral components of the central nervous system’s structural proteins, but also play a vital role in maintaining optimal functioning of the central nervous system. This is due to their essential role in regulating neurotransmitter levels, including dopamine, norepinephrine, serotonin, and others. For many patients with schizophrenia, a most persistent and effective improvement therapy may be supplementing amino acids, which can improve the expected therapeutic effect of AP and alleviate positive and negative symptoms of schizophrenia [ 33 ].

Numerous studies have demonstrated a plausible correlation between aspartate and schizophrenia; however, our prospective and reverse MR investigations have failed to establish a causal link between aspartate and schizophrenia. This discrepancy may be attributed to the indirect influence of aspartate on the central nervous system through the stimulation of NMDAR, necessitating further investigation to elucidate the direct relationship between aspartate and schizophrenia.

This study used a bidirectional two-sample MR analysis method to explore the causal relationship between aspartate and asparagine with schizophrenia, as well as its inverse relationship [ 34 ]. The utilization of MR analysis presents numerous benefits in the determination of causality [ 35 ]. Notably, the random allocation of alleles to gametes within this method permits the assumption of no correlation between instrumental variables and confounding factors. Consequently, this approach effectively alleviates bias stemming from confounding factors during the inference of causality. Furthermore, the study’s utilization of a substantial sample size in the GWAS summary data engenders a heightened level of confidence in the obtained results [ 36 ]. Consequently, this investigation not only advances the existing body of research on the relationship between aspartate and asparagine with schizophrenia, but also contributed to clinical treatment decisions for patients with schizophrenia.

Nevertheless, this study possesses certain limitations, as it solely relies on populations of European ancestry for both exposure and results. Consequently, it remains uncertain whether these findings can be replicated among non-European races, necessitating further investigation. In addition, in this study, whether the effects of aspartate and asparagine on schizophrenia vary by gender or age cannot be evaluated, and stratified MR analysis should be performed. Additional experimental research is imperative for a comprehensive understanding of the underlying biological mechanisms connecting aspartate and asparagine with schizophrenia.

In summary, our MR analysis found a negative correlation between asparagine and schizophrenia, indicating that asparagine reduces the risk of schizophrenia. However, there is no potential causal relationship between schizophrenia and asparagine. This study provides new ideas for the early detection of schizophrenia in the clinical setting and offers new insights into the etiology and pathogenesis of schizophrenia. Nonetheless, additional research is required to elucidate the potential mechanisms that underlie the association between aspartate and asparagine with schizophrenia.

Availability of data and materials

The datasets generated and analysed during the current study are available in the GWAS repository. https://gwas.mrcieu.ac.uk/datasets/met-a-388/ , https://gwas.mrcieu.ac.uk/datasets/met-a-638/ , https://gwas.mrcieu.ac.uk/datasets/ieu-b-42/ .

Charlson FJ, Ferrari AJ, Santomauro DF, Diminic S, Stockings E, Scott JG, McGrath JJ, Whiteford HA. Global epidemiology and burden of schizophrenia: findings from the global burden of disease study 2016. Schizophr Bull. 2018;44(6):1195–203.

Article   PubMed   PubMed Central   Google Scholar  

Rehm J, Shield KD. Global burden of disease and the impact of mental and addictive disorders. Curr Psychiatry Rep. 2019;21(2):10.

Article   PubMed   Google Scholar  

Vita A, Minelli A, Barlati S, Deste G, Giacopuzzi E, Valsecchi P, Turrina C, Gennarelli M. Treatment-resistant Schizophrenia: genetic and neuroimaging correlates. Front Pharmacol. 2019;10:402.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Adell A. Brain NMDA receptors in schizophrenia and depression. Biomolecules. 2020;10(6):947.

Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014;511(7510):421–7.

Abdulbagi M, Wang L, Siddig O, Di B, Li B. D-Amino acids and D-amino acid-containing peptides: potential disease biomarkers and therapeutic targets? Biomolecules. 2021;11(11):1716.

Krashia P, Ledonne A, Nobili A, Cordella A, Errico F, Usiello A, D’Amelio M, Mercuri NB, Guatteo E, Carunchio I. Persistent elevation of D-Aspartate enhances NMDA receptor-mediated responses in mouse substantia Nigra pars compacta dopamine neurons. Neuropharmacology. 2016;103:69–78.

Article   CAS   PubMed   Google Scholar  

Kantrowitz JT, Epstein ML, Lee M, Lehrfeld N, Nolan KA, Shope C, Petkova E, Silipo G, Javitt DC. Improvement in mismatch negativity generation during d-serine treatment in schizophrenia: correlation with symptoms. Schizophr Res. 2018;191:70–9.

Elkis H, Buckley PF. Treatment-resistant Schizophrenia. Psychiatr Clin North Am. 2016;39(2):239–65.

Dunlop DS, Neidle A, McHale D, Dunlop DM, Lajtha A. The presence of free D-aspartic acid in rodents and man. Biochem Biophys Res Commun. 1986;141(1):27–32.

de Bartolomeis A, Vellucci L, Austin MC, De Simone G, Barone A. Rational and translational implications of D-Amino acids for treatment-resistant Schizophrenia: from neurobiology to the clinics. Biomolecules. 2022;12(7):909.

Taniguchi K, Sawamura H, Ikeda Y, Tsuji A, Kitagishi Y, Matsuda S. D-amino acids as a biomarker in schizophrenia. Diseases. 2022;10(1):9.

Singh SP, Singh V. Meta-analysis of the efficacy of adjunctive NMDA receptor modulators in chronic schizophrenia. CNS Drugs. 2011;25(10):859–85.

Errico F, Napolitano F, Squillace M, Vitucci D, Blasi G, de Bartolomeis A, Bertolino A, D’Aniello A, Usiello A. Decreased levels of D-aspartate and NMDA in the prefrontal cortex and striatum of patients with schizophrenia. J Psychiatr Res. 2013;47(10):1432–7.

Rousseau J, Gagné V, Labuda M, Beaubois C, Sinnett D, Laverdière C, Moghrabi A, Sallan SE, Silverman LB, Neuberg D, et al. ATF5 polymorphisms influence ATF function and response to treatment in children with childhood acute lymphoblastic leukemia. Blood. 2011;118(22):5883–90.

Liu L, Zhao J, Chen Y, Feng R. Metabolomics strategy assisted by transcriptomics analysis to identify biomarkers associated with schizophrenia. Anal Chim Acta. 2020;1140:18–29.

Cao B, Wang D, Brietzke E, McIntyre RS, Pan Z, Cha D, Rosenblat JD, Zuckerman H, Liu Y, Xie Q, et al. Characterizing amino-acid biosignatures amongst individuals with schizophrenia: a case-control study. Amino Acids. 2018;50(8):1013–23.

Olthof BMJ, Gartside SE, Rees A. Puncta of neuronal nitric oxide synthase (nNOS) mediate NMDA receptor signaling in the Auditory Midbrain. J Neuroscience: Official J Soc Neurosci. 2019;39(5):876–87.

Article   CAS   Google Scholar  

Tortorella A, Monteleone P, Fabrazzo M, Viggiano A, De Luca L, Maj M. Plasma concentrations of amino acids in chronic schizophrenics treated with clozapine. Neuropsychobiology. 2001;44(4):167–71.

Rao ML, Strebel B, Gross G, Huber G. Serum amino acid profiles and dopamine in schizophrenic patients and healthy subjects: window to the brain? Amino Acids. 1992;2(1–2):111–8.

Davey Smith G, Ebrahim S. What can mendelian randomisation tell us about modifiable behavioural and environmental exposures? BMJ (Clinical Res ed). 2005;330(7499):1076–9.

Article   Google Scholar  

Freuer D, Meisinger C. Causal link between thyroid function and schizophrenia: a two-sample mendelian randomization study. Eur J Epidemiol. 2023;38(10):1081–8.

Papiol S, Schmitt A, Maurus I, Rossner MJ, Schulze TG, Falkai P. Association between Physical Activity and Schizophrenia: results of a 2-Sample mendelian randomization analysis. JAMA Psychiat. 2021;78(4):441–4.

Shin SY, Fauman EB, Petersen AK, Krumsiek J, Santos R, Huang J, Arnold M, Erte I, Forgetta V, Yang TP, et al. An atlas of genetic influences on human blood metabolites. Nat Genet. 2014;46(6):543–50.

Zhao JV, Kwok MK, Schooling CM. Effect of glutamate and aspartate on ischemic heart disease, blood pressure, and diabetes: a mendelian randomization study. Am J Clin Nutr. 2019;109(4):1197–206.

Zhou K, Zhu L, Chen N, Huang G, Feng G, Wu Q, Wei X, Gou X. Causal associations between schizophrenia and cancers risk: a mendelian randomization study. Front Oncol. 2023;13:1258015.

Zapata RC, Rosenthal SB, Fisch K, Dao K, Jain M, Osborn O. Metabolomic profiles associated with a mouse model of antipsychotic-induced food intake and weight gain. Sci Rep. 2020;10(1):18581.

Parksepp M, Leppik L, Koch K, Uppin K, Kangro R, Haring L, Vasar E, Zilmer M. Metabolomics approach revealed robust changes in amino acid and biogenic amine signatures in patients with schizophrenia in the early course of the disease. Sci Rep. 2020;10(1):13983.

Rao ML, Gross G, Strebel B, Bräunig P, Huber G, Klosterkötter J. Serum amino acids, central monoamines, and hormones in drug-naive, drug-free, and neuroleptic-treated schizophrenic patients and healthy subjects. Psychiatry Res. 1990;34(3):243–57.

Errico F, D’Argenio V, Sforazzini F, Iasevoli F, Squillace M, Guerri G, Napolitano F, Angrisano T, Di Maio A, Keller S, et al. A role for D-aspartate oxidase in schizophrenia and in schizophrenia-related symptoms induced by phencyclidine in mice. Transl Psychiatry. 2015;5(2):e512.

Nuzzo T, Sacchi S, Errico F, Keller S, Palumbo O, Florio E, Punzo D, Napolitano F, Copetti M, Carella M, et al. Decreased free d-aspartate levels are linked to enhanced d-aspartate oxidase activity in the dorsolateral prefrontal cortex of schizophrenia patients. NPJ Schizophr. 2017;3:16.

Errico F, Rossi S, Napolitano F, Catuogno V, Topo E, Fisone G, D’Aniello A, Centonze D, Usiello A. D-aspartate prevents corticostriatal long-term depression and attenuates schizophrenia-like symptoms induced by amphetamine and MK-801. J Neurosci. 2008;28(41):10404–14.

Nasyrova RF, Khasanova AK, Altynbekov KS, Asadullin AR, Markina EA, Gayduk AJ, Shipulin GA, Petrova MM, Shnayder NA. The role of D-Serine and D-aspartate in the pathogenesis and therapy of treatment-resistant schizophrenia. Nutrients. 2022;14(23):5142.

Hao D, Liu C. Deepening insights into food and medicine continuum within the context of pharmacophylogeny. Chin Herb Med. 2023;15(1):1–2.

PubMed   Google Scholar  

Zhang Y. Awareness and ability of paradigm shift are needed for research on dominant diseases of TCM. Chin Herb Med. 2023;15(4):475.

CAS   PubMed   PubMed Central   Google Scholar  

Chen J. Essential role of medicine and food homology in health and wellness. Chin Herb Med. 2023;15(3):347–8.

PubMed   PubMed Central   Google Scholar  

Download references

This work was supported by the National Natural Science Foundation of China (82271546, 82301725, 81971601); National Key Research and Development Program of China (2023YFC2506201); Key Project of Science and Technology Innovation 2030 of China (2021ZD0201800, 2021ZD0201805); China Postdoctoral Science Foundation (2023M732155); Fundamental Research Program of Shanxi Province (202203021211018, 202203021212028, 202203021212038). Research Project Supported by Shanxi Scholarship Council of China (2022 − 190); Scientific Research Plan of Shanxi Health Commission (2020081, 2020SYS03,2021RC24); Shanxi Provincial Administration of Traditional Chinese Medicine (2023ZYYC2034), Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi (2022L132); Shanxi Medical University School-level Doctoral Initiation Fund Project (XD2102); Youth Project of First Hospital of Shanxi Medical University (YQ2203); Doctor Fund Project of Shanxi Medical University in Shanxi Province (SD2216); Shanxi Science and Technology Innovation Talent Team (202304051001049); 136 Medical Rejuvenation Project of Shanxi Province, China; STI2030-Major Projects-2021ZD0200700. Key laboratory of Health Commission of Shanxi Province (2020SYS03);

Author information

Huang-Hui Liu and Yao Gao contributed equally to this work.

Authors and Affiliations

Department of Psychiatry, First Hospital/First Clinical Medical College of Shanxi Medical University, NO.85 Jiefang Nan Road, Taiyuan, China

Huang-Hui Liu, Yao Gao, Dan Xu, Xin-Zhe Du, Si-Meng Wei, Jian-Zhen Hu, Yong Xu & Liu Sha

Shanxi Key Laboratory of Artificial Intelligence Assisted Diagnosis and Treatment for Mental Disorder, First Hospital of Shanxi Medical University, Taiyuan, China

You can also search for this author in PubMed   Google Scholar

Contributions

Huang-Hui Liu and Yao Gao provided the concept and designed the study. Huang-Hui Liu and Yao Gao conducted the analyses and wrote the manuscript. Dan Xu, Huang-Hui Liu and Yao Gao participated in data collection. Xin-Zhe Du, Si-Meng Wei and Jian-Zhen Hu participated in the analysis of the data. Liu Sha, Yong Xu and Yao Gao revised and proof-read the manuscript. All authors contributed to the article and approved the submitted version.

Corresponding authors

Correspondence to Yong Xu or Liu Sha .

Ethics declarations

Ethics approval and consent to participate.

Not applicable’ for that section.

Consent for publication

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Liu, HH., Gao, Y., Xu, D. et al. Asparagine reduces the risk of schizophrenia: a bidirectional two-sample mendelian randomization study of aspartate, asparagine and schizophrenia. BMC Psychiatry 24 , 299 (2024). https://doi.org/10.1186/s12888-024-05765-5

Download citation

Received : 20 February 2024

Accepted : 15 April 2024

Published : 19 April 2024

DOI : https://doi.org/10.1186/s12888-024-05765-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Schizophrenia
  • Mendelian randomization

BMC Psychiatry

ISSN: 1471-244X

research paper on genome wide association studies

Adjusting for principal components can induce spurious associations in genome-wide association studies in admixed populations

  • PMID: 38617337
  • PMCID: PMC11014513
  • DOI: 10.1101/2024.04.02.587682

Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA. However, these suggestions are not universally implemented and the implications for GWAS are not fully understood, especially in the context of admixed populations. In this paper, we investigate the impact of pre-processing and the number of PCs included in GWAS models in African American samples from the Women's Women's Health Initiative SNP Health Association Resource and two Trans-Omics for Precision Medicine Whole Genome Sequencing Project contributing studies (Jackson Heart Study and Genetic Epidemiology of Chronic Obstructive Pulmonary Disease Study). In all three samples, we find the first PC is highly correlated with genome-wide ancestry whereas later PCs often capture local genomic features. The pattern of which, and how many, genetic variants are highly correlated with individual PCs differs from what has been observed in prior studies focused on European populations and leads to distinct downstream consequences: adjusting for such PCs yields biased effect size estimates and elevated rates of spurious associations due to the phenomenon of collider bias. Excluding high LD regions identified in previous studies does not resolve these issues. LD pruning proves more effective, but the optimal choice of thresholds varies across datasets. Altogether, our work highlights unique issues that arise when using PCA to control for ancestral heterogeneity in admixed populations and demonstrates the importance of careful pre-processing and diagnostics to ensure that PCs capturing multiple local genomic features are not included in GWAS models.

Author summary: Principal component analysis (PCA) is a widely used technique in human genetics research. One of its most frequent applications is in the context of genetic association studies, wherein researchers use PCA to infer, and then adjust for, the genetic ancestry of study participants. Although a powerful approach, prior work has shown that PCA sometimes captures other features or data quality issues, and pre-processing steps have been suggested to address these concerns. However, the utility and downstream implications of this recommended preprocessing are not fully understood, nor are these steps universally implemented. Moreover, the vast majority of prior work in this area was conducted in studies that exclusively included individuals of European ancestry. Here, we revisit this work in the context of admixed populations-populations with diverse, mixed ancestry that have been largely underrepresented in genetics research to date. We demonstrate the unique concerns that can arise in this context and illustrate the detrimental effects that including principal components in genetic association study models can have when not implemented carefully. Altogether, we hope our work serves as a reminder of the care that must be taken-including careful pre-processing, diagnostics, and modeling choices-when implementing PCA in admixed populations and beyond.

Publication types

IMAGES

  1. An Introduction To Genome-Wide Association Study (GWAS)

    research paper on genome wide association studies

  2. (PDF) Genome-wide association studies for complex traits: consensus

    research paper on genome wide association studies

  3. (PDF) Mediation analysis in genome-wide association studies: Current

    research paper on genome wide association studies

  4. (PDF) GWASdb: A database for human genetic variants identified by

    research paper on genome wide association studies

  5. How To Interpret A Genome Wide Association Study

    research paper on genome wide association studies

  6. (PDF) Chapter 11: Genome-Wide Association Studies

    research paper on genome wide association studies

VIDEO

  1. Duality Demo Genome Wide Association Studies GWAS

  2. EAAP 2023

  3. 5.8 sequence genome

  4. Science Reporters' Seminar on Genome-Wide Association Studies

  5. 学习全基因组关联研究的透视概述 GWAS (Chinese) 在六分钟内

  6. GWAS

COMMENTS

  1. Genome-wide association studies

    Genome-wide association studies (GWAS) test hundreds of thousands of genetic variants across many genomes to find those statistically associated with a specific trait or disease. This methodology ...

  2. 15 years of genome-wide association studies and no signs of slowing

    First, the decreasing cost of genome-wide genotyping arrays, now >20 times less expensive than 15 years ago, has allowed more studies to participate in gene discovery efforts. Recent GWAS meta ...

  3. Benefits and limitations of genome-wide association studies

    Nature Genetics (2024) Genome-wide association studies (GWAS) involve testing genetic variants across the genomes of many individuals to identify genotype-phenotype associations. GWAS have ...

  4. 10 Years of GWAS Discovery: Biology, Function, and Translation

    Abstract. Application of the experimental design of genome-wide association studies (GWASs) is now 10 years old (young), and here we review the remarkable range of discoveries it has facilitated in population and complex-trait genetics, the biology of diseases, and translation toward new therapeutics. We predict the likely discoveries in the ...

  5. Chapter 11: Genome-Wide Association Studies

    Abstract. Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we review the key concepts underlying GWAS, including the architecture of common diseases, the structure of common human genetic variation, technologies for ...

  6. An Overview of Genome-Wide Association Studies

    Abstract. Genome-wide association study (GWAS) is a powerful study design to identify genetic variants of a trait and, in particular, detect the association between common single-nucleotide polymorphisms (SNPs) and common human diseases such as heart disease, inflammatory bowel disease, type 2 diabetes, and psychiatric disorders.

  7. The impact of genome-wide association studies on biomedical research

    The past decade has seen major investment in genome-wide association studies (GWAS). Among the many goals of GWAS, a major one is to identify and motivate research on novel genes involved in complex human disease. To assess whether this goal is being met, we quantified the effect of GWAS on the overall distribution of biomedical research ...

  8. Perspectives and recent progress of genome-wide association studies

    In addition to this, genome-wide genotyping is a prerequisite for genome-wide association studies that have been used successfully to discover the genes, which control polygenic traits including the genetic loci, associated with the trait of interest in fruit crops. ... Biotechnology Research and Application Center, Çukurova University, 01330 ...

  9. Status and prospects of genome-wide association studies in plants

    Abstract. Genome-wide association studies (GWAS) have developed into a powerful and ubiquitous tool for the investigation of complex traits. In large part, this was fueled by advances in genomic technology, enabling us to examine genome-wide genetic variants across diverse genetic materials. The development of the mixed model framework for GWAS ...

  10. Research progress and applications of genome‐wide association study in

    Inexpensive, large-scale or hyper-scale sequencing has allowed the study of the associations between epigenetic or microbiota features and animal phenotypes, and therefore the metagenome-wide association study (MWAS) and epigenome-wide association study (EWAS) have been developed. Partial research highlights are provided in Table 1.

  11. (PDF) Chapter 11: Genome-Wide Association Studies

    Abstract and Figures. Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we ...

  12. Genome-wide association studies: assessing trait characteristics in

    GWAS involves testing genetic variants across the genomes of many individuals of a population to identify genotype-phenotype association. It was initially developed and has proven highly successful in human disease genetics. In plants genome-wide association studies (GWAS) initially focused on single feature polymorphism and recombination and linkage disequilibrium but has now been embraced ...

  13. Genome-Wide Association Studies Fact Sheet

    The impact on medical care from genome-wide association studies could potentially be substantial. Such research is laying the groundwork for the era of personalized medicine, in which the current one size-fits-all approach to medical care will give way to more customized strategies.In the future, after improvements are made in the cost and efficiency of genome-wide scans and other innovative ...

  14. A scientometric review of genome-wide association studies

    This scientometric review of genome-wide association studies (GWAS) from 2005 to 2018 (3639 studies; 3508 traits) reveals extraordinary increases in sample sizes, rates of discovery and traits ...

  15. Genome-wide association studies have problems due to confounding: Are

    Genome-wide association studies (GWASs) can be affected by confounding, but family-based GWASs use random, within-family genetic variation to avoid this. This Primer explores a study in PLOS Biology which asks how different sources of confounding affect GWASs and whether family-based designs offer a solution.

  16. Genome-wide association study and its applications in the non-model

    Sesame is a rare example of non-model and minor crop for which numerous genetic loci and candidate genes underlying features of interest have been disclosed at relatively high resolution. These progresses have been achieved thanks to the applications of the genome-wide association study (GWAS) approach. GWAS has benefited from the availability of high-quality genomes, re-sequencing data from ...

  17. Advancements and Prospects of Genome-wide Association Studies

    Genome-wide association studies (GWAS) aim to identify the genetic variants associated with a dichotomous (e.g., type 2 diabetes case/control status) or quantitative (e.g., serum fasting glucose levels) traits. The study involves a high-density scan of the genome, genotyping single nucleotide polymorphisms (SNPs) and then using these genotyped SNPs, matched with an appropriate reference ...

  18. Genome-Wide Association Studies

    Abstract. Genetic association studies have made a major contribution to our understanding of the genetics of complex disorders over the last 10 years through genome-wide association studies (GWAS). In this chapter, we review the key concepts that underlie the GWAS approach. We will describe the "common disease, common variant" theory, and will ...

  19. Genome-wide Association Studies

    The genomic era of biomedical research has given rise to the genome-wide association study (GWAS) approach, which attempts to discover novel genes affecting an outcome by testing a large number ( i.e ., hundreds of thousands to millions) of genetic variants for association. This article discusses the issues surrounding the GWAS approach with ...

  20. Genome-wide association study in Alzheimer's disease: A bibliometric

    Background Thousands of research studies concerning genome-wide association studies (GWAS) in Alzheimer's disease (AD) have been published in the last decades. However, a comprehensive understanding of the current research status and future development trends of GWAS in AD have not been clearly shown. In this study, we tried to gain a systematic overview of GWAS in AD by bibliometric and ...

  21. Integrating both common and rare variants to predict bone mineral

    For more than 10 yr now, genome-wide association studies (GWASs) have identified around ~1000 genetic loci associated with BMD, osteoporosis, and fractures in human populations. 1 However, even in large-scale biobank-based GWAS, 2 most of the identified genetic variants are common, and few low-frequency and rare variants were identified to be associated with these bone traits by previous ...

  22. Interpreting population- and family-based genome-wide association

    A central aim of genome-wide association studies (GWASs) is to estimate direct genetic effects: the causal effects on an individual's phenotype of the alleles that they carry. However, estimates of direct effects can be subject to genetic and environmental confounding and can also absorb the "indirect" genetic effects of relatives' genotypes. Recently, an important development in ...

  23. The impact of genome-wide association studies on biomedical research

    The past decade has seen major investment in genome-wide association studies (GWAS). Among the many goals of GWAS, a major one is to identify and motivate research on novel genes involved in complex human disease. To assess whether this goal is being met, we quantified the effect of GWAS on the overall distribution of biomedical research publications and on the subsequent publication history ...

  24. Genome-wide association analysis for drought tolerance and ...

    The potential production and productivity of groundnuts are limited due to severe drought stress associated with climate change. The current study aimed to identify genomic regions and candidate genes associated with drought tolerance and component traits for gene introgression and to guide marker-assisted breeding of groundnut varieties. Ninety-nine genetically diverse groundnut genotypes ...

  25. Genome-wide association studies

    A genome-wide association study of 57 trace elements measured in up to 6564 Scandinavians, identifies genetic loci associated with blood levels of essential and non-essential trace elements and ...

  26. Shared genetic architecture between autoimmune disorders and B-cell

    To study the shared genetic structure between autoimmune diseases and B-cell acute lymphoblastic leukemia (B-ALL) and identify the shared risk loci and genes and genetic mechanisms involved. Based on large-scale genome-wide association study (GWAS) summary-level data sets, we observed genetic overlaps between autoimmune diseases and B-ALL, and cross-trait pleiotropic analysis was performed to ...

  27. [PDF] Genome-wide association study of early-onset and late-onset

    DOI: 10.1192/j.eurpsy.2024.26 Corpus ID: 268817000; Genome-wide association study of early-onset and late-onset postpartum depression: the IGEDEPP prospective study. @article{Tebeka2024GenomewideAS, title={Genome-wide association study of early-onset and late-onset postpartum depression: the IGEDEPP prospective study.}, author={Sarah Tebeka and Emilie Gloaguen and Jimmy Mullaert and Qin He and ...

  28. Genome-Wide Association Studies and Beyond

    INTRODUCTION. Genome-wide association studies (GWAS) compare common genetic variants in large numbers of affected cases to those in unaffected controls to determine whether an association with disease exists (34, 55).GWAS have been made possible by the identification of millions of single nucleotide polymorphisms (SNPs) across the human genome and the realization that a subset of these SNPs ...

  29. Asparagine reduces the risk of schizophrenia: a bidirectional two

    This study employed summary data from genome-wide association studies (GWAS) conducted on European populations to examine the correlation between aspartate and asparagine with schizophrenia. In order to investigate the causal effects of aspartate and asparagine on schizophrenia, this study conducted a two-sample bidirectional MR analysis using ...

  30. Adjusting for principal components can induce spurious ...

    Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD).