meta analysis phd

Introduction to Meta-Analysis: A Guide for the Novice

Experimental Psychology
Methodology
Statistical Analysis

Free Meta-Analysis Software and Macros

MetaXL (Version 2.0)

RevMan (Version 5.3)

Meta-Analysis Macros for SAS, SPSS, and Stata

Opposing theories and disparate findings populate the field of psychology; scientists must interpret the results of any single study in the context of its limitations. Meta-analysis is a robust tool that can help researchers overcome these challenges by assimilating data across studies identified through a literature review. In other words, rather than surveying participants, a meta-analysis surveys studies. The goal is to calculate the direction and/or magnitude of an effect across all relevant studies, both published and unpublished. Despite the utility of this statistical technique, it can intimidate a beginner who has no formal training in the approach. However, any motivated researcher with a statistics background can complete a meta-analysis. This article provides an overview of the main steps of basic meta-analysis.

Meta-analysis has many strengths. First, meta-analysis provides an organized approach for handling a large number of studies. Second, the process is systematic and documented in great detail, which allows readers to evaluate the researchers’ decisions and conclusions. Third, meta-analysis allows researchers to examine an effect within a collection of studies in a more sophisticated manner than a qualitative summary.

However, meta-analysis also involves numerous challenges. First, it consumes a great deal of time and requires a great deal of effort. Second, meta-analysis has been criticized for aggregating studies that are too different (i.e., mixing “apples and oranges”). Third, some scientists argue that the objective coding procedure used in meta-analysis ignores the context of each individual study, such as its methodological rigor. Fourth, when a researcher includes low-quality studies in a meta-analysis, the limitations of these studies impact the mean effect size (i.e., “garbage in, garbage out”). As long as researchers are aware of these issues and consider the potential influence of these limitations on their findings, meta-analysis can serve as a powerful and informative approach to help us draw conclusions from a large literature base.

Identifying the Right Question

Similar to any research study, a meta-analysis begins with a research question. Meta-analysis can be used in any situation where the goal is to summarize quantitative findings from empirical studies. It can be used to examine different types of effects, including prevalence rates (e.g., percentage of rape survivors with depression), growth rates (e.g., changes in depression from pretreatment to posttreatment), group differences (e.g., comparison of treatment and placebo groups on depression), and associations between variables (e.g., correlation between depression and self-esteem). To select the effect metric, researchers should consider the statistical form of the results in the literature. Any given meta-analysis can focus on only one metric at a time. While selecting a research question, researchers should think about the size of the literature base and select a manageable topic. At the same time, they should make sure the number of existing studies is large enough to warrant a meta-analysis.

Determining Eligibility Criteria

After choosing a relevant question, researchers should then identify and explicitly state the types of studies to be included. These criteria ensure that the studies overlap enough in topic and methodology that it makes sense to combine them. The inclusion and exclusion criteria depend on the specific research question and characteristics of the literature. First, researchers can specify relevant participant characteristics, such as age or gender. Second, researchers can identify the key variables that must be included in the study. Third, the language, date range, and types (e.g., peer-reviewed journal articles) of studies should be specified. Fourth, pertinent study characteristics, such as experimental design, can be defined. Eligibility criteria should be clearly documented and relevant to the research question. Specifying the eligibility criteria prior to conducting the literature search allows the researcher to perform a more targeted search and reduces the number of irrelevant studies. Eligibility criteria can also be revised later, because the researcher may become aware of unforeseen issues during the literature search stage.

Conducting a Literature Search and Review

The next step is to identify, retrieve, and review published and unpublished studies. The goal is to be exhaustive; however, being too broad can result in an overwhelming number of studies to review.

Online databases, such as PsycINFO and PubMed, compile millions of searchable records, including peer-reviewed journals, books, and dissertations. In addition, through these electronic databases, researchers can access the full text of many of the records. It is important that researchers carefully choose search terms and databases, because these decisions impact the breadth of the review. Researchers who aren’t familiar with the research topic should consult with an expert.

Additional ways to identify studies include searching conference proceedings, examining reference lists of relevant studies, and directly contacting researchers. After the literature search is completed, researchers must evaluate each study for inclusion using the eligibility criteria. At least a subset of the studies should be reviewed by two individuals (i.e., double coded) to serve as a reliability check. It is vital that researchers keep meticulous records of this process; for publication, a flow diagram is typically required to depict the search and results. Researchers should allow adequate time, because this step can be quite time consuming.

Calculating Effect Size

Next, researchers calculate an effect size for each eligible study. The effect size is the key component of a meta-analysis because it encodes the results in a numeric value that can then be aggregated. Examples of commonly used effect size metrics include Cohen’s d (i.e., group differences) and Pearson’s r (i.e., association between two variables). The effect size metric is based on the statistical form of the results in the literature and the research question. Because studies that include more participants provide more accurate estimates of an effect than those that include fewer participants, it is important to also calculate the precision of the effect size (e.g., standard error).

Meta-analysis software guides researchers through the calculation process by requesting the necessary information for the specified effect size metric. I have identified some potentially useful resources and programs below. Although meta-analysis software makes effect size calculations simple, it is good practice for researchers to understand what computations are being used.

The effect size and precision of each individual study are aggregated into a summary statistic, which can be done with meta-analysis software. Researchers should confirm that the effect sizes are independent of each other (i.e., no overlap in participants). Additionally, researchers must select either a fixed effects model (i.e., assumes all studies share one true effect size) or a random effects model (i.e., assumes the true effect size varies among studies). The random effects model is typically preferred when the studies have been conducted using different methodologies. Depending on the software, additional specifications or adjustments may be possible.

During analysis, the effect sizes of the included studies are weighted by their precision (e.g., inverse of the sampling error variance) and the mean is calculated. The mean effect size represents the direction and/or magnitude of the effect summarized across all eligible studies. This statistic is typically accompanied by an estimate of its precision (e.g., confidence interval) and p -value representing statistical significance. Forest plots are a common way of displaying meta-analysis results.

Depending on the situation, follow-up analyses may be advised. Researchers can quantify heterogeneity (e.g., Q, t 2 , I 2 ), which is a measure of the variation among the effect sizes of included studies. Moderator variables, such as the quality of the studies or age of participants, may be included to examine sources of heterogeneity. Because published studies may be biased towards significant effects, it is important to evaluate the impact of publication bias (e.g., funnel plot, Rosenthal’s Fail-safe N ). Sensitivity analysis can indicate how the results of the meta-analysis would change if one study were excluded from the analysis.

If properly conducted and clearly documented, meta-analyses often make significant contributions to a specific field of study and therefore stand a good chance of being published in a top-tier journal. The biggest obstacle for most researchers who attempt meta-analysis for the first time is the amount of work and organization required for proper execution, rather than their level of statistical knowledge.

Recommended Resources

Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2009). Introduction to meta-analysis . Hoboken, NJ: Wiley.

Cooper, H., Hedges, L., & Valentine, J. (2009). The handbook of research synthesis and meta-analysis (2nd ed.). New York, NY: Russell Sage Foundation.

Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis . Thousand Oaks, California: Sage Publications.

Rothstein, H. R., Sutton, A. J., & Borenstein, M. (2005). Publication bias in meta-analysis: Prevention, assessment, and adjustments . Hoboken, NJ: Wiley.

It is nice to see the software we developed (MetaXL) being mentioned. However, the reason we developed the software and made publicly available for free is that we disagree with an important statement in the review. This statement is “researchers must select either a fixed effects model (i.e., assumes all studies share one true effect size) or a random effects model (i.e., assumes the true effect size varies among studies)”. We developed MetaXL because we think that the random effects model is seriously flawed and should be abandoned. We implemented in MetaXL two additional models, the Inverse Variance heterogeneity model and the Quality Effects model, both meant to be used in case of heterogeneity. More details are in the User Guide, available from the Epigear website.

Thank you very much! The article really helped me to start understanding what meta-analysis is about

thank you for sharing this article; it is very helpful.But I am still confused about how to remove quickly duplicates papers without wasting time if we more than 10 000 papers?

Not being one to blow my own horn all the time, but I would like to suggest that you may want to take a look at a web based application I wrote that conducts a Hunter-Schmidt type meta-analysis. The Meta-Analysis is very easy to use and corrects for sampling and error variance due to reliability. It also exports the results in excel format. You can also export the dataset effect sizes (r, d, and z), sample sizes and reliability information in excel as well.

http://www.lyonsmorris.com/lyons/MaCalc/index.cfm

APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines .

Please login with your APS account to comment.

About the Author

Laura C. Wilson is an Assistant Professor in the Psychology Department at the University of Mary Washington. She earned a PhD in Clinical Psychology from Virginia Tech and MA in General/Experimental Psychology from The College of William & Mary. Her main area of expertise is post-trauma functioning, particularly in survivors of sexual violence or mass trauma (e.g., terrorism, mass shootings, combat). She also has interest in predictors of violence and aggression, including psychophysiological and personality factors.

Careers Up Close: Joel Anderson on Gender and Sexual Prejudices, the Freedoms of Academic Research, and the Importance of Collaboration

Joel Anderson, a senior research fellow at both Australian Catholic University and La Trobe University, researches group processes, with a specific interest on prejudice, stigma, and stereotypes.

Experimental Methods Are Not Neutral Tools

Ana Sofia Morais and Ralph Hertwig explain how experimental psychologists have painted too negative a picture of human rationality, and how their pessimism is rooted in a seemingly mundane detail: methodological choices.

APS Fellows Elected to SEP

In addition, an APS Rising Star receives the society’s Early Investigator Award.

Privacy Overview

Jump to navigation

Cochrane Training

Chapter 10: analysing data and undertaking meta-analyses.

Jonathan J Deeks, Julian PT Higgins, Douglas G Altman; on behalf of the Cochrane Statistical Methods Group

Key Points:

Meta-analysis is the statistical combination of results from two or more separate studies.
Potential advantages of meta-analyses include an improvement in precision, the ability to answer questions not posed by individual studies, and the opportunity to settle controversies arising from conflicting claims. However, they also have the potential to mislead seriously, particularly if specific study designs, within-study biases, variation across studies, and reporting biases are not carefully considered.
It is important to be familiar with the type of data (e.g. dichotomous, continuous) that result from measurement of an outcome in an individual study, and to choose suitable effect measures for comparing intervention groups.
Most meta-analysis methods are variations on a weighted average of the effect estimates from the different studies.
Studies with no events contribute no information about the risk ratio or odds ratio. For rare events, the Peto method has been observed to be less biased and more powerful than other methods.
Variation across studies (heterogeneity) must be considered, although most Cochrane Reviews do not have enough studies to allow for the reliable investigation of its causes. Random-effects meta-analyses allow for heterogeneity by assuming that underlying effects follow a normal distribution, but they must be interpreted carefully. Prediction intervals from random-effects meta-analyses are a useful device for presenting the extent of between-study variation.
Many judgements are required in the process of preparing a meta-analysis. Sensitivity analyses should be used to examine whether overall findings are robust to potentially influential decisions.

Cite this chapter as: Deeks JJ, Higgins JPT, Altman DG (editors). Chapter 10: Analysing data and undertaking meta-analyses. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.4 (updated August 2023). Cochrane, 2023. Available from www.training.cochrane.org/handbook .

10.1 Do not start here!

It can be tempting to jump prematurely into a statistical analysis when undertaking a systematic review. The production of a diamond at the bottom of a plot is an exciting moment for many authors, but results of meta-analyses can be very misleading if suitable attention has not been given to formulating the review question; specifying eligibility criteria; identifying and selecting studies; collecting appropriate data; considering risk of bias; planning intervention comparisons; and deciding what data would be meaningful to analyse. Review authors should consult the chapters that precede this one before a meta-analysis is undertaken.

10.2 Introduction to meta-analysis

An important step in a systematic review is the thoughtful consideration of whether it is appropriate to combine the numerical results of all, or perhaps some, of the studies. Such a meta-analysis yields an overall statistic (together with its confidence interval) that summarizes the effectiveness of an experimental intervention compared with a comparator intervention. Potential advantages of meta-analyses include the following:

T o improve precision . Many studies are too small to provide convincing evidence about intervention effects in isolation. Estimation is usually improved when it is based on more information.
To answer questions not posed by the individual studies . Primary studies often involve a specific type of participant and explicitly defined interventions. A selection of studies in which these characteristics differ can allow investigation of the consistency of effect across a wider range of populations and interventions. It may also, if relevant, allow reasons for differences in effect estimates to be investigated.
To settle controversies arising from apparently conflicting studies or to generate new hypotheses . Statistical synthesis of findings allows the degree of conflict to be formally assessed, and reasons for different results to be explored and quantified.

Of course, the use of statistical synthesis methods does not guarantee that the results of a review are valid, any more than it does for a primary study. Moreover, like any tool, statistical methods can be misused.

This chapter describes the principles and methods used to carry out a meta-analysis for a comparison of two interventions for the main types of data encountered. The use of network meta-analysis to compare more than two interventions is addressed in Chapter 11 . Formulae for most of the methods described are provided in the RevMan Web Knowledge Base under Statistical Algorithms and calculations used in Review Manager (documentation.cochrane.org/revman-kb/statistical-methods-210600101.html), and a longer discussion of many of the issues is available ( Deeks et al 2001 ).

10.2.1 Principles of meta-analysis

The commonly used methods for meta-analysis follow the following basic principles:

Meta-analysis is typically a two-stage process. In the first stage, a summary statistic is calculated for each study, to describe the observed intervention effect in the same way for every study. For example, the summary statistic may be a risk ratio if the data are dichotomous, or a difference between means if the data are continuous (see Chapter 6 ).

The combination of intervention effect estimates across studies may optionally incorporate an assumption that the studies are not all estimating the same intervention effect, but estimate intervention effects that follow a distribution across studies. This is the basis of a random-effects meta-analysis (see Section 10.10.4 ). Alternatively, if it is assumed that each study is estimating exactly the same quantity, then a fixed-effect meta-analysis is performed.
The standard error of the summary intervention effect can be used to derive a confidence interval, which communicates the precision (or uncertainty) of the summary estimate; and to derive a P value, which communicates the strength of the evidence against the null hypothesis of no intervention effect.
As well as yielding a summary quantification of the intervention effect, all methods of meta-analysis can incorporate an assessment of whether the variation among the results of the separate studies is compatible with random variation, or whether it is large enough to indicate inconsistency of intervention effects across studies (see Section 10.10 ).
The problem of missing data is one of the numerous practical considerations that must be thought through when undertaking a meta-analysis. In particular, review authors should consider the implications of missing outcome data from individual participants (due to losses to follow-up or exclusions from analysis) (see Section 10.12 ).

Meta-analyses are usually illustrated using a forest plot . An example appears in Figure 10.2.a . A forest plot displays effect estimates and confidence intervals for both individual studies and meta-analyses (Lewis and Clarke 2001). Each study is represented by a block at the point estimate of intervention effect with a horizontal line extending either side of the block. The area of the block indicates the weight assigned to that study in the meta-analysis while the horizontal line depicts the confidence interval (usually with a 95% level of confidence). The area of the block and the confidence interval convey similar information, but both make different contributions to the graphic. The confidence interval depicts the range of intervention effects compatible with the study’s result. The size of the block draws the eye towards the studies with larger weight (usually those with narrower confidence intervals), which dominate the calculation of the summary result, presented as a diamond at the bottom.

Figure 10.2.a Example of a forest plot from a review of interventions to promote ownership of smoke alarms (DiGuiseppi and Higgins 2001). Reproduced with permission of John Wiley & Sons

10.3 A generic inverse-variance approach to meta-analysis

A very common and simple version of the meta-analysis procedure is commonly referred to as the inverse-variance method . This approach is implemented in its most basic form in RevMan, and is used behind the scenes in many meta-analyses of both dichotomous and continuous data.

The inverse-variance method is so named because the weight given to each study is chosen to be the inverse of the variance of the effect estimate (i.e. 1 over the square of its standard error). Thus, larger studies, which have smaller standard errors, are given more weight than smaller studies, which have larger standard errors. This choice of weights minimizes the imprecision (uncertainty) of the pooled effect estimate.

10.3.1 Fixed-effect method for meta-analysis

A fixed-effect meta-analysis using the inverse-variance method calculates a weighted average as:

where Y i is the intervention effect estimated in the i th study, SE i is the standard error of that estimate, and the summation is across all studies. The basic data required for the analysis are therefore an estimate of the intervention effect and its standard error from each study. A fixed-effect meta-analysis is valid under an assumption that all effect estimates are estimating the same underlying intervention effect, which is referred to variously as a ‘fixed-effect’ assumption, a ‘common-effect’ assumption or an ‘equal-effects’ assumption. However, the result of the meta-analysis can be interpreted without making such an assumption (Rice et al 2018).

10.3.2 Random-effects methods for meta-analysis

A variation on the inverse-variance method is to incorporate an assumption that the different studies are estimating different, yet related, intervention effects (Higgins et al 2009). This produces a random-effects meta-analysis, and the simplest version is known as the DerSimonian and Laird method (DerSimonian and Laird 1986). Random-effects meta-analysis is discussed in detail in Section 10.10.4 .

10.3.3 Performing inverse-variance meta-analyses

Most meta-analysis programs perform inverse-variance meta-analyses. Usually the user provides summary data from each intervention arm of each study, such as a 2×2 table when the outcome is dichotomous (see Chapter 6, Section 6.4 ), or means, standard deviations and sample sizes for each group when the outcome is continuous (see Chapter 6, Section 6.5 ). This avoids the need for the author to calculate effect estimates, and allows the use of methods targeted specifically at different types of data (see Sections 10.4 and 10.5 ).

When the data are conveniently available as summary statistics from each intervention group, the inverse-variance method can be implemented directly. For example, estimates and their standard errors may be entered directly into RevMan under the ‘Generic inverse variance’ outcome type. For ratio measures of intervention effect, the data must be entered into RevMan as natural logarithms (for example, as a log odds ratio and the standard error of the log odds ratio). However, it is straightforward to instruct the software to display results on the original (e.g. odds ratio) scale. It is possible to supplement or replace this with a column providing the sample sizes in the two groups. Note that the ability to enter estimates and standard errors creates a high degree of flexibility in meta-analysis. It facilitates the analysis of properly analysed crossover trials, cluster-randomized trials and non-randomized trials (see Chapter 23 ), as well as outcome data that are ordinal, time-to-event or rates (see Chapter 6 ).

10.4 Meta-analysis of dichotomous outcomes

There are four widely used methods of meta-analysis for dichotomous outcomes, three fixed-effect methods (Mantel-Haenszel, Peto and inverse variance) and one random-effects method (DerSimonian and Laird inverse variance). All of these methods are available as analysis options in RevMan. The Peto method can only combine odds ratios, whilst the other three methods can combine odds ratios, risk ratios or risk differences. Formulae for all of the meta-analysis methods are available elsewhere (Deeks et al 2001).

Note that having no events in one group (sometimes referred to as ‘zero cells’) causes problems with computation of estimates and standard errors with some methods: see Section 10.4.4 .

10.4.1 Mantel-Haenszel methods

When data are sparse, either in terms of event risks being low or study size being small, the estimates of the standard errors of the effect estimates that are used in the inverse-variance methods may be poor. Mantel-Haenszel methods are fixed-effect meta-analysis methods using a different weighting scheme that depends on which effect measure (e.g. risk ratio, odds ratio, risk difference) is being used (Mantel and Haenszel 1959, Greenland and Robins 1985). They have been shown to have better statistical properties when there are few events. As this is a common situation in Cochrane Reviews, the Mantel-Haenszel method is generally preferable to the inverse variance method in fixed-effect meta-analyses. In other situations the two methods give similar estimates.

10.4.2 Peto odds ratio method

Peto’s method can only be used to combine odds ratios (Yusuf et al 1985). It uses an inverse-variance approach, but uses an approximate method of estimating the log odds ratio, and uses different weights. An alternative way of viewing the Peto method is as a sum of ‘O – E’ statistics. Here, O is the observed number of events and E is an expected number of events in the experimental intervention group of each study under the null hypothesis of no intervention effect.

The approximation used in the computation of the log odds ratio works well when intervention effects are small (odds ratios are close to 1), events are not particularly common and the studies have similar numbers in experimental and comparator groups. In other situations it has been shown to give biased answers. As these criteria are not always fulfilled, Peto’s method is not recommended as a default approach for meta-analysis.

Corrections for zero cell counts are not necessary when using Peto’s method. Perhaps for this reason, this method performs well when events are very rare (Bradburn et al 2007); see Section 10.4.4.1 . Also, Peto’s method can be used to combine studies with dichotomous outcome data with studies using time-to-event analyses where log-rank tests have been used (see Section 10.9 ).

10.4.3 Which effect measure for dichotomous outcomes?

Effect measures for dichotomous data are described in Chapter 6, Section 6.4.1 . The effect of an intervention can be expressed as either a relative or an absolute effect. The risk ratio (relative risk) and odds ratio are relative measures, while the risk difference and number needed to treat for an additional beneficial outcome are absolute measures. A further complication is that there are, in fact, two risk ratios. We can calculate the risk ratio of an event occurring or the risk ratio of no event occurring. These give different summary results in a meta-analysis, sometimes dramatically so.

The selection of a summary statistic for use in meta-analysis depends on balancing three criteria (Deeks 2002). First, we desire a summary statistic that gives values that are similar for all the studies in the meta-analysis and subdivisions of the population to which the interventions will be applied. The more consistent the summary statistic, the greater is the justification for expressing the intervention effect as a single summary number. Second, the summary statistic must have the mathematical properties required to perform a valid meta-analysis. Third, the summary statistic would ideally be easily understood and applied by those using the review. The summary intervention effect should be presented in a way that helps readers to interpret and apply the results appropriately. Among effect measures for dichotomous data, no single measure is uniformly best, so the choice inevitably involves a compromise.

Consistency Empirical evidence suggests that relative effect measures are, on average, more consistent than absolute measures (Engels et al 2000, Deeks 2002, Rücker et al 2009). For this reason, it is wise to avoid performing meta-analyses of risk differences, unless there is a clear reason to suspect that risk differences will be consistent in a particular clinical situation. On average there is little difference between the odds ratio and risk ratio in terms of consistency (Deeks 2002). When the study aims to reduce the incidence of an adverse event, there is empirical evidence that risk ratios of the adverse event are more consistent than risk ratios of the non-event (Deeks 2002). Selecting an effect measure based on what is the most consistent in a particular situation is not a generally recommended strategy, since it may lead to a selection that spuriously maximizes the precision of a meta-analysis estimate.

Mathematical properties The most important mathematical criterion is the availability of a reliable variance estimate. The number needed to treat for an additional beneficial outcome does not have a simple variance estimator and cannot easily be used directly in meta-analysis, although it can be computed from the meta-analysis result afterwards (see Chapter 15, Section 15.4.2 ). There is no consensus regarding the importance of two other often-cited mathematical properties: the fact that the behaviour of the odds ratio and the risk difference do not rely on which of the two outcome states is coded as the event, and the odds ratio being the only statistic which is unbounded (see Chapter 6, Section 6.4.1 ).

Ease of interpretation The odds ratio is the hardest summary statistic to understand and to apply in practice, and many practising clinicians report difficulties in using them. There are many published examples where authors have misinterpreted odds ratios from meta-analyses as risk ratios. Although odds ratios can be re-expressed for interpretation (as discussed here), there must be some concern that routine presentation of the results of systematic reviews as odds ratios will lead to frequent over-estimation of the benefits and harms of interventions when the results are applied in clinical practice. Absolute measures of effect are thought to be more easily interpreted by clinicians than relative effects (Sinclair and Bracken 1994), and allow trade-offs to be made between likely benefits and likely harms of interventions. However, they are less likely to be generalizable.

It is generally recommended that meta-analyses are undertaken using risk ratios (taking care to make a sensible choice over which category of outcome is classified as the event) or odds ratios. This is because it seems important to avoid using summary statistics for which there is empirical evidence that they are unlikely to give consistent estimates of intervention effects (the risk difference), and it is impossible to use statistics for which meta-analysis cannot be performed (the number needed to treat for an additional beneficial outcome). It may be wise to plan to undertake a sensitivity analysis to investigate whether choice of summary statistic (and selection of the event category) is critical to the conclusions of the meta-analysis (see Section 10.14 ).

It is often sensible to use one statistic for meta-analysis and to re-express the results using a second, more easily interpretable statistic. For example, often meta-analysis may be best performed using relative effect measures (risk ratios or odds ratios) and the results re-expressed using absolute effect measures (risk differences or numbers needed to treat for an additional beneficial outcome – see Chapter 15, Section 15.4 . This is one of the key motivations for ‘Summary of findings’ tables in Cochrane Reviews: see Chapter 14 ). If odds ratios are used for meta-analysis they can also be re-expressed as risk ratios (see Chapter 15, Section 15.4 ). In all cases the same formulae can be used to convert upper and lower confidence limits. However, all of these transformations require specification of a value of baseline risk that indicates the likely risk of the outcome in the ‘control’ population to which the experimental intervention will be applied. Where the chosen value for this assumed comparator group risk is close to the typical observed comparator group risks across the studies, similar estimates of absolute effect will be obtained regardless of whether odds ratios or risk ratios are used for meta-analysis. Where the assumed comparator risk differs from the typical observed comparator group risk, the predictions of absolute benefit will differ according to which summary statistic was used for meta-analysis.

10.4.4 Meta-analysis of rare events

For rare outcomes, meta-analysis may be the only way to obtain reliable evidence of the effects of healthcare interventions. Individual studies are usually under-powered to detect differences in rare outcomes, but a meta-analysis of many studies may have adequate power to investigate whether interventions do have an impact on the incidence of the rare event. However, many methods of meta-analysis are based on large sample approximations, and are unsuitable when events are rare. Thus authors must take care when selecting a method of meta-analysis (Efthimiou 2018).

There is no single risk at which events are classified as ‘rare’. Certainly risks of 1 in 1000 constitute rare events, and many would classify risks of 1 in 100 the same way. However, the performance of methods when risks are as high as 1 in 10 may also be affected by the issues discussed in this section. What is typical is that a high proportion of the studies in the meta-analysis observe no events in one or more study arms.

10.4.4.1 Studies with no events in one or more arms

Computational problems can occur when no events are observed in one or both groups in an individual study. Inverse variance meta-analytical methods involve computing an intervention effect estimate and its standard error for each study. For studies where no events were observed in one or both arms, these computations often involve dividing by a zero count, which yields a computational error. Most meta-analytical software routines (including those in RevMan) automatically check for problematic zero counts, and add a fixed value (typically 0.5) to all cells of a 2×2 table where the problems occur. The Mantel-Haenszel methods require zero-cell corrections only if the same cell is zero in all the included studies, and hence need to use the correction less often. However, in many software applications the same correction rules are applied for Mantel-Haenszel methods as for the inverse-variance methods. Odds ratio and risk ratio methods require zero cell corrections more often than difference methods, except for the Peto odds ratio method, which encounters computation problems only in the extreme situation of no events occurring in all arms of all studies.

Whilst the fixed correction meets the objective of avoiding computational errors, it usually has the undesirable effect of biasing study estimates towards no difference and over-estimating variances of study estimates (consequently down-weighting inappropriately their contribution to the meta-analysis). Where the sizes of the study arms are unequal (which occurs more commonly in non-randomized studies than randomized trials), they will introduce a directional bias in the treatment effect. Alternative non-fixed zero-cell corrections have been explored by Sweeting and colleagues, including a correction proportional to the reciprocal of the size of the contrasting study arm, which they found preferable to the fixed 0.5 correction when arm sizes were not balanced (Sweeting et al 2004).

10.4.4.2 Studies with no events in either arm

The standard practice in meta-analysis of odds ratios and risk ratios is to exclude studies from the meta-analysis where there are no events in both arms. This is because such studies do not provide any indication of either the direction or magnitude of the relative treatment effect. Whilst it may be clear that events are very rare on both the experimental intervention and the comparator intervention, no information is provided as to which group is likely to have the higher risk, or on whether the risks are of the same or different orders of magnitude (when risks are very low, they are compatible with very large or very small ratios). Whilst one might be tempted to infer that the risk would be lowest in the group with the larger sample size (as the upper limit of the confidence interval would be lower), this is not justified as the sample size allocation was determined by the study investigators and is not a measure of the incidence of the event.

Risk difference methods superficially appear to have an advantage over odds ratio methods in that the risk difference is defined (as zero) when no events occur in either arm. Such studies are therefore included in the estimation process. Bradburn and colleagues undertook simulation studies which revealed that all risk difference methods yield confidence intervals that are too wide when events are rare, and have associated poor statistical power, which make them unsuitable for meta-analysis of rare events (Bradburn et al 2007). This is especially relevant when outcomes that focus on treatment safety are being studied, as the ability to identify correctly (or attempt to refute) serious adverse events is a key issue in drug development.

It is likely that outcomes for which no events occur in either arm may not be mentioned in reports of many randomized trials, precluding their inclusion in a meta-analysis. It is unclear, though, when working with published results, whether failure to mention a particular adverse event means there were no such events, or simply that such events were not included as a measured endpoint. Whilst the results of risk difference meta-analyses will be affected by non-reporting of outcomes with no events, odds and risk ratio based methods naturally exclude these data whether or not they are published, and are therefore unaffected.

10.4.4.3 Validity of methods of meta-analysis for rare events

Simulation studies have revealed that many meta-analytical methods can give misleading results for rare events, which is unsurprising given their reliance on asymptotic statistical theory. Their performance has been judged suboptimal either through results being biased, confidence intervals being inappropriately wide, or statistical power being too low to detect substantial differences.

In the following we consider the choice of statistical method for meta-analyses of odds ratios. Appropriate choices appear to depend on the comparator group risk, the likely size of the treatment effect and consideration of balance in the numbers of experimental and comparator participants in the constituent studies. We are not aware of research that has evaluated risk ratio measures directly, but their performance is likely to be very similar to corresponding odds ratio measurements. When events are rare, estimates of odds and risks are near identical, and results of both can be interpreted as ratios of probabilities.

Bradburn and colleagues found that many of the most commonly used meta-analytical methods were biased when events were rare (Bradburn et al 2007). The bias was greatest in inverse variance and DerSimonian and Laird odds ratio and risk difference methods, and the Mantel-Haenszel odds ratio method using a 0.5 zero-cell correction. As already noted, risk difference meta-analytical methods tended to show conservative confidence interval coverage and low statistical power when risks of events were low.

At event rates below 1% the Peto one-step odds ratio method was found to be the least biased and most powerful method, and provided the best confidence interval coverage, provided there was no substantial imbalance between treatment and comparator group sizes within studies, and treatment effects were not exceptionally large. This finding was consistently observed across three different meta-analytical scenarios, and was also observed by Sweeting and colleagues (Sweeting et al 2004).

This finding was noted despite the method producing only an approximation to the odds ratio. For very large effects (e.g. risk ratio=0.2) when the approximation is known to be poor, treatment effects were under-estimated, but the Peto method still had the best performance of all the methods considered for event risks of 1 in 1000, and the bias was never more than 6% of the comparator group risk.

In other circumstances (i.e. event risks above 1%, very large effects at event risks around 1%, and meta-analyses where many studies were substantially imbalanced) the best performing methods were the Mantel-Haenszel odds ratio without zero-cell corrections, logistic regression and an exact method. None of these methods is available in RevMan.

Methods that should be avoided with rare events are the inverse-variance methods (including the DerSimonian and Laird random-effects method) (Efthimiou 2018). These directly incorporate the study’s variance in the estimation of its contribution to the meta-analysis, but these are usually based on a large-sample variance approximation, which was not intended for use with rare events. We would suggest that incorporation of heterogeneity into an estimate of a treatment effect should be a secondary consideration when attempting to produce estimates of effects from sparse data – the primary concern is to discern whether there is any signal of an effect in the data.

10.5 Meta-analysis of continuous outcomes

An important assumption underlying standard methods for meta-analysis of continuous data is that the outcomes have a normal distribution in each intervention arm in each study. This assumption may not always be met, although it is unimportant in very large studies. It is useful to consider the possibility of skewed data (see Section 10.5.3 ).

10.5.1 Which effect measure for continuous outcomes?

The two summary statistics commonly used for meta-analysis of continuous data are the mean difference (MD) and the standardized mean difference (SMD). Other options are available, such as the ratio of means (see Chapter 6, Section 6.5.1 ). Selection of summary statistics for continuous data is principally determined by whether studies all report the outcome using the same scale (when the mean difference can be used) or using different scales (when the standardized mean difference is usually used). The ratio of means can be used in either situation, but is appropriate only when outcome measurements are strictly greater than zero. Further considerations in deciding on an effect measure that will facilitate interpretation of the findings appears in Chapter 15, Section 15.5 .

The different roles played in MD and SMD approaches by the standard deviations (SDs) of outcomes observed in the two groups should be understood.

For the mean difference approach, the SDs are used together with the sample sizes to compute the weight given to each study. Studies with small SDs are given relatively higher weight whilst studies with larger SDs are given relatively smaller weights. This is appropriate if variation in SDs between studies reflects differences in the reliability of outcome measurements, but is probably not appropriate if the differences in SD reflect real differences in the variability of outcomes in the study populations.

For the standardized mean difference approach, the SDs are used to standardize the mean differences to a single scale, as well as in the computation of study weights. Thus, studies with small SDs lead to relatively higher estimates of SMD, whilst studies with larger SDs lead to relatively smaller estimates of SMD. For this to be appropriate, it must be assumed that between-study variation in SDs reflects only differences in measurement scales and not differences in the reliability of outcome measures or variability among study populations, as discussed in Chapter 6, Section 6.5.1.2 .

These assumptions of the methods should be borne in mind when unexpected variation of SDs is observed across studies.

10.5.2 Meta-analysis of change scores

In some circumstances an analysis based on changes from baseline will be more efficient and powerful than comparison of post-intervention values, as it removes a component of between-person variability from the analysis. However, calculation of a change score requires measurement of the outcome twice and in practice may be less efficient for outcomes that are unstable or difficult to measure precisely, where the measurement error may be larger than true between-person baseline variability. Change-from-baseline outcomes may also be preferred if they have a less skewed distribution than post-intervention measurement outcomes. Although sometimes used as a device to ‘correct’ for unlucky randomization, this practice is not recommended.

The preferred statistical approach to accounting for baseline measurements of the outcome variable is to include the baseline outcome measurements as a covariate in a regression model or analysis of covariance (ANCOVA). These analyses produce an ‘adjusted’ estimate of the intervention effect together with its standard error. These analyses are the least frequently encountered, but as they give the most precise and least biased estimates of intervention effects they should be included in the analysis when they are available. However, they can only be included in a meta-analysis using the generic inverse-variance method, since means and SDs are not available for each intervention group separately.

In practice an author is likely to discover that the studies included in a review include a mixture of change-from-baseline and post-intervention value scores. However, mixing of outcomes is not a problem when it comes to meta-analysis of MDs. There is no statistical reason why studies with change-from-baseline outcomes should not be combined in a meta-analysis with studies with post-intervention measurement outcomes when using the (unstandardized) MD method. In a randomized study, MD based on changes from baseline can usually be assumed to be addressing exactly the same underlying intervention effects as analyses based on post-intervention measurements. That is to say, the difference in mean post-intervention values will on average be the same as the difference in mean change scores. If the use of change scores does increase precision, appropriately, the studies presenting change scores will be given higher weights in the analysis than they would have received if post-intervention values had been used, as they will have smaller SDs.

When combining the data on the MD scale, authors must be careful to use the appropriate means and SDs (either of post-intervention measurements or of changes from baseline) for each study. Since the mean values and SDs for the two types of outcome may differ substantially, it may be advisable to place them in separate subgroups to avoid confusion for the reader, but the results of the subgroups can legitimately be pooled together.

In contrast, post-intervention value and change scores should not in principle be combined using standard meta-analysis approaches when the effect measure is an SMD. This is because the SDs used in the standardization reflect different things. The SD when standardizing post-intervention values reflects between-person variability at a single point in time. The SD when standardizing change scores reflects variation in between-person changes over time, so will depend on both within-person and between-person variability; within-person variability in turn is likely to depend on the length of time between measurements. Nevertheless, an empirical study of 21 meta-analyses in osteoarthritis did not find a difference between combined SMDs based on post-intervention values and combined SMDs based on change scores (da Costa et al 2013). One option is to standardize SMDs using post-intervention SDs rather than change score SDs. This would lead to valid synthesis of the two approaches, but we are not aware that an appropriate standard error for this has been derived.

A common practical problem associated with including change-from-baseline measures is that the SD of changes is not reported. Imputation of SDs is discussed in Chapter 6, Section 6.5.2.8 .

10.5.3 Meta-analysis of skewed data

Analyses based on means are appropriate for data that are at least approximately normally distributed, and for data from very large trials. If the true distribution of outcomes is asymmetrical, then the data are said to be skewed. Review authors should consider the possibility and implications of skewed data when analysing continuous outcomes (see MECIR Box 10.5.a ). Skew can sometimes be diagnosed from the means and SDs of the outcomes. A rough check is available, but it is only valid if a lowest or highest possible value for an outcome is known to exist. Thus, the check may be used for outcomes such as weight, volume and blood concentrations, which have lowest possible values of 0, or for scale outcomes with minimum or maximum scores, but it may not be appropriate for change-from-baseline measures. The check involves calculating the observed mean minus the lowest possible value (or the highest possible value minus the observed mean), and dividing this by the SD. A ratio less than 2 suggests skew (Altman and Bland 1996). If the ratio is less than 1, there is strong evidence of a skewed distribution.

Transformation of the original outcome data may reduce skew substantially. Reports of trials may present results on a transformed scale, usually a log scale. Collection of appropriate data summaries from the trialists, or acquisition of individual patient data, is currently the approach of choice. Appropriate data summaries and analysis strategies for the individual patient data will depend on the situation. Consultation with a knowledgeable statistician is advised.

Where data have been analysed on a log scale, results are commonly presented as geometric means and ratios of geometric means. A meta-analysis may be then performed on the scale of the log-transformed data; an example of the calculation of the required means and SD is given in Chapter 6, Section 6.5.2.4 . This approach depends on being able to obtain transformed data for all studies; methods for transforming from one scale to the other are available (Higgins et al 2008b). Log-transformed and untransformed data should not be mixed in a meta-analysis.

MECIR Box 10.5.a Relevant expectations for conduct of intervention reviews

10.6 Combining dichotomous and continuous outcomes

Occasionally authors encounter a situation where data for the same outcome are presented in some studies as dichotomous data and in other studies as continuous data. For example, scores on depression scales can be reported as means, or as the percentage of patients who were depressed at some point after an intervention (i.e. with a score above a specified cut-point). This type of information is often easier to understand, and more helpful, when it is dichotomized. However, deciding on a cut-point may be arbitrary, and information is lost when continuous data are transformed to dichotomous data.

There are several options for handling combinations of dichotomous and continuous data. Generally, it is useful to summarize results from all the relevant, valid studies in a similar way, but this is not always possible. It may be possible to collect missing data from investigators so that this can be done. If not, it may be useful to summarize the data in three ways: by entering the means and SDs as continuous outcomes, by entering the counts as dichotomous outcomes and by entering all of the data in text form as ‘Other data’ outcomes.

There are statistical approaches available that will re-express odds ratios as SMDs (and vice versa), allowing dichotomous and continuous data to be combined (Anzures-Cabrera et al 2011). A simple approach is as follows. Based on an assumption that the underlying continuous measurements in each intervention group follow a logistic distribution (which is a symmetrical distribution similar in shape to the normal distribution, but with more data in the distributional tails), and that the variability of the outcomes is the same in both experimental and comparator participants, the odds ratios can be re-expressed as a SMD according to the following simple formula (Chinn 2000):

The standard error of the log odds ratio can be converted to the standard error of a SMD by multiplying by the same constant (√3/π=0.5513). Alternatively SMDs can be re-expressed as log odds ratios by multiplying by π/√3=1.814. Once SMDs (or log odds ratios) and their standard errors have been computed for all studies in the meta-analysis, they can be combined using the generic inverse-variance method. Standard errors can be computed for all studies by entering the data as dichotomous and continuous outcome type data, as appropriate, and converting the confidence intervals for the resulting log odds ratios and SMDs into standard errors (see Chapter 6, Section 6.3 ).

10.7 Meta-analysis of ordinal outcomes and measurement scale s

Ordinal and measurement scale outcomes are most commonly meta-analysed as dichotomous data (if so, see Section 10.4 ) or continuous data (if so, see Section 10.5 ) depending on the way that the study authors performed the original analyses.

Occasionally it is possible to analyse the data using proportional odds models. This is the case when ordinal scales have a small number of categories, the numbers falling into each category for each intervention group can be obtained, and the same ordinal scale has been used in all studies. This approach may make more efficient use of all available data than dichotomization, but requires access to statistical software and results in a summary statistic for which it is challenging to find a clinical meaning.

The proportional odds model uses the proportional odds ratio as the measure of intervention effect (Agresti 1996) (see Chapter 6, Section 6.6 ), and can be used for conducting a meta-analysis in advanced statistical software packages (Whitehead and Jones 1994). Estimates of log odds ratios and their standard errors from a proportional odds model may be meta-analysed using the generic inverse-variance method (see Section 10.3.3 ). If the same ordinal scale has been used in all studies, but in some reports has been presented as a dichotomous outcome, it may still be possible to include all studies in the meta-analysis. In the context of the three-category model, this might mean that for some studies category 1 constitutes a success, while for others both categories 1 and 2 constitute a success. Methods are available for dealing with this, and for combining data from scales that are related but have different definitions for their categories (Whitehead and Jones 1994).

10.8 Meta-analysis of counts and rates

Results may be expressed as count data when each participant may experience an event, and may experience it more than once (see Chapter 6, Section 6.7 ). For example, ‘number of strokes’, or ‘number of hospital visits’ are counts. These events may not happen at all, but if they do happen there is no theoretical maximum number of occurrences for an individual. Count data may be analysed using methods for dichotomous data if the counts are dichotomized for each individual (see Section 10.4 ), continuous data (see Section 10.5 ) and time-to-event data (see Section 10.9 ), as well as being analysed as rate data.

Rate data occur if counts are measured for each participant along with the time over which they are observed. This is particularly appropriate when the events being counted are rare. For example, a woman may experience two strokes during a follow-up period of two years. Her rate of strokes is one per year of follow-up (or, equivalently 0.083 per month of follow-up). Rates are conventionally summarized at the group level. For example, participants in the comparator group of a clinical trial may experience 85 strokes during a total of 2836 person-years of follow-up. An underlying assumption associated with the use of rates is that the risk of an event is constant across participants and over time. This assumption should be carefully considered for each situation. For example, in contraception studies, rates have been used (known as Pearl indices) to describe the number of pregnancies per 100 women-years of follow-up. This is now considered inappropriate since couples have different risks of conception, and the risk for each woman changes over time. Pregnancies are now analysed more often using life tables or time-to-event methods that investigate the time elapsing before the first pregnancy.

Analysing count data as rates is not always the most appropriate approach and is uncommon in practice. This is because:

the assumption of a constant underlying risk may not be suitable; and
the statistical methods are not as well developed as they are for other types of data.

The results of a study may be expressed as a rate ratio , that is the ratio of the rate in the experimental intervention group to the rate in the comparator group. The (natural) logarithms of the rate ratios may be combined across studies using the generic inverse-variance method (see Section 10.3.3 ). Alternatively, Poisson regression approaches can be used (Spittal et al 2015).

In a randomized trial, rate ratios may often be very similar to risk ratios obtained after dichotomizing the participants, since the average period of follow-up should be similar in all intervention groups. Rate ratios and risk ratios will differ, however, if an intervention affects the likelihood of some participants experiencing multiple events.

It is possible also to focus attention on the rate difference (see Chapter 6, Section 6.7.1 ). The analysis again can be performed using the generic inverse-variance method (Hasselblad and McCrory 1995, Guevara et al 2004).

10.9 Meta-analysis of time-to-event outcomes

Two approaches to meta-analysis of time-to-event outcomes are readily available to Cochrane Review authors. The choice of which to use will depend on the type of data that have been extracted from the primary studies, or obtained from re-analysis of individual participant data.

If ‘O – E’ and ‘V’ statistics have been obtained (see Chapter 6, Section 6.8.2 ), either through re-analysis of individual participant data or from aggregate statistics presented in the study reports, then these statistics may be entered directly into RevMan using the ‘O – E and Variance’ outcome type. There are several ways to calculate these ‘O – E’ and ‘V’ statistics. Peto’s method applied to dichotomous data (Section 10.4.2 ) gives rise to an odds ratio; a log-rank approach gives rise to a hazard ratio; and a variation of the Peto method for analysing time-to-event data gives rise to something in between (Simmonds et al 2011). The appropriate effect measure should be specified. Only fixed-effect meta-analysis methods are available in RevMan for ‘O – E and Variance’ outcomes.

Alternatively, if estimates of log hazard ratios and standard errors have been obtained from results of Cox proportional hazards regression models, study results can be combined using generic inverse-variance methods (see Section 10.3.3 ).

If a mixture of log-rank and Cox model estimates are obtained from the studies, all results can be combined using the generic inverse-variance method, as the log-rank estimates can be converted into log hazard ratios and standard errors using the approaches discussed in Chapter 6, Section 6.8 .

10.10 Heterogeneity

10.10.1 what is heterogeneity.

Inevitably, studies brought together in a systematic review will differ. Any kind of variability among studies in a systematic review may be termed heterogeneity. It can be helpful to distinguish between different types of heterogeneity. Variability in the participants, interventions and outcomes studied may be described as clinical diversity (sometimes called clinical heterogeneity), and variability in study design, outcome measurement tools and risk of bias may be described as methodological diversity (sometimes called methodological heterogeneity). Variability in the intervention effects being evaluated in the different studies is known as statistical heterogeneity , and is a consequence of clinical or methodological diversity, or both, among the studies. Statistical heterogeneity manifests itself in the observed intervention effects being more different from each other than one would expect due to random error (chance) alone. We will follow convention and refer to statistical heterogeneity simply as heterogeneity .

Clinical variation will lead to heterogeneity if the intervention effect is affected by the factors that vary across studies; most obviously, the specific interventions or patient characteristics. In other words, the true intervention effect will be different in different studies.

Differences between studies in terms of methodological factors, such as use of blinding and concealment of allocation sequence, or if there are differences between studies in the way the outcomes are defined and measured, may be expected to lead to differences in the observed intervention effects. Significant statistical heterogeneity arising from methodological diversity or differences in outcome assessments suggests that the studies are not all estimating the same quantity, but does not necessarily suggest that the true intervention effect varies. In particular, heterogeneity associated solely with methodological diversity would indicate that the studies suffer from different degrees of bias. Empirical evidence suggests that some aspects of design can affect the result of clinical trials, although this is not always the case. Further discussion appears in Chapter 7 and Chapter 8 .

The scope of a review will largely determine the extent to which studies included in a review are diverse. Sometimes a review will include studies addressing a variety of questions, for example when several different interventions for the same condition are of interest (see also Chapter 11 ) or when the differential effects of an intervention in different populations are of interest. Meta-analysis should only be considered when a group of studies is sufficiently homogeneous in terms of participants, interventions and outcomes to provide a meaningful summary (see MECIR Box 10.10.a. ). It is often appropriate to take a broader perspective in a meta-analysis than in a single clinical trial. A common analogy is that systematic reviews bring together apples and oranges, and that combining these can yield a meaningless result. This is true if apples and oranges are of intrinsic interest on their own, but may not be if they are used to contribute to a wider question about fruit. For example, a meta-analysis may reasonably evaluate the average effect of a class of drugs by combining results from trials where each evaluates the effect of a different drug from the class.

MECIR Box 10.10.a Relevant expectations for conduct of intervention reviews

There may be specific interest in a review in investigating how clinical and methodological aspects of studies relate to their results. Where possible these investigations should be specified a priori (i.e. in the protocol for the systematic review). It is legitimate for a systematic review to focus on examining the relationship between some clinical characteristic(s) of the studies and the size of intervention effect, rather than on obtaining a summary effect estimate across a series of studies (see Section 10.11 ). Meta-regression may best be used for this purpose, although it is not implemented in RevMan (see Section 10.11.4 ).

10.10.2 Identifying and measuring heterogeneity

It is essential to consider the extent to which the results of studies are consistent with each other (see MECIR Box 10.10.b ). If confidence intervals for the results of individual studies (generally depicted graphically using horizontal lines) have poor overlap, this generally indicates the presence of statistical heterogeneity. More formally, a statistical test for heterogeneity is available. This Chi 2 (χ 2 , or chi-squared) test is included in the forest plots in Cochrane Reviews. It assesses whether observed differences in results are compatible with chance alone. A low P value (or a large Chi 2 statistic relative to its degree of freedom) provides evidence of heterogeneity of intervention effects (variation in effect estimates beyond chance).

MECIR Box 10.10.b Relevant expectations for conduct of intervention reviews

Care must be taken in the interpretation of the Chi 2 test, since it has low power in the (common) situation of a meta-analysis when studies have small sample size or are few in number. This means that while a statistically significant result may indicate a problem with heterogeneity, a non-significant result must not be taken as evidence of no heterogeneity. This is also why a P value of 0.10, rather than the conventional level of 0.05, is sometimes used to determine statistical significance. A further problem with the test, which seldom occurs in Cochrane Reviews, is that when there are many studies in a meta-analysis, the test has high power to detect a small amount of heterogeneity that may be clinically unimportant.

Some argue that, since clinical and methodological diversity always occur in a meta-analysis, statistical heterogeneity is inevitable (Higgins et al 2003). Thus, the test for heterogeneity is irrelevant to the choice of analysis; heterogeneity will always exist whether or not we happen to be able to detect it using a statistical test. Methods have been developed for quantifying inconsistency across studies that move the focus away from testing whether heterogeneity is present to assessing its impact on the meta-analysis. A useful statistic for quantifying inconsistency is:

In this equation, Q is the Chi 2 statistic and df is its degrees of freedom (Higgins and Thompson 2002, Higgins et al 2003). I 2 describes the percentage of the variability in effect estimates that is due to heterogeneity rather than sampling error (chance).

Thresholds for the interpretation of the I 2 statistic can be misleading, since the importance of inconsistency depends on several factors. A rough guide to interpretation in the context of meta-analyses of randomized trials is as follows:

0% to 40%: might not be important;
30% to 60%: may represent moderate heterogeneity*;
50% to 90%: may represent substantial heterogeneity*;
75% to 100%: considerable heterogeneity*.

*The importance of the observed value of I 2 depends on (1) magnitude and direction of effects, and (2) strength of evidence for heterogeneity (e.g. P value from the Chi 2 test, or a confidence interval for I 2 : uncertainty in the value of I 2 is substantial when the number of studies is small).

10.10.3 Strategies for addressing heterogeneity

Review authors must take into account any statistical heterogeneity when interpreting results, particularly when there is variation in the direction of effect (see MECIR Box 10.10.c ). A number of options are available if heterogeneity is identified among a group of studies that would otherwise be considered suitable for a meta-analysis.

MECIR Box 10.10.c Relevant expectations for conduct of intervention reviews

Check again that the data are correct. Severe apparent heterogeneity can indicate that data have been incorrectly extracted or entered into meta-analysis software. For example, if standard errors have mistakenly been entered as SDs for continuous outcomes, this could manifest itself in overly narrow confidence intervals with poor overlap and hence substantial heterogeneity. Unit-of-analysis errors may also be causes of heterogeneity (see Chapter 6, Section 6.2 ).
Do not do a meta -analysis. A systematic review need not contain any meta-analyses. If there is considerable variation in results, and particularly if there is inconsistency in the direction of effect, it may be misleading to quote an average value for the intervention effect.
Explore heterogeneity. It is clearly of interest to determine the causes of heterogeneity among results of studies. This process is problematic since there are often many characteristics that vary across studies from which one may choose. Heterogeneity may be explored by conducting subgroup analyses (see Section 10.11.3 ) or meta-regression (see Section 10.11.4 ). Reliable conclusions can only be drawn from analyses that are truly pre-specified before inspecting the studies’ results, and even these conclusions should be interpreted with caution. Explorations of heterogeneity that are devised after heterogeneity is identified can at best lead to the generation of hypotheses. They should be interpreted with even more caution and should generally not be listed among the conclusions of a review. Also, investigations of heterogeneity when there are very few studies are of questionable value.
Ignore heterogeneity. Fixed-effect meta-analyses ignore heterogeneity. The summary effect estimate from a fixed-effect meta-analysis is normally interpreted as being the best estimate of the intervention effect. However, the existence of heterogeneity suggests that there may not be a single intervention effect but a variety of intervention effects. Thus, the summary fixed-effect estimate may be an intervention effect that does not actually exist in any population, and therefore have a confidence interval that is meaningless as well as being too narrow (see Section 10.10.4 ).
Perform a random-effects meta-analysis. A random-effects meta-analysis may be used to incorporate heterogeneity among studies. This is not a substitute for a thorough investigation of heterogeneity. It is intended primarily for heterogeneity that cannot be explained. An extended discussion of this option appears in Section 10.10.4 .
Reconsider the effect measure. Heterogeneity may be an artificial consequence of an inappropriate choice of effect measure. For example, when studies collect continuous outcome data using different scales or different units, extreme heterogeneity may be apparent when using the mean difference but not when the more appropriate standardized mean difference is used. Furthermore, choice of effect measure for dichotomous outcomes (odds ratio, risk ratio, or risk difference) may affect the degree of heterogeneity among results. In particular, when comparator group risks vary, homogeneous odds ratios or risk ratios will necessarily lead to heterogeneous risk differences, and vice versa. However, it remains unclear whether homogeneity of intervention effect in a particular meta-analysis is a suitable criterion for choosing between these measures (see also Section 10.4.3 ).
Exclude studies. Heterogeneity may be due to the presence of one or two outlying studies with results that conflict with the rest of the studies. In general it is unwise to exclude studies from a meta-analysis on the basis of their results as this may introduce bias. However, if an obvious reason for the outlying result is apparent, the study might be removed with more confidence. Since usually at least one characteristic can be found for any study in any meta-analysis which makes it different from the others, this criterion is unreliable because it is all too easy to fulfil. It is advisable to perform analyses both with and without outlying studies as part of a sensitivity analysis (see Section 10.14 ). Whenever possible, potential sources of clinical diversity that might lead to such situations should be specified in the protocol.

10.10.4 Incorporating heterogeneity into random-effects models

The random-effects meta-analysis approach incorporates an assumption that the different studies are estimating different, yet related, intervention effects (DerSimonian and Laird 1986, Borenstein et al 2010). The approach allows us to address heterogeneity that cannot readily be explained by other factors. A random-effects meta-analysis model involves an assumption that the effects being estimated in the different studies follow some distribution. The model represents our lack of knowledge about why real, or apparent, intervention effects differ, by considering the differences as if they were random. The centre of the assumed distribution describes the average of the effects, while its width describes the degree of heterogeneity. The conventional choice of distribution is a normal distribution. It is difficult to establish the validity of any particular distributional assumption, and this is a common criticism of random-effects meta-analyses. The importance of the assumed shape for this distribution has not been widely studied.

To undertake a random-effects meta-analysis, the standard errors of the study-specific estimates (SE i in Section 10.3.1 ) are adjusted to incorporate a measure of the extent of variation, or heterogeneity, among the intervention effects observed in different studies (this variation is often referred to as Tau-squared, τ 2 , or Tau 2 ). The amount of variation, and hence the adjustment, can be estimated from the intervention effects and standard errors of the studies included in the meta-analysis.

In a heterogeneous set of studies, a random-effects meta-analysis will award relatively more weight to smaller studies than such studies would receive in a fixed-effect meta-analysis. This is because small studies are more informative for learning about the distribution of effects across studies than for learning about an assumed common intervention effect.

Note that a random-effects model does not ‘take account’ of the heterogeneity, in the sense that it is no longer an issue. It is always preferable to explore possible causes of heterogeneity, although there may be too few studies to do this adequately (see Section 10.11 ).

10.10.4.1 Fixed or random effects?

A fixed-effect meta-analysis provides a result that may be viewed as a ‘typical intervention effect’ from the studies included in the analysis. In order to calculate a confidence interval for a fixed-effect meta-analysis the assumption is usually made that the true effect of intervention (in both magnitude and direction) is the same value in every study (i.e. fixed across studies). This assumption implies that the observed differences among study results are due solely to the play of chance (i.e. that there is no statistical heterogeneity).

A random-effects model provides a result that may be viewed as an ‘average intervention effect’, where this average is explicitly defined according to an assumed distribution of effects across studies. Instead of assuming that the intervention effects are the same, we assume that they follow (usually) a normal distribution. The assumption implies that the observed differences among study results are due to a combination of the play of chance and some genuine variation in the intervention effects.

The random-effects method and the fixed-effect method will give identical results when there is no heterogeneity among the studies.

When heterogeneity is present, a confidence interval around the random-effects summary estimate is wider than a confidence interval around a fixed-effect summary estimate. This will happen whenever the I 2 statistic is greater than zero, even if the heterogeneity is not detected by the Chi 2 test for heterogeneity (see Section 10.10.2 ).

Sometimes the central estimate of the intervention effect is different between fixed-effect and random-effects analyses. In particular, if results of smaller studies are systematically different from results of larger ones, which can happen as a result of publication bias or within-study bias in smaller studies (Egger et al 1997, Poole and Greenland 1999, Kjaergard et al 2001), then a random-effects meta-analysis will exacerbate the effects of the bias (see also Chapter 13, Section 13.3.5.6 ). A fixed-effect analysis will be affected less, although strictly it will also be inappropriate.

The decision between fixed- and random-effects meta-analyses has been the subject of much debate, and we do not provide a universal recommendation. Some considerations in making this choice are as follows:

Many have argued that the decision should be based on an expectation of whether the intervention effects are truly identical, preferring the fixed-effect model if this is likely and a random-effects model if this is unlikely (Borenstein et al 2010). Since it is generally considered to be implausible that intervention effects across studies are identical (unless the intervention has no effect at all), this leads many to advocate use of the random-effects model.
Others have argued that a fixed-effect analysis can be interpreted in the presence of heterogeneity, and that it makes fewer assumptions than a random-effects meta-analysis. They then refer to it as a ‘fixed-effects’ meta-analysis (Peto et al 1995, Rice et al 2018).
Under any interpretation, a fixed-effect meta-analysis ignores heterogeneity. If the method is used, it is therefore important to supplement it with a statistical investigation of the extent of heterogeneity (see Section 10.10.2 ).
In the presence of heterogeneity, a random-effects analysis gives relatively more weight to smaller studies and relatively less weight to larger studies. If there is additionally some funnel plot asymmetry (i.e. a relationship between intervention effect magnitude and study size), then this will push the results of the random-effects analysis towards the findings in the smaller studies. In the context of randomized trials, this is generally regarded as an unfortunate consequence of the model.
A pragmatic approach is to plan to undertake both a fixed-effect and a random-effects meta-analysis, with an intention to present the random-effects result if there is no indication of funnel plot asymmetry. If there is an indication of funnel plot asymmetry, then both methods are problematic. It may be reasonable to present both analyses or neither, or to perform a sensitivity analysis in which small studies are excluded or addressed directly using meta-regression (see Chapter 13, Section 13.3.5.6 ).
The choice between a fixed-effect and a random-effects meta-analysis should never be made on the basis of a statistical test for heterogeneity.

10.10.4.2 Interpretation of random-effects meta-analyses

The summary estimate and confidence interval from a random-effects meta-analysis refer to the centre of the distribution of intervention effects, but do not describe the width of the distribution. Often the summary estimate and its confidence interval are quoted in isolation and portrayed as a sufficient summary of the meta-analysis. This is inappropriate. The confidence interval from a random-effects meta-analysis describes uncertainty in the location of the mean of systematically different effects in the different studies. It does not describe the degree of heterogeneity among studies, as may be commonly believed. For example, when there are many studies in a meta-analysis, we may obtain a very tight confidence interval around the random-effects estimate of the mean effect even when there is a large amount of heterogeneity. A solution to this problem is to consider a prediction interval (see Section 10.10.4.3 ).

Methodological diversity creates heterogeneity through biases variably affecting the results of different studies. The random-effects summary estimate will only correctly estimate the average intervention effect if the biases are symmetrically distributed, leading to a mixture of over-estimates and under-estimates of effect, which is unlikely to be the case. In practice it can be very difficult to distinguish whether heterogeneity results from clinical or methodological diversity, and in most cases it is likely to be due to both, so these distinctions are hard to draw in the interpretation.

When there is little information, either because there are few studies or if the studies are small with few events, a random-effects analysis will provide poor estimates of the amount of heterogeneity (i.e. of the width of the distribution of intervention effects). Fixed-effect methods such as the Mantel-Haenszel method will provide more robust estimates of the average intervention effect, but at the cost of ignoring any heterogeneity.

10.10.4.3 Prediction intervals from a random-effects meta-analysis

An estimate of the between-study variance in a random-effects meta-analysis is typically presented as part of its results. The square root of this number (i.e. Tau) is the estimated standard deviation of underlying effects across studies. Prediction intervals are a way of expressing this value in an interpretable way.

To motivate the idea of a prediction interval, note that for absolute measures of effect (e.g. risk difference, mean difference, standardized mean difference), an approximate 95% range of normally distributed underlying effects can be obtained by creating an interval from 1.96´Tau below the random-effects mean, to 1.96✕Tau above it. (For relative measures such as the odds ratio and risk ratio, an equivalent interval needs to be based on the natural logarithm of the summary estimate.) In reality, both the summary estimate and the value of Tau are associated with uncertainty. A prediction interval seeks to present the range of effects in a way that acknowledges this uncertainty (Higgins et al 2009). A simple 95% prediction interval can be calculated as:

where M is the summary mean from the random-effects meta-analysis, t k −2 is the 95% percentile of a t -distribution with k –2 degrees of freedom, k is the number of studies, Tau 2 is the estimated amount of heterogeneity and SE( M ) is the standard error of the summary mean.

The term ‘prediction interval’ relates to the use of this interval to predict the possible underlying effect in a new study that is similar to the studies in the meta-analysis. A more useful interpretation of the interval is as a summary of the spread of underlying effects in the studies included in the random-effects meta-analysis.

Prediction intervals have proved a popular way of expressing the amount of heterogeneity in a meta-analysis (Riley et al 2011). They are, however, strongly based on the assumption of a normal distribution for the effects across studies, and can be very problematic when the number of studies is small, in which case they can appear spuriously wide or spuriously narrow. Nevertheless, we encourage their use when the number of studies is reasonable (e.g. more than ten) and there is no clear funnel plot asymmetry.

10.10.4.4 Implementing random-effects meta-analyses

As introduced in Section 10.3.2 , the random-effects model can be implemented using an inverse-variance approach, incorporating a measure of the extent of heterogeneity into the study weights. RevMan implements a version of random-effects meta-analysis that is described by DerSimonian and Laird, making use of a ‘moment-based’ estimate of the between-study variance (DerSimonian and Laird 1986). The attraction of this method is that the calculations are straightforward, but it has a theoretical disadvantage in that the confidence intervals are slightly too narrow to encompass full uncertainty resulting from having estimated the degree of heterogeneity.

For many years, RevMan has implemented two random-effects methods for dichotomous data: a Mantel-Haenszel method and an inverse-variance method. Both use the moment-based approach to estimating the amount of between-studies variation. The difference between the two is subtle: the former estimates the between-study variation by comparing each study’s result with a Mantel-Haenszel fixed-effect meta-analysis result, whereas the latter estimates it by comparing each study’s result with an inverse-variance fixed-effect meta-analysis result. In practice, the difference is likely to be trivial.

There are alternative methods for performing random-effects meta-analyses that have better technical properties than the DerSimonian and Laird approach with a moment-based estimate (Veroniki et al 2016). Most notable among these is an adjustment to the confidence interval proposed by Hartung and Knapp and by Sidik and Jonkman (Hartung and Knapp 2001, Sidik and Jonkman 2002). This adjustment widens the confidence interval to reflect uncertainty in the estimation of between-study heterogeneity, and it should be used if available to review authors. An alternative option to encompass full uncertainty in the degree of heterogeneity is to take a Bayesian approach (see Section 10.13 ).

An empirical comparison of different ways to estimate between-study variation in Cochrane meta-analyses has shown that they can lead to substantial differences in estimates of heterogeneity, but seldom have major implications for estimating summary effects (Langan et al 2015). Several simulation studies have concluded that an approach proposed by Paule and Mandel should be recommended (Langan et al 2017); whereas a comprehensive recent simulation study recommended a restricted maximum likelihood approach, although noted that no single approach is universally preferable (Langan et al 2019). Review authors are encouraged to select one of these options if it is available to them.

10.11 Investigating heterogeneity

10.11.1 interaction and effect modification.

Does the intervention effect vary with different populations or intervention characteristics (such as dose or duration)? Such variation is known as interaction by statisticians and as effect modification by epidemiologists. Methods to search for such interactions include subgroup analyses and meta-regression. All methods have considerable pitfalls.

10.11.2 What are subgroup analyses?

Subgroup analyses involve splitting all the participant data into subgroups, often in order to make comparisons between them. Subgroup analyses may be done for subsets of participants (such as males and females), or for subsets of studies (such as different geographical locations). Subgroup analyses may be done as a means of investigating heterogeneous results, or to answer specific questions about particular patient groups, types of intervention or types of study.

Subgroup analyses of subsets of participants within studies are uncommon in systematic reviews based on published literature because sufficient details to extract data about separate participant types are seldom published in reports. By contrast, such subsets of participants are easily analysed when individual participant data have been collected (see Chapter 26 ). The methods we describe in the remainder of this chapter are for subgroups of studies.

Findings from multiple subgroup analyses may be misleading. Subgroup analyses are observational by nature and are not based on randomized comparisons. False negative and false positive significance tests increase in likelihood rapidly as more subgroup analyses are performed. If their findings are presented as definitive conclusions there is clearly a risk of people being denied an effective intervention or treated with an ineffective (or even harmful) intervention. Subgroup analyses can also generate misleading recommendations about directions for future research that, if followed, would waste scarce resources.

It is useful to distinguish between the notions of ‘qualitative interaction’ and ‘quantitative interaction’ (Yusuf et al 1991). Qualitative interaction exists if the direction of effect is reversed, that is if an intervention is beneficial in one subgroup but is harmful in another. Qualitative interaction is rare. This may be used as an argument that the most appropriate result of a meta-analysis is the overall effect across all subgroups. Quantitative interaction exists when the size of the effect varies but not the direction, that is if an intervention is beneficial to different degrees in different subgroups.

10.11.3 Undertaking subgroup analyses

Meta-analyses can be undertaken in RevMan both within subgroups of studies as well as across all studies irrespective of their subgroup membership. It is tempting to compare effect estimates in different subgroups by considering the meta-analysis results from each subgroup separately. This should only be done informally by comparing the magnitudes of effect. Noting that either the effect or the test for heterogeneity in one subgroup is statistically significant whilst that in the other subgroup is not statistically significant does not indicate that the subgroup factor explains heterogeneity. Since different subgroups are likely to contain different amounts of information and thus have different abilities to detect effects, it is extremely misleading simply to compare the statistical significance of the results.

10.11.3.1 Is the effect different in different subgroups?

Valid investigations of whether an intervention works differently in different subgroups involve comparing the subgroups with each other. It is a mistake to compare within-subgroup inferences such as P values. If one subgroup analysis is statistically significant and another is not, then the latter may simply reflect a lack of information rather than a smaller (or absent) effect. When there are only two subgroups, non-overlap of the confidence intervals indicates statistical significance, but note that the confidence intervals can overlap to a small degree and the difference still be statistically significant.

A formal statistical approach should be used to examine differences among subgroups (see MECIR Box 10.11.a ). A simple significance test to investigate differences between two or more subgroups can be performed (Borenstein and Higgins 2013). This procedure consists of undertaking a standard test for heterogeneity across subgroup results rather than across individual study results. When the meta-analysis uses a fixed-effect inverse-variance weighted average approach, the method is exactly equivalent to the test described by Deeks and colleagues (Deeks et al 2001). An I 2 statistic is also computed for subgroup differences. This describes the percentage of the variability in effect estimates from the different subgroups that is due to genuine subgroup differences rather than sampling error (chance). Note that these methods for examining subgroup differences should be used only when the data in the subgroups are independent (i.e. they should not be used if the same study participants contribute to more than one of the subgroups in the forest plot).

If fixed-effect models are used for the analysis within each subgroup, then these statistics relate to differences in typical effects across different subgroups. If random-effects models are used for the analysis within each subgroup, then the statistics relate to variation in the mean effects in the different subgroups.

An alternative method for testing for differences between subgroups is to use meta-regression techniques, in which case a random-effects model is generally preferred (see Section 10.11.4 ). Tests for subgroup differences based on random-effects models may be regarded as preferable to those based on fixed-effect models, due to the high risk of false-positive results when a fixed-effect model is used to compare subgroups (Higgins and Thompson 2004).

MECIR Box 10.11.a Relevant expectations for conduct of intervention reviews

10.11.4 Meta-regression

If studies are divided into subgroups (see Section 10.11.2 ), this may be viewed as an investigation of how a categorical study characteristic is associated with the intervention effects in the meta-analysis. For example, studies in which allocation sequence concealment was adequate may yield different results from those in which it was inadequate. Here, allocation sequence concealment, being either adequate or inadequate, is a categorical characteristic at the study level. Meta-regression is an extension to subgroup analyses that allows the effect of continuous, as well as categorical, characteristics to be investigated, and in principle allows the effects of multiple factors to be investigated simultaneously (although this is rarely possible due to inadequate numbers of studies) (Thompson and Higgins 2002). Meta-regression should generally not be considered when there are fewer than ten studies in a meta-analysis.

Meta-regressions are similar in essence to simple regressions, in which an outcome variable is predicted according to the values of one or more explanatory variables . In meta-regression, the outcome variable is the effect estimate (for example, a mean difference, a risk difference, a log odds ratio or a log risk ratio). The explanatory variables are characteristics of studies that might influence the size of intervention effect. These are often called ‘potential effect modifiers’ or covariates. Meta-regressions usually differ from simple regressions in two ways. First, larger studies have more influence on the relationship than smaller studies, since studies are weighted by the precision of their respective effect estimate. Second, it is wise to allow for the residual heterogeneity among intervention effects not modelled by the explanatory variables. This gives rise to the term ‘random-effects meta-regression’, since the extra variability is incorporated in the same way as in a random-effects meta-analysis (Thompson and Sharp 1999).

The regression coefficient obtained from a meta-regression analysis will describe how the outcome variable (the intervention effect) changes with a unit increase in the explanatory variable (the potential effect modifier). The statistical significance of the regression coefficient is a test of whether there is a linear relationship between intervention effect and the explanatory variable. If the intervention effect is a ratio measure, the log-transformed value of the intervention effect should always be used in the regression model (see Chapter 6, Section 6.1.2.1 ), and the exponential of the regression coefficient will give an estimate of the relative change in intervention effect with a unit increase in the explanatory variable.

Meta-regression can also be used to investigate differences for categorical explanatory variables as done in subgroup analyses. If there are J subgroups, membership of particular subgroups is indicated by using J minus 1 dummy variables (which can only take values of zero or one) in the meta-regression model (as in standard linear regression modelling). The regression coefficients will estimate how the intervention effect in each subgroup differs from a nominated reference subgroup. The P value of each regression coefficient will indicate the strength of evidence against the null hypothesis that the characteristic is not associated with the intervention effect.

Meta-regression may be performed using the ‘metareg’ macro available for the Stata statistical package, or using the ‘metafor’ package for R, as well as other packages.

10.11.5 Selection of study characteristics for subgroup analyses and meta-regression

Authors need to be cautious about undertaking subgroup analyses, and interpreting any that they do. Some considerations are outlined here for selecting characteristics (also called explanatory variables, potential effect modifiers or covariates) that will be investigated for their possible influence on the size of the intervention effect. These considerations apply similarly to subgroup analyses and to meta-regressions. Further details may be obtained elsewhere (Oxman and Guyatt 1992, Berlin and Antman 1994).

10.11.5.1 Ensure that there are adequate studies to justify subgroup analyses and meta-regressions

It is very unlikely that an investigation of heterogeneity will produce useful findings unless there is a substantial number of studies. Typical advice for undertaking simple regression analyses: that at least ten observations (i.e. ten studies in a meta-analysis) should be available for each characteristic modelled. However, even this will be too few when the covariates are unevenly distributed across studies.

10.11.5.2 Specify characteristics in advance

Authors should, whenever possible, pre-specify characteristics in the protocol that later will be subject to subgroup analyses or meta-regression. The plan specified in the protocol should then be followed (data permitting), without undue emphasis on any particular findings (see MECIR Box 10.11.b ). Pre-specifying characteristics reduces the likelihood of spurious findings, first by limiting the number of subgroups investigated, and second by preventing knowledge of the studies’ results influencing which subgroups are analysed. True pre-specification is difficult in systematic reviews, because the results of some of the relevant studies are often known when the protocol is drafted. If a characteristic was overlooked in the protocol, but is clearly of major importance and justified by external evidence, then authors should not be reluctant to explore it. However, such post-hoc analyses should be identified as such.

MECIR Box 10.11.b Relevant expectations for conduct of intervention reviews

10.11.5.3 Select a small number of characteristics

The likelihood of a false-positive result among subgroup analyses and meta-regression increases with the number of characteristics investigated. It is difficult to suggest a maximum number of characteristics to look at, especially since the number of available studies is unknown in advance. If more than one or two characteristics are investigated it may be sensible to adjust the level of significance to account for making multiple comparisons.

10.11.5.4 Ensure there is scientific rationale for investigating each characteristic

Selection of characteristics should be motivated by biological and clinical hypotheses, ideally supported by evidence from sources other than the included studies. Subgroup analyses using characteristics that are implausible or clinically irrelevant are not likely to be useful and should be avoided. For example, a relationship between intervention effect and year of publication is seldom in itself clinically informative, and if identified runs the risk of initiating a post-hoc data dredge of factors that may have changed over time.

Prognostic factors are those that predict the outcome of a disease or condition, whereas effect modifiers are factors that influence how well an intervention works in affecting the outcome. Confusion between prognostic factors and effect modifiers is common in planning subgroup analyses, especially at the protocol stage. Prognostic factors are not good candidates for subgroup analyses unless they are also believed to modify the effect of intervention. For example, being a smoker may be a strong predictor of mortality within the next ten years, but there may not be reason for it to influence the effect of a drug therapy on mortality (Deeks 1998). Potential effect modifiers may include participant characteristics (age, setting), the precise interventions (dose of active intervention, choice of comparison intervention), how the study was done (length of follow-up) or methodology (design and quality).

10.11.5.5 Be aware that the effect of a characteristic may not always be identified

Many characteristics that might have important effects on how well an intervention works cannot be investigated using subgroup analysis or meta-regression. These are characteristics of participants that might vary substantially within studies, but that can only be summarized at the level of the study. An example is age. Consider a collection of clinical trials involving adults ranging from 18 to 60 years old. There may be a strong relationship between age and intervention effect that is apparent within each study. However, if the mean ages for the trials are similar, then no relationship will be apparent by looking at trial mean ages and trial-level effect estimates. The problem is one of aggregating individuals’ results and is variously known as aggregation bias, ecological bias or the ecological fallacy (Morgenstern 1982, Greenland 1987, Berlin et al 2002). It is even possible for the direction of the relationship across studies be the opposite of the direction of the relationship observed within each study.

10.11.5.6 Think about whether the characteristic is closely related to another characteristic (confounded)

The problem of ‘confounding’ complicates interpretation of subgroup analyses and meta-regressions and can lead to incorrect conclusions. Two characteristics are confounded if their influences on the intervention effect cannot be disentangled. For example, if those studies implementing an intensive version of a therapy happened to be the studies that involved patients with more severe disease, then one cannot tell which aspect is the cause of any difference in effect estimates between these studies and others. In meta-regression, co-linearity between potential effect modifiers leads to similar difficulties (Berlin and Antman 1994). Computing correlations between study characteristics will give some information about which study characteristics may be confounded with each other.

10.11.6 Interpretation of subgroup analyses and meta-regressions

Appropriate interpretation of subgroup analyses and meta-regressions requires caution (Oxman and Guyatt 1992).

Subgroup comparisons are observational. It must be remembered that subgroup analyses and meta-regressions are entirely observational in their nature. These analyses investigate differences between studies. Even if individuals are randomized to one group or other within a clinical trial, they are not randomized to go in one trial or another. Hence, subgroup analyses suffer the limitations of any observational investigation, including possible bias through confounding by other study-level characteristics. Furthermore, even a genuine difference between subgroups is not necessarily due to the classification of the subgroups. As an example, a subgroup analysis of bone marrow transplantation for treating leukaemia might show a strong association between the age of a sibling donor and the success of the transplant. However, this probably does not mean that the age of donor is important. In fact, the age of the recipient is probably a key factor and the subgroup finding would simply be due to the strong association between the age of the recipient and the age of their sibling.
Was the analysis pre-specified or post hoc? Authors should state whether subgroup analyses were pre-specified or undertaken after the results of the studies had been compiled (post hoc). More reliance may be placed on a subgroup analysis if it was one of a small number of pre-specified analyses. Performing numerous post-hoc subgroup analyses to explain heterogeneity is a form of data dredging. Data dredging is condemned because it is usually possible to find an apparent, but false, explanation for heterogeneity by considering lots of different characteristics.
Is there indirect evidence in support of the findings? Differences between subgroups should be clinically plausible and supported by other external or indirect evidence, if they are to be convincing.
Is the magnitude of the difference practically important? If the magnitude of a difference between subgroups will not result in different recommendations for different subgroups, then it may be better to present only the overall analysis results.
Is there a statistically significant difference between subgroups? To establish whether there is a different effect of an intervention in different situations, the magnitudes of effects in different subgroups should be compared directly with each other. In particular, statistical significance of the results within separate subgroup analyses should not be compared (see Section 10.11.3.1 ).
Are analyses looking at within-study or between-study relationships? For patient and intervention characteristics, differences in subgroups that are observed within studies are more reliable than analyses of subsets of studies. If such within-study relationships are replicated across studies then this adds confidence to the findings.

10.11.7 Investigating the effect of underlying risk

One potentially important source of heterogeneity among a series of studies is when the underlying average risk of the outcome event varies between the studies. The underlying risk of a particular event may be viewed as an aggregate measure of case-mix factors such as age or disease severity. It is generally measured as the observed risk of the event in the comparator group of each study (the comparator group risk, or CGR). The notion is controversial in its relevance to clinical practice since underlying risk represents a summary of both known and unknown risk factors. Problems also arise because comparator group risk will depend on the length of follow-up, which often varies across studies. However, underlying risk has received particular attention in meta-analysis because the information is readily available once dichotomous data have been prepared for use in meta-analyses. Sharp provides a full discussion of the topic (Sharp 2001).

Intuition would suggest that participants are more or less likely to benefit from an effective intervention according to their risk status. However, the relationship between underlying risk and intervention effect is a complicated issue. For example, suppose an intervention is equally beneficial in the sense that for all patients it reduces the risk of an event, say a stroke, to 80% of the underlying risk. Then it is not equally beneficial in terms of absolute differences in risk in the sense that it reduces a 50% stroke rate by 10 percentage points to 40% (number needed to treat=10), but a 20% stroke rate by 4 percentage points to 16% (number needed to treat=25).

Use of different summary statistics (risk ratio, odds ratio and risk difference) will demonstrate different relationships with underlying risk. Summary statistics that show close to no relationship with underlying risk are generally preferred for use in meta-analysis (see Section 10.4.3 ).

Investigating any relationship between effect estimates and the comparator group risk is also complicated by a technical phenomenon known as regression to the mean. This arises because the comparator group risk forms an integral part of the effect estimate. A high risk in a comparator group, observed entirely by chance, will on average give rise to a higher than expected effect estimate, and vice versa. This phenomenon results in a false correlation between effect estimates and comparator group risks. There are methods, which require sophisticated software, that correct for regression to the mean (McIntosh 1996, Thompson et al 1997). These should be used for such analyses, and statistical expertise is recommended.

10.11.8 Dose-response analyses

The principles of meta-regression can be applied to the relationships between intervention effect and dose (commonly termed dose-response), treatment intensity or treatment duration (Greenland and Longnecker 1992, Berlin et al 1993). Conclusions about differences in effect due to differences in dose (or similar factors) are on stronger ground if participants are randomized to one dose or another within a study and a consistent relationship is found across similar studies. While authors should consider these effects, particularly as a possible explanation for heterogeneity, they should be cautious about drawing conclusions based on between-study differences. Authors should be particularly cautious about claiming that a dose-response relationship does not exist, given the low power of many meta-regression analyses to detect genuine relationships.

10.12 Missing data

10.12.1 types of missing data.

There are many potential sources of missing data in a systematic review or meta-analysis (see Table 10.12.a ). For example, a whole study may be missing from the review, an outcome may be missing from a study, summary data may be missing for an outcome, and individual participants may be missing from the summary data. Here we discuss a variety of potential sources of missing data, highlighting where more detailed discussions are available elsewhere in the Handbook .

Whole studies may be missing from a review because they are never published, are published in obscure places, are rarely cited, or are inappropriately indexed in databases. Thus, review authors should always be aware of the possibility that they have failed to identify relevant studies. There is a strong possibility that such studies are missing because of their ‘uninteresting’ or ‘unwelcome’ findings (that is, in the presence of publication bias). This problem is discussed at length in Chapter 13 . Details of comprehensive search methods are provided in Chapter 4 .

Some studies might not report any information on outcomes of interest to the review. For example, there may be no information on quality of life, or on serious adverse effects. It is often difficult to determine whether this is because the outcome was not measured or because the outcome was not reported. Furthermore, failure to report that outcomes were measured may be dependent on the unreported results (selective outcome reporting bias; see Chapter 7, Section 7.2.3.3 ). Similarly, summary data for an outcome, in a form that can be included in a meta-analysis, may be missing. A common example is missing standard deviations (SDs) for continuous outcomes. This is often a problem when change-from-baseline outcomes are sought. We discuss imputation of missing SDs in Chapter 6, Section 6.5.2.8 . Other examples of missing summary data are missing sample sizes (particularly those for each intervention group separately), numbers of events, standard errors, follow-up times for calculating rates, and sufficient details of time-to-event outcomes. Inappropriate analyses of studies, for example of cluster-randomized and crossover trials, can lead to missing summary data. It is sometimes possible to approximate the correct analyses of such studies, for example by imputing correlation coefficients or SDs, as discussed in Chapter 23, Section 23.1 , for cluster-randomized studies and Chapter 23,Section 23.2 , for crossover trials. As a general rule, most methodologists believe that missing summary data (e.g. ‘no usable data’) should not be used as a reason to exclude a study from a systematic review. It is more appropriate to include the study in the review, and to discuss the potential implications of its absence from a meta-analysis.

It is likely that in some, if not all, included studies, there will be individuals missing from the reported results. Review authors are encouraged to consider this problem carefully (see MECIR Box 10.12.a ). We provide further discussion of this problem in Section 10.12.3 ; see also Chapter 8, Section 8.5 .

Missing data can also affect subgroup analyses. If subgroup analyses or meta-regressions are planned (see Section 10.11 ), they require details of the study-level characteristics that distinguish studies from one another. If these are not available for all studies, review authors should consider asking the study authors for more information.

Table 10.12.a Types of missing data in a meta-analysis

MECIR Box 10.12.a Relevant expectations for conduct of intervention reviews

10.12.2 General principles for dealing with missing data

There is a large literature of statistical methods for dealing with missing data. Here we briefly review some key concepts and make some general recommendations for Cochrane Review authors. It is important to think why data may be missing. Statisticians often use the terms ‘missing at random’ and ‘not missing at random’ to represent different scenarios.

Data are said to be ‘missing at random’ if the fact that they are missing is unrelated to actual values of the missing data. For instance, if some quality-of-life questionnaires were lost in the postal system, this would be unlikely to be related to the quality of life of the trial participants who completed the forms. In some circumstances, statisticians distinguish between data ‘missing at random’ and data ‘missing completely at random’, although in the context of a systematic review the distinction is unlikely to be important. Data that are missing at random may not be important. Analyses based on the available data will often be unbiased, although based on a smaller sample size than the original data set.

Data are said to be ‘not missing at random’ if the fact that they are missing is related to the actual missing data. For instance, in a depression trial, participants who had a relapse of depression might be less likely to attend the final follow-up interview, and more likely to have missing outcome data. Such data are ‘non-ignorable’ in the sense that an analysis of the available data alone will typically be biased. Publication bias and selective reporting bias lead by definition to data that are ‘not missing at random’, and attrition and exclusions of individuals within studies often do as well.

The principal options for dealing with missing data are:

analysing only the available data (i.e. ignoring the missing data);
imputing the missing data with replacement values, and treating these as if they were observed (e.g. last observation carried forward, imputing an assumed outcome such as assuming all were poor outcomes, imputing the mean, imputing based on predicted values from a regression analysis);
imputing the missing data and accounting for the fact that these were imputed with uncertainty (e.g. multiple imputation, simple imputation methods (as point 2) with adjustment to the standard error); and
using statistical models to allow for missing data, making assumptions about their relationships with the available data.

Option 2 is practical in most circumstances and very commonly used in systematic reviews. However, it fails to acknowledge uncertainty in the imputed values and results, typically, in confidence intervals that are too narrow. Options 3 and 4 would require involvement of a knowledgeable statistician.

Five general recommendations for dealing with missing data in Cochrane Reviews are as follows:

Whenever possible, contact the original investigators to request missing data.
Make explicit the assumptions of any methods used to address missing data: for example, that the data are assumed missing at random, or that missing values were assumed to have a particular value such as a poor outcome.
Follow the guidance in Chapter 8 to assess risk of bias due to missing outcome data in randomized trials.
Perform sensitivity analyses to assess how sensitive results are to reasonable changes in the assumptions that are made (see Section 10.14 ).
Address the potential impact of missing data on the findings of the review in the Discussion section.

10.12.3 Dealing with missing outcome data from individual participants

Review authors may undertake sensitivity analyses to assess the potential impact of missing outcome data, based on assumptions about the relationship between missingness in the outcome and its true value. Several methods are available (Akl et al 2015). For dichotomous outcomes, Higgins and colleagues propose a strategy involving different assumptions about how the risk of the event among the missing participants differs from the risk of the event among the observed participants, taking account of uncertainty introduced by the assumptions (Higgins et al 2008a). Akl and colleagues propose a suite of simple imputation methods, including a similar approach to that of Higgins and colleagues based on relative risks of the event in missing versus observed participants. Similar ideas can be applied to continuous outcome data (Ebrahim et al 2013, Ebrahim et al 2014). Particular care is required to avoid double counting events, since it can be unclear whether reported numbers of events in trial reports apply to the full randomized sample or only to those who did not drop out (Akl et al 2016).

Although there is a tradition of implementing ‘worst case’ and ‘best case’ analyses clarifying the extreme boundaries of what is theoretically possible, such analyses may not be informative for the most plausible scenarios (Higgins et al 2008a).

10.13 Bayesian approaches to meta-analysis

Bayesian statistics is an approach to statistics based on a different philosophy from that which underlies significance tests and confidence intervals. It is essentially about updating of evidence. In a Bayesian analysis, initial uncertainty is expressed through a prior distribution about the quantities of interest. Current data and assumptions concerning how they were generated are summarized in the likelihood . The posterior distribution for the quantities of interest can then be obtained by combining the prior distribution and the likelihood. The likelihood summarizes both the data from studies included in the meta-analysis (for example, 2×2 tables from randomized trials) and the meta-analysis model (for example, assuming a fixed effect or random effects). The result of the analysis is usually presented as a point estimate and 95% credible interval from the posterior distribution for each quantity of interest, which look much like classical estimates and confidence intervals. Potential advantages of Bayesian analyses are summarized in Box 10.13.a . Bayesian analysis may be performed using WinBUGS software (Smith et al 1995, Lunn et al 2000), within R (Röver 2017), or – for some applications – using standard meta-regression software with a simple trick (Rhodes et al 2016).

A difference between Bayesian analysis and classical meta-analysis is that the interpretation is directly in terms of belief: a 95% credible interval for an odds ratio is that region in which we believe the odds ratio to lie with probability 95%. This is how many practitioners actually interpret a classical confidence interval, but strictly in the classical framework the 95% refers to the long-term frequency with which 95% intervals contain the true value. The Bayesian framework also allows a review author to calculate the probability that the odds ratio has a particular range of values, which cannot be done in the classical framework. For example, we can determine the probability that the odds ratio is less than 1 (which might indicate a beneficial effect of an experimental intervention), or that it is no larger than 0.8 (which might indicate a clinically important effect). It should be noted that these probabilities are specific to the choice of the prior distribution. Different meta-analysts may analyse the same data using different prior distributions and obtain different results. It is therefore important to carry out sensitivity analyses to investigate how the results depend on any assumptions made.

In the context of a meta-analysis, prior distributions are needed for the particular intervention effect being analysed (such as the odds ratio or the mean difference) and – in the context of a random-effects meta-analysis – on the amount of heterogeneity among intervention effects across studies. Prior distributions may represent subjective belief about the size of the effect, or may be derived from sources of evidence not included in the meta-analysis, such as information from non-randomized studies of the same intervention or from randomized trials of other interventions. The width of the prior distribution reflects the degree of uncertainty about the quantity. When there is little or no information, a ‘non-informative’ prior can be used, in which all values across the possible range are equally likely.

Most Bayesian meta-analyses use non-informative (or very weakly informative) prior distributions to represent beliefs about intervention effects, since many regard it as controversial to combine objective trial data with subjective opinion. However, prior distributions are increasingly used for the extent of among-study variation in a random-effects analysis. This is particularly advantageous when the number of studies in the meta-analysis is small, say fewer than five or ten. Libraries of data-based prior distributions are available that have been derived from re-analyses of many thousands of meta-analyses in the Cochrane Database of Systematic Reviews (Turner et al 2012).

Box 10.13.a Some potential advantages of Bayesian meta-analysis

Statistical expertise is strongly recommended for review authors who wish to carry out Bayesian analyses. There are several good texts (Sutton et al 2000, Sutton and Abrams 2001, Spiegelhalter et al 2004).

10.14 Sensitivity analyses

The process of undertaking a systematic review involves a sequence of decisions. Whilst many of these decisions are clearly objective and non-contentious, some will be somewhat arbitrary or unclear. For instance, if eligibility criteria involve a numerical value, the choice of value is usually arbitrary: for example, defining groups of older people may reasonably have lower limits of 60, 65, 70 or 75 years, or any value in between. Other decisions may be unclear because a study report fails to include the required information. Some decisions are unclear because the included studies themselves never obtained the information required: for example, the outcomes of those who were lost to follow-up. Further decisions are unclear because there is no consensus on the best statistical method to use for a particular problem.

It is highly desirable to prove that the findings from a systematic review are not dependent on such arbitrary or unclear decisions by using sensitivity analysis (see MECIR Box 10.14.a ). A sensitivity analysis is a repeat of the primary analysis or meta-analysis in which alternative decisions or ranges of values are substituted for decisions that were arbitrary or unclear. For example, if the eligibility of some studies in the meta-analysis is dubious because they do not contain full details, sensitivity analysis may involve undertaking the meta-analysis twice: the first time including all studies and, second, including only those that are definitely known to be eligible. A sensitivity analysis asks the question, ‘Are the findings robust to the decisions made in the process of obtaining them?’

MECIR Box 10.14.a Relevant expectations for conduct of intervention reviews

There are many decision nodes within the systematic review process that can generate a need for a sensitivity analysis. Examples include:

Searching for studies:

Should abstracts whose results cannot be confirmed in subsequent publications be included in the review?

Eligibility criteria:

Characteristics of participants: where a majority but not all people in a study meet an age range, should the study be included?
Characteristics of the intervention: what range of doses should be included in the meta-analysis?
Characteristics of the comparator: what criteria are required to define usual care to be used as a comparator group?
Characteristics of the outcome: what time point or range of time points are eligible for inclusion?
Study design: should blinded and unblinded outcome assessment be included, or should study inclusion be restricted by other aspects of methodological criteria?

What data should be analysed?

Time-to-event data: what assumptions of the distribution of censored data should be made?
Continuous data: where standard deviations are missing, when and how should they be imputed? Should analyses be based on change scores or on post-intervention values?
Ordinal scales: what cut-point should be used to dichotomize short ordinal scales into two groups?
Cluster-randomized trials: what values of the intraclass correlation coefficient should be used when trial analyses have not been adjusted for clustering?
Crossover trials: what values of the within-subject correlation coefficient should be used when this is not available in primary reports?
All analyses: what assumptions should be made about missing outcomes? Should adjusted or unadjusted estimates of intervention effects be used?

Analysis methods:

Should fixed-effect or random-effects methods be used for the analysis?
For dichotomous outcomes, should odds ratios, risk ratios or risk differences be used?
For continuous outcomes, where several scales have assessed the same dimension, should results be analysed as a standardized mean difference across all scales or as mean differences individually for each scale?

Some sensitivity analyses can be pre-specified in the study protocol, but many issues suitable for sensitivity analysis are only identified during the review process where the individual peculiarities of the studies under investigation are identified. When sensitivity analyses show that the overall result and conclusions are not affected by the different decisions that could be made during the review process, the results of the review can be regarded with a higher degree of certainty. Where sensitivity analyses identify particular decisions or missing information that greatly influence the findings of the review, greater resources can be deployed to try and resolve uncertainties and obtain extra information, possibly through contacting trial authors and obtaining individual participant data. If this cannot be achieved, the results must be interpreted with an appropriate degree of caution. Such findings may generate proposals for further investigations and future research.

Reporting of sensitivity analyses in a systematic review may best be done by producing a summary table. Rarely is it informative to produce individual forest plots for each sensitivity analysis undertaken.

Sensitivity analyses are sometimes confused with subgroup analysis. Although some sensitivity analyses involve restricting the analysis to a subset of the totality of studies, the two methods differ in two ways. First, sensitivity analyses do not attempt to estimate the effect of the intervention in the group of studies removed from the analysis, whereas in subgroup analyses, estimates are produced for each subgroup. Second, in sensitivity analyses, informal comparisons are made between different ways of estimating the same thing, whereas in subgroup analyses, formal statistical comparisons are made across the subgroups.

10.15 Chapter information

Editors: Jonathan J Deeks, Julian PT Higgins, Douglas G Altman; on behalf of the Cochrane Statistical Methods Group

Contributing authors: Douglas Altman, Deborah Ashby, Jacqueline Birks, Michael Borenstein, Marion Campbell, Jonathan Deeks, Matthias Egger, Julian Higgins, Joseph Lau, Keith O’Rourke, Gerta Rücker, Rob Scholten, Jonathan Sterne, Simon Thompson, Anne Whitehead

Acknowledgements: We are grateful to the following for commenting helpfully on earlier drafts: Bodil Als-Nielsen, Deborah Ashby, Jesse Berlin, Joseph Beyene, Jacqueline Birks, Michael Bracken, Marion Campbell, Chris Cates, Wendong Chen, Mike Clarke, Albert Cobos, Esther Coren, Francois Curtin, Roberto D’Amico, Keith Dear, Heather Dickinson, Diana Elbourne, Simon Gates, Paul Glasziou, Christian Gluud, Peter Herbison, Sally Hollis, David Jones, Steff Lewis, Tianjing Li, Joanne McKenzie, Philippa Middleton, Nathan Pace, Craig Ramsey, Keith O’Rourke, Rob Scholten, Guido Schwarzer, Jack Sinclair, Jonathan Sterne, Simon Thompson, Andy Vail, Clarine van Oel, Paula Williamson and Fred Wolf.

Funding: JJD received support from the National Institute for Health Research (NIHR) Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. JPTH is a member of the NIHR Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bristol. JPTH received funding from National Institute for Health Research Senior Investigator award NF-SI-0617-10145. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

10.16 References

Agresti A. An Introduction to Categorical Data Analysis . New York (NY): John Wiley & Sons; 1996.

Akl EA, Kahale LA, Agoritsas T, Brignardello-Petersen R, Busse JW, Carrasco-Labra A, Ebrahim S, Johnston BC, Neumann I, Sola I, Sun X, Vandvik P, Zhang Y, Alonso-Coello P, Guyatt G. Handling trial participants with missing outcome data when conducting a meta-analysis: a systematic survey of proposed approaches. Systematic Reviews 2015; 4 : 98.

Akl EA, Kahale LA, Ebrahim S, Alonso-Coello P, Schünemann HJ, Guyatt GH. Three challenges described for identifying participants with missing data in trials reports, and potential solutions suggested to systematic reviewers. Journal of Clinical Epidemiology 2016; 76 : 147-154.

Altman DG, Bland JM. Detecting skewness from summary information. BMJ 1996; 313 : 1200.

Anzures-Cabrera J, Sarpatwari A, Higgins JPT. Expressing findings from meta-analyses of continuous outcomes in terms of risks. Statistics in Medicine 2011; 30 : 2967-2985.

Berlin JA, Longnecker MP, Greenland S. Meta-analysis of epidemiologic dose-response data. Epidemiology 1993; 4 : 218-228.

Berlin JA, Antman EM. Advantages and limitations of metaanalytic regressions of clinical trials data. Online Journal of Current Clinical Trials 1994; Doc No 134 .

Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman KA, Group A-LAITS. Individual patient- versus group-level data meta-regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head. Statistics in Medicine 2002; 21 : 371-387.

Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. A basic introduction to fixed-effect and random-effects models for meta-analysis. Research Synthesis Methods 2010; 1 : 97-111.

Borenstein M, Higgins JPT. Meta-analysis and subgroups. Prev Sci 2013; 14 : 134-143.

Bradburn MJ, Deeks JJ, Berlin JA, Russell Localio A. Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Statistics in Medicine 2007; 26 : 53-77.

Chinn S. A simple method for converting an odds ratio to effect size for use in meta-analysis. Statistics in Medicine 2000; 19 : 3127-3131.

da Costa BR, Nuesch E, Rutjes AW, Johnston BC, Reichenbach S, Trelle S, Guyatt GH, Jüni P. Combining follow-up and change data is valid in meta-analyses of continuous outcomes: a meta-epidemiological study. Journal of Clinical Epidemiology 2013; 66 : 847-855.

Deeks JJ. Systematic reviews of published evidence: Miracles or minefields? Annals of Oncology 1998; 9 : 703-709.

Deeks JJ, Altman DG, Bradburn MJ. Statistical methods for examining heterogeneity and combining results from several studies in meta-analysis. In: Egger M, Davey Smith G, Altman DG, editors. Systematic Reviews in Health Care: Meta-analysis in Context . 2nd edition ed. London (UK): BMJ Publication Group; 2001. p. 285-312.

Deeks JJ. Issues in the selection of a summary statistic for meta-analysis of clinical trials with binary outcomes. Statistics in Medicine 2002; 21 : 1575-1600.

DerSimonian R, Laird N. Meta-analysis in clinical trials. Controlled Clinical Trials 1986; 7 : 177-188.

DiGuiseppi C, Higgins JPT. Interventions for promoting smoke alarm ownership and function. Cochrane Database of Systematic Reviews 2001; 2 : CD002246.

Ebrahim S, Akl EA, Mustafa RA, Sun X, Walter SD, Heels-Ansdell D, Alonso-Coello P, Johnston BC, Guyatt GH. Addressing continuous data for participants excluded from trial analysis: a guide for systematic reviewers. Journal of Clinical Epidemiology 2013; 66 : 1014-1021 e1011.

Ebrahim S, Johnston BC, Akl EA, Mustafa RA, Sun X, Walter SD, Heels-Ansdell D, Alonso-Coello P, Guyatt GH. Addressing continuous data measured with different instruments for participants excluded from trial analysis: a guide for systematic reviewers. Journal of Clinical Epidemiology 2014; 67 : 560-570.

Efthimiou O. Practical guide to the meta-analysis of rare events. Evidence-Based Mental Health 2018; 21 : 72-76.

Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997; 315 : 629-634.

Engels EA, Schmid CH, Terrin N, Olkin I, Lau J. Heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. Statistics in Medicine 2000; 19 : 1707-1728.

Greenland S, Robins JM. Estimation of a common effect parameter from sparse follow-up data. Biometrics 1985; 41 : 55-68.

Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiologic Reviews 1987; 9 : 1-30.

Greenland S, Longnecker MP. Methods for trend estimation from summarized dose-response data, with applications to meta-analysis. American Journal of Epidemiology 1992; 135 : 1301-1309.

Guevara JP, Berlin JA, Wolf FM. Meta-analytic methods for pooling rates when follow-up duration varies: a case study. BMC Medical Research Methodology 2004; 4 : 17.

Hartung J, Knapp G. A refined method for the meta-analysis of controlled clinical trials with binary outcome. Statistics in Medicine 2001; 20 : 3875-3889.

Hasselblad V, McCrory DC. Meta-analytic tools for medical decision making: A practical guide. Medical Decision Making 1995; 15 : 81-96.

Higgins JPT, Thompson SG. Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 2002; 21 : 1539-1558.

Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003; 327 : 557-560.

Higgins JPT, Thompson SG. Controlling the risk of spurious findings from meta-regression. Statistics in Medicine 2004; 23 : 1663-1682.

Higgins JPT, White IR, Wood AM. Imputation methods for missing outcome data in meta-analysis of clinical trials. Clinical Trials 2008a; 5 : 225-239.

Higgins JPT, White IR, Anzures-Cabrera J. Meta-analysis of skewed data: combining results reported on log-transformed or raw scales. Statistics in Medicine 2008b; 27 : 6072-6092.

Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2009; 172 : 137-159.

Kjaergard LL, Villumsen J, Gluud C. Reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses. Annals of Internal Medicine 2001; 135 : 982-989.

Langan D, Higgins JPT, Simmonds M. An empirical comparison of heterogeneity variance estimators in 12 894 meta-analyses. Research Synthesis Methods 2015; 6 : 195-205.

Langan D, Higgins JPT, Simmonds M. Comparative performance of heterogeneity variance estimators in meta-analysis: a review of simulation studies. Research Synthesis Methods 2017; 8 : 181-198.

Langan D, Higgins JPT, Jackson D, Bowden J, Veroniki AA, Kontopantelis E, Viechtbauer W, Simmonds M. A comparison of heterogeneity variance estimators in simulated random-effects meta-analyses. Research Synthesis Methods 2019; 10 : 83-98.

Lewis S, Clarke M. Forest plots: trying to see the wood and the trees. BMJ 2001; 322 : 1479-1480.

Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing 2000; 10 : 325-337.

Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 1959; 22 : 719-748.

McIntosh MW. The population risk as an explanatory variable in research synthesis of clinical trials. Statistics in Medicine 1996; 15 : 1713-1728.

Morgenstern H. Uses of ecologic analysis in epidemiologic research. American Journal of Public Health 1982; 72 : 1336-1344.

Oxman AD, Guyatt GH. A consumers guide to subgroup analyses. Annals of Internal Medicine 1992; 116 : 78-84.

Peto R, Collins R, Gray R. Large-scale randomized evidence: large, simple trials and overviews of trials. Journal of Clinical Epidemiology 1995; 48 : 23-40.

Poole C, Greenland S. Random-effects meta-analyses are not always conservative. American Journal of Epidemiology 1999; 150 : 469-475.

Rhodes KM, Turner RM, White IR, Jackson D, Spiegelhalter DJ, Higgins JPT. Implementing informative priors for heterogeneity in meta-analysis using meta-regression and pseudo data. Statistics in Medicine 2016; 35 : 5495-5511.

Rice K, Higgins JPT, Lumley T. A re-evaluation of fixed effect(s) meta-analysis. Journal of the Royal Statistical Society Series A (Statistics in Society) 2018; 181 : 205-227.

Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects meta-analyses. BMJ 2011; 342 : d549.

Röver C. Bayesian random-effects meta-analysis using the bayesmeta R package 2017. https://arxiv.org/abs/1711.08683 .

Rücker G, Schwarzer G, Carpenter J, Olkin I. Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Statistics in Medicine 2009; 28 : 721-738.

Sharp SJ. Analysing the relationship between treatment benefit and underlying risk: precautions and practical recommendations. In: Egger M, Davey Smith G, Altman DG, editors. Systematic Reviews in Health Care: Meta-analysis in Context . 2nd edition ed. London (UK): BMJ Publication Group; 2001. p. 176-188.

Sidik K, Jonkman JN. A simple confidence interval for meta-analysis. Statistics in Medicine 2002; 21 : 3153-3159.

Simmonds MC, Tierney J, Bowden J, Higgins JPT. Meta-analysis of time-to-event data: a comparison of two-stage methods. Research Synthesis Methods 2011; 2 : 139-149.

Sinclair JC, Bracken MB. Clinically useful measures of effect in binary analyses of randomized trials. Journal of Clinical Epidemiology 1994; 47 : 881-889.

Smith TC, Spiegelhalter DJ, Thomas A. Bayesian approaches to random-effects meta-analysis: a comparative study. Statistics in Medicine 1995; 14 : 2685-2699.

Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation . Chichester (UK): John Wiley & Sons; 2004.

Spittal MJ, Pirkis J, Gurrin LC. Meta-analysis of incidence rate data in the presence of zero events. BMC Medical Research Methodology 2015; 15 : 42.

Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Methods for Meta-analysis in Medical Research . Chichester (UK): John Wiley & Sons; 2000.

Sutton AJ, Abrams KR. Bayesian methods in meta-analysis and evidence synthesis. Statistical Methods in Medical Research 2001; 10 : 277-303.

Sweeting MJ, Sutton AJ, Lambert PC. What to add to nothing? Use and avoidance of continuity corrections in meta-analysis of sparse data. Statistics in Medicine 2004; 23 : 1351-1375.

Thompson SG, Smith TC, Sharp SJ. Investigating underlying risk as a source of heterogeneity in meta-analysis. Statistics in Medicine 1997; 16 : 2741-2758.

Thompson SG, Sharp SJ. Explaining heterogeneity in meta-analysis: a comparison of methods. Statistics in Medicine 1999; 18 : 2693-2708.

Thompson SG, Higgins JPT. How should meta-regression analyses be undertaken and interpreted? Statistics in Medicine 2002; 21 : 1559-1574.

Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JPT. Predicting the extent of heterogeneity in meta-analysis, using empirical data from the Cochrane Database of Systematic Reviews. International Journal of Epidemiology 2012; 41 : 818-827.

Veroniki AA, Jackson D, Viechtbauer W, Bender R, Bowden J, Knapp G, Kuss O, Higgins JPT, Langan D, Salanti G. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research Synthesis Methods 2016; 7 : 55-79.

Whitehead A, Jones NMB. A meta-analysis of clinical trials involving different classifications of response into ordered categories. Statistics in Medicine 1994; 13 : 2503-2515.

Yusuf S, Peto R, Lewis J, Collins R, Sleight P. Beta blockade during and after myocardial infarction: an overview of the randomized trials. Progress in Cardiovascular Diseases 1985; 27 : 335-371.

Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA 1991; 266 : 93-98.

For permission to re-use material from the Handbook (either academic or commercial), please see here for full details.

A Guide to Conducting a Meta-Analysis

Affiliations.

1 Department of Psychology, Faculty of Arts and Social Sciences, National University of Singapore, Block AS4, Level 2, 9 Arts Link, Singapore, 117570, Singapore. [email protected].
2 Department of Psychology, Faculty of Arts and Social Sciences, National University of Singapore, Block AS4, Level 2, 9 Arts Link, Singapore, 117570, Singapore.
PMID: 27209412
DOI: 10.1007/s11065-016-9319-z

Meta-analysis is widely accepted as the preferred method to synthesize research findings in various disciplines. This paper provides an introduction to when and how to conduct a meta-analysis. Several practical questions, such as advantages of meta-analysis over conventional narrative review and the number of studies required for a meta-analysis, are addressed. Common meta-analytic models are then introduced. An artificial dataset is used to illustrate how a meta-analysis is conducted in several software packages. The paper concludes with some common pitfalls of meta-analysis and their solutions. The primary goal of this paper is to provide a summary background to readers who would like to conduct their first meta-analytic study.

Keywords: Literature review; Meta-analysis; Moderator analysis; Systematic review.

Publication types

Research Support, Non-U.S. Gov't
Data Interpretation, Statistical
Meta-Analysis as Topic*
Publication Bias
Review Literature as Topic

Search Menu
Browse content in Arts and Humanities
Browse content in Archaeology
Anglo-Saxon and Medieval Archaeology
Archaeological Methodology and Techniques
Archaeology by Region
Archaeology of Religion
Archaeology of Trade and Exchange
Biblical Archaeology
Contemporary and Public Archaeology
Environmental Archaeology
Historical Archaeology
History and Theory of Archaeology
Industrial Archaeology
Landscape Archaeology
Mortuary Archaeology
Prehistoric Archaeology
Underwater Archaeology
Urban Archaeology
Zooarchaeology
Browse content in Architecture
Architectural Structure and Design
History of Architecture
Residential and Domestic Buildings
Theory of Architecture
Browse content in Art
Art Subjects and Themes
History of Art
Industrial and Commercial Art
Theory of Art
Biographical Studies
Byzantine Studies
Browse content in Classical Studies
Classical History
Classical Philosophy
Classical Mythology
Classical Literature
Classical Reception
Classical Art and Architecture
Classical Oratory and Rhetoric
Greek and Roman Epigraphy
Greek and Roman Law
Greek and Roman Archaeology
Greek and Roman Papyrology
Late Antiquity
Religion in the Ancient World
Digital Humanities
Browse content in History
Colonialism and Imperialism
Diplomatic History
Environmental History
Genealogy, Heraldry, Names, and Honours
Genocide and Ethnic Cleansing
Historical Geography
History by Period
History of Agriculture
History of Education
History of Emotions
History of Gender and Sexuality
Industrial History
Intellectual History
International History
Labour History
Legal and Constitutional History
Local and Family History
Maritime History
Military History
National Liberation and Post-Colonialism
Oral History
Political History
Public History
Regional and National History
Revolutions and Rebellions
Slavery and Abolition of Slavery
Social and Cultural History
Theory, Methods, and Historiography
Urban History
World History
Browse content in Language Teaching and Learning
Language Learning (Specific Skills)
Language Teaching Theory and Methods
Browse content in Linguistics
Applied Linguistics
Cognitive Linguistics
Computational Linguistics
Forensic Linguistics
Grammar, Syntax and Morphology
Historical and Diachronic Linguistics
History of English
Language Acquisition
Language Variation
Language Families
Language Evolution
Language Reference
Lexicography
Linguistic Theories
Linguistic Typology
Linguistic Anthropology
Phonetics and Phonology
Psycholinguistics
Sociolinguistics
Translation and Interpretation
Writing Systems
Browse content in Literature
Bibliography
Children's Literature Studies
Literary Studies (Asian)
Literary Studies (European)
Literary Studies (Eco-criticism)
Literary Studies (Modernism)
Literary Studies (Romanticism)
Literary Studies (American)
Literary Studies - World
Literary Studies (1500 to 1800)
Literary Studies (19th Century)
Literary Studies (20th Century onwards)
Literary Studies (African American Literature)
Literary Studies (British and Irish)
Literary Studies (Early and Medieval)
Literary Studies (Fiction, Novelists, and Prose Writers)
Literary Studies (Gender Studies)
Literary Studies (Graphic Novels)
Literary Studies (History of the Book)
Literary Studies (Plays and Playwrights)
Literary Studies (Poetry and Poets)
Literary Studies (Postcolonial Literature)
Literary Studies (Queer Studies)
Literary Studies (Science Fiction)
Literary Studies (Travel Literature)
Literary Studies (War Literature)
Literary Studies (Women's Writing)
Literary Theory and Cultural Studies
Mythology and Folklore
Shakespeare Studies and Criticism
Browse content in Media Studies
Browse content in Music
Applied Music
Dance and Music
Ethics in Music
Ethnomusicology
Gender and Sexuality in Music
Medicine and Music
Music Cultures
Music and Religion
Music and Culture
Music and Media
Music Education and Pedagogy
Music Theory and Analysis
Musical Scores, Lyrics, and Libretti
Musical Structures, Styles, and Techniques
Musicology and Music History
Performance Practice and Studies
Race and Ethnicity in Music
Sound Studies
Browse content in Performing Arts
Browse content in Philosophy
Aesthetics and Philosophy of Art
Epistemology
Feminist Philosophy
History of Western Philosophy
Metaphysics
Moral Philosophy
Non-Western Philosophy
Philosophy of Science
Philosophy of Action
Philosophy of Law
Philosophy of Religion
Philosophy of Language
Philosophy of Mind
Philosophy of Perception
Philosophy of Mathematics and Logic
Practical Ethics
Social and Political Philosophy
Browse content in Religion
Biblical Studies
Christianity
East Asian Religions
History of Religion
Judaism and Jewish Studies
Qumran Studies
Religion and Education
Religion and Health
Religion and Politics
Religion and Science
Religion and Law
Religion and Art, Literature, and Music
Religious Studies
Browse content in Society and Culture
Cookery, Food, and Drink
Cultural Studies
Customs and Traditions
Ethical Issues and Debates
Hobbies, Games, Arts and Crafts
Lifestyle, Home, and Garden
Natural world, Country Life, and Pets
Popular Beliefs and Controversial Knowledge
Sports and Outdoor Recreation
Technology and Society
Travel and Holiday
Visual Culture
Browse content in Law
Arbitration
Browse content in Company and Commercial Law
Commercial Law
Company Law
Browse content in Comparative Law
Systems of Law
Competition Law
Browse content in Constitutional and Administrative Law
Government Powers
Judicial Review
Local Government Law
Military and Defence Law
Parliamentary and Legislative Practice
Construction Law
Contract Law
Browse content in Criminal Law
Criminal Procedure
Criminal Evidence Law
Sentencing and Punishment
Employment and Labour Law
Environment and Energy Law
Browse content in Financial Law
Banking Law
Insolvency Law
History of Law
Human Rights and Immigration
Intellectual Property Law
Browse content in International Law
Private International Law and Conflict of Laws
Public International Law
IT and Communications Law
Jurisprudence and Philosophy of Law
Law and Politics
Law and Society
Browse content in Legal System and Practice
Courts and Procedure
Legal Skills and Practice
Primary Sources of Law
Regulation of Legal Profession
Medical and Healthcare Law
Browse content in Policing
Criminal Investigation and Detection
Police and Security Services
Police Procedure and Law
Police Regional Planning
Browse content in Property Law
Personal Property Law
Study and Revision
Terrorism and National Security Law
Browse content in Trusts Law
Wills and Probate or Succession
Browse content in Medicine and Health
Browse content in Allied Health Professions
Arts Therapies
Clinical Science
Dietetics and Nutrition
Occupational Therapy
Operating Department Practice
Physiotherapy
Radiography
Speech and Language Therapy
Browse content in Anaesthetics
General Anaesthesia
Neuroanaesthesia
Browse content in Clinical Medicine
Acute Medicine
Cardiovascular Medicine
Clinical Genetics
Clinical Pharmacology and Therapeutics
Dermatology
Endocrinology and Diabetes
Gastroenterology
Genito-urinary Medicine
Geriatric Medicine
Infectious Diseases
Medical Oncology
Medical Toxicology
Pain Medicine
Palliative Medicine
Rehabilitation Medicine
Respiratory Medicine and Pulmonology
Rheumatology
Sleep Medicine
Sports and Exercise Medicine
Clinical Neuroscience
Community Medical Services
Critical Care
Emergency Medicine
Forensic Medicine
Haematology
History of Medicine
Browse content in Medical Dentistry
Oral and Maxillofacial Surgery
Paediatric Dentistry
Restorative Dentistry and Orthodontics
Surgical Dentistry
Medical Ethics
Browse content in Medical Skills
Clinical Skills
Communication Skills
Nursing Skills
Surgical Skills
Medical Statistics and Methodology
Browse content in Neurology
Clinical Neurophysiology
Neuropathology
Nursing Studies
Browse content in Obstetrics and Gynaecology
Gynaecology
Occupational Medicine
Ophthalmology
Otolaryngology (ENT)
Browse content in Paediatrics
Neonatology
Browse content in Pathology
Chemical Pathology
Clinical Cytogenetics and Molecular Genetics
Histopathology
Medical Microbiology and Virology
Patient Education and Information
Browse content in Pharmacology
Psychopharmacology
Browse content in Popular Health
Caring for Others
Complementary and Alternative Medicine
Self-help and Personal Development
Browse content in Preclinical Medicine
Cell Biology
Molecular Biology and Genetics
Reproduction, Growth and Development
Primary Care
Professional Development in Medicine
Browse content in Psychiatry
Addiction Medicine
Child and Adolescent Psychiatry
Forensic Psychiatry
Learning Disabilities
Old Age Psychiatry
Psychotherapy
Browse content in Public Health and Epidemiology
Epidemiology
Public Health
Browse content in Radiology
Clinical Radiology
Interventional Radiology
Nuclear Medicine
Radiation Oncology
Reproductive Medicine
Browse content in Surgery
Cardiothoracic Surgery
Gastro-intestinal and Colorectal Surgery
General Surgery
Neurosurgery
Paediatric Surgery
Peri-operative Care
Plastic and Reconstructive Surgery
Surgical Oncology
Transplant Surgery
Trauma and Orthopaedic Surgery
Vascular Surgery
Browse content in Science and Mathematics
Browse content in Biological Sciences
Aquatic Biology
Biochemistry
Bioinformatics and Computational Biology
Developmental Biology
Ecology and Conservation
Evolutionary Biology
Genetics and Genomics
Microbiology
Molecular and Cell Biology
Natural History
Plant Sciences and Forestry
Research Methods in Life Sciences
Structural Biology
Systems Biology
Zoology and Animal Sciences
Browse content in Chemistry
Analytical Chemistry
Computational Chemistry
Crystallography
Environmental Chemistry
Industrial Chemistry
Inorganic Chemistry
Materials Chemistry
Medicinal Chemistry
Mineralogy and Gems
Organic Chemistry
Physical Chemistry
Polymer Chemistry
Study and Communication Skills in Chemistry
Theoretical Chemistry
Browse content in Computer Science
Artificial Intelligence
Computer Architecture and Logic Design
Game Studies
Human-Computer Interaction
Mathematical Theory of Computation
Programming Languages
Software Engineering
Systems Analysis and Design
Virtual Reality
Browse content in Computing
Business Applications
Computer Security
Computer Games
Computer Networking and Communications
Digital Lifestyle
Graphical and Digital Media Applications
Operating Systems
Browse content in Earth Sciences and Geography
Atmospheric Sciences
Environmental Geography
Geology and the Lithosphere
Maps and Map-making
Meteorology and Climatology
Oceanography and Hydrology
Palaeontology
Physical Geography and Topography
Regional Geography
Soil Science
Urban Geography
Browse content in Engineering and Technology
Agriculture and Farming
Biological Engineering
Civil Engineering, Surveying, and Building
Electronics and Communications Engineering
Energy Technology
Engineering (General)
Environmental Science, Engineering, and Technology
History of Engineering and Technology
Mechanical Engineering and Materials
Technology of Industrial Chemistry
Transport Technology and Trades
Browse content in Environmental Science
Applied Ecology (Environmental Science)
Conservation of the Environment (Environmental Science)
Environmental Sustainability
Environmentalist Thought and Ideology (Environmental Science)
Management of Land and Natural Resources (Environmental Science)
Natural Disasters (Environmental Science)
Nuclear Issues (Environmental Science)
Pollution and Threats to the Environment (Environmental Science)
Social Impact of Environmental Issues (Environmental Science)
History of Science and Technology
Browse content in Materials Science
Ceramics and Glasses
Composite Materials
Metals, Alloying, and Corrosion
Nanotechnology
Browse content in Mathematics
Applied Mathematics
Biomathematics and Statistics
History of Mathematics
Mathematical Education
Mathematical Finance
Mathematical Analysis
Numerical and Computational Mathematics
Probability and Statistics
Pure Mathematics
Browse content in Neuroscience
Cognition and Behavioural Neuroscience
Development of the Nervous System
Disorders of the Nervous System
History of Neuroscience
Invertebrate Neurobiology
Molecular and Cellular Systems
Neuroendocrinology and Autonomic Nervous System
Neuroscientific Techniques
Sensory and Motor Systems
Browse content in Physics
Astronomy and Astrophysics
Atomic, Molecular, and Optical Physics
Biological and Medical Physics
Classical Mechanics
Computational Physics
Condensed Matter Physics
Electromagnetism, Optics, and Acoustics
History of Physics
Mathematical and Statistical Physics
Measurement Science
Nuclear Physics
Particles and Fields
Plasma Physics
Quantum Physics
Relativity and Gravitation
Semiconductor and Mesoscopic Physics
Browse content in Psychology
Affective Sciences
Clinical Psychology
Cognitive Neuroscience
Cognitive Psychology
Criminal and Forensic Psychology
Developmental Psychology
Educational Psychology
Evolutionary Psychology
Health Psychology
History and Systems in Psychology
Music Psychology
Neuropsychology
Organizational Psychology
Psychological Assessment and Testing
Psychology of Human-Technology Interaction
Psychology Professional Development and Training
Research Methods in Psychology
Social Psychology
Browse content in Social Sciences
Browse content in Anthropology
Anthropology of Religion
Human Evolution
Medical Anthropology
Physical Anthropology
Regional Anthropology
Social and Cultural Anthropology
Theory and Practice of Anthropology
Browse content in Business and Management
Business Strategy
Business History
Business Ethics
Business and Government
Business and Technology
Business and the Environment
Comparative Management
Corporate Governance
Corporate Social Responsibility
Entrepreneurship
Health Management
Human Resource Management
Industrial and Employment Relations
Industry Studies
Information and Communication Technologies
International Business
Knowledge Management
Management and Management Techniques
Operations Management
Organizational Theory and Behaviour
Pensions and Pension Management
Public and Nonprofit Management
Strategic Management
Supply Chain Management
Browse content in Criminology and Criminal Justice
Criminal Justice
Criminology
Forms of Crime
International and Comparative Criminology
Youth Violence and Juvenile Justice
Development Studies
Browse content in Economics
Agricultural, Environmental, and Natural Resource Economics
Asian Economics
Behavioural Finance
Behavioural Economics and Neuroeconomics
Econometrics and Mathematical Economics
Economic Systems
Economic Methodology
Economic History
Economic Development and Growth
Financial Markets
Financial Institutions and Services
General Economics and Teaching
Health, Education, and Welfare
History of Economic Thought
International Economics
Labour and Demographic Economics
Law and Economics
Macroeconomics and Monetary Economics
Microeconomics
Public Economics
Urban, Rural, and Regional Economics
Welfare Economics
Browse content in Education
Adult Education and Continuous Learning
Care and Counselling of Students
Early Childhood and Elementary Education
Educational Equipment and Technology
Educational Strategies and Policy
Higher and Further Education
Organization and Management of Education
Philosophy and Theory of Education
Schools Studies
Secondary Education
Teaching of a Specific Subject
Teaching of Specific Groups and Special Educational Needs
Teaching Skills and Techniques
Browse content in Environment
Applied Ecology (Social Science)
Climate Change
Conservation of the Environment (Social Science)
Environmentalist Thought and Ideology (Social Science)
Natural Disasters (Environment)
Social Impact of Environmental Issues (Social Science)
Browse content in Human Geography
Cultural Geography
Economic Geography
Political Geography
Browse content in Interdisciplinary Studies
Communication Studies
Museums, Libraries, and Information Sciences
Browse content in Politics
African Politics
Asian Politics
Chinese Politics
Comparative Politics
Conflict Politics
Elections and Electoral Studies
Environmental Politics
European Union
Foreign Policy
Gender and Politics
Human Rights and Politics
Indian Politics
International Relations
International Organization (Politics)
International Political Economy
Irish Politics
Latin American Politics
Middle Eastern Politics
Political Methodology
Political Communication
Political Philosophy
Political Sociology
Political Theory
Political Behaviour
Political Economy
Political Institutions
Politics and Law
Public Administration
Public Policy
Quantitative Political Methodology
Regional Political Studies
Russian Politics
Security Studies
State and Local Government
UK Politics
US Politics
Browse content in Regional and Area Studies
African Studies
Asian Studies
East Asian Studies
Japanese Studies
Latin American Studies
Middle Eastern Studies
Native American Studies
Scottish Studies
Browse content in Research and Information
Research Methods
Browse content in Social Work
Addictions and Substance Misuse
Adoption and Fostering
Care of the Elderly
Child and Adolescent Social Work
Couple and Family Social Work
Developmental and Physical Disabilities Social Work
Direct Practice and Clinical Social Work
Emergency Services
Human Behaviour and the Social Environment
International and Global Issues in Social Work
Mental and Behavioural Health
Social Justice and Human Rights
Social Policy and Advocacy
Social Work and Crime and Justice
Social Work Macro Practice
Social Work Practice Settings
Social Work Research and Evidence-based Practice
Welfare and Benefit Systems
Browse content in Sociology
Childhood Studies
Community Development
Comparative and Historical Sociology
Economic Sociology
Gender and Sexuality
Gerontology and Ageing
Health, Illness, and Medicine
Marriage and the Family
Migration Studies
Occupations, Professions, and Work
Organizations
Population and Demography
Race and Ethnicity
Social Theory
Social Movements and Social Change
Social Research and Statistics
Social Stratification, Inequality, and Mobility
Sociology of Religion
Sociology of Education
Sport and Leisure
Urban and Rural Studies
Browse content in Warfare and Defence
Defence Strategy, Planning, and Research
Land Forces and Warfare
Military Administration
Military Life and Institutions
Naval Forces and Warfare
Other Warfare and Defence Issues
Peace Studies and Conflict Resolution
Weapons and Equipment

The Oxford Handbook of Research Strategies for Clinical Psychology

< Previous chapter
Next chapter >

17 Meta-analysis in Clinical Psychology Research

Andy P. Field, School of Psychology, University of Sussex

Published: 01 August 2013
Cite Icon Cite
Permissions Icon Permissions

Meta-analysis is now the method of choice for assimilating research investigating the same question. This chapter is a nontechnical overview of the process of conducting meta-analysis in the context of clinical psychology. We begin with an overview of what meta-analysis aims to achieve. The process of conducting a meta-analysis is then described in six stages: (1) how to do a literature search; (2) how to decide which studies to include in the analysis (inclusion criteria); (3) how to calculate effect sizes for each study; (4) running a basic meta-analysis using the metaphor package for the free software R; (5) how to look for publication bias and moderators of effect sizes; and (6) how to write up the results for publication.

Introduction

Meta-analysis has become an increasingly popular research methodology, with an exponential increase in papers published seen across both social sciences and science in general. Field (2009) reports data showing that up until 1990 there were very few studies published on the topic of meta-analysis, but after this date the use of this tool has been on a meteoric increase. This trend has occurred in clinical psychology too. Figure 17.1 shows the number of articles with “meta-analysis” in the title published within the domain of “clinical psychology” since the term “meta-analysis” came into common usage. The data show a clear increase in publications after the 1990s, and a staggering acceleration in the number of published meta-analyses in this area in the past 3 to 5 years. Meta-analysis has been used to draw conclusions about the causes ( Bar-Haim, Lamy, Pergamin, Bakermans-Kranenburg, & van Ijzendoorn, 2007 ; Brewin, Kleiner, Vasterling, & Field, 2007 ; Burt, 2009 ; Chan, Xu, Heinrichs, Yu, & Wang, 2010 ; Kashdan, 2007 ; Ruocco, 2005 ), diagnosis ( Bloch, Landeros-Weisenberger, Rosario, Pittenger, & Leckman, 2008 ; Cuijpers, Li, Hofmann, & Andersson, 2010 ), and preferred treatments ( Barbato & D'Avanzo, 2008 ; Bradley, Greene, Russ, Dutra, & Westen, 2005 ; Cartwright-Hatton, Roberts, Chitsabesan, Fothergill, & Harrington, 2004 ; Covin, Ouimet, Seeds, & Dozois, 2008 ; Hendriks, Voshaar, Keijsers, Hoogduin, & van Balkom, 2008 ; Kleinstaeuber, Witthoeft, & Hiller, 2011 ; Malouff, Thorsteinsson, Rooke, Bhullar, & Schutte, 2008 ; Parsons & Rizzo, 2008 ; Roberts, Kitchiner, Kenardy, Bisson, & Psych, 2009 ; Rosa-Alcazar, Sanchez-Meca, Gomez-Conesa, & Marin-Martinez, 2008 ; Singh, Singh, Kar, & Chan, 2010 ; Spreckley & Boyd, 2009 ; Stewart & Chambless, 2009 ; Villeneuve, Potvin, Lesage, & Nicole, 2010 ) of a variety of mental health problems. This illustrative selection of articles shows that meta-analysis has been used to determine the efficacy of behavioral, cognitive, couple-based, cognitive-behavioral (CBT), virtual reality, and psychopharmacological interventions and on problems as diverse as schizophrenia, anxiety disorders, depression, chronic fatigue, personality disorders, and autism. There is little in the world of clinical psychology that has not been subjected to meta-analysis.

The number of studies using meta-analysis in clinical psychology.

This chapter provides a practical introduction to meta-analysis. For mathematical details, see other sources (e.g., H. M. Cooper, 2010 ; Field, 2001 , 2005a , 2009 ; Field & Gillett, 2010 ; Hedges & Olkin, 1985 ; Hedges & Vevea, 1998 ; Hunter & Schmidt, 2004 ; Overton, 1998 ; Rosenthal & DiMatteo, 2001 ; Schulze, 2004 ). This chapter overviews the important issues when conducting meta-analysis and shows an example of how to conduct a meta-analysis using the free software R ( R Development Core Team, 2010 ).

What Is Meta-analysis?

Clinical psychologists are typically interested in reaching conclusions that can be applied generally. These questions might whether CBT is efficacious as a treatment for obsessive-compulsive disorder ( Rosa-Alcazar et al., 2008 ), whether antidepressant medication treats the negative symptoms of schizophrenia ( Singh et al., 2010 ), whether virtual reality can be effective in treating specific phobias ( Parsons & Rizzo, 2008 ), whether school-based prevention programs reduce anxiety and/or depression in youth (Mychailyszyn, Brodman, Read, & Kendall, 2011), whether there are memory deficits for emotional information in posttraumatic stress disorder (PTSD; Brewin et al., 2007 ), what the magnitude of association between exposure to disasters and youth PTSD is ( Furr, Corner, Edmunds, & Kendall, 2010 ), or what the magnitude of threat-related attentional biases in anxious individuals is ( Bar-Haim, et al., 2007 ). Although answers to these questions may be attainable in a single study, single studies have two limitations: (1) they are at the mercy of their sample size because estimates of effects in small samples will be more biased than large sample studies and (2) replication is an important means to deal with the problems created by measurement error in research ( Fisher, 1935 ). Meta-analysis pools the results from similar studies in the hope of generating more accurate estimates of the true effect in the population. A meta-analysis can tell us:

The mean and variance of underlying population effects— for example, the effects in the population of conducting CBT with depressed adolescents compared to waitlist controls. You can also compute confidence intervals for the population effects.

Variability in effects across studies. It is possible to estimate the variability between effect sizes across studies (the homogeneity of effect sizes). There is accumulating evidence that effect sizes should be heterogeneous across studies (see, e.g., National Research Council, 1992 ). Therefore, variability statistics should be reported routinely. (You will often see significance tests reported for these estimates of variability; however, these tests typically have low power and are probably best ignored.)

Moderator variables. If there is variability in effect sizes, and in most cases there is ( Field, 2005a ), this variability can be explored in terms of moderator variables ( Field, 2003b ; Overton, 1998 ). For example, we might find that attentional biases to threat in anxious individuals are stronger when picture stimuli are used to measure these biases than when words are used.

A Bit of History

More than 70 years ago, Fisher and Pearson discussed ways to combine studies to find an overall probability ( Fisher, 1938 ; Pearson, 1938 ), and over 60 years ago, Stouffer presented a method for combining effect sizes ( Stouffer, 1949 ). The roots of meta-analysis are buried deep within the psychological and statistical earth. However, clinical psychology has some claim over the popularization of the method: in 1977, Smith and Glass published an influential paper in which they combined effects from 375 studies that had looked at the effects of psychotherapy ( Smith & Glass, 1977 ). They concluded that psychotherapy was effective, and that the type of psychotherapy did not matter. A year earlier, Glass (1976) published a paper in which he coined the term “meta-analysis” (if this wasn't the first usage of the term, then it was certainly one of the first) and summarized the basic principles. Shortly after these two seminal papers, Rosenthal published an influential theoretical paper on meta-analysis, and a meta-analysis combining 345 studies to show that interpersonal expectancies affected behavior ( Rosenthal, 1978 ; Rosenthal & Rubin, 1978 ). It is probably fair to say that these papers put “meta-analysis” in the spotlight of psychology. However, it was not until the early 1980s that three books were published by Rosenthal (1984 , 1991 ), Hedges and Olkin (1985) , and Hunter and Schmidt (1990) . These books were the first to provide detailed and accessible accounts of how to conduct a meta-analysis. Given a few years for researchers to assimilate these works, it is no surprise that the use and discussion of meta-analysis accelerated after 1990 (see Fig. 17.1 ). The even more dramatic acceleration in the number of published meta-analyses in the past 5 years is almost certainly due to the widespread availability of computer software packages that make the job of meta-analysis easier than before.

Computer Software for Doing Meta-analysis

An overview of the options.

There are several standalone packages for conducting meta-analyses: for example, the Cochrane Collaboration's Review Manager (RevMan) software ( The Cochrane Collaboration, 2008 ). There is also a package called Comprehensive Meta-Analysis ( Borenstein, Hedges, Higgins, & Rothstein, 2005 ). There are two add-ins for Microsoft Excel: Mix ( Bax, Yu, Ikeda, Tsuruta, & Moons, 2006 ) and MetaEasy ( Kontopantelis & Reeves, 2009 ). These packages implement many different meta-analysis methods, convert effect sizes, and create plots of study effects. Although it is not 100 percent clear from their website, Comprehensive Meta-Analysis appears to be available only for Windows, Mix works only with Excel 2007 and 2010 in Windows, and MetaEasy works with Excel 2007 (again Windows). RevMan uses Java and so is available for Windows, Linux, and MacOS operating systems. Although RevMan and MetaEasy are free and Mix comes in a free light version, Comprehensive Meta-Analysis and the pro version of Mix are commercial products.

SPSS (a commercial statistics package commonly used by clinical psychologists) does not incorporate a menu-driven option for conducting meta-analysis, but it is possible to use its syntax to run a meta-analysis. Field and Gillett (2010) provide a tutorial on meta-analysis and also include syntax files and examples showing how to run a meta-analysis using SPSS. Other SPSS syntax files can be obtained from Lavesque (2001) and Wilson (2004) .

Meta-analysis can also be conducted with R ( R Development Core Team, 2010 ), a freely available package for conducting a staggering array of statistical procedures. R is free, open-source software available for Windows, MacOS, and Linux that is growing in popularity in the psychology community. Scripts for running a variety of meta-analysis procedures on d are available in the meta package that can be installed into R ( Schwarzer, 2005 ). However, my favorite package for conducting meta-analysis in R is metafor ( Viechtbauer, 2010 ) because it has functions to compute effect sizes from raw data, can work with a wide array of different effect sizes ( d, r , odds ratios, relative risks, risk differences, proportions, and incidence rates), produces publication-standard graphics, and implements moderator analysis and fixed- and random-effects methods (more on this later). It is a brilliant package, and given that it can be used for free across Windows, Linux, and MacOS, I have based this chapter on using this package within R .

Getting Started with R

R ( R Development Core Team, 2010 ) is an environment/language for statistical analysis and is the fastest-growing statistics software. R is a command language: we type in commands that we then execute to see the results. Clinical psychologists are likely to be familiar with the point-and-click graphical user interfaces (GUIs) of packages like SPSS and Excel, and so at first R might appear bewildering. However, I will walk through the process step by step assuming that the reader has no knowledge of R . I cannot, obviously, explain everything there is to know about R , and readers are advised to become familiar with the software by reading one of the many good introductory books (e.g., Crawley, 2007 ; Quick, 2010 ; Verzani, 2004 ; Zuur, Ieno, & Meesters, 2009 ), the best of which, in my entirely objective opinion, is Field, Miles, and Field (2012) .

Once you have installed R on your computer and opened the software, you will see the console window, which contains a prompt at which you type commands. Once a command has been typed, you press the return key to execute it. You can also write commands in a script window and execute them from there, which is my preference—see Chapter 3 of Field and colleagues (2012) . R comes with a basic set of functionality, which can be expanded by installing packages stored at a central online location that has mirrors around the globe. To install the metafor package you need to execute this command:

install.packages(“metafor”)

This command installs the package (you need to be connected to the Internet for this command to work). You need to install the package only once (although whenever you update R you will have to reinstall it), but you need to load the package every time that you want to use it. You do this by executing this command:

library(metafor)

The library() function tells R that you want to use a package (in this case metafor ) in the current session. If you close the program and restart it, then you would need to re-execute the library command to use the metafor package.

The Six Basic Steps of Meta-analysis: An Example

Broadly speaking, there are six sequential steps to conducting a quality meta-analysis: (1) Do a literature search; (2) Decide on inclusion criteria; (3) Calculate the effect sizes; (4) Do the basic meta-analysis; (5) Do some more advanced analysis; and (6) Write it up. In this chapter, to illustrate these six steps, we will use a real dataset from a meta-analysis in which I was involved ( Hanrahan, Field, Jones, & Davey, 2013 ). This meta-analysis looked at the efficacy of cognitive-based treatments for worry in generalized anxiety disorder (GAD), and the part of the analysis that we will use here simply aimed to estimate the efficacy of treatment postintervention and to see whether the type of control group used moderated the effects obtained. This meta-analysis is representative of clinical research in that relatively few studies had addressed this question (it is a small analysis) and sample sizes within each study were relatively small. These data are used as our main example, and the most benefit can be gained from reading the original meta-analysis in conjunction with this chapter. We will now look at each stage of the process of doing a meta-analysis.

Step 1: Do a Literature Search

The first step is to search the literature for studies that have addressed the core/central/same research question using electronic databases such as the ISI Web of Knowledge, PubMed, and PsycInfo. Although the obvious reason for doing this is to find articles, it is also helpful in identifying authors who might have unpublished data (see below). It is often useful to hand-search the reference sections of the articles that you have found to check for articles that you have missed, and to consult directly with noted experts in this literature to ensure that relevant papers have not been missed.

Although it is tempting to assume that meta-analysis is a wonderfully objective tool, it is not without a dash of bias. Possibly the main source of bias is the “file-drawer problem,” or publication bias ( Rosenthal, 1979 ). This bias stems from the reality that significant findings are more likely to be published than nonsignificant findings: significant findings are estimated to be eight times more likely to be submitted than nonsignificant ones ( Greenwald, 1975 ), studies with positive findings are around seven times more likely to be published than studies with results supporting the null hypothesis ( Coursol & Wagner, 1986 ), and 97 percent of articles in psychology journals report significant results ( Sterling, 1959 ). Without rigorous attempts to counteract publication bias, meta-analytic reviews could overestimate population effects because effect sizes in unpublished studies will be smaller ( McLeod & Weisz, 2004 )—up to half the size ( Shadish, 1992 )—of published studies of comparable methodological quality. The best way to minimize the bias is to extend your search to relevant conference proceedings and to contact experts in the field to see if they have or know of any unpublished data. This can be done by direct email to authors in the field, but also by posting a message to a topic-specific newsgroup or email listserv.

In our study, we gathered articles by searching PsycInfo, Web of Science, and Medline for English-language studies using keywords considered relevant. The reference lists of previous meta-analyses and retrieved articles were scanned for relevant studies. Finally, email addresses of published researchers were compiled from retrieved papers, and 52 researchers were emailed and invited to send any unpublished data fitting the inclusion criteria. This search strategy highlights the use of varied resources to ensure that all potentially relevant studies are included and to reduce bias due to the file-drawer problem.

Step 2: Decide on Inclusion Criteria

A second source of bias in a meta-analysis is the inclusion of poorly conducted research. As Field and Gillett (2010) put it:

Although meta-analysis might seem to solve the problem of variance in study quality because these differences will “come out in the wash,” even one red sock (bad study) amongst the white clothes (good studies) can ruin the laundry. (pp. 667–668)

Inclusion criteria depend on the research question being addressed and any specific methodological issues in the field, but the guiding principle is that you want to compare apples with apples, not apples with pears ( Eysenck, 1978 ). In a meta-analysis of CBT, for example, you might decide on a working definition of what constitutes CBT, and maybe exclude studies that do not have proper control groups. It is important to use a precise, reliable, set of criteria that is applied consistently to each potential study so as not to introduce subjective bias into the analysis. In your write-up, you should be explicit about your inclusion criteria and report the number of studies that were excluded at each step of the selection process.

In our analysis, at a first pass we excluded studies based on the following criteria: (a) treatments were considered too distinct to be meaningfully compared to face-to-face therapies (e.g., bibliotherapy, telephone, or computer-administered treatment); (b) subsamples of the data were already included in the meta-analysis because they were published over several papers; and (c) information was insufficient to enable us to compute effect sizes. Within this pool of studies, we set the following inclusion criteria:

Studies that included only those participants who met criteria for a diagnosis of GAD outlined by the DSM since GAD was recognized as an independent disorder; that is, the DSM-III-R, DSM-IV, or DSM-IV-TR (prior to DSM-III-R, GAD was simply a poorly characterized residual diagnostic category). This was to avoid samples being heterogeneous.

Studies in which the majority of participants were aged 18 to 65 years. This was because there may be developmental issues that affect the efficacy of therapy in younger samples.

The Penn State Worry Questionnaire (PSWQ) was used to capture symptom change.

Treatments included were defined as any treatment that used cognitive techniques, either in combination with, or without, behavioral techniques.

To ensure that the highest possible quality of data was included, only studies that used a randomized controlled design were included.

Step 3: Calculate the Effect Sizes

What are effect sizes and how do i calculate them.

Your selected studies are likely to have used different outcome measures, and of course we cannot directly compare raw change on a children's self-report inventory to that being measured using a diagnostic tool such as the Anxiety Disorders Interview Schedule (ADIS). Therefore, we need to standardize the effects within each study so that they can be combined and compared. To do this we convert each effect in each study into one of many standard effect size measures. When quantifying group differences on a continuous measure (such as the PSWQ) people tend to favor Cohen's d ; Pearson's r is used more when looking at associations between measures; and if recovery rates are the primary interest, then it is common to see odds ratios used as the effect size measure.

Once an effect size measure is chosen, you need to compute it for each effect that you want to compare for every paper you want to include in the meta-analysis. A given paper may contain several effect sizes depending on the sorts of questions you are trying to address with your meta-analysis. For example, in a meta-analysis on cognitive impairment in PTSD in which I was involved ( Brewin et al., 2007 ), impairment was measured in a variety of ways in individual studies, and so we had to compute several effect sizes within many of the studies. In this situation, we have to make a decision about how to treat the studies that have produced multiple effect sizes that address the same question. A common solution is to calculate the average effect size across all measures of the same outcome within a study ( Rosenthal, 1991 ), so that every study contributes only one effect to the main analysis (as in Brewin et al., 2007 ).

Computing effect sizes is probably the hardest part of a meta-analysis because the data contained within published articles will vary in their detail and specificity. Some articles will report effect sizes, but many will not; articles might use different effect size metrics; you will feel as though some studies have a grudge against you and are trying to make it as hard for you as possible to extract an effect size. If no effect sizes are reported, then you need to try to use the reported data to calculate one. If using d , then you can use means and standard deviations, odds ratios are easily obtained from frequency data, and most effect size measures (including r ) can be obtained from test statistics such as t , z , χ 2 , and F , or probability values for effects (by converting first to z ). A full description of the various ways in which effect sizes can be computed is beyond the present scope, but there are many freely available means to compute effect sizes from raw data and test statistics; some examples are Wilson (2001 , 2004 ) and DeCoster (1998) . To do the meta-analysis you need not just the effect size, but the corresponding value of its sampling variance ( v ) or standard error ( se ); Wilson (2001) , for example, will give you an estimate of the effect size and the sampling variance.

If a paper does not include sufficient data to calculate an effect size, contact the authors for the raw data, or relevant statistics from which an effect size can be computed. (If you are on the receiving end of such an email please be sympathetic, as attempts to get data out of researchers can be like climbing up a jelly mountain.)

Effect Sizes for Hanrahan and Colleagues’ Study

When reporting a meta-analysis it is a good idea to tabulate the effect sizes with other helpful information (such as the sample size on which the effect size is based, N ) and also to present a stem-and-leaf plot of the effect sizes. For the study conducted by Hanrahan and colleagues, we used d as the effect size measure and corrected for the known bias that d has in small samples using the adjustment described

by Hedges (1981) . In meta-analysis, a stem-and-leaf plot graphically organizes included effect sizes to visualize the shape and central tendency of the effect size distribution across studies included. Table 17.1 shows a stem-and-leaf plot of the resulting effect sizes, and this should be included in the write-up. This stem-and-leaf plot tells us the effect sizes to one decimal place, with the stem reflecting the value before the decimal point and the leaf showing the first decimal place; for example, we know the smallest effect size was d = −0.2, the largest was d = −3.2, and there were effect sizes of 1.2 and 1.4 (for example). Table 17.2 shows the studies included in the Hanrahan and colleagues’ paper, with their corresponding effect sizes (expressed as d ), the sample sizes on which these d s are based, and the standard errors associated with each effect size. Note that the d s match those reported in Table 2 of Hanrahan and colleagues (2013) .

Step 4: Do the Basic Meta-Analysis

Initial considerations.

Meta-analysis aims to estimate the effect in the population (and a confidence interval around it) by combining the effect sizes from different studies using a weighted mean of the effect sizes. The “weight” that is used is usually a value reflecting the sampling precision of the effect size, which is typically a function of sample size. As such, effect sizes with better precision are weighted more highly than effect sizes that are imprecise. There are different methods for estimating the population effects, and these methods have pros and cons. There are two related issues to consider: (1) which method to use and (2) how to conceptualize your data. There are other issues, too, but we will focus on these two because there are articles elsewhere that can be consulted as a next step (e.g., Field, 2001 , 2003a , 2003b , 2005a , 2005b ; Hall & Brannick, 2002 ; Hunter & Schmidt, 2004 ; Rosenthal & DiMatteo, 2001 ; Schulze, 2004 ).

Choosing a Model

It is tempting simply to tell you to use a random-effects model and end the discussion: however, in the interests of informed decision making I will explain why. Meta-analysis can be conceptualized in two ways: fixed- and random-effects models ( Hedges, 1992 ; Hedges & Vevea, 1998 ; Hunter & Schmidt, 2000 ). We can assume that studies in a meta-analysis are sampled from a population in which the average effect size is fixed ( Hunter & Schmidt, 2000 ). Consequently, sample effect sizes should be homogenous. This is the fixed-effects model. The alternative assumption is that the average effect size in the population varies randomly from study to study: population effect sizes can be thought of as being sampled from a “superpopulation” ( Hedges, 1992 ). In this case, because effect sizes come from populations with varying average effect sizes, they should be heterogeneous. This is the random-effects model. Essentially, the researcher using a random-effects model assumes that the studies included represent a mere random sampling of the larger population of studies that could have been conducted on the topic, whereas the researcher using a fixed-effects model assumes that the studies included are the comprehensive set of representative studies. Therefore, the fixed-effects model can be thought to characterize the scope of existing research, and the random-effects model can be thought to afford inferences about a broader population than just the sample of studies analyzed. When effect size variability is explained by a moderator variable that is treated as “fixed,” then the random-effects model becomes a mixed-effects model (see Overton, 1998 ).

Statistically speaking, fixed- and random-effects models differ in the sources of error. Fixed-effects models have error derived from sampling studies from a population of studies. Random-effects models have this error too, but in addition there is error created by sampling the populations from a superpopulation.

The two most widely used methods of meta-analysis are those by Hunter and Schmidt (2004) , which is a random-effects method, and the method by Hedges and colleagues (e.g., Hedges, 1992 ; Hedges & Olkin, 1985 ; Hedges & Vevea, 1998 ), who provide both fixed- and random-effects methods. However, multilevel models can also be used in the context of meta-analysis (see Hox, 2002 , Chapter 8).

Your first decision is whether to conceptualize your model as fixed- or random-effects. You might consider the assumptions that can be realistically made about the populations from which your studies are sampled. There is compelling evidence that real-world data in the social sciences are likely to have variable population parameters ( Field, 2003b ; Hunter & Schmidt, 2000 , 2004 ; National Research Council, 1992 ; Osburn & Callender, 1992 ). Field (2005a) found that the standard deviations of effect sizes for all meta-analytic studies (using r ) published in Psychological Bulletin from 1997 to 2002 ranged from 0 to 0.3 and were most frequently in the region of 0.10 to 0.16; similarly, Barrick and Mount (1991) reported that the standard deviation of effect sizes ( r s) in published datasets was around 0.16.

Second, consider the inferences that you wish to make ( Hedges & Vevea, 1998 ): if you want to make inferences that extend only to the studies included in the meta-analysis ( conditional inferences ), then fixed-effect models are appropriate; however, for inferences that generalize beyond the studies in the meta-analysis ( unconditional inferences ), a random-effects model is appropriate.

Third, consider the consequences of making the “wrong” choice. The consequences of applying fixed-effects methods to random-effects data can be quite dramatic: (1) it inflates the significance tests of the estimate of the population effect from the normal 5 percent to 11 to 28 percent ( Hunter & Schmidt, 2000 ) and 43 to 80 percent ( Field, 2003b ) and (2) published fixed-effects confidence intervals around mean effect sizes have been shown to be, on average, 52 percent narrower than their actual width—these nominal 95 percent fixed-effects confidence intervals were on average 56 percent confidence intervals ( Schmidt, Oh, & Hayes, 2009 ). The consequences of applying random-effects methods to fixed-effects data are considerably less dramatic: in Hedges’ method, for example, when sample effect sizes are homogenous, the additional between-study effect size variance becomes zero, yielding the same result as the fixed-effects method.

This leads me neatly back to my opening sentence of this section: unless you can find a good reason not to, use a random-effects method because (1) social science data normally have heterogeneous effect sizes; (2) psychologists generally want to make inferences that extend beyond the studies in the meta-analysis; and (3) if you apply a random-effects method to homogenous effect sizes, it does not affect the results (certainly not as dramatically as if you apply a fixed-effects model to heterogeneous effect sizes).

Choosing a Method

Let's assume that you trust me (I have an honest face) and opt for a random-effects model. You then need to decide whether to use Hunter and Schmidt, H-S (2004) or Hedges and colleagues’ method (H-V). The technical differences between these methods have been summarized elsewhere ( Field, 2005a ) and will not be repeated here. In a series of Monte Carlo simulations comparing the performance of the Hunter and Schmidt and Hedges and Vevea (fixed- and random-effects) methods, Field (2001 ; but see Hafdahl & Williams, 2009 ) found that when comparing random-effects methods, the Hunter-Schmidt method yielded the most accurate estimates of population correlations across a variety of situations (a view echoed by Hall & Brannick, 2002 , in a similar study). Based on a more extensive set of stimulations, Field (2005a) concluded that in general both H-V and H-S random-effects methods produce accurate estimates of the population effect size. Although there were subtle differences in the accuracy of population effect size estimates across the two methods, in practical terms the bias in both methods was negligible. In terms of 95 percent confidence intervals around the population estimate, Hedges’ method was in general better at achieving these intervals (the intervals for Hunter and Schmidt's method tended to be too narrow, probably because they recommend using credibility intervals and not confidence intervals).

Hunter and Schmidt's method involves psychometric corrections for the attenuation of observed effect sizes that can be caused by measurement error ( Hunter, Schmidt, & Le, 2006 ), and these psychometric corrections can be incorporated into the H-V method if correlations are used as the effect size, but these corrections were not explored in the studies mentioned above, which limits what they can tell us. Therefore, diligent researchers might consult the various tables in Field (2005a) to assess which method might be most accurate for the given parameters of the meta-analysis that they are about to conduct; however, the small differences between the methods will probably not make a substantive impact on the conclusions that will be drawn from the analysis.

Entering the Data into R

Having computed the effect sizes, we need to enter these into R . In R , commands follow the basic structure of:

Object 〈- Instructions about how to create the object

Therefore, to create an object that is a variable, we give it a name on the left-hand side of the arrow, and on the right-hand side input the data that makes up the variable. To input data we use the c() function, which simply binds things together into a single object (in this case it binds the different values of d into a single object or variable). To enter the value of d from Table 17.2 , we would execute:

d 〈- c(1.42, 0.68, −0.17, 2.57, 0.82, 0.13, 0.45, 0.31, 2.44, 3.22, −0.08, 2.25, 0.89, 1.23, 0.27, 2.22, 0.31, 0.83, 0.26)

Executing this command creates an object that we have named “d” (we could have named it “Thelma” if we wanted to, but “d” seems like a fairly descriptive name in the circumstances). If we want to view this variable, we simply execute its name:

〉d [1] 1.42 0.68 −0.17 2.57 0.82 0.13 0.45 0.31 2.44 3.22 −0.08 2.25 0.89 1.23 0.27 2.22 0.31 0.83 0.26

We can enter the standard errors from Table 17.2 in a similar way; this time we create an object that we have decided to call “sed”:

sed 〈- c(0.278, 0.186, 0.218, 0.590, 0.313, 0.293, 0.230, 0.263, 0.468, 0.609, 0.394, 0.587, 0.284, 0.299, 0.291, 0.490, 0.343, 0.336, 0.323)

Next I'm going to create a variable that gives each effect size a label of the first author of the study from which the effect came and the year. We can do this by executing (note that to enter text strings instead of numbers, we place the text in quotes so that R knows the data are text strings):

study 〈-c(“v.d. Heiden (2010)”, “v.d. Heiden (2010)”, “Newman (2010)”, “Wells (2010)”, “Dugas (2010)”, “Dugas (2010)”, “Westra (2009)”, “Leichsenring (2009)”, “Roemer (2008)”, “Rezvan (2008)”, “Rezvan (2008)”, “Zinbarg (2007)”, “Gosselin (2006)”, “Dugas (2003)”, “Borkovec (2002)”, “Ladouceur (2000)”, “Ost (2000)”, “Borkovec (1993)”, “Borkovec (1993)”)

We also have some information about the type of control group used. We'll come back to this later, but if we want to record this information, we can do so using a coding variable. We need to enter values that represent the different types of control group, and then to tell R what these values represent. Let's imagine that we want to code non-therapy controls as 0, CT as 1, and non-CT as 2. First we can enter these values into R :

controlType 〈- c(0, 1, 1, 2, 0, 2, 1, 2, 0, 0, 1, 0, 2, 0, 1, 0, 2, 2, 2)

Next, we need to tell R that this variable is a coding variable (a.k.a. a factor), using the factor() function. Within this function we name the variable that we want to convert (in this case controlType ), we tell R what numerical values we have used to code levels of the factor by specifying levels = 0:2 (0:2 means zero to 2 inclusive, so we are specifying levels of 0, 1, 2), we then tell it what labels to attach to those levels (in the order of the numerical codes) by including labels = c( “Non-Therapy”, “CT”, “Non-CT”) . Therefore, to turn controlType into a factor based on itself, we execute:

controlType 〈- factor(controlType, levels = 0:2, labels = c(“Non-Therapy”, “CT”, “Non-CT”))

We now have four variables containing data: d (the effect sizes), sed (their standard errors), study (a string variable that identifies the study from which the effect came), and controlType (a categorical variable that defines what control group was used for each effect size). We can combine these variables into a data frame by executing:

GAD.data 〈- data.frame(study, controlType, d, sed)

This creates an object called GAD.data (note that in R you cannot use spaces when you name objects) that is a data frame made up of the four variables that we have just created. To “see” this data frame, execute its name:

〉 GAD.data study controlType d sed 1 v.d. Heiden (2010) Non-Therapy 1.42 0.278 2 v.d. Heiden (2010) CT 0.68 0.186 3 Newman (2010) CT -0.17 0.218 4 Wells (2010) Non-CT 2.57 0.590 5 Dugas (2010) Non-Therapy 0.82 0.313 6 Dugas (2010) Non-CT 0.13 0.293 7 Westra (2009) CT 0.45 0.230 8 Leichsenring (2009) Non-CT 0.31 0.263 9 Roemer (2008) Non-Therapy 2.44 0.468 10 Rezvan (2008) Non-Therapy 3.22 0.609 11 Rezvan (2008) CT -0.08 0.394 12 Zinbarg (2007) Non-Therapy 2.25 0.587 13 Gosselin (2006) Non-CT 0.89 0.284 14 Dugas (2003) Non-Therapy 1.23 0.299 15 Borkovec (2002) CT 0.27 0.291 16 Ladouceur (2000) Non-Therapy 2.22 0.490 17 Ost (2000) Non-CT 0.31 0.343 18 Borkovec (1993) Non-CT 0.83 0.336 19 Borkovec (1993) Non-CT 0.26 0.323

You can also prepare the data as a tab-delimited or comma-separated text file (using Excel, SPSS, Stata, or whatever other software you like) and read this file into R using the read.delim() or read.csv() functions. In both cases, you could use the choose.file() function to open a standard dialogue box that lets you choose the file by navigating your file system. 1 For example, to create a data frame called “GAD.data” from a tab-delimited file (.dat), you would execute:

GAD.data 〈 - read.delim(file.choose(), header = TRUE)

Similarly, to create a data frame from a comma-separated text file, you would execute:

GAD.data 〈- read.csv(file.choose(), header = TRUE)

In both cases the header = TRUE option is used if you have variable names in the first row of the data file; if you do not have variable names in the file, omit this option (the default value is false). If this data-entry section has confused you, then read Chapter 3 of Field and colleagues (2012) or your preferred introductory book on R .

Doing the Meta-analysis

To do a basic meta-analysis you use the rma() function. This function has the following general format when using the standard error of effect sizes:

maModel 〈- rma(yi = variable containing effect sizes , sei = variable containing standard error of effect sizes , data = dataFrame , method = “DL”)

maModel is whatever name you want to give your model (but remember you can't use spaces), variable containing effect sizes is replaced with the name of the variable containing your effect sizes, variable containing standard error of effect sizes is replaced with the name of the variable that contains the standard errors, and dataFrame is the name of the data frame containing these variables.

When using the variance of effect sizes we substitute the sei option with vi :

maModel 〈- rma(yi = variable containing effect sizes , vi = variable containing variance of effect sizes , data = dataFrame , method = “DL”)

Therefore, for our GAD analysis, we can execute:

maGAD 〈- rma(yi = d, sei = sed, data = GAD.data, method = “DL”)

This creates an object called maGAD by using the rma() function. Within this function, we have told R that we want to use the object GAD.data as our data frame, and that within this data frame, the variable d contains the effect sizes ( yi = d ) and the variable sed contains the standard errors ( sei = sed ). Finally, we have set the method to be “DL,” which will use the DerSimonian-Laird estimator (which is used in the H-V random-effects method). We can change how the model is estimated by changing this option, which can be set to the following:

method = “FE”: produces a fixed-effects meta-analysis

method = “HS” = random effects using the Hunter-Schmidt estimator

method = “HE” = random effects using the Hedges estimator

method = “DL” = random effects using the DerSimonian-Laird estimator

method = “SJ” = random effects using the Sidik-Jonkman estimator

method = “ML” = random effects using the maximum-likelihood estimator

method = “REML” = random effects using the restricted maximum-likelihood estimator (this is the default if you don't specify a method at all)

method = “EB” = random effects using the empirical Bayes estimator.

To see the results of the analysis we need to use the summary() function and put the name of the model within it:

summary(maGAD)

The resulting output can be seen in Figure 17.2 . This output is fairly self-explanatory 2 . For example, we can see that for Hedges and Vevea's method, the Q statistic, which measures heterogeneity in effect sizes, is highly significant, χ 2 (18) = 100.50, p 〈 .001. The estimate of between-study variability, τ 2 = 0.44 (most important, this is not zero), and the proportion of variability due to heterogeneity, I 2 , was 82.09 percent. In other words, there was a lot of variability in study effects. The population effect size and its 95 percent confidence interval are: 0.93 (CI .95 = 0.59 (lower), 1.27 (upper)). We can also see that this population effect size is significant, z = 5.40, p 〈 .001.

R output for a basic meta-analysis.

Based on the homogeneity estimates and tests, we could say that there was considerable variation in effect sizes overall. Also, based on the estimate of population effect size and its confidence interval, we could conclude that there was a strong effect of CT for GAD compared to controls.

Creating a forest plot of the studies and their effect sizes is very easy after having created the meta-analysis model. We simply place the name of the model within the forest() command and execute:

forest(maGAD)

However, I want to add the study labels to the plot, so let's execute:

forest(maGAD, slab = GAD.data$study)

By adding slab = GAD.data$study to the command we introduce study labels (that's what slab stands for) and the labels we use are in the variable called study within the GAD.data data frame (that's what GAD.data$study means). The resulting figure is in Figure 17.3 . It shows each study with a square indicating the effect size from that study (the size of the square is proportional to the weight used in the meta-analysis, so we can see that the first three studies were weighted fairly heavily). The branches of each effect size represent the confidence interval of the effect size. Also note that because we added the slab option, our effects have been annotated using the names in the variable called study in our data frame. Looking at this plot, we can see that there are five studies that produced fairly substantially bigger effects than the rest, and two studies with effect sizes below zero (the dotted line), which therefore showed that CBT was worse than controls. The diamond at the bottom shows the population effect size based on these individual studies (it is the value of the population effect size from our analysis). The forest plot is a very useful way to summarize the studies in the meta-analysis.

Step 5: Do Some More Advanced Analysis

Estimating publication bias.

Various techniques have been developed to estimate the effect of publication bias and to correct for it. The earliest and most commonly reported estimate of publication bias is Rosenthal's (1979) fail-safe N . This was an elegant and easily understood method for estimating the number of unpublished studies that would need to exist to turn a significant population effect size estimate into a nonsignificant one. However, because significance testing the estimate of the population effect size is not really the reason for doing a meta-analysis, the fail-safe N is fairly limited.

The funnel plot ( Light & Pillerner, 1984 ) is a simple and effective graphical technique for exploring potential publication bias. A funnel plot displays effect sizes plotted against the sample size, standard error, conditional variance, or some other measure of the precision of the estimate. An unbiased sample would ideally show a cloud of data points that is symmetrical around the population effect size and has the shape of a funnel. This funnel shape reflects the greater variability in effect sizes from studies with small sample sizes/less precision, and the estimates drawn from larger/more precise studies converging around the population effect size. A sample with publication bias will lack symmetry because studies based on small samples that showed small effects will be less likely to be published than studies based on the same-sized samples that showed larger effects ( Macaskill, Walter, & Irwig, 2001 ).

Forest plot of the GAD data.

Funnel plots should be used as a first step before further analysis because factors other than publication bias can cause asymmetry. Some examples are data irregularities including fraud and poor study design ( Egger, Smith, Schneider, & Minder, 1997 ), true heterogeneity of effect sizes (in intervention studies this can happen because the intervention is more intensely delivered in smaller, more personalized studies), and English-language bias (studies with smaller effects are often found in non–English-language journals and get overlooked in the literature search).

To get a funnel plot for a meta-analysis model created in R , we simply place that model into the funnel() function and execute:

funnel(maGAD)

Figure 17.4 shows the resulting funnel plot, which is clearly not symmetrical. The studies with large standard errors (bottom right) consistently produce the largest effect sizes, and the studies are not evenly distributed around the mean effect size (or within the unshaded triangle). This graph shows clear publication bias.

Funnel plots offer no means to correct for any bias detected. Trim and fill ( Duval & Tweedie, 2000 ) is a method in which a biased funnel plot is truncated (“trimmed”) and the number ( k ) of missing studies from the truncated part is estimated. Next, k artificial studies are added (“filled”) to the negative side of the funnel plot (and therefore have small effect sizes) so that in effect the study now contains k new “studies” with effect sizes as small in magnitude as the k largest effect sizes that were trimmed. The new “filled” effects are presumed to represent the magnitude of effects identified in hypothetical unpublished studies. A new estimate of the population effect size is then calculated including these artificially small effect sizes. Vevea and Woods (2005) point out that this method can lead to overcorrection because it relies on the strict assumption that all of the “missing” studies are those with the smallest effect sizes. Vevea and Woods propose a more sophisticated correction method based on weight function models of publication bias. These methods use weights to model the process through which the likelihood of a study being published varies (usually based on a criterion such as the significance of a study). Their method can be applied to even small meta-analyses and is relatively flexible in allowing meta-analysts to specify the likely conditions of publication bias in their particular research scenario. (The downside of this flexibility is that it can be hard to know what the precise conditions are.) They specify four typical weight functions: “moderate one-tailed selection,” “severe one-tailed selection,” “moderate two-tailed selection,” and “severe two-tailed selection”; however, they recommend adapting the weight functions based on what the funnel plot reveals (see Vevea & Woods, 2005 ). These corrections can be applied in R (see Field & Gillett, 2010 , for a tutorial) but do not form part of the metafor package and are a little too technical for this introductory chapter.

Funnel plot of the GAD data.

Moderator Analysis

When there is variability in effect sizes, it can be useful to try to explain this variability using theoretically driven predictors of effect sizes. For example, in our dataset there were three different types of control group used: non-therapy (waitlist), non-CT therapy, and CT therapy. We might reasonably expect effects to be stronger if a waitlist control was used in the study compared to a CT control because the waitlist control gets no treatment at all, whereas CT controls get some treatment. We can test for this using a mixed model (i.e., a random-effects model in which we add a fixed effect).

Moderator models assume a general linear model in which each effect size can be predicted from the moderator effect (represented by β 1 ):

The within-study error variance is represented by e i . To calculate the moderator effect, β 1 , a generalized least squares (GLS) estimate is calculated. It is not necessary to know the mathematics behind the process (if you are interested, then read Field, 2003b ; Overton, 1998 ); the main thing to understand is that we're just doing a regression in which effect sizes are predicted. Like any form of regression we can, therefore, predict effect sizes from either continuous variables (such as study quality) or categorical ones (which will be dummy coded using contrast weights).

The package metafor allows both continuous and categorical predictors (moderators) to be entered into the regression model that a researcher wishes to test. Moderator variables are added by including the mods option to the basic meta-analysis command. You can enter a single moderator by specifying mods = variableName (in which variableName is the name of the moderator variable that you want to enter into the model) or enter several moderator variables by including mods = matrixName (in which matrixName is the name of a matrix that contains values of moderator variables in each of its columns). Continuous variables are treated as they are; for categorical variables, you should either dummy code them manually or use the factor() function, as we did earlier, in which case R does the dummy coding for you.

Therefore, in our example, we can add the variable controlType as a moderator by rerunning the model including mods = controlType into the command. This variable is categorical, but because we converted it to a factor earlier on, R will treat it as a dummy-coded categorical variable. The rest of the command is identical to before:

modGAD 〈- rma(yi = d, sei = sed, data = GAD.data, mods = controlType, method = “DL”) summary(modGAD)

The resulting output is shown in Figure 17.5 . This output is fairly self-explanatory; for example, we can see that for Hedges and Vevea's method, the estimate of between-study variability, χ 2 = 0.33, is less than it was before (it was 0.44), which means that our moderator variable has explained some variance. However, there is still a significant amount left to explain, χ 2 (17) = 76.35, p 〈 .001.

Output from R for moderation analysis.

The Q statistic shows that the amount of variance explained by controlType is highly significant, χ 2 (1) = 8.93, p =.0028. In other words, it is a significant predictor of effect sizes. The beta parameter for the moderator and its 95 percent confidence interval are: −0.55, CI .95 = −0.92 (lower), −0.19 (upper). We can also see that this parameter is significant, z = −3.11, p = .0028 (note that the p value is identical to the Q statistic because they're testing the same thing). In a nutshell, then, the type of control group had a significant impact on the effect that CT had on GAD (measured by the PSWQ). We could break this effect apart by running the main meta-analysis on the three control groups separately.

Step 6: Write It Up

There are several detailed guidelines on how to write up a meta-analysis. For clinical trials, the QUORUM and PRISMA guidelines are particularly useful ( Moher et al., 1999 ; Moher, Liberati, Tetzlaff, Altman, & Grp, 2009 ), and more generally the American Psychological Association (APA) has published its own Meta-Analysis Reporting Standards (MARS; H. Cooper, Maxwell, Stone, Sher, & Board, 2008 ). In addition, there are individual articles that offer advice (e.g., Field & Gillett, 2010 ; Rosenthal, 1995 ). There is a lot of overlap in these guidelines, and Table 17.3 assimilates them in an attempt to give a thorough overview of the structure and content of a meta-analysis article. This table should need no elaboration, but it is worth highlighting some of the key messages:

Introduction : Be clear about the rationale for the meta-analysis: What are the theoretical, practical, or policy drivers of the research? Why is a meta-analysis necessary? What hypotheses do you have?

Methods : Be clear about your search and inclusion criteria. How did you reach the final sample of studies? The PRISMA guidelines suggest including a flowchart of the selection process: Figure 17.6 shows the type of flowchart suggested by PRISMA, which outlines the number of studies retained and eliminated at each phase of the selection process. Also, state your computational and analytic methods in sufficient detail: Which effect size measure are you using (and did you have any issues in computing these)? Which meta-analytic technique did you apply to the data and why? Did you do a subgroup or moderator analysis?

Results : Include a graphical summary of the effect sizes included in the study. A forest plot is a very effective way to show the reader the raw data. When there are too many studies for a forest plot, consider a stem-and-leaf plot. A summary table of studies and any important study characteristics/moderator variables is helpful. If you have carried out a moderator analysis, then you might also provide stem-and-leaf plots or forest plots for subgroups of the analysis. Always report statistics relating to the variability of effect sizes (these should include the actual estimate of variability as well as statistical tests of variability), and obviously the estimate of the population effect size and its associated confidence interval (or credibility interval). Report information on publication bias (e.g., a forest plot) and preferably a sensitivity analysis (e.g., Vevea and Woods’ method).

Discussion : Pull out the key theoretical, policy, or practical messages that emerge from the analysis. Discuss any limitations or potential sources of bias within the analysis. Finally, it is helpful to make a clear statement about how the results inform the future research agenda.

The PRISMA-recommended flowchart.

This chapter offered a preliminary but comprehensive overview of the main issues when conducting a meta-analysis. We also used some real data to show how the metafor package in R can be used to conduct the analysis. The analysis begins by collecting articles about the research question you are trying to address through a variety of methods: emailing people in the field for unpublished studies, electronic searches, searches of conference abstracts, and so forth. Next, inclusion criteria should be devised that reflect the concerns pertinent to the particular research question (which might include the type of control group used, diagnostic measures, quality of outcome measure, type of treatment used, or other factors that ensure a minimum level of research quality). Statistical details are then extracted from the papers from which effect sizes can be calculated; the same effect size metric should be used for all studies, and you need to compute the variance or standard error for each effect size too. Choose the type of analysis appropriate for your particular situation (fixed- vs. random-effects, Hedges’ method or Hunter and Schmidt's, etc.), and then apply this method to the data. Describe the effect of publication bias descriptively (e.g., funnel plots), and consider investigating how to re-estimate the population effect under various publication-bias models using Vevea and Woods’ (2005) model. Finally, when reporting the results, make sure that the reader has clear information about the distribution of effect sizes (e.g., a stem-and-leaf plot), the effect size variability, the estimate of the population effect and its 95 percent confidence interval, the extent of publication bias, and whether any moderator variables were explored.

Useful Web Links

Comprehensive Meta-Analysis: http://www.meta-analysis.com/

MetaEasy: http://www.statanalysis.co.uk/meta-analysis.html

metafor package for R : http://www.metafor-project.org/

Mix: http://www.meta-analysis-made-easy.com/

PRISMA (guidelines and checklists for reporting meta-analysis): http://www.prisma-statement.org/

R : http://www.r-project.org/

Review Manager: http://ims.cochrane.org/revman

SPSS (materials accompanying Field & Gillett, 2010 ): http://www.discoveringstatistics.com/meta_analysis/how_to_do_a_meta_analysis.html

I generally find it easier to export from SPSS to a tab-delimited file because this format can also be read by software packages other than R. However, you can read SPSS data files (.sav) into R directly using the read.spss() function, but you need to first install and load a package called foreign .

There are slight differences in the decimal places between the results reported here and those on page 125 of Hanrahan and colleagues’ papers because we did not round effect sizes and their standard errors to 2 and 3 decimal places respectively before conducting the analysis.

Bar-Haim, Y. , Lamy, D. , Pergamin, L. , Bakermans-Kranenburg, M. J. , & van Ijzendoorn, M. H. ( 2007 ). Threat-related attentional bias in anxious and nonanxious individuals: A meta-analytic study. Psychological Bulletin , 133 (1), 1–24. doi: 10.1037/0033-2909.133.1.1

Google Scholar

Barbato, A. , & D'Avanzo, B. ( 2008 ). Efficacy of couple therapy as a treatment for depression: A meta-analysis. Psychiatric Quarterly , 79 (2), 121–132. doi: 10.1007/s11126-008-9068-0

Barrick, M. R. , & Mount, M. K. ( 1991 ). The big 5 personality dimensions and job-performance—a meta-analysis. Personnel Psychology , 44 (1), 1–26.

Bax, L. , Yu, L. M. , Ikeda, N. , Tsuruta, H. , & Moons, K. G. M. ( 2006 ). Development and validation of MIX: Comprehensive free software for meta-analysis of causal research data. BMC Medical Research Methodology , 6 (50). http://www.biomedcentral.com/1471-2288/6/50

Bloch, M. H. , Landeros-Weisenberger, A. , Rosario, M. C. , Pittenger, C. , & Leckman, J. F. ( 2008 ). Meta-analysis of the symptom structure of obsessive-compulsive disorder. American Journal of Psychiatry , 165 (12), 1532–1542. doi: 10.1176/appi.ajp.2008.08020320

Borenstein, M. , Hedges, L. , Higgins, J. , & Rothstein, H. ( 2005 ). Comprehensive meta-analysis (Version 2). Englewood, NJ: Biostat. Retrieved from http://www.meta-analysis.com/

Google Preview

Bradley, R. , Greene, J. , Russ, E. , Dutra, L. , & Westen, D. ( 2005 ). A multidimensional meta-analysis of psychotherapy for PTSD. American Journal of Psychiatry , 162 (2), 214–227. doi: 10.1176/appi.ajp.162.2.214

Brewin, C. R. , Kleiner, J. S. , Vasterling, J. J. , & Field, A. P. ( 2007 ). Memory for emotionally neutral information in posttraumatic stress disorder: A meta-analytic investigation. Journal of Abnormal Psychology , 116 (3), 448–463. doi: Doi 10.1037/0021-843x.116.3.448

Burt, S. A. ( 2009 ). Rethinking environmental contributions to child and adolescent psychopathology: a meta-analysis of shared environmental influences. Psychological Bulletin , 135 (4), 608–637. doi: 10.1037/a0015702

Cartwright-Hatton, S. , Roberts, C. , Chitsabesan, P. , Fothergill, C. , & Harrington, R. ( 2004 ). Systematic review of the efficacy of cognitive behaviour therapies for childhood and adolescent anxiety disorders. British Journal of Clinical Psychology , 43 , 421–436.

Chan, R. C. K. , Xu, T. , Heinrichs, R. W. , Yu, Y. , & Wang, Y. ( 2010 ). Neurological soft signs in schizophrenia: a meta-analysis. Schizophrenia Bulletin , 36 (6), 1089–1104. doi: 10.1093/schbul/sbp011

Cooper, H. , Maxwell, S. , Stone, A. , Sher, K. J. , & Board, A. P. C. ( 2008 ). Reporting standards for research in psychology: Why do we need them? What might they be? American Psychologist , 63 (9), 839–851.

Cooper, H. M. ( 2010 ). Research synthesis and meta-analysis: a step-by-step approach (4th ed.). Thousand Oaks, CA: Sage.

Coursol, A. , & Wagner, E. E. ( 1986 ). Effect of positive findings on submission and acceptance rates: A note on meta-analysis bias. Professional Psychology , 17 , 136–137.

Covin, R. , Ouimet, A. J. , Seeds, P. M. , & Dozois, D. J. A. ( 2008 ). A meta-analysis of CBT for pathological worry among clients with GAD. Journal of Anxiety Disorders , 22 (1), 108–116. doi: 10.1016/j.janxdis.2007.01.002

Crawley, M. J. ( 2007 ). The R book . Chichester: Wiley-Blackwell.

Cuijpers, P. , Li, J. , Hofmann, S. G. , & Andersson, G. ( 2010 ). Self-reported versus clinician-rated symptoms of depression as outcome measures in psychotherapy research on depression: A meta-analysis. Clinical Psychology Review , 30 (6), 768–778. doi: 10.1016/j.cpr.2010.06.001

DeCoster, J. (1998). Microsoft Excel spreadsheets: Meta-analysis . Retrieved October 1, 2006, from http://www.stat-help.com/spreadsheets.html

Duval, S. J. , & Tweedie, R. L. ( 2000 ). A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. Journal of the American Statistical Association , 95 (449), 89–98.

Egger, M. , Smith, G. D. , Schneider, M. , & Minder, C. ( 1997 ). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal , 315 (7109), 629–634.

Eysenck, H. J. ( 1978 ). Exercise in mega-silliness. American Psychologist , 33 (5), 517–517.

Field, A. P. ( 2001 ). Meta-analysis of correlation coefficients: A Monte Carlo comparison of fixed- and random-effects methods. Psychological Methods , 6 (2), 161–180.

Field, A. P. ( 2003 a). Can meta-analysis be trusted? Psychologist , 16 (12), 642–645.

Field, A. P. ( 2003 b). The problems in using fixed-effects models of meta-analysis on real-world data. Understanding Statistics , 2 , 77–96.

Field, A. P. ( 2005 a). Is the meta-analysis of correlation coefficients accurate when population correlations vary? Psychological Methods , 10 (4), 444–467.

Field, A. P. ( 2005 b). Meta-analysis. In J. Miles & P. Gilbert (Eds.), A handbook of research methods in clinical and health psychology (pp. 295–308). Oxford: Oxford University Press.

Field, A. P. ( 2009 ). Meta-analysis. In R. E. Millsap & A. Maydeu-Olivares (Eds.), The SAGE handbook of quantitative methods in psychology (pp. 404–422). London: Sage.

Field, A. P. , & Gillett, R. ( 2010 ). How to do a meta-analysis. British Journal of Mathematical & Statistical Psychology , 63 , 665–694.

Field, A. P. , Miles, J. N. V. , & Field, Z. C. ( 2012 ). Discovering statistics using R: And sex and drugs and rock ‘n’ roll . London: Sage.

Fisher, R. A. ( 1935 ). The design of experiments . Edinburgh: Oliver & Boyd.

Fisher, R. A. ( 1938 ). Statistical methods for research workers (7th ed.). London: Oliver & Boyd.

Furr, J. M. , Corner, J. S. , Edmunds, J. M. , & Kendall, P. C. ( 2010 ). Disasters and youth: a meta-analytic examination of posttraumatic stress. Journal of Consulting and Clinical Psychology , 78 (6), 765–780. doi: 10.1037/A0021482

Glass, G. V. ( 1976 ). Primary, secondary, and meta-analysis of research. Educational Researcher , 5 (10), 3–8.

Greenwald, A. G. ( 1975 ). Consequences of prejudice against null hypothesis. Psychological Bulletin , 82 (1), 1–19.

Hafdahl, A. R. , & Williams, M. A. ( 2009 ). Meta-analysis of correlations revisited: Attempted replication and extension of Field's (2001) simulation studies. Psychological Methods , 14 (1), 24–42. doi: 10.1037/a0014697

Hall, S. M. , & Brannick, M. T. ( 2002 ). Comparison of two random-effects methods of meta-analysis. Journal of Applied Psychology , 87 (2), 377–389.

Hanrahan, F. , Field, A. P. , Jones, F. , & Davey, G. C. L. ( 2013 ). A meta-analysis of cognitive-behavior therapy for worry in generalized anxiety disorder. Clinical Psychology Review , 33 , 120–132..

Hedges, L. ( 1981 ). Distribution Theory for glass's estimator of effect size and related estimators. Journal of Educational Statistics , 6 , 107–128.

Hedges, L. V. ( 1992 ). Meta-analysis. Journal of Educational Statistics , 17 (4), 279–296.

Hedges, L. V. , & Olkin, I. ( 1985 ). Statistical methods for meta-analysis . Orlando, FL: Academic Press.

Hedges, L. V. , & Vevea, J. L. ( 1998 ). Fixed- and random-effects models in meta-analysis. Psychological Methods , 3 (4), 486–504.

Hendriks, G. J. , Voshaar, R. C. O. , Keijsers, G. P. J. , Hoogduin, C. A. L. , & van Balkom, A. J. L. M. ( 2008 ). Cognitive-behavioural therapy for late-life anxiety disorders: a systematic review and meta-analysis. Acta Psychiatrica Scandinavica , 117 (6), 403–411. doi: 10.1111/j.1600-0447.2008.01190.x

Hox, J. J. ( 2002 ). Multilevel analysis, techniques and applications . Mahwah, NJ: Lawrence Erlbaum Associates.

Hunter, J. E. , & Schmidt, F. L. ( 1990 ). Methods of meta-analysis: correcting error and bias in research findings . Newbury Park, CA: Sage.

Hunter, J. E. , & Schmidt, F. L. ( 2000 ). Fixed effects vs. random effects meta-analysis models: Implications for cumulative research knowledge. International Journal of Selection and Assessment , 8 (4), 275–292.

Hunter, J. E. , & Schmidt, F. L. ( 2004 ). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Newbury Park, CA: Sage.

Hunter, J. E. , Schmidt, F. L. , & Le, H. ( 2006 ). Implications of direct and indirect range restriction for meta-analysis methods and findings. Journal of Applied Psychology , 91 (3), 594–612. doi: 10.1037/0021-9010.91.3.594

Kashdan, T. B. ( 2007 ). Social anxiety spectrum and diminished positive experiences: Theoretical synthesis and meta-analysis. Clinical Psychology Review , 27 (3), 348–365. doi: 10.1016/j.cpr.2006.12.003

Kleinstaeuber, M. , Witthoeft, M. , & Hiller, W. ( 2011 ). Efficacy of short-term psychotherapy for multiple medically unexplained physical symptoms: A meta-analysis. Clinical Psychology Review , 31 (1), 146–160. doi: 10.1016/j.cpr.2010.09.001

Kontopantelis, E. , & Reeves, D. ( 2009 ). MetaEasy: A meta-analysis add-in for Microsoft Excel. Journal of Statistical Software , 30 (7). http://www.jstatsoft.org/v30/i07/paper

Lavesque, R. (2001). Syntax: meta-analysis . Retrieved October 1, 2006, from http://www.spsstools.net/

Light, R. J. , & Pillerner, D. B. ( 1984 ). Summing up: The science of reviewing research . Cambridge, MA: Harvard University Press.

Macaskill, P. , Walter, S. D. , & Irwig, L. ( 2001 ). A comparison of methods to detect publication bias in meta-analysis. Statistics in Medicine , 20 (4), 641–654.

Malouff, J. A. , Thorsteinsson, E. B. , Rooke, S. E. , Bhullar, N. , & Schutte, N. S. ( 2008 ). Efficacy of cognitive behavioral therapy for chronic fatigue syndrome: A meta-analysis. Clinical Psychology Review , 28 (5), 736–745. doi: 10.1016/j.cpr.2007.10.004

McLeod, B. D. , & Weisz, J. R. ( 2004 ). Using dissertations to examine potential bias in child and adolescent clinical trials. Journal of Consulting and Clinical Psychology , 72 (2), 235–251.

Moher, D. , Cook, D. J. , Eastwood, S. , Olkin, I. , Rennie, D. , Stroup, D. F. , et al. ( 1999 ). Improving the quality of reports of meta-analyses of randomised controlled trials: the QUOROM statement. Lancet , 354 (9193), 1896–1900.

Moher, D. , Liberati, A. , Tetzlaff, J. , Altman, D. G. , & Grp, P. ( 2009 ). Preferred reporting items for systematic reviews and meta-analyses: the PRISMA Statement. Journal of Clinical Epidemiology , 62 (10), 1006–1012. doi: 10.1016/J.Jclinepi.2009.06.005

Mychailyszyn, M.P. , Brodman, D. , Read, K.L. , & Kendall, P.C. ( 2012 ). Cognitive-behavioral school-based interventions for anxious and depressed youth: A meta-analysis of outcomes. Clinical Psychology: Science and Practice , 19 (2), 129–153.

National Research Council. ( 1992 ). Combining information: Statistical issues and opportunities for research . Washington, D.C.: National Academy Press.

Osburn, H. G. , & Callender, J. ( 1992 ). A note on the sampling variance of the mean uncorrected correlation in meta-analysis and validity generalization. Journal of Applied Psychology , 77 (2), 115–122.

Overton, R. C. ( 1998 ). A comparison of fixed-effects and mixed (random-effects) models for meta-analysis tests of moderator variable effects. Psychological Methods , 3 (3), 354–379.

Parsons, T. D. , & Rizzo, A. A. ( 2008 ). Affective outcomes of virtual reality exposure therapy for anxiety and specific phobias: A meta-analysis. Journal of Behavior Therapy and Experimental Psychiatry , 39 (3), 250–261. doi: 10.1016/j.jbtep.2007.07.007

Pearson, E. S. ( 1938 ). The probability integral transformation for testing goodness of fit and combining tests of significance. Biometrika , 30 , 134–148.

Quick, J. M. ( 2010 ). The statistical analysis with R beginners guide . Birmingham: Packt.

R Development Core Team. ( 2010 ). R: A language and environment for statistical computing . Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org

Roberts, N. P. , Kitchiner, N. J. , Kenardy, J. , Bisson, J. I. , & Psych, F. R. C. ( 2009 ). Systematic review and meta-analysis of multiple-session early interventions following traumatic Eventse American Journal of Psychiatry , 166 (3), 293–301. doi: 10.1176/appi.ajp.2008.08040590

Rosa-Alcazar, A. I. , Sanchez-Meca, J. , Gomez-Conesa, A. , & Marin-Martinez, F. ( 2008 ). Psychological treatment of obsessive-compulsive disorder: A meta-analysis. Clinical Psychology Review , 28 (8), 1310–1325. doi: 10.1016/j.cpr.2008.07.001

Rosenthal, R. ( 1978 ). Combining results of independent studies. Psychological Bulletin , 85 (1), 185–193.

Rosenthal, R. ( 1979 ). The file drawer problem and tolerance for null results. Psychological Bulletin , 86 (3), 638–641.

Rosenthal, R. ( 1984 ). Meta-analytic procedures for social research . Beverly Hills, CA: Sage.

Rosenthal, R. ( 1991 ). Meta-analytic procedures for social research (2nd ed.). Newbury Park, CA: Sage.

Rosenthal, R. ( 1995 ). Writing meta-analytic reviews. Psychological Bulletin , 118 (2), 183–192.

Rosenthal, R. , & DiMatteo, M. R. ( 2001 ). Meta-analysis: Recent developments in quantitative methods for literature reviews. Annual Review of Psychology , 52 , 59–82.

Rosenthal, R. , & Rubin, D. B. ( 1978 ). Interpersonal expectancy effects: the first 345 studies. Behavioral and Brain Sciences , 1 (3), 377–386.

Ruocco, A. C. ( 2005 ). The neuropsychology of borderline personality disorder: A meta-analysis and review. Psychiatry Research , 137 (3), 191–202. doi: 10.1016/j.psychres.2005.07.004

Schmidt, F. L. , Oh, I. S. , & Hayes, T. L. ( 2009 ). Fixed- versus random-effects models in meta-analysis: Model properties and an empirical comparison of differences in results. British Journal of Mathematical & Statistical Psychology , 62 , 97–128. doi: 10.1348/000711007x255327

Schulze, R. ( 2004 ). Meta-analysis: a comparison of approaches . Cambridge, MA: Hogrefe & Huber.

Schwarzer, G. (2005). Meta . Retrieved October 1, 2006, from http://www.stats.bris.ac.uk/R/

Shadish, W. R. ( 1992 ). Do family and marital psychotherapies change what people do? A meta-analysis of behavioural outcomes. In T. D. Cook , H. Cooper , D. S. Cordray , H. Hartmann , L. V. Hedges , R. J. Light , T. A. Louis , & F. Mosteller (Eds.), Meta-analysis for explanation: A casebook (pp. 129–208). New York: Sage.

Singh, S. P. , Singh, V. , Kar, N. , & Chan, K. ( 2010 ). Efficacy of antidepressants in treating the negative symptoms of chronic schizophrenia: meta-analysis. British Journal of Psychiatry , 197 (3), 174–179. doi: 10.1192/bjp.bp.109.067710

Smith, M. L. , & Glass, G. V. ( 1977 ). Meta-analysis of psychotherapy outcome studies. American Psychologist , 32 (9), 752–760.

Spreckley, M. , & Boyd, R. ( 2009 ). Efficacy of applied behavioral intervention in preschool children with autism for improving cognitive, language, and adaptive behavior: a systematic review and meta-analysis. Journal of Pediatrics , 154 (3), 338–344. doi: 10.1016/j.jpeds.2008.09.012

Sterling, T. D. ( 1959 ). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association , 54 (285), 30–34.

Stewart, R. E. , & Chambless, D. L. ( 2009 ). Cognitive-behavioral therapy for adult anxiety disorders in clinical practice: A meta-analysis of effectiveness studies. Journal of Consulting and Clinical Psychology , 77 (4), 595–606. doi: 10.1037/a0016032

Stouffer, S. A. ( 1949 ). The American soldier: Vol. 1. Adjustment during Army life . Princeton, NJ: Princeton University Press.

The Cochrane Collaboration. ( 2008 ). Review Manager (RevMan) for Windows: Version 5.0 . Copenhagen: The Nordic Cochrane Centre. Retrieved from http://www.cc-ims.net/revman/

Verzani, J. ( 2004 ). Using R for introductory statistics . Boca Raton, FL: Chapman & Hall.

Vevea, J. L. , & Woods, C. M. ( 2005 ). Publication bias in research synthesis: Sensitivity analysis using a priori weight functions. Psychological Methods , 10 (4), 428–443.

Viechtbauer, W. ( 2010 ). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software , 36 (3), 1–48.

Villeneuve, K. , Potvin, S. , Lesage, A. , & Nicole, L. ( 2010 ). Meta-analysis of rates of drop-out from psychosocial treatment among persons with schizophrenia spectrum disorder. Schizophrenia Research , 121 (1-3), 266–270. doi: 10.1016/j.schres.2010.04.003

Wilson, D. B. (2001). Practical meta-analysis effect size calculator . Retrieved August 3, 2010, from http://gunston.gmu.edu/cebcp/EffectSizeCalculator/index.html

Wilson, D. B. (2004). A spreadsheet for calculating standardized mean difference type effect sizes . Retrieved October 1, 2006, from http://mason.gmu.edu/~dwilsonb/ma.html

Zuur, A. F. , Ieno, E. N. , & Meesters, E. H. W. G. ( 2009 ). A beginner's guide to R . New York: Springer.

About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2024 Oxford University Press
Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 13 July 2021

Systematic review and meta-analysis of depression, anxiety, and suicidal ideation among Ph.D. students

Emily N. Satinsky 1 ,
Tomoki Kimura 2 ,
Mathew V. Kiang 3 , 4 ,
Rediet Abebe 5 , 6 ,
Scott Cunningham 7 ,
Hedwig Lee 8 ,
Xiaofei Lin 9 ,
Cindy H. Liu 10 , 11 ,
Igor Rudan 12 ,
Srijan Sen 13 ,
Mark Tomlinson 14 , 15 ,
Miranda Yaver 16 &
Alexander C. Tsai 1 , 11 , 17

Scientific Reports volume 11 , Article number: 14370 ( 2021 ) Cite this article

87k Accesses

69 Citations

822 Altmetric

Metrics details

Epidemiology
Health policy
Quality of life

University administrators and mental health clinicians have raised concerns about depression and anxiety among Ph.D. students, yet no study has systematically synthesized the available evidence in this area. After searching the literature for studies reporting on depression, anxiety, and/or suicidal ideation among Ph.D. students, we included 32 articles. Among 16 studies reporting the prevalence of clinically significant symptoms of depression across 23,469 Ph.D. students, the pooled estimate of the proportion of students with depression was 0.24 (95% confidence interval [CI], 0.18–0.31; I 2 = 98.75%). In a meta-analysis of the nine studies reporting the prevalence of clinically significant symptoms of anxiety across 15,626 students, the estimated proportion of students with anxiety was 0.17 (95% CI, 0.12–0.23; I 2 = 98.05%). We conclude that depression and anxiety are highly prevalent among Ph.D. students. Data limitations precluded our ability to obtain a pooled estimate of suicidal ideation prevalence. Programs that systematically monitor and promote the mental health of Ph.D. students are urgently needed.

Adults who microdose psychedelics report health related motivations and lower levels of anxiety and depression compared to non-microdosers

Joseph M. Rootman, Pamela Kryskow, … Zach Walsh

Psilocybin microdosers demonstrate greater observed improvements in mood and mental health at one month relative to non-microdosing controls

Joseph M. Rootman, Maggie Kiraga, … Zach Walsh

The serotonin theory of depression: a systematic umbrella review of the evidence

Joanna Moncrieff, Ruth E. Cooper, … Mark A. Horowitz

Introduction

Mental health problems among graduate students in doctoral degree programs have received increasing attention 1 , 2 , 3 , 4 . Ph.D. students (and students completing equivalent degrees, such as the Sc.D.) face training periods of unpredictable duration, financial insecurity and food insecurity, competitive markets for tenure-track positions, and unsparing publishing and funding models 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 —all of which may have greater adverse impacts on students from marginalized and underrepresented populations 13 , 14 , 15 . Ph.D. students’ mental health problems may negatively affect their physical health 16 , interpersonal relationships 17 , academic output, and work performance 18 , 19 , and may also contribute to program attrition 20 , 21 , 22 . As many as 30 to 50% of Ph.D. students drop out of their programs, depending on the country and discipline 23 , 24 , 25 , 26 , 27 . Further, while mental health problems among Ph.D. students raise concerns for the wellbeing of the individuals themselves and their personal networks, they also have broader repercussions for their institutions and academia as a whole 22 .

Despite the potential public health significance of this problem, most evidence syntheses on student mental health have focused on undergraduate students 28 , 29 or graduate students in professional degree programs (e.g., medical students) 30 . In non-systematic summaries, estimates of the prevalence of clinically significant depressive symptoms among Ph.D. students vary considerably 31 , 32 , 33 . Reliable estimates of depression and other mental health problems among Ph.D. students are needed to inform preventive, screening, or treatment efforts. To address this gap in the literature, we conducted a systematic review and meta-analysis to explore patterns of depression, anxiety, and suicidal ideation among Ph.D. students.

Flowchart of included articles.

The evidence search yielded 886 articles, of which 286 were excluded as duplicates (Fig. 1 ). An additional nine articles were identified through reference lists or grey literature reports published on university websites. Following a title/abstract review and subsequent full-text review, 520 additional articles were excluded.

Of the 89 remaining articles, 74 were unclear about their definition of graduate students or grouped Ph.D. and non-Ph.D. students without disaggregating the estimates by degree level. We obtained contact information for the authors of most of these articles (69 [93%]), requesting additional data. Three authors clarified that their study samples only included Ph.D. students 34 , 35 , 36 . Fourteen authors confirmed that their study samples included both Ph.D. and non-Ph.D. students but provided us with data on the subsample of Ph.D. students 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 . Where authors clarified that the sample was limited to graduate students in non-doctoral degree programs, did not provide additional data on the subsample of Ph.D. students, or did not reply to our information requests, we excluded the studies due to insufficient information (Supplementary Table S1 ).

Ultimately, 32 articles describing the findings of 29 unique studies were identified and included in the review 16 , 32 , 33 , 34 , 35 , 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 (Table 1 ). Overall, 26 studies measured depression, 19 studies measured anxiety, and six studies measured suicidal ideation. Three pairs of articles reported data on the same sample of Ph.D. students 33 , 38 , 45 , 51 , 53 , 56 and were therefore grouped in Table 1 and reported as three studies. Publication dates ranged from 1979 to 2019, but most articles (22/32 [69%]) were published after 2015. Most studies were conducted in the United States (20/29 [69%]), with additional studies conducted in Australia, Belgium, China, Iran, Mexico, and South Korea. Two studies were conducted in cross-national settings representing 48 additional countries. None were conducted in sub-Saharan Africa or South America. Most studies included students completing their degrees in a mix of disciplines (17/29 [59%]), while 12 studies were limited to students in a specific field (e.g., biomedicine, education). The median sample size was 172 students (interquartile range [IQR], 68–654; range, 6–6405). Seven studies focused on mental health outcomes in demographic subgroups, including ethnic or racialized minority students 37 , 41 , 43 , international students 47 , 50 , and sexual and gender minority students 42 , 54 .

In all, 16 studies reported the prevalence of depression among a total of 23,469 Ph.D. students (Fig. 2 ; range, 10–47%). Of these, the most widely used depression scales were the PHQ-9 (9 studies) and variants of the Center for Epidemiologic Studies-Depression scale (CES-D, 4 studies) 63 , and all studies assessed clinically significant symptoms of depression over the past one to two weeks. Three of these studies reported findings based on data from different survey years of the same parent study (the Healthy Minds Study) 40 , 42 , 43 , but due to overlap in the survey years reported across articles, these data were pooled. Most of these studies were based on data collected through online surveys (13/16 [81%]). Ten studies (63%) used random or systematic sampling, four studies (25%) used convenience sampling, and two studies (13%) used multiple sampling techniques.

Pooled estimate of the proportion of Ph.D. students with clinically significant symptoms of depression.

The estimated proportion of Ph.D. students assessed as having clinically significant symptoms of depression was 0.24 (95% confidence interval [CI], 0.18–0.31; 95% predictive interval [PI], 0.04–0.54), with significant evidence of between-study heterogeneity (I 2 = 98.75%). A subgroup analysis restricted to the twelve studies conducted in the United States yielded similar findings (pooled estimate [ES] = 0.23; 95% CI, 0.15–0.32; 95% PI, 0.01–0.60), with no appreciable difference in heterogeneity (I 2 = 98.91%). A subgroup analysis restricted to the studies that used the PHQ-9 to assess depression yielded a slightly lower prevalence estimate and a slight reduction in heterogeneity (ES = 0.18; 95% CI, 0.14–0.22; 95% PI, 0.07–0.34; I 2 = 90.59%).

Nine studies reported the prevalence of clinically significant symptoms of anxiety among a total of 15,626 Ph.D. students (Fig. 3 ; range 4–49%). Of these, the most widely used anxiety scale was the 7-item Generalized Anxiety Disorder scale (GAD-7, 5 studies) 64 . Data from three of the Healthy Minds Study articles were pooled into two estimates, because the scale used to measure anxiety changed midway through the parent study (i.e., the Patient Health Questionnaire-Generalized Anxiety Disorder [PHQ-GAD] scale was used from 2007 to 2012 and then switched to the GAD-7 in 2013 40 ). Most studies (8/9 [89%]) assessed clinically significant symptoms of anxiety over the past two to four weeks, with the one remaining study measuring anxiety over the past year. Again, most of these studies were based on data collected through online surveys (7/9 [78%]). Five studies (56%) used random or systematic sampling, two studies (22%) used convenience sampling, and two studies (22%) used multiple sampling techniques.

Pooled estimate of the proportion of Ph.D. students with clinically significant symptoms of anxiety.

The estimated proportion of Ph.D. students assessed as having anxiety was 0.17 (95% CI, 0.12–0.23; 95% PI, 0.02–0.41), with significant evidence of between-study heterogeneity (I 2 = 98.05%). The subgroup analysis restricted to the five studies conducted in the United States yielded a slightly lower proportion of students assessed as having anxiety (ES = 0.14; 95% CI, 0.08–0.20; 95% PI, 0.00–0.43), with no appreciable difference in heterogeneity (I 2 = 98.54%).

Six studies reported the prevalence of suicidal ideation (range, 2–12%), but the recall windows varied greatly (e.g., ideation within the past 2 weeks vs. past year), precluding pooled estimation.

Additional stratified pooled estimates could not be obtained. One study of Ph.D. students across 54 countries found that phase of study was a significant moderator of mental health, with students in the comprehensive examination and dissertation phases more likely to experience distress compared with students primarily engaged in coursework 59 . Other studies identified a higher prevalence of mental ill-health among women 54 ; lesbian, gay, bisexual, transgender, and queer (LGBTQ) students 42 , 54 , 60 ; and students with multiple intersecting identities 54 .

Several studies identified correlates of mental health problems including: project- and supervisor-related issues, stress about productivity, and self-doubt 53 , 62 ; uncertain career prospects, poor living conditions, financial stressors, lack of sleep, feeling devalued, social isolation, and advisor relationships 61 ; financial challenges 38 ; difficulties with work-life balance 58 ; and feelings of isolation and loneliness 52 . Despite these challenges, help-seeking appeared to be limited, with only about one-quarter of Ph.D. students reporting mental health problems also reporting that they were receiving treatment 40 , 52 .

Risk of bias

Twenty-one of 32 articles were assessed as having low risk of bias (Supplementary Table S2 ). Five articles received one point for all five categories on the risk of bias assessment (lowest risk of bias), and one article received no points (highest risk). The mean risk of bias score was 3.22 (standard deviation, 1.34; median, 4; IQR, 2–4). Restricting the estimation sample to 12 studies assessed as having low risk of bias, the estimated proportion of Ph.D. students with depression was 0.25 (95% CI, 0.18–0.33; 95% PI, 0.04–0.57; I 2 = 99.11%), nearly identical to the primary estimate, with no reduction in heterogeneity. The estimated proportion of Ph.D. students with anxiety, among the 7 studies assessed as having low risk of bias, was 0.12 (95% CI, 0.07–0.17; 95% PI, 0.01–0.34; I 2 = 98.17%), again with no appreciable reduction in heterogeneity.

In our meta-analysis of 16 studies representing 23,469 Ph.D. students, we estimated that the pooled prevalence of clinically significant symptoms of depression was 24%. This estimate is consistent with estimated prevalence rates in other high-stress biomedical trainee populations, including medical students (27%) 30 , resident physicians (29%) 65 , and postdoctoral research fellows (29%) 66 . In the sample of nine studies representing 15,626 Ph.D. students, we estimated that the pooled prevalence of clinically significant symptoms of anxiety was 17%. While validated screening instruments tend to over-identify cases of depression (relative to structured clinical interviews) by approximately a factor of two 67 , 68 , our findings nonetheless point to a major public health problem among Ph.D. students. Available data suggest that the prevalence of depressive and anxiety disorders in the general population ranges from 5 to 7% worldwide 69 , 70 . In contrast, prevalence estimates of major depressive disorder among young adults have ranged from 13% (for young adults between the ages of 18 and 29 years in the 2012–2013 National Epidemiologic Survey on Alcohol and Related Conditions III 71 ) to 15% (for young adults between the ages of 18 and 25 in the 2019 U.S. National Survey on Drug Use and Health 72 ). Likewise, the prevalence of generalized anxiety disorder was estimated at 4% among young adults between the ages of 18 and 29 in the 2001–03 U.S. National Comorbidity Survey Replication 73 . Thus, even accounting for potential upward bias inherent in these studies’ use of screening instruments, our estimates suggest that the rates of recent clinically significant symptoms of depression and anxiety are greater among Ph.D. students compared with young adults in the general population.

Further underscoring the importance of this public health issue, Ph.D. students face unique stressors and uncertainties that may put them at increased risk for mental health and substance use problems. Students grapple with competing responsibilities, including coursework, teaching, and research, while also managing interpersonal relationships, social isolation, caregiving, and financial insecurity 3 , 10 . Increasing enrollment in doctoral degree programs has not been matched with a commensurate increase in tenure-track academic job opportunities, intensifying competition and pressure to find employment post-graduation 5 . Advisor-student power relations rarely offer options for recourse if and when such relationships become strained, particularly in the setting of sexual harassment, unwanted sexual attention, sexual coercion, and rape 74 , 75 , 76 , 77 , 78 . All of these stressors may be magnified—and compounded by stressors unrelated to graduate school—for subgroups of students who are underrepresented in doctoral degree programs and among whom mental health problems are either more prevalent and/or undertreated compared with the general population, including Black, indigenous, and other people of color 13 , 79 , 80 ; women 81 , 82 ; first-generation students 14 , 15 ; people who identify as LGBTQ 83 , 84 , 85 ; people with disabilities; and people with multiple intersecting identities.

Structural- and individual-level interventions will be needed to reduce the burden of mental ill-health among Ph.D. students worldwide 31 , 86 . Despite the high prevalence of mental health and substance use problems 87 , Ph.D. students demonstrate low rates of help-seeking 40 , 52 , 88 . Common barriers to help-seeking include fears of harming one’s academic career, financial insecurity, lack of time, and lack of awareness 89 , 90 , 91 , as well as health care systems-related barriers, including insufficient numbers of culturally competent counseling staff, limited access to psychological services beyond time-limited psychotherapies, and lack of programs that address the specific needs either of Ph.D. students in general 92 or of Ph.D. students belonging to marginalized groups 93 , 94 . Structural interventions focused solely on enhancing student resilience might include programs aimed at reducing stigma, fostering social cohesion, and reducing social isolation, while changing norms around help-seeking behavior 95 , 96 . However, structural interventions focused on changing stressogenic aspects of the graduate student environment itself are also needed 97 , beyond any enhancements to Ph.D. student resilience, including: undercutting power differentials between graduate students and individual faculty advisors, e.g., by diffusing power among multiple faculty advisors; eliminating racist, sexist, and other discriminatory behaviors by faculty advisors 74 , 75 , 98 ; valuing mentorship and other aspects of “invisible work” that are often disproportionately borne by women faculty and faculty of color 99 , 100 ; and training faculty members to emphasize the dignity of, and adequately prepare Ph.D. students for, non-academic careers 101 , 102 .

Our findings should be interpreted with several limitations in mind. First, the pooled estimates are characterized by a high degree of heterogeneity, similar to meta-analyses of depression prevalence in other populations 30 , 65 , 103 , 104 , 105 . Second, we were only able to aggregate depression prevalence across 16 studies and anxiety prevalence across nine studies (the majority of which were conducted in the U.S.) – far fewer than the 183 studies included in a meta-analysis of depression prevalence among medical students 30 and the 54 studies included in a meta-analysis of resident physicians 65 . These differences underscore the need for more rigorous study in this critical area. Many articles were either excluded from the review or from the meta-analyses for not meeting inclusion criteria or not reporting relevant statistics. Future research in this area should ensure the systematic collection of high-quality, clinically relevant data from a comprehensive set of institutions, across disciplines and countries, and disaggregated by graduate student type. As part of conducting research and addressing student mental health and wellbeing, university deans, provosts, and chancellors should partner with national survey and program institutions (e.g., Graduate Student Experience in the Research University [gradSERU] 106 , the American College Health Association National College Health Assessment [ACHA-NCHA], and HealthyMinds). Furthermore, federal agencies that oversee health and higher education should provide resources for these efforts, and accreditation agencies should require monitoring of mental health and programmatic responses to stressors among Ph.D. students.

Third, heterogeneity in reporting precluded a meta-analysis of the suicidality outcomes among the few studies that reported such data. While reducing the burden of mental health problems among graduate students is an important public health aim in itself, more research into understanding non-suicidal self-injurious behavior, suicide attempts, and completed suicide among Ph.D. students is warranted. Fourth, it is possible that the grey literature reports included in our meta-analysis are more likely to be undertaken at research-intensive institutions 52 , 60 , 61 . However, the direction of bias is unpredictable: mental health problems among Ph.D. students in research-intensive environments may be more prevalent due to detection bias, but such institutions may also have more resources devoted to preventive, screening, or treatment efforts 92 . Fifth, inclusion in this meta-analysis and systematic review was limited to those based on community samples. Inclusion of clinic-based samples, or of studies conducted before or after specific milestones (e.g., the qualifying examination or dissertation prospectus defense), likely would have yielded even higher pooled prevalence estimates of mental health problems. And finally, few studies provided disaggregated data according to sociodemographic factors, stage of training (e.g., first year, pre-prospectus defense, all-but-dissertation), or discipline of study. These factors might be investigated further for differences in mental health outcomes.

Clinically significant symptoms of depression and anxiety are pervasive among graduate students in doctoral degree programs, but these are understudied relative to other trainee populations. Structural and clinical interventions to systematically monitor and promote the mental health and wellbeing of Ph.D. students are urgently needed.

This systematic review and meta-analysis follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) approach (Supplementary Table S3 ) 107 . This study was based on data collected from publicly available bibliometric databases and did not require ethical approval from our institutional review boards.

Eligibility criteria

Studies were included if they provided data on either: (a) the number or proportion of Ph.D. students with clinically significant symptoms of depression or anxiety, ascertained using a validated scale; or (b) the mean depression or anxiety symptom severity score and its standard deviation among Ph.D. students. Suicidal ideation was examined as a secondary outcome.

We excluded studies that focused on graduate students in non-doctoral degree programs (e.g., Master of Public Health) or professional degree programs (e.g., Doctor of Medicine, Juris Doctor) because more is known about mental health problems in these populations 30 , 108 , 109 , 110 and because Ph.D. students face unique uncertainties. To minimize the potential for upward bias in our pooled prevalence estimates, we excluded studies that recruited students from campus counseling centers or other clinic-based settings. Studies that measured affective states, or state anxiety, before or after specific events (e.g., terrorist attacks, qualifying examinations) were also excluded.

If articles described the study sample in general terms (i.e., without clarifying the degree level of the participants), we contacted the authors by email for clarification. Similarly, if articles pooled results across graduate students in doctoral and non-doctoral degree programs (e.g., reporting a single estimate for a mixed sample of graduate students), we contacted the authors by email to request disaggregated data on the subsample of Ph.D. students. If authors did not reply after two contact attempts spaced over 2 months, or were unable to provide these data, we excluded these studies from further consideration.

Search strategy and data extraction

PubMed, Embase, PsycINFO, ERIC, and Business Source Complete were searched from inception of each database to November 5, 2019. The search strategy included terms related to mental health symptoms (e.g., depression, anxiety, suicide), the study population (e.g., graduate, doctoral), and measurement category (e.g., depression, Columbia-Suicide Severity Rating Scale) (Supplementary Table S4 ). In addition, we searched the reference lists and the grey literature.

After duplicates were removed, we screened the remaining titles and abstracts, followed by a full-text review. We excluded articles following the eligibility criteria listed above (i.e., those that were not focused on Ph.D. students; those that did not assess depression and/or anxiety using a validated screening tool; those that did not report relevant statistics of depression and/or anxiety; and those that recruited students from clinic-based settings). Reasons for exclusion were tracked at each stage. Following selection of included articles, two members of the research team extracted data and conducted risk of bias assessments. Discrepancies were discussed with a third member of the research team. Key extraction variables included: study design, geographic region, sample size, response rate, demographic characteristics of the sample, screening instrument(s) used for assessment, mean depression or anxiety symptom severity score (and its standard deviation), and the number (or proportion) of students experiencing clinically significant symptoms of depression or anxiety.

Risk of bias assessment

Following prior work 30 , 65 , the Newcastle–Ottawa Scale 111 was adapted and used to assess risk of bias in the included studies. Each study was assessed across 5 categories: sample representativeness, sample size, non-respondents, ascertainment of outcomes, and quality of descriptive statistics reporting (Supplementary Information S5 ). Studies were judged as having either low risk of bias (≥ 3 points) or high risk of bias (< 3 points).

Analysis and synthesis

Before pooling the estimated prevalence rates across studies, we first transformed the proportions using a variance-stabilizing double arcsine transformation 112 . We then computed pooled estimates of prevalence using a random effects model 113 . Study specific confidence intervals were estimated using the score method 114 , 115 . We estimated between-study heterogeneity using the I 2 statistic 116 . In an attempt to reduce the extent of heterogeneity, we re-estimated pooled prevalence restricting the analysis to studies conducted in the United States and to studies in which depression assessment was based on the 9-item Patient Health Questionnaire (PHQ-9) 117 . All analyses were conducted using Stata (version 16; StataCorp LP, College Station, Tex.). Where heterogeneity limited our ability to summarize the findings using meta-analysis, we synthesized the data using narrative review.

Woolston, C. Why mental health matters. Nature 557 , 129–131 (2018).

Article ADS CAS Google Scholar

Woolston, C. A love-hurt relationship. Nature 550 , 549–552 (2017).

Article Google Scholar

Woolston, C. PhD poll reveals fear and joy, contentment and anguish. Nature 575 , 403–406 (2019).

Article ADS CAS PubMed Google Scholar

Byrom, N. COVID-19 and the research community: The challenges of lockdown for early-career researchers. Elife 9 , e59634 (2020).

Article PubMed PubMed Central Google Scholar

Alberts, B., Kirschner, M. W., Tilghman, S. & Varmus, H. Rescuing US biomedical research from its systemic flaws. Proc. Natl. Acad. Sci. USA 111 , 5773–5777 (2014).

Article ADS CAS PubMed PubMed Central Google Scholar

McDowell, G. S. et al. Shaping the future of research: A perspective from junior scientists. F1000Res 3 , 291 (2014).

Article PubMed Google Scholar

Petersen, A. M., Riccaboni, M., Stanley, H. E. & Pammoli, F. Persistence and uncertainty in the academic career. Proc. Natl. Acad. Sci. USA 109 , 5213–5218 (2012).

Leshner, A. I. Rethinking graduate education. Science 349 , 349 (2015).

National Academies of Sciences Engineering and Medicine. Graduate STEM Education for the 21st Century (National Academies Press, 2018).

Google Scholar

Charles, S. T., Karnaze, M. M. & Leslie, F. M. Positive factors related to graduate student mental health. J. Am. Coll. Health https://doi.org/10.1080/07448481.2020.1841207 (2021).

Riddle, E. S., Niles, M. T. & Nickerson, A. Prevalence and factors associated with food insecurity across an entire campus population. PLoS ONE 15 , e0237637 (2020).

Article CAS PubMed PubMed Central Google Scholar

Soldavini, J., Berner, M. & Da Silva, J. Rates of and characteristics associated with food insecurity differ among undergraduate and graduate students at a large public university in the Southeast United States. Prev. Med. Rep. 14 , 100836 (2019).

Clark, U. S. & Hurd, Y. L. Addressing racism and disparities in the biomedical sciences. Nat. Hum. Behav. 4 , 774–777 (2020).

Gardner, S. K. The challenges of first-generation doctoral students. New Dir. High. Educ. 2013 , 43–54 (2013).

Seay, S. E., Lifton, D. E., Wuensch, K. L., Bradshaw, L. K. & McDowelle, J. O. First-generation graduate students and attrition risks. J. Contin. High. Educ. 56 , 11–25 (2008).

Rummell, C. M. An exploratory study of psychology graduate student workload, health, and program satisfaction. Prof. Psychol. Res. Pract. 46 , 391–399 (2015).

Salzer, M. S. A comparative study of campus experiences of college students with mental illnesses versus a general college sample. J. Am. Coll. Health 60 , 1–7 (2012).

Hysenbegasi, A., Hass, S. & Rowland, C. The impact of depression on the academic productivity of university students. J. Ment. Health Policy Econ. 8 , 145–151 (2005).

PubMed Google Scholar

Harvey, S. et al. Depression and work performance: An ecological study using web-based screening. Occup. Med. (Lond.) 61 , 209–211 (2011).

Article CAS Google Scholar

Eisenberg, D., Golberstein, E. & Hunt, J. B. Mental health and academic success in college. BE J. Econ. Anal. Policy 9 , 40 (2009).

Lovitts, B. E. Who is responsible for graduate student attrition--the individual or the institution? Toward an explanation of the high and persistent rate of attrition. In: Annual Meeting of the American Education Research Association (New York, 1996). Available at: https://eric.ed.gov/?id=ED399878.

Gardner, S. K. Student and faculty attributions of attrition in high and low-completing doctoral programs in the United States. High. Educ. 58 , 97–112 (2009).

Lovitts, B. E. Leaving the Ivory Tower: The Causes and Consequences of Departure from Doctoral Study (Rowman & Littlefield Publishers, 2001).

Rigler Jr, K. L., Bowlin, L. K., Sweat, K., Watts, S. & Throne, R. Agency, socialization, and support: a critical review of doctoral student attrition. In: Proceedings of the Third International Conference on Doctoral Education: Organizational Leadership and Impact , University of Central Florida, Orlando, (2017).

Golde, C. M. The role of the department and discipline in doctoral student attrition: Lessons from four departments. J. High. Educ. 76 , 669–700 (2005).

Council of Graduate Schools. PhD Completion and Attrition: Analysis of Baseline Program Data from the PhD Completion Project (Council of Graduate Schools, 2008).

National Research Council. A Data-Based Assessment of Research-Doctorate Programs in the United States (The National Academies Press, 2011).

Akhtar, P. et al. Prevalence of depression among university students in low and middle income countries (LMICs): A systematic review and meta-analysis. J. Affect. Disord. 274 , 911–919 (2020).

Mortier, P. et al. The prevalence of suicidal thoughts and behaviours among college students: A meta-analysis. Psychol. Med. 48 , 554–565 (2018).

Article CAS PubMed Google Scholar

Rotenstein, L. S. et al. Prevalence of depression, depressive symptoms, and suicidal ideation among medical students: A systematic review and meta-analysis. JAMA 316 , 2214–2236 (2016).

Tsai, J. W. & Muindi, F. Towards sustaining a culture of mental health and wellness for trainees in the biosciences. Nat. Biotechnol. 34 , 353–355 (2016).

Levecque, K., Anseel, F., De Beuckelaer, A., Van der Heyden, J. & Gisle, L. Work organization and mental health problems in PhD students. Res. Policy 46 , 868–879 (2017).

Nagy, G. A. et al. Burnout and mental health problems in biomedical doctoral students. CBE Life Sci. Educ. 18 , 1–14 (2019).

Garcia-Williams, A., Moffitt, L. & Kaslow, N. J. Mental health and suicidal behavior among graduate students. Acad. Psychiatry 28 , 554–560 (2014).

Sheldon, K. M. Emotionality differences between artists and scientists. J. Res. Pers. 28 , 481–491 (1994).

Lightstone, S. N., Swencionis, C. & Cohen, H. W. The effect of bioterrorism messages on anxiety levels. Int. Q. Community Health Educ. 24 , 111–122 (2006).

Clark, C. R., Mercer, S. H., Zeigler-Hill, V. & Dufrene, B. A. Barriers to the success of ethnic minority students in school psychology graduate programs. School Psych. Rev. 41 , 176–192 (2012).

Eisenberg, D., Gollust, S. E., Golberstein, E. & Hefner, J. L. Prevalence and correlates of depression, anxiety, and suicidality among university students. Am. J. Orthopsychiatry 77 , 534–542 (2007).

Farrer, L. M., Gulliver, A., Bennett, K., Fassnacht, D. B. & Griffiths, K. M. Demographic and psychosocial predictors of major depression and generalised anxiety disorder in Australian university students. BMC Psychiatry 16 , 241 (2016).

Lipson, S. K., Zhou, S., Wagner, B. III., Beck, K. & Eisenberg, D. Major differences: Variations in undergraduate and graduate student mental health and treatment utilization across academic disciplines. J. Coll. Stud. Psychother. 30 , 23–41 (2016).

Lilly, F. R. W. et al. The influence of racial microaggressions and social rank on risk for depression among minority graduate and professional students. Coll. Stud. J. 52 , 86–104 (2018).

Lipson, S. K., Raifman, J., Abelson, S. & Reisner, S. L. Gender minority mental health in the U.S.: Results of a national survey on college campuses. Am. J. Prev. Med. 57 , 293–301 (2019).

Lipson, S. K., Kern, A., Eisenberg, D. & Breland-Noble, A. M. Mental health disparities among college students of color. J. Adolesc. Health 63 , 348–356 (2018).

Baker, A. J. L. & Chambers, J. Adult recall of childhood exposure to parental conflict: Unpacking the black box of parental alienation. J. Divorce Remarriage 52 , 55–76 (2011).

Golberstein, E., Eisenberg, D. & Gollust, S. E. Perceived stigma and mental health care seeking. Psychiatr. Serv. 59 , 392–399 (2008).

Hindman, R. K., Glass, C. R., Arnkoff, D. B. & Maron, D. D. A comparison of formal and informal mindfulness programs for stress reduction in university students. Mindfulness 6 , 873–884 (2015).

Hirai, R., Frazier, P. & Syed, M. Psychological and sociocultural adjustment of first-year international students: Trajectories and predictors. J. Couns. Psychol. 62 , 438–452 (2015).

Lee, J. S. & Jeong, B. Having mentors and campus social networks moderates the impact of worries and video gaming on depressive symptoms: A moderated mediation analysis. BMC Public Health 14 , 1–12 (2014).

Corral-Frias, N. S., Velardez Soto, S. N., Frias-Armenta, M., Corona-Espinosa, A. & Watson, D. Concurrent validity and reliability of two short forms of the mood and anxiety symptom questionnaire in a student sample from Northwest Mexico. J. Psychopathol. Behav. Assess. 41 , 304–316 (2019).

Meghani, D. T. & Harvey, E. A. Asian Indian international students’ trajectories of depression, acculturation, and enculturation. Asian Am. J. Psychol. 7 , 1–14 (2016).

Barry, K. M., Woods, M., Martin, A., Stirling, C. & Warnecke, E. A randomized controlled trial of the effects of mindfulness practice on doctoral candidate psychological status. J. Am. Coll. Health 67 , 299–307 (2019).

Bolotnyy, V., Basilico, M. & Barreira, P. Graduate student mental health: lessons from American economics departments. J. Econ. Lit. (in press).

Barry, K. M., Woods, M., Warnecke, E., Stirling, C. & Martin, A. Psychological health of doctoral candidates, study-related challenges and perceived performance. High. Educ. Res. Dev. 37 , 468–483 (2018).

Boyle, K. M. & McKinzie, A. E. The prevalence and psychological cost of interpersonal violence in graduate and law school. J. Interpers. Violence 36 , 6319-6350 (2021).

Heinrich, D. L. The causal influence of anxiety on academic achievement for students of differing intellectual ability. Appl. Psychol. Meas. 3 , 351–359 (1979).

Hish, A. J. et al. Applying the stress process model to stress-burnout and stress-depression relationships in biomedical doctoral students: A cross-sectional pilot study. CBE Life Sci. Educ. 18 , 1–11 (2019).

Jamshidi, F. et al. A cross-sectional study of psychiatric disorders in medical sciences students. Mater. Sociomed. 29 , 188–191 (2017).

Liu, C. et al. Prevalence and associated factors of depression and anxiety among doctoral students: The mediating effect of mentoring relationships on the association between research self-efficacy and depression/anxiety. Psychol. Res. Behav. Manag. 12 , 195–208 (2019).

Sverdlik, A. & Hall, N. C. Not just a phase: Exploring the role of program stage on well-being and motivation in doctoral students. J. Adult Contin. Educ. 26 , 1–28 (2019).

University of California Office of the President. The University of California Graduate student Well-Being Survey Report (University of California, 2017).

The Graduate Assembly. Graduate Student Happiness & Well-Being Report (University of California at Berkeley, 2014).

Richardson, C. M., Trusty, W. T. & George, K. A. Trainee wellness: Self-critical perfectionism, self-compassion, depression, and burnout among doctoral trainees in psychology. Couns. Psychol. Q. 33 , 187-198 (2020).

Radloff, L. S. The CES-D Scale: A self-report depression scale for research in the general population. Appl. Psychol. Meas. 1 , 385–401 (1977).

Spitzer, R. L., Kroenke, K., Williams, J. B. W. & Lowe, B. A brief measure for assessing generalized anxiety disorder: The GAD-7. Arch. Intern. Med. 166 , 1092–1097 (2006).

Mata, D. A. et al. Prevalence of depression and depressive symptoms among residents physicians: A systematic review and meta-analysis. JAMA 314 , 2373–2383 (2015).

Gloria, C. T. & Steinhardt, M. A. Flourishing, languishing, and depressed postdoctoral fellows: Differences in stress, anxiety, and depressive symptoms. J. Postdoct. Aff. 3 , 1–9 (2013).

Levis, B. et al. Patient Health Questionnaire-9 scores do not accurately estimate depression prevalence: Individual participant data meta-analysis. J. Clin. Epidemiol. 122 , 115-128.e111 (2020).

Tsai, A. C. Reliability and validity of depression assessment among persons with HIV in sub-Saharan Africa: Systematic review and meta-analysis. J. Acquir. Immune Defic. Syndr. 66 , 503–511 (2014).

Baxter, A. J., Scott, K. M., Vos, T. & Whiteford, H. A. Global prevalence of anxiety disorders: A systematic review and meta-regression. Psychol. Med. 43 , 897–910 (2013).

Ferrari, A. et al. Global variation in the prevalence and incidence of major depressive disorder: A systematic review of the epidemiological literature. Psychol. Med. 43 , 471–481 (2013).

Hasin, D. S. et al. Epidemiology of adult DSM-5 major depressive disorder and its specifiers in the United States. JAMA Psychiatry 75 , 336–346 (2018).

US Substance Abuse and Mental Health Services Administration. Key Substance Use and Mental Health Indicators in the United States: Results from the 2019 National Survey on Drug Use and Health (Center for Behavioral Health Statistics and Quality, Substance Abuse and Mental Health Services Administration, 2020).

Kessler, R. C. et al. Lifetime prevalence and age-of-onset distributions of DSM-IV disorders in the National Comorbidity Survey Replication. Arch. Gen. Psychiatry 62 , 593–602 (2005).

Working Group report to the Advisory Committee to the NIH Director. Changing the Culture to End Sexual Harassment (U. S. National Institutes of Health, 2019).

National Academies of Sciences Engineering and Medicine. Sexual Harassment of Women: Climate, Culture, and Consequences in Academic Sciences, Engineering, and Medicine (The National Academies Press, 2018).

Wadman, M. A hidden history. Science 360 , 480–485 (2018).

Hockfield, S., Magley, V. & Yoshino, K. Report of the External Review Committee to Review Sexual Harassment at Harvard University (External Review Committee to Review Sexual Harassment at Harvard University, 2021).

Bartlett, T. & Gluckman, N. She left Harvard. He got to stay. Chronicle High. Educ. 64 , A14 (2021). Available at: https://www.chronicle.com/article/she-left-harvard-he-got-to-stay/.

Tseng, M. et al. Strategies and support for Black, Indigenous, and people of colour in ecology and evolutionary biology. Nat. Ecol. Evol. 4 , 1288–1290 (2020).

Williams, D. R. et al. Prevalence and distribution of major depressive disorder in African Americans, Caribbean blacks, and non-Hispanic whites: Results from the National Survey of American Life. Arch. Gen. Psychiatry 64 , 305–315 (2007).

Wu, A. H. Gender bias in rumors among professionals: An identity-based interpretation. Rev. Econ. Stat. 102 , 867–880 (2020).

Kessler, R. C. Epidemiology of women and depression. J. Affect. Disord. 74 , 5–13 (2003).

Mattheis, A., Cruz-Ramirez De Arellano, D. & Yoder, J. B. A model of queer STEM identity in the workplace. J. Homosex 67 , 1839–1863 (2020).

Semlyen, J., King, M., Varney, J. & Hagger-Johnson, G. Sexual orientation and symptoms of common mental disorder or low wellbeing: Combined meta-analysis of 12 UK population health surveys. BMC Psychiatry 16 , 1–19 (2016).

Lark, J. S. & Croteau, J. M. Lesbian, gay, and bisexual doctoral students’ mentoring relationships with faculty in counseling psychology: A qualitative study. Couns. Psychol. 26 , 754–776 (1998).

Jaremka, L. M. et al. Common academic experiences no one talks about: Repeated rejection, imposter syndrome, and burnout. Perspect Psychol Sci 15 , 519–543 (2020).

Allen, H. K. et al. Substance use and mental health problems among graduate students: Individual and program-level correlates. J. Am. Coll. Health https://doi.org/10.1080/07448481.2020.1725020 (2020).

Turner, A. & Berry, T. Counseling center contributions to student retention and graduation: A longitudinal assessment. J. Coll. Stud. Dev. 41 , 627–636 (2000).

Dyrbye, L. N., Thomas, M. R. & Shanafelt, T. D. Medical student distress: Causes, consequences, and proposed solutions. Mayo Clin. Proc. 80 , 1613–1622 (2005).

Tija, J., Givens, J. L. & Shea, J. A. Factors associated with undertreatment of medical student depression. J. Am. Coll. Health 53 , 219–224 (2005).

Dearing, R., Maddux, J. & Tangney, J. Predictors of psychological help seeking in clinical and counseling psychology graduate students. Prof. Psychol. Res. Pract. 36 , 323–329 (2005).

Langin, K. Amid concerns about grad student mental health, one university takes a novel approach. Science https://doi.org/10.1126/science.caredit.aay7113 (2019).

Guillory, D. Combating anti-blackness in the AI community. arXiv , arXiv:2006.16879 (2020).

Galán, C. A. et al. A call to action for an antiracist clinical science. J. Clin. Child Adolesc. Psychol 50 , 12-57 (2021).

Wyman, P. A. et al. Effect of the Wingman-Connect upstream suicide prevention program for air force personnel in training: A cluster randomized clinical trial. JAMA Netw Open 3 , e2022532 (2020).

Knox, K. L. et al. The US Air Force Suicide Prevention Program: Implications for public health policy. Am. J. Public Health 100 , 2457–2463 (2010).

Inclusive Climate Subcommittee of the Government Department Climate Change Committee. Government Department Climate Change: Final Report and Recommendations (Government Department, Harvard University, 2019).

Inclusive Climate Subcommittee of the Government Department Climate Change Committee. Government Department Climate Survey Report (Government Department, Harvard University, 2019).

Magoqwana, B., Maqabuka, Q. & Tshoaedi, M. “Forced to care” at the neoliberal university: Invisible labour as academic labour performed by Black women academics in the South African university. S. Afr. Rev. Sociol. 50 , 6–21 (2019).

Jones, H. A., Perrin, P. B., Heller, M. B., Hailu, S. & Barnett, C. Black psychology graduate students’ lives matter: Using informal mentoring to create an inclusive climate amidst national race-related events. Prof. Psychol. Res. Pract. 49 , 75–82 (2018).

Mathur, A., Meyers, F. J., Chalkley, R., O’Brien, T. C. & Fuhrmann, C. N. Transforming training to reflect the workforce. Sci. Transl. Med. 7 , 285 (2015).

Scharff, V. Advice: Prepare your Ph.D.s for diverse career paths. Chronicle High. Educ. 65 , 30 (2018).

Beattie, T. S., Smilenova, B., Krishnaratne, S. & Mazzuca, A. Mental health problems among female sex workers in low- and middle-income countries: A systematic review and meta-analysis. PLoS Med. 17 , e1003297 (2020).

Ismail, Z. et al. Prevalence of depression in patients with mild cognitive impairment: A systematic review and meta-analysis. JAMA Psychiatry 74 , 58–67 (2017).

Lim, G. Y. et al. Prevalence of depression in the community from 30 countries between 1994 and 2014. Sci. Rep. 8 , 1–10 (2018).

Article ADS Google Scholar

Jones-White, D. R., Soria, K. M., Tower, E. K. B. & Horner, O. G. Factors associated with anxiety and depression among U.S. doctoral students: Evidence from the gradSERU survey. J. Am. Coll. Health https://doi.org/10.1080/07448481.2020.1865975 (2021).

Moher, D., Liberati, A., Tetzlaff, J. & Altman, D. G. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Ann. Intern. Med. 151 , 264–269 (2009).

Helmers, K. F., Danoff, D., Steinert, Y., Leyton, M. & Young, S. N. Stress and depressed mood in medical students, law students, and graduate students at McGill University. Acad. Med. 72 , 708–714 (1997).

Rabkow, N. et al. Facing the truth: A report on the mental health situation of German law students. Int. J. Law Psychiatry 71 , 101599 (2020).

Bergin, A. & Pakenham, K. Law student stress: Relationships between academic demands, social isolation, career pressure, study/life imbalance and adjustment outcomes in law students. Psychiatr. Psychol. Law 22 , 388–406 (2015).

Stang, A. Critical evaluation of the Newcastle-Ottawa scale for the assessment of the quality of nonrandomized studies in meta-analyses. Eur. J. Epidemiol. 25 , 603–605 (2010).

Freeman, M. F. & Tukey, J. W. Transformations related to the angular and the square root. Ann. Math. Stat. 21 , 607–611 (1950).

Article MathSciNet MATH Google Scholar

DerSimonian, R. & Laird, N. Meta-analysis in clinical trials. Control Clin. Trials 7 , 177–188 (1986).

Wilson, E. B. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22 , 209–212 (1927).

Newcombe, R. G. Two-sided confidence intervals for the single proportion: Comparison of seven methods. Stat. Med. 17 , 857–872 (1998).

Higgins, J. P. T. & Thompson, S. G. Quantifying heterogeneity in a meta-analysis. Stat. Med. 21 , 1539–1558 (2002).

Kroenke, K., Spitzer, R. L. & Williams, J. B. W. The PHQ-9: Validity of a brief depression severity measure. J. Gen. Intern. Med. 16 , 606–613 (2001).

Download references

Acknowledgements

We thank the following investigators for generously sharing their time and/or data: Gordon J. G. Asmundson, Ph.D., Amy J. L. Baker, Ph.D., Hillel W. Cohen, Dr.P.H., Alcir L. Dafre, Ph.D., Deborah Danoff, M.D., Daniel Eisenberg, Ph.D., Lou Farrer, Ph.D., Christy B. Fraenza, Ph.D., Patricia A. Frazier, Ph.D., Nadia Corral-Frías, Ph.D., Hanga Galfalvy, Ph.D., Edward E. Goldenberg, Ph.D., Robert K. Hindman, Ph.D., Jürgen Hoyer, Ph.D., Ayako Isato, Ph.D., Azharul Islam, Ph.D., Shanna E. Smith Jaggars, Ph.D., Bumseok Jeong, M.D., Ph.D., Ju R. Joeng, Nadine J. Kaslow, Ph.D., Rukhsana Kausar, Ph.D., Flavius R. W. Lilly, Ph.D., Sarah K. Lipson, Ph.D., Frances Meeten, D.Phil., D.Clin.Psy., Dhara T. Meghani, Ph.D., Sterett H. Mercer, Ph.D., Masaki Mori, Ph.D., Arif Musa, M.D., Shizar Nahidi, M.D., Ph.D., Arthur M. Nezu, Ph.D., D.H.L., Angelo Picardi, M.D., Nicole E. Rossi, Ph.D., Denise M. Saint Arnault, Ph.D., Sagar Sharma, Ph.D., Bryony Sheaves, D.Clin.Psy., Kennon M. Sheldon, Ph.D., Daniel Shepherd, Ph.D., Keisuke Takano, Ph.D., Sara Tement, Ph.D., Sherri Turner, Ph.D., Shawn O. Utsey, Ph.D., Ron Valle, Ph.D., Caleb Wang, B.S., Pengju Wang, Katsuyuki Yamasaki, Ph.D.

A.C.T. acknowledges funding from the Sullivan Family Foundation. This paper does not reflect an official statement or opinion from the County of San Mateo.

Author information

Authors and affiliations.

Center for Global Health, Massachusetts General Hospital, Boston, MA, USA

Emily N. Satinsky & Alexander C. Tsai

San Mateo County Behavioral Health and Recovery Services, San Mateo, CA, USA

Tomoki Kimura

Department of Epidemiology and Population Health, Stanford University, Palo Alto, CA, USA

Mathew V. Kiang

Center for Population Health Sciences, Stanford University School of Medicine, Palo Alto, CA, USA

Harvard Society of Fellows, Harvard University, Cambridge, MA, USA

Rediet Abebe

Department of Electrical Engineering and Computer Science, University of California Berkeley, Berkeley, CA, USA

Department of Economics, Hankamer School of Business, Baylor University, Waco, TX, USA

Scott Cunningham

Department of Sociology, Washington University in St. Louis, St. Louis, MO, USA

Department of Microbiology, Immunology, and Molecular Genetics, Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, CA, USA

Xiaofei Lin

Departments of Newborn Medicine and Psychiatry, Brigham and Women’s Hospital, Boston, MA, USA

Cindy H. Liu

Harvard Medical School, Boston, MA, USA

Cindy H. Liu & Alexander C. Tsai

Centre for Global Health, Edinburgh Medical School, Usher Institute, University of Edinburgh, Edinburgh, Scotland, UK

Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA

Department of Global Health, Institute for Life Course Health Research, Stellenbosch University, Cape Town, South Africa

Mark Tomlinson

School of Nursing and Midwifery, Queens University, Belfast, UK

Fielding School of Public Health, Los Angeles Area Health Services Research Training Program, University of California Los Angeles, Los Angeles, CA, USA

Miranda Yaver

Mongan Institute, Massachusetts General Hospital, Boston, MA, USA

Alexander C. Tsai

You can also search for this author in PubMed Google Scholar

Contributions

A.C.T. conceptualized the study and provided supervision. T.K. conducted the search. E.N.S. contacted authors for additional information not reported in published articles. E.N.S. and T.K. extracted data and performed the quality assessment appraisal. E.N.S. and A.C.T. conducted the statistical analysis and drafted the manuscript. T.K., M.V.K., R.A., S.C., H.L., X.L., C.H.L., I.R., S.S., M.T. and M.Y. contributed to the interpretation of the results. All authors provided critical feedback on drafts and approved the final manuscript.

Corresponding authors

Correspondence to Emily N. Satinsky or Alexander C. Tsai .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Satinsky, E.N., Kimura, T., Kiang, M.V. et al. Systematic review and meta-analysis of depression, anxiety, and suicidal ideation among Ph.D. students. Sci Rep 11 , 14370 (2021). https://doi.org/10.1038/s41598-021-93687-7

Download citation

Received : 31 March 2021

Accepted : 24 June 2021

Published : 13 July 2021

DOI : https://doi.org/10.1038/s41598-021-93687-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

How to improve academic well-being: an analysis of the leveraging factors based on the italian case.

Alice Tontodimamma
Emiliano del Gobbo
Antonio Aquino

Quality & Quantity (2024)

Suicidal affective risk among female college students: the impact of life satisfaction

Dawei Huang
Xianbin Wang

Current Psychology (2024)

A single-center assessment of mental health and well-being in a biomedical sciences graduate program

Sarah K. Jachim
Bradley S. Bowles
Autumn J. Schulze

Nature Biotechnology (2023)

Mental Health Problems Among Graduate Students in Turkey: a Cross-Sectional Study

Cafer Kılıç
Faika Şanal Karahan

International Journal for the Advancement of Counselling (2023)

A study in University of Ruhuna for investigating prevalence, risk factors and remedies for psychiatric illnesses among students

Patikiri Arachchige Don Shehan Nilm Wijesekara

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
HHS Author Manuscripts

Meta-Analysis With Complex Research Designs: Dealing With Dependence From Multiple Measures and Multiple Group Comparisons

Nancy scammacca.

Nancy Scammacca, PhD, is a research associate at the Meadows Center for Preventing Educational Risk in the College of Education at the University of Texas at Austin, 1 University Station D4900, Austin, TX 78712

Greg Roberts

Greg Roberts, PhD, is the director of the Vaughn Gross Center for Reading and Language Arts and the associate director of the Meadows Center for Preventing Educational Risk at the University of Texas at Austin

Karla K. Stuebing

Karla K. Stuebing, PhD, is a research professor at the Texas Institute for Measurement, Evaluation, and Statistics at the University of Houston

Previous research has shown that treating dependent effect sizes as independent inflates the variance of the mean effect size and introduces bias by giving studies with more effect sizes more weight in the meta-analysis. This article summarizes the different approaches to handling dependence that have been advocated by methodologists, some of which are more feasible to implement with education research studies than others. A case study using effect sizes from a recent meta-analysis of reading interventions is presented to compare the results obtained from different approaches to dealing with dependence. Overall, mean effect sizes and variance estimates were found to be similar, but estimates of indexes of heterogeneity varied. Meta-analysts are advised to explore the effect of the method of handling dependence on the heterogeneity estimates before conducting moderator analyses and to choose the approach to dependence that is best suited to their research question and their data set.

The inclusion of statistically dependent effect sizes in a meta-analysis can present a serious threat to the validity of the meta-analytic results. Dependence can arise in a number of ways. One common way that dependence presents itself occurs when a study included in a meta-analysis uses more than one outcome measure, such as a reading intervention study that measures both reading fluency and reading comprehension. The resulting effect sizes are dependent because the same participants were measured more than once. Dependence also commonly occurs when a study's research design includes two treatment groups compared with the same control group. Because the same control group participants are included in each treatment/control comparison, the resulting effect sizes are statistically dependent. Failure to resolve or model dependence results in artificially reduced estimates of variance, which in turn inflates Type I error ( Borenstein, Hedges, Higgins, & Rothstein, 2009a ). Treating dependent effect sizes as if they were independent also gives more weight in the meta-analysis to studies that have multiple measures or more than two groups. Statistical dependence must be resolved in a way that allows each study to contribute a single independent effect size to the meta-analysis or modeled using methodological techniques designed to handle dependence to avoid these threats to the validity of the meta-analytic results.

Prevalence of the Problem of Dependence in Meta-Analyses in Education Research

Education research studies commonly yield a set of dependent effect sizes. For example, Edmonds et al. (2009) extracted 78 effect sizes from 21 studies of interventions for struggling readers, an average of nearly four per study across multiple measures and multiple dependent comparisons. Tran, Sanchez, Arellano, and Swanson (2011) calculated 107 effect sizes from multiple measures across 13 response-to-instruction studies, meaning that an average of eight outcome measures had been used in these studies. In their meta-analysis on the effectiveness of Reading Recovery, D'Agostino and Murphy (2004) calculated 1,379 effect sizes across the multiple outcomes, group comparisons, and testing occasions in the 36 studies that met their inclusion criteria, for an average of approximately 38 effect sizes per study. In a review of education meta-analyses published since 2000, Ahn, Ames, and Myers (2012) found that 37.5% of the 56 meta-analyses included in their report averaged three or more effect sizes per study. The average number of effect sizes per study across all 56 meta-analyses was 3.71. Just 7 of the 56 meta-analytic reports stated that dependence of effect sizes was not an issue in their data set.

Statistical Methods for Handling Dependence From Multiple Outcomes

Much has been written by prominent researchers about how to resolve dependence of effect sizes in a meta-analysis when faced with multiple outcomes. Some methods are more complex and challenging to implement with education research studies than others. On the less complex end of the spectrum, Card (2012) recommended choosing between two straightforward methods of resolving dependence. The first is to select a single outcome to include based on the focus of the meta-analysis. He cautioned that this approach is appropriate only when the meta-analyst can make a strong case for including one outcome over others. A second option, and one that is frequently implemented in education meta-analyses, is to aggregate all measures by computing an average effect size. Although computing an average effect across measures within a study is easy to do, the result may not be the best measure of the effect of the study. This approach effectively punishes studies for attempting to measure the impact of their treatment across a broad array of measures. For example, researchers testing a reading fluency intervention might be interested in knowing if their intervention has any effect on reading comprehension. Such a study conceivably could result in a large effect of 0.80 on a measure of reading fluency and a small effect of 0.20 on a measure of reading comprehension. If these measures are averaged for inclusion in a meta-analysis that is focused broadly on the effect of reading interventions on reading skills, the resulting effect size of 0.50 would not accurately represent the effectiveness of this study's intervention.

Reflecting on this problem, Marín-Martínez and Sánchez-Meca (1999) cautioned meta-analysts to consider whether or not effect sizes within a study are homogenous before averaging them to resolve dependence. If effects within studies are not homogenous, another approach to resolving dependence should be implemented. Cooper (1998) suggested a variation on simply averaging all outcomes. In his shifting-unit-of-analysis approach, effect sizes within studies are combined based on the variables of interest in the meta-analysis to provide a single estimate of the overall effect to include in the meta-analysis. Cooper stated that this approach minimizes violations of the assumption of independence of the effect sizes while preserving as much of the data as possible. However, using this approach can result in running multiple meta-analyses for each outcome type, with some analyses having a small number of studies and little power as a result.

More complex approaches to dealing with dependence from multiple outcomes involve accounting for the correlation between measures when computing a summary effect size across multiple dependent outcomes. As Borenstein et al. (2009b) pointed out, averaging effect sizes across measures makes an implicit assumption that the correlation between measures is 1.0—meaning that each outcome essentially duplicates the information provided by other outcomes. When meta-analysts ignore dependence and include effect sizes from all measures as if the effects were independent, the assumed correlation between measures is 0—meaning that each outcome contributes information that is unrelated to any other outcome. According to Borenstein et al., when making either of these assumptions about the correlation between measures, the result is an incorrect estimate of the variance of the composite effect size that the study contributes to the meta-analysis. Assuming a correlation of 1.0 results in an overestimate of the variance of the composite effect size because all the information provided by the outcomes is redundant. Assuming a correlation of 0 results in an underestimate of the variance for the composite effect size because each effect size is seen as contributing independent information. A larger estimate of the variance results in a larger confidence interval around the effect size and an increased likelihood of finding that the effect size is not significantly different from zero (a Type II error). The opposite is true when an inaccurately small estimate of the variance is calculated, resulting in an inflation of the Type I error rate.

When the correlation between outcomes is known, the dependence can be accounted for mathematically when computing a mean effect for a study. Rosenthal and Rubin (1986) ; Raudenbush, Becker, and Kalaian (1988) ; Gleser and Olkin (1994) ; and Borenstein et al. (2009b) provided equations for calculating an effect size for a study with multiple outcomes that include the correlations between the outcomes. More complex approaches incorporate the correlation between measures into multivariate models for conducting meta-analysis. Kalaian and Raudenbush (1996) described and illustrated the use of multivariate multilevel modeling to conduct meta-analysis in a way that models dependency in effects within studies. In their example, they meta-analyzed studies of the impact of coaching on performance on the Scholastic Aptitude Test (SAT) math and verbal subtests. Given that the correlation between these subtests has been reported by the developers of the SAT, Kalaian and Raudenbush were able to compute the covari-ance matrix needed for implementing their modeling technique. The structural equation modeling (SEM) approach to meta-analysis proposed by Cheung (2010) also requires that the correlations between multiple measures within a study are known.

In her discussion of multivariate meta-analysis, Becker (2000) acknowledged that in many cases the meta-analyst does not know the correlations between multiple measures used in a particular study. She suggested consulting previous studies or manuals from test publishers to impute a correlation. Theoretically, such an approach makes sense. However, it is often impractical or impossible for a meta-analyst working with education research studies to implement any of these suggestions. Researcher-designed measures are commonly used in education research, and the correlations between such measures are not routinely reported. When a study measures outcomes using standardized tests, the correlations between them might be available from test publishers or in the research literature, but the extent to which these correlations generalize beyond the normative sample to a special population (such as students with learning disabilities) is rarely documented.

When it is not possible to locate the correlation from these sources, Becker (2000) and Borenstein et al. (2009b) suggested conducting sensitivity analyses to determine a possible range of correlations between measures. Conducting sensitivity analyses can be a workable solution when a small number of measures are involved and only a few studies use multiple measures. However, when more than two or three measures are used in multiple studies to be included in the meta-analysis, conducting sensitivity analyses for every pair of outcomes quickly become so laborious and time-consuming that it is not feasible, especially because computer programs to conduct sensitivity analysis are not available. In these instances, averaging outcomes with an assumed correlation of 1.0 and inflating Type II error is considered the more conservative approach.

Statistical Methods for Handling Dependence From Multiple Group Comparisons

Many of the same researchers who have suggested methods for dealing with dependence when including studies with multiple outcomes also have described methods for dealing with dependence from multiple group comparisons within studies. Gleser and Olkin (1994) provided equations for a matrix of effect sizes that come from a set of studies where multiple treatments are compared with a no-treatment control group. They assumed that the corpus of studies that the meta-analyst has gathered includes a common and defined set of treatments (such as several types of diet or exercise routines), with some studies including perhaps two of these treatments compared with a no-treatment control group and others including three or four or more. In this scenario, regression models can be fit that account for the dependence in the group comparisons within studies. This approach works well in fields where treatments are standardized or come from a common set of treatments, such as medicine. Within education research, it is rare that the same treatments are present across studies, making it impossible to construct the type of matrix needed to implement Gleser and Olkin's approach.

Borenstein et al. (2009c) proposed a way of dealing with the dependence inherent in multiple group comparisons that is more easily applied to education research. First, they advised meta-analysts to consider if their interest is in comparing the effects of two specific treatments or in computing a combined overall effect of treatment compared with the control group. If one's interest is in comparing treatments, and two treatment groups are compared with a single control group in a given study, an effect size can be computed from the information provided for the two treatments that indicates the benefit of one treatment over the other. In this case, effect sizes from treatment–control comparisons are not included in the meta-analysis, eliminating the dependence from the shared control group. This approach makes sense only if the two treatments are present in a similar enough form across the corpus of studies to allow for similar contrasts across the meta-analysis.

If one's interest is in the overall effect of different types of treatment compared with a control group, calculating a combined effect size and its variance for studies in which multiple treatments are compared with the same control group is a straightforward process as long as the number of participants in each treatment group and the control group is known. The correlation between the effect size for the first treatment group versus the control group and the effect size for the second treatment group versus the control group can be calculated based on the number of participants in each group. A combined weighted mean effect size can be computed that gives more weight to an effect from a treatment with a larger sample size than to another treatment in the same study with a smaller sample size. The variance of this combined effect can be computed in a manner that takes into account the proportion of all study participants that are shared members of the control group. For example, if 50 participants are in one treatment group, 50 participants are in a second treatment group, and 50 participants are in the control group, the proportion of shared participants in the comparison of the each treatment group with the control group is 0.50 because 50% of the participants in each comparison are the same. More simply, in cases where means, standard deviations, and sample sizes are available for all treatment groups and the control group, the meta-analyst can create a combined mean simply by calculating a weighted mean and standard deviation for a study with all treatment conditions combined and using this mean and standard deviation with the mean and standard deviation of the control group to calculate a standardized mean difference effect size.

Borenstein et al.'s (2009c) approach to computing a combined, weighted mean effect is easier to apply to the types of research methodologies typically found in education research reports than Gleser and Olkin's (1994) approach. It is a sound means of preserving the statistical independence of effect sizes in a meta-analysis. However, independence comes at the cost of losing information about the unique effect of each treatment. Averaging the effects of treatment may not represent the intent of a study's researchers when they designed a multiple treatment versus control study. Additionally, when there are vast differences in the effectiveness of the treatments, this approach handicaps the most effective treatment in a study by averaging it with less effective treatments. When there are many studies with multiple dependent comparisons in a meta-analysis, the overall mean effect will be reduced by the presence of weaker and stronger treatments homogenized into a middling studywise effect size.

New Approaches to Dealing With Dependence From Multiple Outcomes and Comparisons

Robust variance estimation.

Hedges, Tipton, and Johnson (2010) proposed a new approach to dealing with dependence that can be applied no matter the source or sources of dependence in a data set of effect sizes. Known as robust variance estimation (RVE), it overcomes the need to include the known correlations between measures in order to include all effect sizes from all measures and all group comparisons in the meta-analysis. Instead of modeling dependence as is done in multivariate approaches to meta-analysis that require known correlations, RVE mathematically adjusts the standard errors of the effect sizes to account for the dependence ( Hedges et al., 2010 ; Tanner-Smith & Tipton, 2013 ). An intraclass correlation (ρ) that represents the within-study correlation between effects must be specified when implementing RVE to estimate the effect size weights, but because RVE is not affected very much by the choice of weights, it does not matter if the correlation is precise ( Hedges et al., 2010 ; Tanner-Smith & Tipton, 2013 ). Because the same ρ is applied to all dependent effect sizes within each study in the meta-analysis, sensitivity analysis with a range of values for ρ can be conducted quite easily to determine how the correlation that is chosen affects the resulting estimates of the mean effect and its variance. Dependence from multiple sources, including multiple measures and multiple group comparisons, can be accommodated simultaneously ( Tanner-Smith & Tipton, 2013 ). RVE is reasonably easy to implement with syntax for several popular statistical software packages provided by Tanner-Smith and Tipton and available from the Peabody Research Institute (n.d.) .

There are some important limitations to consider when implementing RVE. Because the math involved in RVE relies on the central limit theorem, simulation studies have shown that a minimum of 10 independent studies are needed to estimate a reliable main effect and a minimum of 40 independent students are needed to estimate a meta-regression coefficient ( Hedges et al., 2010 ; Tanner-Smith & Tipton, 2013 ). RVE can be used only in meta-regression. If a meta-analysis involves categorical moderators with more than two levels, the dummy-coding of variables required to analyze all pairwise comparisons can be cumbersome to implement in currently available statistical software. Additionally, because the degrees of freedom used to test the statistical significance of the meta-regression coefficients is equal to the number of independent studies minus the number of parameters estimated, meta-analyses with a small number of studies will be restricted in the number of covariates that can be included ( Tanner-Smith & Tipton, 2013 ). Tanner-Smith and Tipton's simulation studies indicated that a minimum of 40 studies with an average of at least five effect sizes per study are needed to estimate a meta-regression coefficient. When fewer studies are included, they found that the confidence interval for the coefficient tends to be too narrow, meaning that the p value for the estimate will be inaccurate. Nevertheless, RVE is a mathematically sound method for modeling dependence that should be strongly considered by education meta-analysts when their data sets meet its requirements.

Three-Level Meta-Analysis

Konstantopoulos (2011) proposed three-level meta-analysis as an extension of the use of two-level random-effects models in meta-analysis. In two-level models, Level 2 variance represents between-study differences in effect size estimates, with the assumption that all studies are contributing an independent effect size. Three-level meta-analysis allows for clustering of dependent effect sizes within studies at Level 2; between-study effects are then estimated at Level 3. Cheung (2013) described how three-level meta-analysis can be used to pool dependent effect sizes within each study, modeling the within-study dependence at Level 2 and the between-study mean effect size and variance at Level 3. This approach to dependence can be applied when the correlations between the dependent effect sizes are not known, as is usually the case when multiple measures are used in a study. Unlike in RVE, three-level meta-analysis provides estimates of both the Level 2 (within study) and Level 3 (between study) variance so that meta-analysts can determine where the variation in effects is the greatest. Covariates can be included in the three-level model at both Level 2 and Level 3 to attempt to explain the variance present at each level.

Cheung (2013) described how to use SEM to conduct a three-level meta-analysis. Some advantages of the SEM approach include its ability to handle missing data on covariates and to provide a means for empirical comparison of the two-level and three-level models to determine which model best fits the data. Cheung provided syntax and a package for running three-level meta-analysis in R, making it easier for other meta-analysts to implement his approach. Like RVE, three-level meta-analysis is a promising solution to the problem of dependence in meta-analysis. However, as Cheung noted, additional studies are needed to demonstrate the strengths and potential limitations of both approaches to dependence because neither technique has been used widely in published research.

How Education Researchers Handle Dependence in Meta-Analysis

Drawing from the methods described above, education researchers have implemented a variety of means of handling dependence from multiple measures and/or multiple group comparisons when conducting a meta-analysis. In their meta-analysis of the effect of writing instruction on reading, Graham and Hebert (2011) resolved the dependence from multiple measures using Cooper's (1998) shifting-unit-of-analysis approach. They separated measures by construct (e.g., reading comprehension, reading fluency) and meta-analyzed effect sizes for each construct separately. When studies included multiple measures of a single construct, they included the average of the effects in their meta-analysis. Graham and Hebert's approach yielded multiple sets of independent effects that they meta-analyzed separately. This approach also can be implemented when studies provide multiple treatment comparisons by conducting separate meta-analyses for each type of treatment.

The advantage of this approach is that it allows the meta-analyst to retain all of the information from each study while preserving statistical independence. However, to do so the meta-analyst must run multiple analyses and cannot draw conclusions about the overall effect from the corpus of studies. Additionally, dividing the corpus of studies into groups by measure type and/or treatment type can result in a significant reduction in power. Nevertheless, this approach remains popular with meta-analysts and has been implemented in a number of other recent meta-analyses in education (e.g., Flynn, Zheng, & Swanson, 2012 ; Gersten et al., 2009 ; Tran et al., 2011 ). In their review of 56 education meta-analyses, Ahn et al. (2012) found that 26.8% of the meta-analyses in their data set used the shifting-unit-of-analysis approach to resolve dependence.

Another common approach to handling dependence in meta-analysis is to select a single measure and/or group comparison that seems to best represent the study's primary research question. Graham and Hebert (2011) took this approach to resolving the statistical dependence in studies that had multiple group comparisons, and Chambers (2004) used it in a meta-analysis of the effects of computers in classrooms. In their meta-analysis of reading comprehension instruction for students with learning disabilities, Berkeley, Scruggs, and Mastropieri (2010) implemented a hybrid of this approach and the approach described above, selecting a single outcome measure from each study that best represented the research question while conducting separate meta-analyses for different types of measures and for measures of treatment effect, maintenance effect, and generalization effect. This approach was used in 14.3% of the 56 education meta-analyses reviewed by Ahn et al. (2012) .

The main advantage of this method of resolving dependence is that it contributes the effect size that conveys the central finding of the study to the meta-analysis. When meta-analysts select a single outcome or group comparison for the meta-analysis, studies that include additional outcomes or comparisons in an attempt to measure the effects of their intervention more broadly or compare it with other types of treatment do not have the effect size of their primary outcome or comparison of interest reduced by averaging it with smaller effects from tertiary outcomes or weaker treatments. However, in large-scale or multicomponent interventions, researchers often expect to see effects of treatment on multiple types of measures or are interested in determining which of several treatments is most effective. In these cases, it can be difficult for the meta-analyst to pick a single measure or group comparison that will best represent the study in the meta-analysis, especially if the study's authors are not clear in describing the outcome or comparison they view as most central to the purpose of their study.

Ahn et al. (2012) documented the use of other approaches to dealing with dependence in the 56 education meta-analyses they reviewed. The approach most commonly used in these meta-analyses was averaging or weighted averaging of the dependent effect sizes within studies. This approach was implemented in 42.9% of the meta-analyses. They also found that a multivariate approach was used in 7.1% of the meta-analyses. A combination of approaches was used in 12.5% of the meta-analyses. In 32.2% of the meta-analyses, researchers either failed to mention whether dependence was an issue in their data set or mentioned it but did not report how they handled it.

Because Hedges et al.'s (2010) RVE approach is a relatively new technique for dealing with dependence, published examples of its use are few in number. Wilson, Tanner-Smith, Lipsey, Steinka-Fry, and Morrison (2011) used RVE to account for dependence in their meta-analysis of high school dropout prevention programs that included 504 effect sizes from 317 independent samples and 152 studies. Uttal et al. (2013) implemented RVE in a meta-analysis that included 1,038 effect sizes from 206 studies that assessed the effect of training programs on spatial skills. Outside of educational research, RVE has been implemented in meta-analyses on the effectiveness of outpatient substance abuse treatment for adolescents ( Tanner-Smith, Wilson, & Lipsey, 2013 ), the relationship between social goals and aggressive behavior in youth ( Samson, Ojanen, & Hollo, 2012 ), and the effect of mindfulness-based stress reduction on physical and mental health in adults ( de Vibe, Bjørndal, Tipton, Hammerstrøm, & Kowalski, 2012 ). No published examples of the use of three-level meta-analysis to handle dependence were found in the educational research literature. Both Konstantopoulus (2011) and Cheung (2013) illustrated the use of three-level meta-analysis with extant data sets. Van den Noortgate, López-López, Marín-Martínez, and Sánchez-Meca (2013) used simulated data sets in their exploration of three-level meta-analysis as a method for handling dependence.

A Case Study in Methods of Dealing With Dependence

To better understand the impact of the choices education meta-analysts face when dealing with multiple measures and multiple group comparisons within studies, different methods of handling dependence were implemented using a set of effect sizes from a meta-analytic study by Scammacca, Roberts, Vaughn, and Stuebing (in press) of reading interventions for struggling readers in Grades 4 to 12. Researchers chose to use an extant set of effect sizes from a recent meta-analysis rather than a simulated data set because we believed that a real-world data set can better emulate the types and nature of dependence that typically exist in studies that education researchers struggle to meta-analyze. In doing so, we acknowledge that simulation studies make an important contribution to the knowledge base and are a necessary next step to the work we present here.

The Scammacca et al. (in press) report involved separate and combined analyses of effect sizes from research published between 1980 and 2004 and between 2005 and 2011. For this report, only effect sizes from the 2005 to 2011 group of 50 studies were used. This more recent group contained many more instances of studies with more than two groups ( k = 17) and with multiple measures ( k = 43) than the earlier group of studies. The proportion of these more complex research designs within the set of 50 is more representative than the older set of the sets of studies that would cause a meta-analyst to confront the issues addressed here. See the appendix for the effect size data used in this case study.

This case study sought to answer the following research questions:

Research Question 1: How do different approaches to dealing with dependence in data from multiple outcomes within studies affect meta-analytic estimates of mean effect size, variance, and indexes of heterogeneity?
Research Question 2: How do different approaches to dealing with dependence in data from multiple group comparisons within studies affect meta-analytic estimates of mean effect size, variance, and indexes of heterogeneity?

The approaches to handling dependence in this case study include those implemented in other meta-analytic studies that involved education data and others chosen to illustrate alternative means of estimating the overall effect from a study with multiple dependent effects. Additionally, meta-analyses were attempted with all outcomes and all groups as independent for comparison purposes.

Procurement of Corpus of Studies

The studies used in Scammacca et al. (in press) were located through a computer search of ERIC and PsycINFO using descriptors related to reading, learning difficulties/disabilities, and reading intervention; a search of abstracts from other published research syntheses and meta-analyses and reference lists in seminal studies; and a hand search of major journals in which previous intervention studies were published. Studies were included in the meta-analysis if (a) participants were English-speaking struggling readers in Grades 4 to 12 (age 9–21), (b) the study's research design used a multiple-group experimental or quasi-experimental treatment-comparison or multiple-treatment comparison designs, (c) the intervention provided any type of reading instruction, (d) data were reported for at least one dependent measure that assessed one or more reading constructs, and (e) sufficient data for calculating effect sizes and standard errors were provided.

Studies that met criteria were coded using a code sheet that included elements specified in the What Works Clearinghouse Design and Implementation Assessment Device ( Institute of Education Sciences, 2008 ) and used in previous research ( Scammacca et al., 2007 ). Researchers with doctorate degrees and doctoral students with experience coding studies for other meta-analyses and research syntheses completed the code sheets. All coders had completed training on how to complete the code sheet and had reached a high level of reliability with others coding the same article independently. Every study was independently coded by two raters. When discrepancies were found between coders, they reviewed the article together and discussed the coding until consensus was reached.

Effect Size Calculation

Effect sizes were calculated using the Hedges (1981) procedure for unbiased effect sizes for Cohen's d (this statistic is also known as Hedges's g ). Hedges's g was calculated using the posttest means and standard deviations for treatment and comparison (or multiple treatment) groups when such data were provided. In some cases, Cohen's d effect sizes were reported and means and standard deviations were not available. For these effects, Cohen's d for posttest mean differences between groups and the treatment and comparison group sample sizes was used to calculate Hedges's g . For each effect, estimates of Hedges's g were weighted by the inverse of the variance to account for variations in precision based on sample size in the studies. All effects were computed using the Comprehensive Meta Analysis (Version 2.2.064) software ( Borenstein, Hedges, Higgins, & Rothstein, 2011 ). Effects were coded for all measures and pairwise group comparisons between treatment and control groups or different treatment groups when no control group was included in the study. The 36 research reports yielded 50 independent studies with a total of 366 effect sizes, an average of about 7 effect sizes per study. At this point, researchers in the original study were faced with the dilemma of how to combine multiple effect sizes from multiple measures and multiple dependent group comparisons to best estimate the mean effect of reading intervention.

Calculating Mean Effects From Studies With Multiple Measures

Nearly all studies provided data on multiple outcome measures. Scammacca et al. (in press) averaged the effect sizes from multiple measures within each pairwise group comparison using the procedure recommended by Card (2012) , and included the average effect size and the average of its standard error in the meta-analysis. Five other approaches to computing a mean effect across multiple measures within a single independent group comparison were conducted for the present report:

The measure that yielded the highest effect size was selected for each independent group comparison.
A measure was selected at random using a random number generator for each independent group comparison.
A measure was selected for each independent group comparison that seemed to best represent the primary focus of the study's intervention.
Measures were analyzed separately based on the type of reading skill measured (fluency, vocabulary, spelling, reading comprehension, word, and word fluency) for each independent group comparison and mean effects were calculated for each skill.
All measures were treated as independent estimates of effects for each independent group comparison.

All five approaches were used in meta-analyses to calculate a mean effect and its standard error across all studies for all measures included in the research reports and for norm-referenced, standardized measures only. Because researcher-designed measures tend to have lower reliability than standardized measures, repeating the meta-analyses with only standardized measures allows researchers to investigate the effects of different approaches to dealing with multiple measures while constraining some of the influence of measurement error.

Calculating Mean Effects From Studies With Multiple Dependent Groups

Seventeen of the research reports contained more than one dependent treatment-control or multiple-treatment group comparison. In Scammacca et al. (in press) , the procedure recommended by Borenstein et al. (2009c) was implemented for comparisons that involved dependent groups. This procedure involves computing a combined weighted mean effect size and its standard error in a manner that reflects the degree of dependence in the data. Four other approaches to computing a mean effect across multiple dependent group comparisons were completed for this report:

The group comparison that yielded the highest mean effect size across measures included in the study was selected and included in the meta-analysis.
A group comparison was selected at random using a random number generator and its mean effect size across measures was included in the meta-analysis.
A group comparison was selected that seemed to best represent the primary focus of the study's intervention and its mean effect size across measures was included in the meta-analysis.
All group comparisons were treated as independent and each mean effect size across measures was included in the meta-analysis.

Meta-analyses were then conducted on the resulting data using all types of measures and using standardized measures only, for the reason stated above. In each of the different analyses for the multiple dependent group comparisons, the average of the effect sizes for all measures involving the group comparison of interest was used to hold constant the effect of multiple measures while examining different approaches to the problem of multiple dependent group comparisons. In a similar way, the effect of multiple dependent group comparisons was held constant in the analyses involving multiple measures. In these analyses, the Borenstein et al. (2009c) method of combining multiple dependent comparisons was implemented. Results for the RVE approach and three-level meta-analysis are reported separately.

Meta-Analytic Procedures

For all the methods of dealing with multiple measures and multiple dependent group comparisons, a random-effects model was used to analyze effect sizes. This model allows for generalizations to be made beyond the studies included in the analysis to the population of studies from which they come. Mean effect size statistics and their standard errors were computed and heterogeneity of variance was evaluated using the Q statistic, the I 2 statistic, and the tau-squared statistic. For all but the RVE and three-level meta-analysis approaches, the meta-analyses were conducted in Comprehensive Meta Analysis (Version 2.2.064) software ( Borenstein et al., 2011 ). For the RVE approach, unrestricted, intercept-only meta-regression models were run in SPSS using a macro provided by Tanner-Smith and Tipton (2013) and Peabody Research Institute (n.d.) . Sensitivity analysis with a range of values for ρ was conducted to determine the effect of varying intraclass correlations on estimates of the mean effect size, the Q statistic, and the tau-squared statistic. For three-level meta-analysis, Cheung's (2013) R syntax for the metaSEM package he authored was used. Finally, meta-regression was conducted using number of measures and number of groups as a predictor of effect size in a mixed-effects model using unrestricted maximum likelihood estimation.

Approaches to Handling Dependence From Multiple Measures

The meta-analyses that implemented different methods of resolving the dependence resulting from having multiple measures within a study produced some points of similarity and some differences across the methods used when considering all types of measures. See Table 1 for results for all types of measures. The mean effect size and variance when using the mean of measures method was nearly identical to the mean effect size when all measures were treated as independent and when a measure was selected based on the primary research question. Using the highest effect size produced a much larger mean effect, as would be expected, and a slightly larger variance. Random selection of an effect size also produced a somewhat larger estimate of the mean effect and a slightly larger variance. Estimates of heterogeneity varied widely depending on the method used to resolve dependence from multiple measures. Treating all measures as independent, using the highest effect size, and randomly selecting an effect size resulted in the largest values across all three indexes of heterogeneity.

Note . ES = effect size; CI = confidence interval.

When all types of measures were meta-analyzed by the type of reading skill tested (the shifting-unit-of-analysis approach), reading comprehension measures, word measures, and word fluency measures produced similar mean effect sizes and variances to the mean of measures, all measure independent, and select by research question measures. Results differed for all measures of fluency, vocabulary, and spelling skills, perhaps due to the number of small number of studies that included vocabulary and spelling measures or due to true differences in the effectiveness of reading interventions on these reading skills. The results for indexes of heterogeneity also differed depending on the domain of reading skills analyzed, with fluency measures showing the most heterogeneity and spelling measures the least.

Results of the meta-analyses that included standardized measures only are shown in Table 2 . As with the analyses of all measures, the mean of measures approach resulted in a similar mean effect size and variance as was obtained when all measures were treated as independent or when measures were selected based on the study's primary research question. Random selection of a measure resulted in a similar mean effect size and variance as well, whereas choosing the measure with the highest effect size resulted in the largest mean effect but a similar variance to other approaches. The Q and I 2 measures of heterogeneity again had large values for the independent approach, the random approach, the highest effect size approach, and the select-by-research-question approach. In the analyses by reading skill, fluency and vocabulary measures again showed much smaller effects than other domains. The I 2 index of heterogeneity had large values for reading comprehension and small values for other domains, likely due to the large number of studies that included a standardized measure of reading comprehension.

Approaches to Handling Dependence From Multiple Group Comparisons

Results from the five approaches used to deal with the dependence from multiple group comparisons are shown in Tables 3 (all types of measures) and and4 4 (standardized measures only). Mean effect sizes and variances were very similar across all approaches for both sets of analyses. Selecting the group comparison with the highest effect size resulted in the largest mean effect size for both all types of measures and standardized measures only, but not by much. The difference was especially small in the analysis of standardized measures. Interestingly, the variance did not increase when all group comparisons were treated as independent. However, treating all group comparisons as independent did result in very large estimates on the Q index of heterogeneity. Tau-squared and I 2 estimates were less affected.

The Robust Variance Estimation Approach to Handling Dependence

Results from the meta-analyses that implemented RVE are shown in Tables 5 (all types of measures) and and6 6 (standardized measures only). Results are reported to the fifth decimal place to show that very little change occurred in the mean effect size and measures of heterogeneity based on varying the intraclass correlation ρ. Compared with the results presented above for other approaches to dealing with dependence, the RVE approach produced estimates of the mean effect size and standard error that were very similar to those found when using the mean of measures approach and the weighted mean for group comparisons approach when looking both at the meta-analysis of all types of measures and only at standardized measures. The Q statistic for the meta-analysis of standardized measures was just slightly larger using the RVE approach than in other methods used to handle multiple dependent group comparisons, but the increase was enough to indicate the presence of statistically significant heterogeneity. The estimates of heterogeneity generally were larger than those obtained with other methods of dealing with dependence but less than those obtained when dependence was ignored.

Note. ES = effect size; CI = confidence interval.

Handling Dependence With Three-Level Meta-Analysis

Results from the three-level meta-analysis using all types of outcome measures were similar to those obtained using RVE. The estimate of the mean effect was 0.27 with a standard error of 0.05 (95% confidence interval = 0.18, 0.37). The tau-squared estimate of variance was 0.10 ( SE = 0.01) at Level 2 (within studies) and 0.07 ( SE = 0.02) at Level 3 (between studies), meaning that more within-study than between-study variation was present. In three-level meta-analysis, I 2 is calculated based on the Q statistic; thus, it is on a different scale and is interpreted differently than the I 2 statistics that have been presented previously in this article. The Level 2 I 2 and Level 3 I 2 values were 0.48 and 0.35, respectively, meaning that 48% of the variation in effect sizes was due to within-study factors and 35% of the variation was due to between-study factors. These finding suggest that Level 2 covari-ates should be included in the model to account for the within-study variation before between-study covariates are considered.

A three-level meta-analysis using effect sizes from standardized measures only was attempted; it failed to converge on an optimal solution. When restricted maximum likelihood estimation was used to examine the variance components, a solution was reached that estimated tau-squared at Level 2 (within studies) as 1 × 10 −10 and at Level 3 (between studies) as 0.02. Therefore, it seems likely that the very small value for tau-squared at Level 2 caused the model to fail to converge on an optimal solution.

Meta-Regression on Number of Measures and Number of Groups

To evaluate the relationship between effect size and number of measures and number of groups in a study, meta-regression was conducted using these variables as a predictor of effect size in a mixed-effects model using unrestricted maximum likelihood estimation. Meta-regression was run predicting the effect size using all types of measures and standardized measures only. The number of measures used in a study was not a statistically significant predictor of effect size when considering all types of outcome measures (β = 0.00, SE = 0.01, Q-model = 0.06, df = 1, p = .805, T 2 = 0.03) or only standardized outcome measures (β = −0.01, SE = 0.01, Q-model = 0.43, df = 1, p = .513, T 2 = 0.00). See Figures 1 and and2 2 for scatterplots of effect sizes by number of measures.

An external file that holds a picture, illustration, etc.
Object name is nihms528552f1.jpg

Scatterplot of effect size by number of measures used in a study.

An external file that holds a picture, illustration, etc.
Object name is nihms528552f2.jpg

Scatterplot of effect size by number of standardized measures used in a study.

The number of groups used in a study was not a statistically significant predictor of effect size when considering all types of outcome measures (β = 0.00, SE = 0.05, Q-model = 0.00, df = 1, p = .999, T 2 = 0.03). However, number of groups was a statistically significant predictor of effect size when considering only standardized measures (β = −0.06, SE = 0.03, Q-model = 5.04, df = 1, p = .024, T 2 = 0.00), with effect sizes from standardized measures decreasing as the number of groups increased. See Figures 3 and and4 4 for scatterplots of effect sizes by number of measures.

An external file that holds a picture, illustration, etc.
Object name is nihms528552f3.jpg

Scatterplot of effect sizes by number of groups in a study using all types of measures.

An external file that holds a picture, illustration, etc.
Object name is nihms528552f4.jpg

Scatterplot of effect sizes by number of groups in a study using standardized measures only.

The case study presented here was conducted to demonstrate the effects of different methods of dealing with dependence from multiple measures and multiple group comparisons within studies on meta-analytic results. Results indicated that most approaches to handling dependence produced similar estimates of the mean effect and variance for this set of effect sizes. The mean effect and variance were especially similar when only standardized measures were included in the analyses. The expected increase in the variance of the mean effect was not observed in the case study data when all measures or all group comparisons were included in the analysis as if they were independent effect sizes.

These findings are not what would be expected based on previous research. In their simulation study of dependence from multiple group comparisons, Kim and Becker (2010) found that the variance of the mean effect increased as the proportion of studies in the meta-analysis that contained dependent comparisons increased. They found that the variance estimate was at least somewhat inflated when as few as 20% of the studies in the meta-analysis included dependent comparisons. In the present case study, 34% of the studies had dependent group comparisons. However, Kim and Becker also noted that variance estimates were most inflated when treatment groups were larger than control groups, which was not generally the case in the studies included in the present case study. Additionally, Kim and Becker's simulations involved a set of 10 studies with 12 and 15 effect sizes representing 20% and 50% dependence. In the present case study, 50 studies with 92 effect sizes were included in the analysis with multiple dependent group comparisons. It may be that the larger number of effect sizes in the case study contributed to the difference in findings. Additional simulation studies with a larger set of effect sizes and additional scenarios of dependence are needed to determine under what circumstances and to what extent dependence inflates variance estimates.

Based on a single set of effects from reading intervention studies, the case study presented here cannot provide definitive guidance on the best way to resolve dependence resulting from multiple measures or multiple group comparisons within studies. Indeed, there may not be only one best way to resolve dependence, given that data sets of effect sizes can differ widely in the degree and nature of the dependence present in them. Additionally, the choice of method in dealing with dependence must take into account the overall purpose and research questions behind the meta-analysis. Despite being unable to offer definitive guidance, the case study presented here raises some important issues for meta-analysts of education research to consider as they deal with dependence. Furthermore, it draws attention to ways in which primary researchers can assist meta-analysts by providing the information needed to make the best decision about how to handle the multiple dependent effects from their studies.

Implications for Education Meta-Analysts

Consider the effect of your method of dealing with dependence when conducting moderator analyses.

The greatest differences between the various methods of dealing with dependence in the case study were seen in the indexes of heterogeneity. For the methods of handling multiple measures in the meta-analysis of all measures, Q values varied widely, ranging as high as 363.04. A good deal of variation also was seen in the Q values in the meta-analysis for standardized measures only, though the range was smaller. Q values were especially large when all group comparisons were treated as independent, providing another reason why this approach to dealing with dependence should be avoided.

Meta-analysts who find large Q values likely will want to find meaningful moderator variables within their set of studies that explain the heterogeneity. If a moderator variable was confounded with the approach taken to deal with dependence, the moderator analysis could show significant differences falsely based on a moderator variable that is a characteristic of the studies in the meta-analysis when in fact the heterogeneity being explained is due to the method used to deal with dependence. This false finding could occur, for example, if grade level is used as a moderator variable and multiple measures were more commonly administered to students in upper grades than students in lower grades. Conversely, when the indexes of heterogeneity are increased as a result of the method chosen to deal with dependence and that method is not confounded with any moderator variable, meta-analysts may not be able to find moderators that explain the heterogeneity and not know why it remains unexplained, while failing to realize that the actual source of the heterogeneity is the method used to deal with dependence.

Therefore, it is critical for meta-analysts to consider and account for the impact of their method of dealing with dependence when their meta-analysis results in a large Q statistic. If possible, running the meta-analysis using only standardized, norm-referenced measures instead of researcher-developed measures also can be helpful in detecting whether the size of the Q statistic is due to variance in measurement rather than meaningful differences between studies. Additionally, looking beyond the often-reported Q statistic and evaluating I 2 and tau-squared as measures of heterogeneity is important. Simulation studies are needed to model the impact of different approaches to dealing with dependence on estimates of heterogeneity and determine under what circumstances these estimates are artificially inflated or constrained.

Match Your Method of Dealing With Dependence to Your Research Question and Your Data Set

When the correlations between measures are known or can be obtained, a mul-tivariate approach using multilevel modeling or SEM is the best approach for handling dependence in meta-analytic data sets. Because this is rarely the case in education meta-analyses, different approaches to dependence when correlations are not known have been recommended by different research methodologists. Given the lack of guidance currently available on a single optimal way to deal with dependence from multiple measures or multiple group comparisons when correlations are not available, meta-analysts of education research would do best to choose an approach that is suited to the data available from their set of studies and the questions they hope to answer through their meta-analyses.

When an overall estimate of the effect of treatment is more central to the purpose of a meta-analysis and/or treatments and measures are sufficiently similar to one another, the RVE, three-level meta-analysis, and mean of measures approaches are the best options. The RVE and three-level meta-analysis approaches are particularly well suited to meta-analyses with a large numbers of studies and when continuous moderators or dichotomous categorical moderators are of interest. Because three-level meta-analysis provides variance estimates at both the within-study and between-study levels and allows for covariates to be introduced at both levels, it is ideal for meta-analyses where researchers are interested in exploring sources of systematic variance at the within-study level. However, given the newness of RVE and three-level meta-analysis and the small number of published meta-analyses that have used these techniques, more research is needed to explore their benefits and limitations as solutions to the problem of dependence in education meta-analyses before they can be considered the optimal methods.

When the meta-analyst's research questions are addressed to the mean effect of particular domains of treatment or measurement and obtaining an overall estimate of effect across domains is not important, the shifting-unit-of-analysis approach works well if domains of treatment or measurement can be sorted cleanly into categories and enough independent effect sizes are available in each domain to allow for sufficient power. When the research question driving the meta-analysis is clearly and at least somewhat narrowly defined, selecting a single measure and/or group comparison that is best aligned with the purpose of the meta-analysis is a reasonable approach to dependence. Intentional selection of a single measure or comparison also is warranted when the authors of the studies in the meta-analysis define causal models in a way that makes clear which measures their treatments should affect most directly or which treatment in a multiple-comparison study is most central to their hypothesis. In data sets where a great deal of dependence is present, multiple approaches to resolving dependence might be attempted and the range of the mean effect, variance, and indexes of heterogeneity for each approach reported.

Practice Full Disclosure

The American Psychological Association's Meta-Analysis Reporting Standards ( American Psychological Association, 2008 ) recommended that meta-analysts describe the method used to arrive at a single independent effect size for studies that contain multiple dependent effect sizes. However, it seems that this recommendation is not routinely followed. Ahn et al. (2012) reported that 32.2% of the education meta-analyses they reviewed failed to disclose any information about the dependence present in their corpus of studies or how it was handled. Similarly, in a review of a random sample of 100 meta-analyses in psychology and related disciplines, Dieckmann, Malle, and Bodner (2009) found that information on dependence was missing from 34% of the reports they reviewed. Given the importance of maintaining statistical independence in a meta-analysis, failure to report the extent and type of dependence present in one's corpus of studies and how that dependence was resolved is inexcusable and raises questions about the validity of the meta-analytic results. Additionally, meta-analysts should describe briefly why a particular approach to resolving dependence was chosen and what attempts were made to determine how the mean effect size, variance, and estimates of heterogeneity were affected by the chosen approach.

Implications for Primary Researchers

Clearly specify all aspects of your causal model.

Given the complexity present in studies where dependence of effect sizes occurs, primary researchers can assist meta-analysts who will be working with these effect sizes by clearly stating the way in which the measures and/or multiple treatment groups are related in their theoretical conceptualization of the effect of their treatment. Readers and future meta-analysts will benefit from knowing which measures a study's researchers view as a primary indicator of the effectiveness of the treatment and which are secondary or tertiary indicators. This information helps meta-analysts who choose to deal with dependence by focusing only on primary indicators to know which measure to include.

When a study introduces dependence from multiple group comparisons, researchers can help meta-analysts by carefully describing the treatment provided to all groups (including details on what, if any, treatment the control group receives, especially if control group members are receiving a business-as-usual treatment provided by their school). Complete descriptions of all groups allow meta-analysts to select group comparisons that align with their research questions and to choose moderator variables to use in attempting to explain heterogeneity. Additionally, primary researchers who include multiple treatment groups should explain why different variations of treatments are being provided, how the treatments differ, and which outcomes are considered primary for each treatment. This information is helpful to meta-analysts who are aggregating independent effects across studies based on similar treatment characteristics or who are interested in including independent effects from certain types of treatment only. Finally, primary researchers might consider whether their causal model would be best represented in future meta-analyses if separate control groups were provided for each treatment group. When researchers are interested in determining the distinct effect of two or more different treatments compared with a control condition, the cost of creating separate control groups would be warranted. Doing so preserves independence while allowing the effect size for each treatment–control comparison to be included rather than averaged.

Provide All Relevant Statistics Needed to Deal With Dependence

Another way that primary researchers can assist meta-analysts in dealing with dependence is by providing all the necessary statistical information needed to allow meta-analysts to implement multivariate meta-analytic methods that model dependence. Researchers should provide the correlations between all measures used so that meta-analysts can create the covariance matrices needed for meta-analytic multilevel modeling and multivariate SEM. A simple table of measures and their correlations based on all participants in the study's sample is all that is needed. Additionally, primary researchers with multiple dependent group comparisons should be sure to include the initial sample size and the sample size after attrition for all treatment and comparison groups so that meta-analysts can use this information to calculate a sample-weighted effect size for all treatments versus the shared control group.

With primary researchers increasingly designing more complex, large-scale studies at the request of grant providers, statistical dependence of the resulting effect sizes has become a significant issue for meta-analysts in education research. All the approaches available to meta-analysts to deal with dependence that were described in this report and demonstrated in the case study have benefits and limitations. At the present time, selecting a method for dealing with dependence is one of many choices a researcher must make when conducting a meta-analysis. Further research is needed to test these approaches with simulated and nonsimulated data to determine the conditions under which each approach is best implemented and to provide better guidance in selecting the best approach for a given set of dependent effect sizes. While waiting for this guidance to become available, the best way forward for education meta-analysts is to weigh carefully the advantages and disadvantages of each approach and to provide as much information as possible on the chosen approach so that readers can consider this information when interpreting the meta-analytic results.

Acknowledgments

This research was supported by Grant P50 HD052117 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305F100013 to The University of Texas at Austin as part of the Reading for Understanding Research Initiative. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, the National Institutes of Health, the Institute of Education Sciences, or the U.S. Department of Education.

Note. T = treatment; C = comparison.

Contributor Information

Nancy Scammacca, Nancy Scammacca, PhD, is a research associate at the Meadows Center for Preventing Educational Risk in the College of Education at the University of Texas at Austin, 1 University Station D4900, Austin, TX 78712.

Greg Roberts, Greg Roberts, PhD, is the director of the Vaughn Gross Center for Reading and Language Arts and the associate director of the Meadows Center for Preventing Educational Risk at the University of Texas at Austin.

Karla K. Stuebing, Karla K. Stuebing, PhD, is a research professor at the Texas Institute for Measurement, Evaluation, and Statistics at the University of Houston.

Ahn S, Ames AJ, Myers ND. A review of meta-analyses in education: Methodological strengths and weaknesses. Review of Educational Research. 2012; 82 :436–476. doi: 10.3102/0034654312458162. [ CrossRef ] [ Google Scholar ]
American Psychological Association. Reporting standards for research in psychology: Why do we need them? What might they be? American Psychologist. 2008; 63 :839–851. doi: 10.1037/0003-066X.63.9.839. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Becker BJ. Multivariate meta-analysis. In: Tinsley HA, Brown SD, editors. Handbook of applied multivariate statistics and mathematical modeling. San Diego, CA: Academic Press; 2000. pp. 499–525. [ CrossRef ] [ Google Scholar ]
Berkeley S, Scruggs TE, Mastropieri MA. Reading comprehension instruction for students with learning disabilities, 1995–2006: A meta-analysis. Remedial and Special Education. 2010; 31 :423–436. doi: 10.1177/0741932509355988. [ CrossRef ] [ Google Scholar ]
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Complex data structures: Overview. In: Borenstein M, Hedges LV, Higgins JPT, Rothstein HR, editors. Introduction to meta-analysis. Chichester, England: John Wiley; 2009a. pp. 215–216. [ Google Scholar ]
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Multiple outcomes or time-points within a study. In: Borenstein M, Hedges LV, Higgins JPT, Rothstein HR, editors. Introduction to meta-analysis. Chichester, England: John Wiley; 2009b. pp. 225–238. [ Google Scholar ]
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Multiple comparisons within a study. In: Borenstein M, Hedges LV, Higgins JPT, Rothstein HR, editors. Introduction to meta-analysis. Chichester, England: John Wiley; 2009c. pp. 239–242. [ Google Scholar ]
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Comprehensive meta analysis. Englewood, NJ: Biostat; 2011. Version 2.2.064. [ Google Scholar ]
Card N. Applied meta-analysis for social science research. New York, NY: Guilford; 2012. [ Google Scholar ]
Chambers EA. An introduction to meta-analysis with articles from the Journal of Educational Research (1992–2002) Journal of Educational Research. 2004; 98 :35–44. [ Google Scholar ]
Cheung MWL. Fixed-effects meta-analyses as multiple-group structural equation models. Structural Equation Modeling. 2010; 17 :481–509. doi: 10.1080/10705511.2010.489367. [ CrossRef ] [ Google Scholar ]
Cheung MWL. Modeling dependent effect sizes with three-level meta-analyses: A structural equation modeling approach. Psychological Methods. 2013 doi: 10.1037/a0032968. Advance online publication. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cooper H. Synthesizing research: A guide for literature reviews. 3rd. Thousand Oaks, CA: Sage; 1998. [ Google Scholar ]
D'Agostino J, Murphy J. A meta-analysis of Reading Recovery in United States schools. Educational Evaluation and Policy Analysis. 2004; 26 :23–38. doi: 10.3102/01623737026001023. [ CrossRef ] [ Google Scholar ]
de Vibe M, Bjørndal A, Tipton E, Hammerstrøm K, Kowalski K. Mindfulness based stress reduction (MBSR) for improving health, quality of life, and social functioning in results. Campbell Systematic Reviews. 2012; 3 doi: 10.4073/csr.2012.3. [ CrossRef ] [ Google Scholar ]
Dieckmann NF, Malle BF, Bodner TE. An empirical assessment of meta-analytic practice. Review of General Psychology. 2009; 13 :101–115. doi: 10.1037/a0015107. [ CrossRef ] [ Google Scholar ]
Edmonds MS, Vaughn S, Wexler J, Reutebuch CK, Cable A, Tackett KK. A synthesis of reading interventions and effects on reading outcomes for older struggling readers. Review of Educational Research. 2009; 79 :262–300. doi: 10.3102/0034654308325998. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Flynn LJ, Zheng X, Swanson HL. Instructing struggling older readers: A selective meta-analysis of intervention research. Learning Disabilities Research & Practice. 2012; 27 :21–32. doi: 10.1111/j.1540-5826.2011.00347.x. [ CrossRef ] [ Google Scholar ]
Gersten R, Chard DJ, Jayanthi M, Baker SK, Morphy P, Flojo J. Mathematics instruction for students with learning disabilities: A meta-analysis of instructional components. Review of Educational Research. 2009; 79 :1202–1242. doi: 10.3102/0034654309334431. [ CrossRef ] [ Google Scholar ]
Gleser LJ, Olkin I. Stochastically dependent effect sizes. In: Cooper H, Hedges LV, editors. The handbook of research synthesis. New York, NY: Russell Sage Foundation; 1994. pp. 339–355. [ Google Scholar ]
Graham S, Hebert M. Writing to read: A meta-analysis of the impact of writing and writing instruction on reading. Harvard Educational Review. 2011; 81 :710–744. [ Google Scholar ]
Hedges LV. Distribution theory for Glass's estimator of effect size and related estimators. Journal of Education Statistics. 1981; 6 :107–128. [ Google Scholar ]
Hedges LV, Tipton E, Johnson MC. Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods. 2010; 1 :39–65. doi: 10.1002/jrsm.5. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Institute of Education Sciences. What works clearinghouse study review standards. 2008 Retrieved from http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_version1_standards.pdf .
Kalaian HA, Raudenbush SW. A multivariate mixed linear model for meta-analysis. Psychological Methods. 1996; 1 :227–235. doi: 10.1037/1082-989X.1.3.227. [ CrossRef ] [ Google Scholar ]
Kim R, Becker B. The degree of dependence between multiple-treatment effect sizes. Multivariate Behavioral Research. 2010; 45 :213–238. doi: 10.1080/00273171003680104. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Konstantopoulos S. Fixed effects and variance components estimation in three-level meta-analysis. Research Synthesis Methods. 2011; 2 :61–76. doi: 10.1002/jrsm.35. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Marín-Martínez F, Sánchez-Meca J. Averaging dependent effect sizes in meta-analysis: A cautionary note about procedures. Spanish Journal of Psychology. 1999; 2 :32–38. [ PubMed ] [ Google Scholar ]
Peabody Research Institute. (n.d.) Methods resources. Retrieved from http://peabody.vanderbilt.edu/research/pri/methods_resources.php .
Raudenbush SW, Becker BJ, Kalaian H. Modeling multivariate effect sizes. Psychological Bulletin. 1988; 103 :111–120. doi: 10.1037/0033-2909.103.1.111. [ CrossRef ] [ Google Scholar ]
Rosenthal R, Rubin DB. Meta-analytic procedures for combining studies with multiple effect sizes. Psychological Bulletin. 1986; 99 :400–406. doi: 10.1037/0033-2909.99.3.400. [ CrossRef ] [ Google Scholar ]
Samson JE, Ojanen T, Hollo A. Social goals and youth aggression: Meta-analysis of prosocial and antisocial goals. Social Development. 2012; 21 :645–666. doi: 10.1111/j.1467-9507.2012.00658.x. [ CrossRef ] [ Google Scholar ]
Scammacca N, Roberts G, Vaughn S, Edmonds M, Wexler J, Reutebuch CK, Torgesen JK. Reading interventions for adolescent struggling readers: A meta-analysis with implications for practice. Portsmouth, NH: RMC Research Corporation, Center on Instruction; 2007. [ Google Scholar ]
Scammacca N, Roberts G, Vaughn S, Stuebing K. A meta-analysis of interventions for struggling readers in Grades 4–12: 1980–2011. Journal of Learning Disabilities in press. [ PMC free article ] [ PubMed ] [ Google Scholar ]
Tanner-Smith EE, Tipton E. Research Synthesis Methods. 2013. Robust variance estimation with dependent effect sizes: Practical considerations including a software tutorial in Stata and SPSS. Advance online publication. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Tanner-Smith EE, Wilson SJ, Lipsey MW. The comparative effectiveness of outpatient treatment for adolescent substance abuse: A meta-analysis. Journal of Substance Abuse Treatment. 2013; 44 :145–158. doi: 10.1016/j.jsat.2012.05.006. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Tran L, Sanchez T, Arellano B, Swanson HL. A meta-analysis of the RTI literature for children at risk for reading disabilities. Journal of Learning Disabilities. 2011; 44 :283–295. doi: 10.1177/0022219410378447. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Uttal DH, Meadow NG, Tipton E, Hand LL, Alden AR, Warren C, Newcombe NS. The malleability of spatial skills: A meta-analysis of training studies. Psychological Bulletin. 2013; 139 :352–402. doi: 10.1037/a0028446. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Van den Noortgate W, López-López JA, Marín-Martínez F, Sánchez-Meca J. Three-level meta-analysis of dependent effect sizes. Behavior Research Methods. 2013; 45 :576–594. doi: 10.3758/s13428-012-0261-6. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Wilson SJ, Tanner-Smith EE, Lipsey MW, Steinka-Fry K, Morrison J. Dropout prevention and intervention programs: Effects on school completion and dropout among school-aged children and youth. Campbell Systematic Reviews. 2011; 8 doi: 10.4073/csr.2011.8. [ CrossRef ] [ Google Scholar ]

IMAGES

Meta-Analysis Methodology for Basic Research: A Practical Guide
PPT
How Is A Meta-Analysis Performed?
Meta-Analysis Methodology for Basic Research: A Practical Guide
Conceptual framework of the meta-analysis.
A practical Guide to do Primary research on Meta analysis Methodology

VIDEO

Meta Analysis Research (मेटा विश्लेषण अनुसंधान) #ugcnet #ResearchMethodology #educationalbyarun
Systematic Review and Meta-analysis Workshop (By Ahmed Negida)
Stanford CS330 Deep Multi-Task & Meta Learning
PhD Research Proposal Vs PhD Final thesis :)
7-6 How to do a systematic review or a meta-analysis with HubMeta: Outlier Analysis
1-4 How to do a systematic review or a meta-analysis with HubMeta: Understanding HubMeta's Dashboard

COMMENTS

How to conduct a meta-analysis in eight steps: a practical guide
For meta-analysis, however, Steel et al. advocate for the inclusion of all available studies, including grey literature, to prevent selection biases based on availability, cost, familiarity, and language (Rothstein et al. 2005), or the "Matthew effect", which denotes the phenomenon that highly cited articles are found faster than less cited ...
Ten simple rules for carrying out and writing meta-analyses
Meta-analysis is a powerful tool to cumulate and summarize the knowledge in a research field . ... YG-G is supported by a PhD fellowship from Centro de Estudios Interdisciplinarios Básicos y Aplicados CEIBA (Rodolfo Llinás Program). DAF is supported by research grants from Colciencias and VCTI. PGB is partially supported by ELIXIR-GR, the ...
PDF How to conduct a meta-analysis in eight steps: a practical guide
2.2.1 Search strategies. Similar to conducting a literature review, the search process of a meta-analysis should be systematic, reproducible, and transparent, resulting in a sample that includes all relevant studies (Fisch and Block 2018; Gusenbauer and Haddaway 2020).
Introduction to Meta-Analysis: A Guide for the Novice
Meta-analysis is a robust tool that can help researchers overcome these challenges by assimilating data across studies identified through a literature review. In other words, rather than surveying participants, a meta-analysis surveys studies. The goal is to calculate the direction and/or magnitude of an effect across all relevant studies, both ...
Chapter 10: Analysing data and undertaking meta-analyses
Many judgements are required in the process of preparing a meta-analysis. Sensitivity analyses should be used to examine whether overall findings are robust to potentially influential decisions. Cite this chapter as: Deeks JJ, Higgins JPT, Altman DG (editors). Chapter 10: Analysing data and undertaking meta-analyses.
How to conduct a meta-analysis in eight steps: a practical guide
2.1 Step 1: deﬁning the research question. The ﬁrst step in conducting a meta-analysis, as with any other empirical study, is the. deﬁnition of the research question. Most importantly, the ...
Methodological Guidance Paper: High-Quality Meta-Analysis in a
The term meta-analysis was first used by Gene Glass (1976) in his presidential address at the AERA (American Educational Research Association) annual meeting, though Pearson (1904) used methods to combine results from studies on the relationship between enteric fever and mortality in 1904. The 1980s was a period of rapid development of statistical methods (Cooper & Hedges, 2009) leading to the ...
A Simple Guide to Systematic Reviews and Meta-Analyses
Systematic reviews and meta-analyses lie on the top of the evidence hierarchy because they utilize explicit methods for literature search and retrieval of studies relevant to the review question as well as robust methodology for quality assessment of included studies and quantitative synthesis of results.
PDF Introduction to Meta-Analysis Charles DiMaggio, PhD
Strengths of Meta-Analysis. Imposes Discipline. Makes process explicit and systematic. Organized way of combining a lot of information. More differentiated and sophisticated than traditional reviews. Combining studies increases power. Find 'significant' results.
Full article: Handbook of meta-analysis
Meta-analysis is defined (p. vii) as, " … the statistical combination of results from multiple studies in order to yield results which make the best use of all available evidence.". Brief historical summaries in the Preface and Chapter 1 motivate the need for a coherent scientific strategy to review and combine evidence for decision ...
Meta-analysis and the science of research synthesis
Meta-analysis is the quantitative, scientific synthesis of research results. Since the term and modern approaches to research synthesis were first introduced in the 1970s, meta-analysis has had a ...
A Guide to Conducting a Meta-Analysis
Abstract. Meta-analysis is widely accepted as the preferred method to synthesize research findings in various disciplines. This paper provides an introduction to when and how to conduct a meta-analysis. Several practical questions, such as advantages of meta-analysis over conventional narrative review and the number of studies required for a ...
A Guide to Understanding Meta-analysis
The meta-analysis result may show either a benefit or lack of ben-efit of a treatment approach that will be indicated by the efect size, which is the term used to describe the treatment ef-fect of an intervention. Treatment efect is the gain (or loss) seen in the experimental group relative to the control group.
17 Meta-analysis in Clinical Psychology Research
Meta-analysis is now the method of choice for assimilating research investigating the same question. This chapter is a nontechnical overview of the process of conducting meta-analysis in the context of clinical psychology. We begin with an overview of what meta-analysis aims to achieve. The process of conducting a meta-analysis is then ...
The Value of Conducting a Meta-analysis in Graduate Research
Conducting a meta-analysis requires mastering a topic and provides a strong foundation for a successful PhD. My background prior to my PhD was in biology, and specifically animal conservation and behavior. For my Master's degree I studied the social and mating lives of prairie voles, which are small social rodents.
Systematic review and meta-analysis of depression, anxiety, and
Mental health problems among graduate students in doctoral degree programs have received increasing attention 1,2,3,4.Ph.D. students (and students completing equivalent degrees, such as the Sc.D ...
Synthesize This: Meta-Analysis as a Dissertation Tool
With a meta-analysis, graduate students gain access to all published, and sometimes unpublished (e.g., Costa Reference Costa 2017; Godefroidt Reference Godefroidt 2022; Roscoe and Jenkins Reference Roscoe and Jenkins 2005), research conducted on a subject. In this sense, meta-analyses are akin to an especially comprehensive literature review.
Meta-Analysis
When a review is performed following predefined steps (ie, systematically) and its results are quantitatively analyzed, it is called meta-analysis. Publication of meta-analyses has increased exponentially in pubmed.gov; using the key word "meta-analysis, 1,473 titles. ". were yielded in 2007 and 176,704 on January 2020.
Comprehensive Meta-Analysis Software (CMA)
Comprehensive Meta‐Analysis is a powerful program for meta‐analysis computations. It is user‐friendly and easy to learn. Its capabilities to produce forest plots and complete subgroup analyses are particularly useful. Additionally, the CMA support staff is quick to respond to any trouble‐shooting issues.
Systematic review and meta-analysis of depression, anxiety, and
In a meta-analysis of the nine studies reporting the prevalence of clinically significant symptoms of anxiety across 15,626 students, the estimated proportion of students with anxiety was 0.17 (95% CI, 0.12-0.23; I 2 = 98.05%). We conclude that depression and anxiety are highly prevalent among Ph.D. students.
Meta-Analysis With Complex Research Designs: Dealing With Dependence
In their meta-analysis of reading comprehension instruction for students with learning disabilities, Berkeley, Scruggs, ... PhD, is a research associate at the Meadows Center for Preventing Educational Risk in the College of Education at the University of Texas at Austin, 1 University Station D4900, ...
Meta-analysis? : r/PhD
My initial thoughts: a meta-analysis is not usually done on its own, it is the analysis step of a systematic review. It is only conducted if enough studies have the same outcome and were measured in approximately the same way (low heterogeneity) or it changes the model used for the meta-analysis and you must explain how you searched for and ...