• Methodology
  • Open access
  • Published: 11 October 2016

Reviewing the research methods literature: principles and strategies illustrated by a systematic overview of sampling in qualitative research

  • Stephen J. Gentles 1 , 4 ,
  • Cathy Charles 1 ,
  • David B. Nicholas 2 ,
  • Jenny Ploeg 3 &
  • K. Ann McKibbon 1  

Systematic Reviews volume  5 , Article number:  172 ( 2016 ) Cite this article

50k Accesses

25 Citations

13 Altmetric

Metrics details

Overviews of methods are potentially useful means to increase clarity and enhance collective understanding of specific methods topics that may be characterized by ambiguity, inconsistency, or a lack of comprehensiveness. This type of review represents a distinct literature synthesis method, although to date, its methodology remains relatively undeveloped despite several aspects that demand unique review procedures. The purpose of this paper is to initiate discussion about what a rigorous systematic approach to reviews of methods, referred to here as systematic methods overviews , might look like by providing tentative suggestions for approaching specific challenges likely to be encountered. The guidance offered here was derived from experience conducting a systematic methods overview on the topic of sampling in qualitative research.

The guidance is organized into several principles that highlight specific objectives for this type of review given the common challenges that must be overcome to achieve them. Optional strategies for achieving each principle are also proposed, along with discussion of how they were successfully implemented in the overview on sampling. We describe seven paired principles and strategies that address the following aspects: delimiting the initial set of publications to consider, searching beyond standard bibliographic databases, searching without the availability of relevant metadata, selecting publications on purposeful conceptual grounds, defining concepts and other information to abstract iteratively, accounting for inconsistent terminology used to describe specific methods topics, and generating rigorous verifiable analytic interpretations. Since a broad aim in systematic methods overviews is to describe and interpret the relevant literature in qualitative terms, we suggest that iterative decision making at various stages of the review process, and a rigorous qualitative approach to analysis are necessary features of this review type.

Conclusions

We believe that the principles and strategies provided here will be useful to anyone choosing to undertake a systematic methods overview. This paper represents an initial effort to promote high quality critical evaluations of the literature regarding problematic methods topics, which have the potential to promote clearer, shared understandings, and accelerate advances in research methods. Further work is warranted to develop more definitive guidance.

Peer Review reports

While reviews of methods are not new, they represent a distinct review type whose methodology remains relatively under-addressed in the literature despite the clear implications for unique review procedures. One of few examples to describe it is a chapter containing reflections of two contributing authors in a book of 21 reviews on methodological topics compiled for the British National Health Service, Health Technology Assessment Program [ 1 ]. Notable is their observation of how the differences between the methods reviews and conventional quantitative systematic reviews, specifically attributable to their varying content and purpose, have implications for defining what qualifies as systematic. While the authors describe general aspects of “systematicity” (including rigorous application of a methodical search, abstraction, and analysis), they also describe a high degree of variation within the category of methods reviews itself and so offer little in the way of concrete guidance. In this paper, we present tentative concrete guidance, in the form of a preliminary set of proposed principles and optional strategies, for a rigorous systematic approach to reviewing and evaluating the literature on quantitative or qualitative methods topics. For purposes of this article, we have used the term systematic methods overview to emphasize the notion of a systematic approach to such reviews.

The conventional focus of rigorous literature reviews (i.e., review types for which systematic methods have been codified, including the various approaches to quantitative systematic reviews [ 2 – 4 ], and the numerous forms of qualitative and mixed methods literature synthesis [ 5 – 10 ]) is to synthesize empirical research findings from multiple studies. By contrast, the focus of overviews of methods, including the systematic approach we advocate, is to synthesize guidance on methods topics. The literature consulted for such reviews may include the methods literature, methods-relevant sections of empirical research reports, or both. Thus, this paper adds to previous work published in this journal—namely, recent preliminary guidance for conducting reviews of theory [ 11 ]—that has extended the application of systematic review methods to novel review types that are concerned with subject matter other than empirical research findings.

Published examples of methods overviews illustrate the varying objectives they can have. One objective is to establish methodological standards for appraisal purposes. For example, reviews of existing quality appraisal standards have been used to propose universal standards for appraising the quality of primary qualitative research [ 12 ] or evaluating qualitative research reports [ 13 ]. A second objective is to survey the methods-relevant sections of empirical research reports to establish current practices on methods use and reporting practices, which Moher and colleagues [ 14 ] recommend as a means for establishing the needs to be addressed in reporting guidelines (see, for example [ 15 , 16 ]). A third objective for a methods review is to offer clarity and enhance collective understanding regarding a specific methods topic that may be characterized by ambiguity, inconsistency, or a lack of comprehensiveness within the available methods literature. An example of this is a overview whose objective was to review the inconsistent definitions of intention-to-treat analysis (the methodologically preferred approach to analyze randomized controlled trial data) that have been offered in the methods literature and propose a solution for improving conceptual clarity [ 17 ]. Such reviews are warranted because students and researchers who must learn or apply research methods typically lack the time to systematically search, retrieve, review, and compare the available literature to develop a thorough and critical sense of the varied approaches regarding certain controversial or ambiguous methods topics.

While systematic methods overviews , as a review type, include both reviews of the methods literature and reviews of methods-relevant sections from empirical study reports, the guidance provided here is primarily applicable to reviews of the methods literature since it was derived from the experience of conducting such a review [ 18 ], described below. To our knowledge, there are no well-developed proposals on how to rigorously conduct such reviews. Such guidance would have the potential to improve the thoroughness and credibility of critical evaluations of the methods literature, which could increase their utility as a tool for generating understandings that advance research methods, both qualitative and quantitative. Our aim in this paper is thus to initiate discussion about what might constitute a rigorous approach to systematic methods overviews. While we hope to promote rigor in the conduct of systematic methods overviews wherever possible, we do not wish to suggest that all methods overviews need be conducted to the same standard. Rather, we believe that the level of rigor may need to be tailored pragmatically to the specific review objectives, which may not always justify the resource requirements of an intensive review process.

The example systematic methods overview on sampling in qualitative research

The principles and strategies we propose in this paper are derived from experience conducting a systematic methods overview on the topic of sampling in qualitative research [ 18 ]. The main objective of that methods overview was to bring clarity and deeper understanding of the prominent concepts related to sampling in qualitative research (purposeful sampling strategies, saturation, etc.). Specifically, we interpreted the available guidance, commenting on areas lacking clarity, consistency, or comprehensiveness (without proposing any recommendations on how to do sampling). This was achieved by a comparative and critical analysis of publications representing the most influential (i.e., highly cited) guidance across several methodological traditions in qualitative research.

The specific methods and procedures for the overview on sampling [ 18 ] from which our proposals are derived were developed both after soliciting initial input from local experts in qualitative research and an expert health librarian (KAM) and through ongoing careful deliberation throughout the review process. To summarize, in that review, we employed a transparent and rigorous approach to search the methods literature, selected publications for inclusion according to a purposeful and iterative process, abstracted textual data using structured abstraction forms, and analyzed (synthesized) the data using a systematic multi-step approach featuring abstraction of text, summary of information in matrices, and analytic comparisons.

For this article, we reflected on both the problems and challenges encountered at different stages of the review and our means for selecting justifiable procedures to deal with them. Several principles were then derived by considering the generic nature of these problems, while the generalizable aspects of the procedures used to address them formed the basis of optional strategies. Further details of the specific methods and procedures used in the overview on qualitative sampling are provided below to illustrate both the types of objectives and challenges that reviewers will likely need to consider and our approach to implementing each of the principles and strategies.

Organization of the guidance into principles and strategies

For the purposes of this article, principles are general statements outlining what we propose are important aims or considerations within a particular review process, given the unique objectives or challenges to be overcome with this type of review. These statements follow the general format, “considering the objective or challenge of X, we propose Y to be an important aim or consideration.” Strategies are optional and flexible approaches for implementing the previous principle outlined. Thus, generic challenges give rise to principles, which in turn give rise to strategies.

We organize the principles and strategies below into three sections corresponding to processes characteristic of most systematic literature synthesis approaches: literature identification and selection ; data abstraction from the publications selected for inclusion; and analysis , including critical appraisal and synthesis of the abstracted data. Within each section, we also describe the specific methodological decisions and procedures used in the overview on sampling in qualitative research [ 18 ] to illustrate how the principles and strategies for each review process were applied and implemented in a specific case. We expect this guidance and accompanying illustrations will be useful for anyone considering engaging in a methods overview, particularly those who may be familiar with conventional systematic review methods but may not yet appreciate some of the challenges specific to reviewing the methods literature.

Results and discussion

Literature identification and selection.

The identification and selection process includes search and retrieval of publications and the development and application of inclusion and exclusion criteria to select the publications that will be abstracted and analyzed in the final review. Literature identification and selection for overviews of the methods literature is challenging and potentially more resource-intensive than for most reviews of empirical research. This is true for several reasons that we describe below, alongside discussion of the potential solutions. Additionally, we suggest in this section how the selection procedures can be chosen to match the specific analytic approach used in methods overviews.

Delimiting a manageable set of publications

One aspect of methods overviews that can make identification and selection challenging is the fact that the universe of literature containing potentially relevant information regarding most methods-related topics is expansive and often unmanageably so. Reviewers are faced with two large categories of literature: the methods literature , where the possible publication types include journal articles, books, and book chapters; and the methods-relevant sections of empirical study reports , where the possible publication types include journal articles, monographs, books, theses, and conference proceedings. In our systematic overview of sampling in qualitative research, exhaustively searching (including retrieval and first-pass screening) all publication types across both categories of literature for information on a single methods-related topic was too burdensome to be feasible. The following proposed principle follows from the need to delimit a manageable set of literature for the review.

Principle #1:

Considering the broad universe of potentially relevant literature, we propose that an important objective early in the identification and selection stage is to delimit a manageable set of methods-relevant publications in accordance with the objectives of the methods overview.

Strategy #1:

To limit the set of methods-relevant publications that must be managed in the selection process, reviewers have the option to initially review only the methods literature, and exclude the methods-relevant sections of empirical study reports, provided this aligns with the review’s particular objectives.

We propose that reviewers are justified in choosing to select only the methods literature when the objective is to map out the range of recognized concepts relevant to a methods topic, to summarize the most authoritative or influential definitions or meanings for methods-related concepts, or to demonstrate a problematic lack of clarity regarding a widely established methods-related concept and potentially make recommendations for a preferred approach to the methods topic in question. For example, in the case of the methods overview on sampling [ 18 ], the primary aim was to define areas lacking in clarity for multiple widely established sampling-related topics. In the review on intention-to-treat in the context of missing outcome data [ 17 ], the authors identified a lack of clarity based on multiple inconsistent definitions in the literature and went on to recommend separating the issue of how to handle missing outcome data from the issue of whether an intention-to-treat analysis can be claimed.

In contrast to strategy #1, it may be appropriate to select the methods-relevant sections of empirical study reports when the objective is to illustrate how a methods concept is operationalized in research practice or reported by authors. For example, one could review all the publications in 2 years’ worth of issues of five high-impact field-related journals to answer questions about how researchers describe implementing a particular method or approach, or to quantify how consistently they define or report using it. Such reviews are often used to highlight gaps in the reporting practices regarding specific methods, which may be used to justify items to address in reporting guidelines (for example, [ 14 – 16 ]).

It is worth recognizing that other authors have advocated broader positions regarding the scope of literature to be considered in a review, expanding on our perspective. Suri [ 10 ] (who, like us, emphasizes how different sampling strategies are suitable for different literature synthesis objectives) has, for example, described a two-stage literature sampling procedure (pp. 96–97). First, reviewers use an initial approach to conduct a broad overview of the field—for reviews of methods topics, this would entail an initial review of the research methods literature. This is followed by a second more focused stage in which practical examples are purposefully selected—for methods reviews, this would involve sampling the empirical literature to illustrate key themes and variations. While this approach is seductive in its capacity to generate more in depth and interpretive analytic findings, some reviewers may consider it too resource-intensive to include the second step no matter how selective the purposeful sampling. In the overview on sampling where we stopped after the first stage [ 18 ], we discussed our selective focus on the methods literature as a limitation that left opportunities for further analysis of the literature. We explicitly recommended, for example, that theoretical sampling was a topic for which a future review of the methods sections of empirical reports was justified to answer specific questions identified in the primary review.

Ultimately, reviewers must make pragmatic decisions that balance resource considerations, combined with informed predictions about the depth and complexity of literature available on their topic, with the stated objectives of their review. The remaining principles and strategies apply primarily to overviews that include the methods literature, although some aspects may be relevant to reviews that include empirical study reports.

Searching beyond standard bibliographic databases

An important reality affecting identification and selection in overviews of the methods literature is the increased likelihood for relevant publications to be located in sources other than journal articles (which is usually not the case for overviews of empirical research, where journal articles generally represent the primary publication type). In the overview on sampling [ 18 ], out of 41 full-text publications retrieved and reviewed, only 4 were journal articles, while 37 were books or book chapters. Since many books and book chapters did not exist electronically, their full text had to be physically retrieved in hardcopy, while 11 publications were retrievable only through interlibrary loan or purchase request. The tasks associated with such retrieval are substantially more time-consuming than electronic retrieval. Since a substantial proportion of methods-related guidance may be located in publication types that are less comprehensively indexed in standard bibliographic databases, identification and retrieval thus become complicated processes.

Principle #2:

Considering that important sources of methods guidance can be located in non-journal publication types (e.g., books, book chapters) that tend to be poorly indexed in standard bibliographic databases, it is important to consider alternative search methods for identifying relevant publications to be further screened for inclusion.

Strategy #2:

To identify books, book chapters, and other non-journal publication types not thoroughly indexed in standard bibliographic databases, reviewers may choose to consult one or more of the following less standard sources: Google Scholar, publisher web sites, or expert opinion.

In the case of the overview on sampling in qualitative research [ 18 ], Google Scholar had two advantages over other standard bibliographic databases: it indexes and returns records of books and book chapters likely to contain guidance on qualitative research methods topics; and it has been validated as providing higher citation counts than ISI Web of Science (a producer of numerous bibliographic databases accessible through institutional subscription) for several non-biomedical disciplines including the social sciences where qualitative research methods are prominently used [ 19 – 21 ]. While we identified numerous useful publications by consulting experts, the author publication lists generated through Google Scholar searches were uniquely useful to identify more recent editions of methods books identified by experts.

Searching without relevant metadata

Determining what publications to select for inclusion in the overview on sampling [ 18 ] could only rarely be accomplished by reviewing the publication’s metadata. This was because for the many books and other non-journal type publications we identified as possibly relevant, the potential content of interest would be located in only a subsection of the publication. In this common scenario for reviews of the methods literature (as opposed to methods overviews that include empirical study reports), reviewers will often be unable to employ standard title, abstract, and keyword database searching or screening as a means for selecting publications.

Principle #3:

Considering that the presence of information about the topic of interest may not be indicated in the metadata for books and similar publication types, it is important to consider other means of identifying potentially useful publications for further screening.

Strategy #3:

One approach to identifying potentially useful books and similar publication types is to consider what classes of such publications (e.g., all methods manuals for a certain research approach) are likely to contain relevant content, then identify, retrieve, and review the full text of corresponding publications to determine whether they contain information on the topic of interest.

In the example of the overview on sampling in qualitative research [ 18 ], the topic of interest (sampling) was one of numerous topics covered in the general qualitative research methods manuals. Consequently, examples from this class of publications first had to be identified for retrieval according to non-keyword-dependent criteria. Thus, all methods manuals within the three research traditions reviewed (grounded theory, phenomenology, and case study) that might contain discussion of sampling were sought through Google Scholar and expert opinion, their full text obtained, and hand-searched for relevant content to determine eligibility. We used tables of contents and index sections of books to aid this hand searching.

Purposefully selecting literature on conceptual grounds

A final consideration in methods overviews relates to the type of analysis used to generate the review findings. Unlike quantitative systematic reviews where reviewers aim for accurate or unbiased quantitative estimates—something that requires identifying and selecting the literature exhaustively to obtain all relevant data available (i.e., a complete sample)—in methods overviews, reviewers must describe and interpret the relevant literature in qualitative terms to achieve review objectives. In other words, the aim in methods overviews is to seek coverage of the qualitative concepts relevant to the methods topic at hand. For example, in the overview of sampling in qualitative research [ 18 ], achieving review objectives entailed providing conceptual coverage of eight sampling-related topics that emerged as key domains. The following principle recognizes that literature sampling should therefore support generating qualitative conceptual data as the input to analysis.

Principle #4:

Since the analytic findings of a systematic methods overview are generated through qualitative description and interpretation of the literature on a specified topic, selection of the literature should be guided by a purposeful strategy designed to achieve adequate conceptual coverage (i.e., representing an appropriate degree of variation in relevant ideas) of the topic according to objectives of the review.

Strategy #4:

One strategy for choosing the purposeful approach to use in selecting the literature according to the review objectives is to consider whether those objectives imply exploring concepts either at a broad overview level, in which case combining maximum variation selection with a strategy that limits yield (e.g., critical case, politically important, or sampling for influence—described below) may be appropriate; or in depth, in which case purposeful approaches aimed at revealing innovative cases will likely be necessary.

In the methods overview on sampling, the implied scope was broad since we set out to review publications on sampling across three divergent qualitative research traditions—grounded theory, phenomenology, and case study—to facilitate making informative conceptual comparisons. Such an approach would be analogous to maximum variation sampling.

At the same time, the purpose of that review was to critically interrogate the clarity, consistency, and comprehensiveness of literature from these traditions that was “most likely to have widely influenced students’ and researchers’ ideas about sampling” (p. 1774) [ 18 ]. In other words, we explicitly set out to review and critique the most established and influential (and therefore dominant) literature, since this represents a common basis of knowledge among students and researchers seeking understanding or practical guidance on sampling in qualitative research. To achieve this objective, we purposefully sampled publications according to the criterion of influence , which we operationalized as how often an author or publication has been referenced in print or informal discourse. This second sampling approach also limited the literature we needed to consider within our broad scope review to a manageable amount.

To operationalize this strategy of sampling for influence , we sought to identify both the most influential authors within a qualitative research tradition (all of whose citations were subsequently screened) and the most influential publications on the topic of interest by non-influential authors. This involved a flexible approach that combined multiple indicators of influence to avoid the dilemma that any single indicator might provide inadequate coverage. These indicators included bibliometric data (h-index for author influence [ 22 ]; number of cites for publication influence), expert opinion, and cross-references in the literature (i.e., snowball sampling). As a final selection criterion, a publication was included only if it made an original contribution in terms of novel guidance regarding sampling or a related concept; thus, purely secondary sources were excluded. Publish or Perish software (Anne-Wil Harzing; available at http://www.harzing.com/resources/publish-or-perish ) was used to generate bibliometric data via the Google Scholar database. Figure  1 illustrates how identification and selection in the methods overview on sampling was a multi-faceted and iterative process. The authors selected as influential, and the publications selected for inclusion or exclusion are listed in Additional file 1 (Matrices 1, 2a, 2b).

Literature identification and selection process used in the methods overview on sampling [ 18 ]

In summary, the strategies of seeking maximum variation and sampling for influence were employed in the sampling overview to meet the specific review objectives described. Reviewers will need to consider the full range of purposeful literature sampling approaches at their disposal in deciding what best matches the specific aims of their own reviews. Suri [ 10 ] has recently retooled Patton’s well-known typology of purposeful sampling strategies (originally intended for primary research) for application to literature synthesis, providing a useful resource in this respect.

Data abstraction

The purpose of data abstraction in rigorous literature reviews is to locate and record all data relevant to the topic of interest from the full text of included publications, making them available for subsequent analysis. Conventionally, a data abstraction form—consisting of numerous distinct conceptually defined fields to which corresponding information from the source publication is recorded—is developed and employed. There are several challenges, however, to the processes of developing the abstraction form and abstracting the data itself when conducting methods overviews, which we address here. Some of these problems and their solutions may be familiar to those who have conducted qualitative literature syntheses, which are similarly conceptual.

Iteratively defining conceptual information to abstract

In the overview on sampling [ 18 ], while we surveyed multiple sources beforehand to develop a list of concepts relevant for abstraction (e.g., purposeful sampling strategies, saturation, sample size), there was no way for us to anticipate some concepts prior to encountering them in the review process. Indeed, in many cases, reviewers are unable to determine the complete set of methods-related concepts that will be the focus of the final review a priori without having systematically reviewed the publications to be included. Thus, defining what information to abstract beforehand may not be feasible.

Principle #5:

Considering the potential impracticality of defining a complete set of relevant methods-related concepts from a body of literature one has not yet systematically read, selecting and defining fields for data abstraction must often be undertaken iteratively. Thus, concepts to be abstracted can be expected to grow and change as data abstraction proceeds.

Strategy #5:

Reviewers can develop an initial form or set of concepts for abstraction purposes according to standard methods (e.g., incorporating expert feedback, pilot testing) and remain attentive to the need to iteratively revise it as concepts are added or modified during the review. Reviewers should document revisions and return to re-abstract data from previously abstracted publications as the new data requirements are determined.

In the sampling overview [ 18 ], we developed and maintained the abstraction form in Microsoft Word. We derived the initial set of abstraction fields from our own knowledge of relevant sampling-related concepts, consultation with local experts, and reviewing a pilot sample of publications. Since the publications in this review included a large proportion of books, the abstraction process often began by flagging the broad sections within a publication containing topic-relevant information for detailed review to identify text to abstract. When reviewing flagged text, the reviewer occasionally encountered an unanticipated concept significant enough to warrant being added as a new field to the abstraction form. For example, a field was added to capture how authors described the timing of sampling decisions, whether before (a priori) or after (ongoing) starting data collection, or whether this was unclear. In these cases, we systematically documented the modification to the form and returned to previously abstracted publications to abstract any information that might be relevant to the new field.

The logic of this strategy is analogous to the logic used in a form of research synthesis called best fit framework synthesis (BFFS) [ 23 – 25 ]. In that method, reviewers initially code evidence using an a priori framework they have selected. When evidence cannot be accommodated by the selected framework, reviewers then develop new themes or concepts from which they construct a new expanded framework. Both the strategy proposed and the BFFS approach to research synthesis are notable for their rigorous and transparent means to adapt a final set of concepts to the content under review.

Accounting for inconsistent terminology

An important complication affecting the abstraction process in methods overviews is that the language used by authors to describe methods-related concepts can easily vary across publications. For example, authors from different qualitative research traditions often use different terms for similar methods-related concepts. Furthermore, as we found in the sampling overview [ 18 ], there may be cases where no identifiable term, phrase, or label for a methods-related concept is used at all, and a description of it is given instead. This can make searching the text for relevant concepts based on keywords unreliable.

Principle #6:

Since accepted terms may not be used consistently to refer to methods concepts, it is necessary to rely on the definitions for concepts, rather than keywords, to identify relevant information in the publication to abstract.

Strategy #6:

An effective means to systematically identify relevant information is to develop and iteratively adjust written definitions for key concepts (corresponding to abstraction fields) that are consistent with and as inclusive of as much of the literature reviewed as possible. Reviewers then seek information that matches these definitions (rather than keywords) when scanning a publication for relevant data to abstract.

In the abstraction process for the sampling overview [ 18 ], we noted the several concepts of interest to the review for which abstraction by keyword was particularly problematic due to inconsistent terminology across publications: sampling , purposeful sampling , sampling strategy , and saturation (for examples, see Additional file 1 , Matrices 3a, 3b, 4). We iteratively developed definitions for these concepts by abstracting text from publications that either provided an explicit definition or from which an implicit definition could be derived, which was recorded in fields dedicated to the concept’s definition. Using a method of constant comparison, we used text from definition fields to inform and modify a centrally maintained definition of the corresponding concept to optimize its fit and inclusiveness with the literature reviewed. Table  1 shows, as an example, the final definition constructed in this way for one of the central concepts of the review, qualitative sampling .

We applied iteratively developed definitions when making decisions about what specific text to abstract for an existing field, which allowed us to abstract concept-relevant data even if no recognized keyword was used. For example, this was the case for the sampling-related concept, saturation , where the relevant text available for abstraction in one publication [ 26 ]—“to continue to collect data until nothing new was being observed or recorded, no matter how long that takes”—was not accompanied by any term or label whatsoever.

This comparative analytic strategy (and our approach to analysis more broadly as described in strategy #7, below) is analogous to the process of reciprocal translation —a technique first introduced for meta-ethnography by Noblit and Hare [ 27 ] that has since been recognized as a common element in a variety of qualitative metasynthesis approaches [ 28 ]. Reciprocal translation, taken broadly, involves making sense of a study’s findings in terms of the findings of the other studies included in the review. In practice, it has been operationalized in different ways. Melendez-Torres and colleagues developed a typology from their review of the metasynthesis literature, describing four overlapping categories of specific operations undertaken in reciprocal translation: visual representation, key paper integration, data reduction and thematic extraction, and line-by-line coding [ 28 ]. The approaches suggested in both strategies #6 and #7, with their emphasis on constant comparison, appear to fall within the line-by-line coding category.

Generating credible and verifiable analytic interpretations

The analysis in a systematic methods overview must support its more general objective, which we suggested above is often to offer clarity and enhance collective understanding regarding a chosen methods topic. In our experience, this involves describing and interpreting the relevant literature in qualitative terms. Furthermore, any interpretative analysis required may entail reaching different levels of abstraction, depending on the more specific objectives of the review. For example, in the overview on sampling [ 18 ], we aimed to produce a comparative analysis of how multiple sampling-related topics were treated differently within and among different qualitative research traditions. To promote credibility of the review, however, not only should one seek a qualitative analytic approach that facilitates reaching varying levels of abstraction but that approach must also ensure that abstract interpretations are supported and justified by the source data and not solely the product of the analyst’s speculative thinking.

Principle #7:

Considering the qualitative nature of the analysis required in systematic methods overviews, it is important to select an analytic method whose interpretations can be verified as being consistent with the literature selected, regardless of the level of abstraction reached.

Strategy #7:

We suggest employing the constant comparative method of analysis [ 29 ] because it supports developing and verifying analytic links to the source data throughout progressively interpretive or abstract levels. In applying this approach, we advise a rigorous approach, documenting how supportive quotes or references to the original texts are carried forward in the successive steps of analysis to allow for easy verification.

The analytic approach used in the methods overview on sampling [ 18 ] comprised four explicit steps, progressing in level of abstraction—data abstraction, matrices, narrative summaries, and final analytic conclusions (Fig.  2 ). While we have positioned data abstraction as the second stage of the generic review process (prior to Analysis), above, we also considered it as an initial step of analysis in the sampling overview for several reasons. First, it involved a process of constant comparisons and iterative decision-making about the fields to add or define during development and modification of the abstraction form, through which we established the range of concepts to be addressed in the review. At the same time, abstraction involved continuous analytic decisions about what textual quotes (ranging in size from short phrases to numerous paragraphs) to record in the fields thus created. This constant comparative process was analogous to open coding in which textual data from publications was compared to conceptual fields (equivalent to codes) or to other instances of data previously abstracted when constructing definitions to optimize their fit with the overall literature as described in strategy #6. Finally, in the data abstraction step, we also recorded our first interpretive thoughts in dedicated fields, providing initial material for the more abstract analytic steps.

Summary of progressive steps of analysis used in the methods overview on sampling [ 18 ]

In the second step of the analysis, we constructed topic-specific matrices , or tables, by copying relevant quotes from abstraction forms into the appropriate cells of matrices (for the complete set of analytic matrices developed in the sampling review, see Additional file 1 (matrices 3 to 10)). Each matrix ranged from one to five pages; row headings, nested three-deep, identified the methodological tradition, author, and publication, respectively; and column headings identified the concepts, which corresponded to abstraction fields. Matrices thus allowed us to make further comparisons across methodological traditions, and between authors within a tradition. In the third step of analysis, we recorded our comparative observations as narrative summaries , in which we used illustrative quotes more sparingly. In the final step, we developed analytic conclusions based on the narrative summaries about the sampling-related concepts within each methodological tradition for which clarity, consistency, or comprehensiveness of the available guidance appeared to be lacking. Higher levels of analysis thus built logically from the lower levels, enabling us to easily verify analytic conclusions by tracing the support for claims by comparing the original text of publications reviewed.

Integrative versus interpretive methods overviews

The analytic product of systematic methods overviews is comparable to qualitative evidence syntheses, since both involve describing and interpreting the relevant literature in qualitative terms. Most qualitative synthesis approaches strive to produce new conceptual understandings that vary in level of interpretation. Dixon-Woods and colleagues [ 30 ] elaborate on a useful distinction, originating from Noblit and Hare [ 27 ], between integrative and interpretive reviews. Integrative reviews focus on summarizing available primary data and involve using largely secure and well defined concepts to do so; definitions are used from an early stage to specify categories for abstraction (or coding) of data, which in turn supports their aggregation; they do not seek as their primary focus to develop or specify new concepts, although they may achieve some theoretical or interpretive functions. For interpretive reviews, meanwhile, the main focus is to develop new concepts and theories that integrate them, with the implication that the concepts developed become fully defined towards the end of the analysis. These two forms are not completely distinct, and “every integrative synthesis will include elements of interpretation, and every interpretive synthesis will include elements of aggregation of data” [ 30 ].

The example methods overview on sampling [ 18 ] could be classified as predominantly integrative because its primary goal was to aggregate influential authors’ ideas on sampling-related concepts; there were also, however, elements of interpretive synthesis since it aimed to develop new ideas about where clarity in guidance on certain sampling-related topics is lacking, and definitions for some concepts were flexible and not fixed until late in the review. We suggest that most systematic methods overviews will be classifiable as predominantly integrative (aggregative). Nevertheless, more highly interpretive methods overviews are also quite possible—for example, when the review objective is to provide a highly critical analysis for the purpose of generating new methodological guidance. In such cases, reviewers may need to sample more deeply (see strategy #4), specifically by selecting empirical research reports (i.e., to go beyond dominant or influential ideas in the methods literature) that are likely to feature innovations or instructive lessons in employing a given method.

In this paper, we have outlined tentative guidance in the form of seven principles and strategies on how to conduct systematic methods overviews, a review type in which methods-relevant literature is systematically analyzed with the aim of offering clarity and enhancing collective understanding regarding a specific methods topic. Our proposals include strategies for delimiting the set of publications to consider, searching beyond standard bibliographic databases, searching without the availability of relevant metadata, selecting publications on purposeful conceptual grounds, defining concepts and other information to abstract iteratively, accounting for inconsistent terminology, and generating credible and verifiable analytic interpretations. We hope the suggestions proposed will be useful to others undertaking reviews on methods topics in future.

As far as we are aware, this is the first published source of concrete guidance for conducting this type of review. It is important to note that our primary objective was to initiate methodological discussion by stimulating reflection on what rigorous methods for this type of review should look like, leaving the development of more complete guidance to future work. While derived from the experience of reviewing a single qualitative methods topic, we believe the principles and strategies provided are generalizable to overviews of both qualitative and quantitative methods topics alike. However, it is expected that additional challenges and insights for conducting such reviews have yet to be defined. Thus, we propose that next steps for developing more definitive guidance should involve an attempt to collect and integrate other reviewers’ perspectives and experiences in conducting systematic methods overviews on a broad range of qualitative and quantitative methods topics. Formalized guidance and standards would improve the quality of future methods overviews, something we believe has important implications for advancing qualitative and quantitative methodology. When undertaken to a high standard, rigorous critical evaluations of the available methods guidance have significant potential to make implicit controversies explicit, and improve the clarity and precision of our understandings of problematic qualitative or quantitative methods issues.

A review process central to most types of rigorous reviews of empirical studies, which we did not explicitly address in a separate review step above, is quality appraisal . The reason we have not treated this as a separate step stems from the different objectives of the primary publications included in overviews of the methods literature (i.e., providing methodological guidance) compared to the primary publications included in the other established review types (i.e., reporting findings from single empirical studies). This is not to say that appraising quality of the methods literature is not an important concern for systematic methods overviews. Rather, appraisal is much more integral to (and difficult to separate from) the analysis step, in which we advocate appraising clarity, consistency, and comprehensiveness—the quality appraisal criteria that we suggest are appropriate for the methods literature. As a second important difference regarding appraisal, we currently advocate appraising the aforementioned aspects at the level of the literature in aggregate rather than at the level of individual publications. One reason for this is that methods guidance from individual publications generally builds on previous literature, and thus we feel that ahistorical judgments about comprehensiveness of single publications lack relevance and utility. Additionally, while different methods authors may express themselves less clearly than others, their guidance can nonetheless be highly influential and useful, and should therefore not be downgraded or ignored based on considerations of clarity—which raises questions about the alternative uses that quality appraisals of individual publications might have. Finally, legitimate variability in the perspectives that methods authors wish to emphasize, and the levels of generality at which they write about methods, makes critiquing individual publications based on the criterion of clarity a complex and potentially problematic endeavor that is beyond the scope of this paper to address. By appraising the current state of the literature at a holistic level, reviewers stand to identify important gaps in understanding that represent valuable opportunities for further methodological development.

To summarize, the principles and strategies provided here may be useful to those seeking to undertake their own systematic methods overview. Additional work is needed, however, to establish guidance that is comprehensive by comparing the experiences from conducting a variety of methods overviews on a range of methods topics. Efforts that further advance standards for systematic methods overviews have the potential to promote high-quality critical evaluations that produce conceptually clear and unified understandings of problematic methods topics, thereby accelerating the advance of research methodology.

Hutton JL, Ashcroft R. What does “systematic” mean for reviews of methods? In: Black N, Brazier J, Fitzpatrick R, Reeves B, editors. Health services research methods: a guide to best practice. London: BMJ Publishing Group; 1998. p. 249–54.

Google Scholar  

Cochrane handbook for systematic reviews of interventions. In. Edited by Higgins JPT, Green S, Version 5.1.0 edn: The Cochrane Collaboration; 2011.

Centre for Reviews and Dissemination: Systematic reviews: CRD’s guidance for undertaking reviews in health care . York: Centre for Reviews and Dissemination; 2009.

Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gotzsche PC, Ioannidis JPA, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration. BMJ. 2009;339:b2700–0.

Barnett-Page E, Thomas J. Methods for the synthesis of qualitative research: a critical review. BMC Med Res Methodol. 2009;9(1):59.

Article   PubMed   PubMed Central   Google Scholar  

Kastner M, Tricco AC, Soobiah C, Lillie E, Perrier L, Horsley T, Welch V, Cogo E, Antony J, Straus SE. What is the most appropriate knowledge synthesis method to conduct a review? Protocol for a scoping review. BMC Med Res Methodol. 2012;12(1):1–1.

Article   Google Scholar  

Booth A, Noyes J, Flemming K, Gerhardus A. Guidance on choosing qualitative evidence synthesis methods for use in health technology assessments of complex interventions. In: Integrate-HTA. 2016.

Booth A, Sutton A, Papaioannou D. Systematic approaches to successful literature review. 2nd ed. London: Sage; 2016.

Hannes K, Lockwood C. Synthesizing qualitative research: choosing the right approach. Chichester: Wiley-Blackwell; 2012.

Suri H. Towards methodologically inclusive research syntheses: expanding possibilities. New York: Routledge; 2014.

Campbell M, Egan M, Lorenc T, Bond L, Popham F, Fenton C, Benzeval M. Considering methodological options for reviews of theory: illustrated by a review of theories linking income and health. Syst Rev. 2014;3(1):1–11.

Cohen DJ, Crabtree BF. Evaluative criteria for qualitative research in health care: controversies and recommendations. Ann Fam Med. 2008;6(4):331–9.

Tong A, Sainsbury P, Craig J. Consolidated criteria for reportingqualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int J Qual Health Care. 2007;19(6):349–57.

Article   PubMed   Google Scholar  

Moher D, Schulz KF, Simera I, Altman DG. Guidance for developers of health research reporting guidelines. PLoS Med. 2010;7(2):e1000217.

Moher D, Tetzlaff J, Tricco AC, Sampson M, Altman DG. Epidemiology and reporting characteristics of systematic reviews. PLoS Med. 2007;4(3):e78.

Chan AW, Altman DG. Epidemiology and reporting of randomised trials published in PubMed journals. Lancet. 2005;365(9465):1159–62.

Alshurafa M, Briel M, Akl EA, Haines T, Moayyedi P, Gentles SJ, Rios L, Tran C, Bhatnagar N, Lamontagne F, et al. Inconsistent definitions for intention-to-treat in relation to missing outcome data: systematic review of the methods literature. PLoS One. 2012;7(11):e49163.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Gentles SJ, Charles C, Ploeg J, McKibbon KA. Sampling in qualitative research: insights from an overview of the methods literature. Qual Rep. 2015;20(11):1772–89.

Harzing A-W, Alakangas S. Google Scholar, Scopus and the Web of Science: a longitudinal and cross-disciplinary comparison. Scientometrics. 2016;106(2):787–804.

Harzing A-WK, van der Wal R. Google Scholar as a new source for citation analysis. Ethics Sci Environ Polit. 2008;8(1):61–73.

Kousha K, Thelwall M. Google Scholar citations and Google Web/URL citations: a multi‐discipline exploratory analysis. J Assoc Inf Sci Technol. 2007;58(7):1055–65.

Hirsch JE. An index to quantify an individual’s scientific research output. Proc Natl Acad Sci U S A. 2005;102(46):16569–72.

Booth A, Carroll C. How to build up the actionable knowledge base: the role of ‘best fit’ framework synthesis for studies of improvement in healthcare. BMJ Quality Safety. 2015;24(11):700–8.

Carroll C, Booth A, Leaviss J, Rick J. “Best fit” framework synthesis: refining the method. BMC Med Res Methodol. 2013;13(1):37.

Carroll C, Booth A, Cooper K. A worked example of “best fit” framework synthesis: a systematic review of views concerning the taking of some potential chemopreventive agents. BMC Med Res Methodol. 2011;11(1):29.

Cohen MZ, Kahn DL, Steeves DL. Hermeneutic phenomenological research: a practical guide for nurse researchers. Thousand Oaks: Sage; 2000.

Noblit GW, Hare RD. Meta-ethnography: synthesizing qualitative studies. Newbury Park: Sage; 1988.

Book   Google Scholar  

Melendez-Torres GJ, Grant S, Bonell C. A systematic review and critical appraisal of qualitative metasynthetic practice in public health to develop a taxonomy of operations of reciprocal translation. Res Synthesis Methods. 2015;6(4):357–71.

Article   CAS   Google Scholar  

Glaser BG, Strauss A. The discovery of grounded theory. Chicago: Aldine; 1967.

Dixon-Woods M, Agarwal S, Young B, Jones D, Sutton A. Integrative approaches to qualitative and quantitative evidence. In: UK National Health Service. 2004. p. 1–44.

Download references

Acknowledgements

Not applicable.

There was no funding for this work.

Availability of data and materials

The systematic methods overview used as a worked example in this article (Gentles SJ, Charles C, Ploeg J, McKibbon KA: Sampling in qualitative research: insights from an overview of the methods literature. The Qual Rep 2015, 20(11):1772-1789) is available from http://nsuworks.nova.edu/tqr/vol20/iss11/5 .

Authors’ contributions

SJG wrote the first draft of this article, with CC contributing to drafting. All authors contributed to revising the manuscript. All authors except CC (deceased) approved the final draft. SJG, CC, KAB, and JP were involved in developing methods for the systematic methods overview on sampling.

Authors’ information

Competing interests.

The authors declare that they have no competing interests.

Consent for publication

Ethics approval and consent to participate, author information, authors and affiliations.

Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada

Stephen J. Gentles, Cathy Charles & K. Ann McKibbon

Faculty of Social Work, University of Calgary, Alberta, Canada

David B. Nicholas

School of Nursing, McMaster University, Hamilton, Ontario, Canada

Jenny Ploeg

CanChild Centre for Childhood Disability Research, McMaster University, 1400 Main Street West, IAHS 408, Hamilton, ON, L8S 1C7, Canada

Stephen J. Gentles

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Stephen J. Gentles .

Additional information

Cathy Charles is deceased

Additional file

Additional file 1:.

Submitted: Analysis_matrices. (DOC 330 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Gentles, S.J., Charles, C., Nicholas, D.B. et al. Reviewing the research methods literature: principles and strategies illustrated by a systematic overview of sampling in qualitative research. Syst Rev 5 , 172 (2016). https://doi.org/10.1186/s13643-016-0343-0

Download citation

Received : 06 June 2016

Accepted : 14 September 2016

Published : 11 October 2016

DOI : https://doi.org/10.1186/s13643-016-0343-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Systematic review
  • Literature selection
  • Research methods
  • Research methodology
  • Overview of methods
  • Systematic methods overview
  • Review methods

Systematic Reviews

ISSN: 2046-4053

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

peer reviewed journal articles on research methods

Mixed methods research: what it is and what it could be

  • Open access
  • Published: 29 March 2019
  • Volume 48 , pages 193–216, ( 2019 )

Cite this article

You have full access to this open access article

  • Rob Timans 1 ,
  • Paul Wouters 2 &
  • Johan Heilbron 3  

112k Accesses

90 Citations

13 Altmetric

Explore all metrics

A Correction to this article was published on 06 May 2019

This article has been updated

Combining methods in social scientific research has recently gained momentum through a research strand called Mixed Methods Research (MMR). This approach, which explicitly aims to offer a framework for combining methods, has rapidly spread through the social and behavioural sciences, and this article offers an analysis of the approach from a field theoretical perspective. After a brief outline of the MMR program, we ask how its recent rise can be understood. We then delve deeper into some of the specific elements that constitute the MMR approach, and we engage critically with the assumptions that underlay this particular conception of using multiple methods. We conclude by offering an alternative view regarding methods and method use.

Similar content being viewed by others

peer reviewed journal articles on research methods

Mixed Methods

peer reviewed journal articles on research methods

Avoid common mistakes on your manuscript.

The interest in combining methods in social scientific research has a long history. Terms such as “triangulation,” “combining methods,” and “multiple methods” have been around for quite a while to designate using different methods of data analysis in empirical studies. However, this practice has gained new momentum through a research strand that has recently emerged and that explicitly aims to offer a framework for combining methods. This approach, which goes by the name of Mixed Methods Research (MMR), has rapidly become popular in the social and behavioural sciences. This can be seen, for instance, in Fig.  1 , where the number of publications mentioning “mixed methods” in the title or abstract in the Thomson Reuters Web of Science is depicted. The number increased rapidly over the past ten years, especially after 2006. Footnote 1

figure 1

Fraction of the total of articles mentioning Mixed Method Research appearing in a given year, 1990–2017 (yearly values sum to 1). See footnote 1

The subject of mixed methods thus seems to have gained recognition among social scientists. The rapid rise of the number of articles mentioning the term raises various sociological questions. In this article, we address three of these questions. The first question concerns the degree to which the approach of MMR has become institutionalized within the field of the social sciences. Has MMR become a recognizable realm of knowledge production? Has its ascendance been accompanied by the production of textbooks, the founding of journals, and other indicators of institutionalization? The answer to this question provides an assessment of the current state of MMR. Once that is determined, the second question is how MMR’s rise can be understood. Where does the approach come from and how can its emergence and spread be understood? To answer this question, we use Pierre Bourdieu’s field analytical approach to science and academic institutions (Bourdieu 1975 , 1988 , 2004 , 2007 ; Bourdieu et al. 1991 ). We flesh out this approach in the next section. The third question concerns the substance of the MMR corpus seen in the light of the answers to the previous questions: how can we interpret the specific content of this approach in the context of its socio-historical genesis and institutionalization, and how can we understand its proposal for “mixing methods” in practice?

We proceed as follows. In the next section, we give an account of our theoretical approach. Then, in the third, we assess the degree of institutionalization of MMR, drawing on the indicators of academic institutionalization developed by Fleck et al. ( 2016 ). In the fourth section, we address the second question by examining the position of the academic entrepreneurs behind the rise of MMR. The aim is to understand these agents’ engagement in MMR, as well as its distinctive content as being informed by their position in this field. Viewing MMR as a position-taking of academic entrepreneurs, linked to their objective position in this field, allows us to reflect sociologically on the substance of the approach. We offer this reflection in the fifth section, where we indicate some problems with MMR. To get ahead of the discussion, these problems have to do with the framing of MMR as a distinct methodology and its specific conceptualization of data and methods of data analysis. We argue that these problems hinder fruitfully combining methods in a practical understanding of social scientific research. Finally, we conclude with some tentative proposals for an alternative view on combining methods.

A field approach

Our investigation of the rise and institutionalization of MMR relies on Bourdieu’s field approach. In general, field theory provides a model for the structural dimensions of practices. In fields, agents occupy a position relative to each other based on the differences in the volume and structure of their capital holdings. Capital can be seen as a resource that agents employ to exert power in the field. The distribution of the form of capital that is specific to the field serves as a principle of hierarchization in the field, differentiating those that hold more capital from those that hold less. This principle allows us to make a distinction between, respectively, the dominant and dominated factions in a field. However, in mature fields all agents—dominant and dominated—share an understanding of what is at stake in the field and tend to accept its principle of hierarchization. They are invested in the game, have an interest in it, and share the field’s illusio .

In the present case, we can interpret the various disciplines in the social sciences as more or less autonomous spaces that revolve around the shared stake in producing legitimate scientific knowledge by the standards of the field. What constitutes legitimate knowledge in these disciplinary fields, the production of which bestows scholars with prestige and an aura of competence, is in large part determined by the dominant agents in the field, who occupy positions in which most of the consecration of scientific work takes place. Scholars operating in a field are endowed with initial and accumulated field-specific capital, and are engaged in the struggle to gain additional capital (mainly scientific and intellectual prestige) in order to advance their position in the field. The main focus of these agents will generally be the disciplinary field in which they built their careers and invested their capital. These various disciplinary spaces are in turn part of a broader field of the social sciences in which the social status and prestige of the various disciplines is at stake. The ensuing disciplinary hierarchy is an important factor to take into account when analysing the circulation of new scientific products such as MMR. Furthermore, a distinction needs to be made between the academic and the scientific field. While the academic field revolves around universities and other degree-granting institutions, the stakes in the scientific field entail the production and valuation of knowledge. Of course, in modern science these fields are closely related, but they do not coincide (Gingras and Gemme 2006 ). For instance, part of the production of legitimate knowledge takes place outside of universities.

This framework makes it possible to contextualize the emergence of MMR in a socio-historical way. It also enables an assessment of some of the characteristics of MMR as a scientific product, since Bourdieu insists on the homology between the objective positions in a field and the position-takings of the agents who occupy these positions. As a new methodological approach, MMR is the result of the position-takings of its producers. The position-takings of the entrepreneurs at the core of MMR can therefore be seen as expressions in the struggles over the authority to define the proper methodology that underlies good scientific work regarding combining methods, and the potential rewards that come with being seen, by other agents, as authoritative on these matters. Possible rewards include a strengthened autonomy of the subfield of MMR and an improved position in the social-scientific field.

The role of these entrepreneurs or ‘intellectual leaders’ who can channel intellectual energy and can take the lead in institution building has been emphasised by sociologists of science as an important aspect of the production of knowledge that is visible and recognized as distinct in the larger scientific field (e.g., Mullins 1973 ; Collins 1998 ). According to Bourdieu, their position can, to a certain degree, explain the strategy they pursue and the options they perceive to be viable in the trade-off regarding the risks and potential rewards for their work.

We do not provide a full-fledged field analysis of MMR here. Rather, we use the concept as a heuristic device to account for the phenomenon of MMR in the social context in which it emerged and diffused. But first, we take stock of the current situation of MMR by focusing on the degree of institutionalization of MMR in the scientific field.

The institutionalization of mixed methods research

When discussing institutionalization, we have to be careful about what we mean by this term. More precisely, we need to be specific about the context and distinguish between institutionalization in the academic field and institutionalization within the scientific field (see Gingras and Gemme 2006 ; Sapiro et al. 2018 ). The first process refers to the establishment of degrees, curricula, faculties, etc., or to institutions tied to the academic bureaucracy and academic politics. The latter refers to the emergence of institutions that support the autonomization of scholarship such as scholarly associations and scientific journals. Since MMR is still a relatively young phenomenon and academic institutionalization tends to lag scientific institutionalization (e.g., for the case of sociology and psychology, see Sapiro et al. 2018 , p. 26), we mainly focus here on the latter dimension.

Drawing on criteria proposed by Fleck et al. ( 2016 ) for the institutionalization of academic disciplines, MMR seems to have achieved a significant degree of institutionalization within the scientific field. MMR quickly gained popularity in the first decade of the twenty-first century (e.g., Tashakkori and Teddlie 2010c , pp. 803–804). A distinct corpus of publications has been produced that aims to educate those interested in MMR and to function as a source of reference for researchers: there are a number of textbooks (e.g., Plowright 2010 ; Creswell and Plano Clark 2011 ; Teddlie and Tashakkori 2008 ); a handbook that is now in its second edition (Tashakkori and Teddlie 2003 , 2010a ); as well as a reader (Plano Clark and Creswell 2007 ). Furthermore, a journal (the Journal of Mixed Methods Research [ JMMR] ) was established in 2007. The JMMR was founded by the editors John Creswell and Abbas Tashakkori with the primary aim of “building an international and multidisciplinary community of mixed methods researchers.” Footnote 2 Contributions to the journal must “fit the definition of mixed methods research” Footnote 3 and explicitly integrate qualitative and quantitative aspects of research, either in an empirical study or in a more theoretical-methodologically oriented piece.

In addition, general textbooks on social research methods and methodology now increasingly devote sections to the issue of combining methods (e.g., Creswell 2008 ; Nagy Hesse-Biber and Leavy 2008 ; Bryman 2012 ), and MMR has been described as a “third paradigm” (Denscombe 2008 ), a “movement” (Bryman 2009 ), a “third methodology” (Tashakkori and Teddlie 2010b ), a “distinct approach” (Greene 2008 ) and an “emerging field” (Tashakkori and Teddlie 2011 ), defined by a common name (that sets it apart from other approaches to combining methods) and shared terminology (Tashakkori and Teddlie 2010b , p. 19). As a further indication of institutionalization, a research association (the Mixed Methods International Research Association—MMIRA) was founded in 2013 and its inaugural conference was held in 2014. Prior to this, there have been a number of conferences on MMR or occasions on which MMR was presented and discussed in other contexts. An example of the first is the conference on mixed method research design held in Basel in 2005. Starting also in 2005, the British Homerton School of Health Studies has organised a series of international conferences on mixed methods. Moreover, MMR was on the list of sessions in a number of conferences on qualitative research (see, e.g., Creswell 2012 ).

Another sign of institutionalization can be found in efforts to forge a common disciplinary identity by providing a narrative about its history. This involves the identification of precursors and pioneers as well as an interpretation of the process that gave rise to a distinctive set of ideas and practices. An explicit attempt to chart the early history of MMR is provided by Johnson and Gray ( 2010 ). They frame MMR as rooted in the philosophy of science, particularly as a way of thinking about science that has transcended some of the most salient historical oppositions in philosophy. Philosophers like Aristotle and Kant are portrayed as thinkers who sought to integrate opposing stances, forwarding “proto-mixed methods ideas” that exhibited the spirit of MMR (Johnson and Gray 2010 , p. 72, p. 86). In this capacity, they (as well as other philosophers like Vico and Montesquieu) are presented as part of MMR providing a philosophical validation of the project by presenting it as a continuation of ideas that have already been voiced by great thinkers in the past.

In the second edition of their textbook, Creswell and Plano Clark ( 2011 ) provide an overview of the history of MMR by identifying five historical stages: the first one being a precursor to the MMR approach, consisting of rather atomised attempts by different authors to combine methods in their research. For Creswell and Plano Clark, one of the earliest examples is Campbell and Fiske’s ( 1959 ) combination of quantitative methods to improve the validity of psychological scales that gave rise to the triangulation approach to research. However, they regard this and other studies that combined methods around that time, as “antecedents to (…) more systematic attempts to forge mixed methods into a complete research design” (Creswell and Plano Clark 2011 , p. 21), and hence label this stage as the “formative period” (ibid., p. 25). Their second stage consists of the emergence of MMR as an identifiable research strand, accompanied by a “paradigm debate” about the possibility of combining qualitative and quantitative data. They locate its beginnings in the late 1980s when researchers in various fields began to combine qualitative and quantitative methods (ibid., pp. 20–21). This provoked a discussion about the feasibility of combining data that were viewed as coming from very different philosophical points of view. The third stage, the “procedural development period,” saw an emphasis on developing more hands-on procedures for designing a mixed methods study, while stage four is identified as consisting of “advocacy and expansion” of MMR as a separate methodology, involving conferences, the establishment of a journal and the first edition of the aforementioned handbook (Tashakkori and Teddlie 2003 ). Finally, the fifth stage is seen as a “reflective period,” in which discussions about the unique philosophical underpinnings and the scientific position of MMR emerge.

Creswell and Plano Clark thus locate the emergence of “MMR proper” at the second stage, when researchers started to use both qualitative and quantitative methods within a single research effort. As reasons for the emergence of MMR at this stage they identify the growing complexity of research problems, the perception of qualitative research as a legitimate form of inquiry (also by quantitative researchers) and the increasing need qualitative researchers felt for generalising their findings. They therefore perceive the emergence of the practice of combining methods as a bottom up process that grew out of research practices, and at some point in time converged towards a more structural approach. Footnote 4 Historical accounts such as these add a cognitive dimension to the efforts to institutionalize MMR. They lay the groundwork for MMR as a separate subfield with its own identity, topics, problems and intellectual history. The use of terms such as “third paradigm” and “third methodology” also suggests that there is a tendency to perceive and promote MMR as a distinct and coherent way to do research.

In view of the brief exploration of the indicators of institutionalisation of MMR, it seems reasonable to conclude that MMR has become a recognizable and fairly institutionalized strand of research with its own identity and profile within the social scientific field. This can be seen both from the establishment of formal institutions (like associations and journals) and more informal ones that rely more on the tacit agreement between agents about “what MMR is” (an example of this, which we address later in the article, is the search for a common definition of MMR in order to fix the meaning of the term). The establishment of these institutions supports the autonomization of MMR and its emancipation from the field in which it originated, but in which it continues to be embedded. This way, it can be viewed as a semi-autonomous subfield within the larger field of the social sciences and as the result of a differentiation internal to this field (Steinmetz 2016 , p. 109). It is a space that is clearly embedded within this higher level field; for example, members of the subfield of MMR also qualify as members of the overarching field, and the allocation of the most valuable and current form of capital is determined there as well. Nevertheless, as a distinct subfield, it also has specific principles that govern the production of knowledge and the rewards of domination.

We return to the content and form of this specific knowledge later in the article. The next section addresses the question of the socio-genesis of MMR.

Where does mixed methods research come from?

The origins of the subfield of MMR lay in the broader field of social scientific disciplines. We interpret the positions of the scholars most involved in MMR (the “pioneers” or “scientific entrepreneurs”) as occupying particular positions within the larger academic and scientific field. Who, then, are the researchers at the heart of MMR? Leech ( 2010 ) interviewed 4 scholars (out of 6) that she identified as early developers of the field: Alan Bryman (UK; sociology), John Creswell (USA; educational psychology), Jennifer Greene (USA; educational psychology) and Janice Morse (USA; nursing and anthropology). Educated in the 1970s and early 1980s, all four of them indicated that they were initially trained in “quantitative methods” and later acquired skills in “qualitative methods.” For two of them (Bryman and Creswell) the impetus to learn qualitative methods was their involvement in writing on, and teaching of, research methods; for Greene and Morse the initial motivation was more instrumental and related to their concrete research activity at the time. Creswell describes himself as “a postpositivist in the 1970s, self-education as a constructivist through teaching qualitative courses in the 1980s, and advocacy for mixed methods (…) from the 1990s to the present” (Creswell 2011 , p. 269). Of this group, only Morse had the benefit of learning about qualitative methods as part of her educational training (in nursing and anthropology; Leech 2010 , p. 267). Independently, Creswell ( 2012 ) identified (in addition to Bryman, Greene and Morse) John Hunter, Allen Brewer (USA; Northwestern and Boston College) and Nigel Fielding (University of Surrey, UK) as important early movers in MMR.

The selections that Leech and Creswell make regarding the key actors are based on their close involvement with the “MMR movement.” It is corroborated by a simple analysis of the articles that appeared in the Journal of Mixed Methods Research ( JMMR ), founded in 2007 as an outlet for MMR.

Table 1 lists all the authors that have published in the issues of the journal since its first publication in 2007 and that have either received more than 14 (4%) of the citations allocated between the group of 343 authors (the TLCS score in Table 1 ), or have written more than 2 articles for the Journal (1.2% of all the articles that have appeared from 2007 until October 2013) together with their educational background (i.e., the discipline in which they completed their PhD).

All the members of Leech’s selection, except for Morse, and the members of Creswell’s selection (except Hunter, Brewer, and Fielding) are represented in the selection based on the entries in the JMMR . Footnote 5 The same holds for two of the three additional authors identified by Creswell. Hunter and Brewer have developed a somewhat different approach to combining methods that explicitly targets data gathering techniques and largely avoids epistemological discussions. In Brewer and Hunter ( 2006 ) they discuss the MMR approach very briefly and only include two references in their bibliography to the handbook of Tashakkori and Teddlie ( 2003 ), and at the end of 2013 they had not published in the JMMR . Fielding, meanwhile, has written two articles for the JMMR (Fielding and Cisneros-Puebla 2009 ; Fielding 2012 ). In general, it seems reasonable to assume that a publication in a journal that positions itself as part of a systematic attempt to build a research tradition, and can be viewed as part of a strategic effort to advance MMR as a distinct alternative to more “traditional” academic research—particularly in methods—at least signals a degree of adherence to the effort and acceptance of the rules of the game it lays out. This would locate Fielding closer to the MMR movement than the others.

The majority of the researchers listed in Table 1 have a background in psychology or social psychology (35%), and sociology (25%). Most of them work in the United States or are UK citizens, and the positions they occupied at the beginning of 2013 indicates that most of these are in applied research: educational research and educational psychology account for 50% of all the disciplinary occupations of the group that were still employed in academia. This is consistent with the view that MMR originated in applied disciplines and thematic studies like education and nursing, rather than “pure disciplines” like psychology and sociology (Tashakkori and Teddlie ( 2010b ), p. 32). Although most of the 20 individuals mentioned in Table 1 have taught methods courses in academic curricula (for 15 of them, we could determine that they were involved in the teaching of qualitative, quantitative, or mixed methods), there are few individuals with a background in statistics or a neighbouring discipline: only Amy Dellinger did her PhD in “research methodology.” In addition, as far as we could determine, only three individuals held a position in a methodological department at some time: Dellinger, Tony Onwuegbuzie, and Nancy Leech.

The pre-eminence of applied fields in MMR is supported when we turn our attention to the circulation of MMR. To assess this we proceeded as follows. We selected 10 categories in the Web of Science that form a rough representation of the space of social science disciplines, taking care to include the most important so-called “studies.” These thematically orientated, interdisciplinary research areas have progressively expanded since they emerged at the end of the 1960s as a critique of the traditional disciplines (Heilbron et al. 2017 ). For each category, we selected the 10 journals with the highest 5-year impact factor in their category in the period 2007–2015. The lists were compiled bi-annually over this period, resulting in 5 top ten lists for the following Web of Science categories: Economics, Psychology, Sociology, Anthropology, Political Science, Nursing, Education & Educational Research, Business, Cultural Studies, and Family Studies. After removing multiple occurring journals, we obtained a list of 164 journals.

We searched the titles and abstracts of the articles appearing in these journals over the period 1992–2016 for occurrences of the terms “mixed method” or “multiple methods” and variants thereof. We chose this particular period and combination of search terms to see if a shift from a more general use of the term “multiple methods” to “mixed methods” occurred following the institutionalization of MMR. In total, we found 797 articles (out of a total of 241,521 articles that appeared in these journals during that time), published in 95 different journals. Table 2 lists the 20 journals that contain at least 1% (8 articles) of the total amount of articles.

As is clear from Table 2 , the largest number of articles in the sample were published in journals in the field of nursing: 332 articles (42%) appeared in journals that can be assigned to this category. The next largest category is Education & Educational Research, to which 224 (28 percentage) of the articles can be allocated. By contrast, classical social science disciples are barely represented. In Table 2 only the journal Field Methods (Anthropology) and the Journal of Child Psychology and Psychiatry (Psychology) are related to classical disciplines. In Table 3 , the articles in the sample are categorized according to the disciplinary category of the journal in which they appeared. Overall, the traditional disciplines are clearly underrepresented: for the Economics category, for example, only the Journal of Economic Geography contains three articles that make a reference to mixed methods.

Focusing on the core MMR group, the top ten authors of the group together collect 458 citations from the 797 articles in the sample, locating them at the center of the citation network. Creswell is the most cited author (210 citations) and his work too receives most citations from journals in nursing and education studies.

The question whether a terminological shift has occurred from “multiple methods” to “mixed methods” must be answered affirmative for this sample. Prior to 2001 most articles (23 out of 31) refer to “multiple methods” or “multi-method” in their title or abstract, while the term “mixed methods” gains traction after 2001. This shift occurs first in journals in nursing studies, with journals in education studies following somewhat later. The same fields are also the first to cite the first textbooks and handbooks of MMR.

Taken together, these results corroborate the notion that MMR circulates mainly in nursing and education studies. How can this be understood from a field theoretical perspective? MMR can be seen as an innovation in the social scientific field, introducing a new methodology for combining existing methods in research. In general, innovation is a relatively risky strategy. Coming up with a truly rule-breaking innovation often involves a small probability of great success and a large probability of failure. However, it is important to add some nuance to this general observation. First, the risk an innovator faces depends on her position in the field. Agents occupying positions at the top of their field’s hierarchy are rich in specific capital and can more easily afford to undertake risky projects. In the scientific field, these are the agents richest in scientific capital. They have the knowledge, authority, and reputation (derived from recognition by their peers; Bourdieu 2004 , p. 34) that tends to decrease the risk they face and increase the chances of success. Moreover, the positions richest in scientific capital will, by definition, be the most consecrated ones. This consecration involves scientific rather than academic capital (cf. Wacquant 2013 , p. 20) and within disciplines these consecrated positions often are related to orthodox position-takings. This presents a paradox: although they have the capital to take more risks, they have also invested heavily in the orthodoxy of the field and will thus be reluctant to upset the status quo and risk destroying the value of their investment. This results in a tendency to take a more conservative stance, aimed at preserving the status quo in the field and defending their position. Footnote 6

For agents in dominated positions this logic is reversed. Possessing less scientific capital, they hold less consecrated positions and their chances of introducing successful innovations are much lower. This leaves them too with two possible strategies. One is to revert to a strategy of adaptation, accepting the established hierarchy in the field and embarking on a slow advancement to gain the necessary capital to make their mark from within the established order. However, Bourdieu notes that sometimes agents with a relatively marginal position in the field will engage in a “flight forward” and pursue higher risk strategies. Strategies promoting a heterodox approach challenge the orthodoxy and the principles of hierarchization of the field, and, if successful (which will be the case only with a small probability), can rake in significant profits by laying claim to a new orthodoxy (Bourdieu 1975 , p. 104; Bourdieu 1993 , pp. 116–117).

Thus, the coupling of innovative strategies to specific field positions based on the amount of scientific capital alone is not straightforward. It is therefore helpful to introduce a second differentiation in the field that, following Bourdieu ( 1975 , p. 103), is based on the differences between the expected profits from these strategies. Here a distinction can be made between an autonomous and a heteronomous pole of the field, i.e., between the purest, most “disinterested” positions and the most “temporal” positions that are more pervious to the heteronomous logic of social hierarchies outside the scientific field. Of course, this difference is a matter of degree, as even the works produced at the most heteronomous positions still have to adhere to the standards of the scientific field to be seen as legitimate. But within each discipline this dimension captures the difference between agents predominantly engaged in fundamental, scholarly work—“production solely for the producers”—and agents more involved in applied lines of research. The main component of the expected profit from innovation in the first case is scientific, whereas in the second case the balance tends to shift towards more temporal profits. This two-fold structuring of the field allows for a more nuanced conception of innovation than the dichotomy “conservative” versus “radical.” Holders of large amounts of scientific capital at the autonomous pole of the field are the producers and conservators of orthodoxy, producing and diffusing what can be called “orthodox innovations” through their control of relatively powerful networks of consecration and circulation. Innovations can be radical or revolutionary in a rational sense, but they tend to originate from questions raised by the orthodoxy of the field. Likewise, the strategy to innovate in this sense can be very risky in that success is in no way guaranteed, but the risk is mitigated by the assurance of peers that these are legitimate questions, tackled in a way that is consistent with orthodoxy and that does not threaten control of the consecration and circulation networks.

These producers are seen as intellectual leaders by most agents in the field, especially by those aspiring to become part of the specific networks of production and circulation they maintain. The exception are the agents located at the autonomous end of the field who possess less scientific capital and outright reject this orthodoxy produced by the field’s elite. Being strictly focused on the most autonomous principles of legitimacy, they are unable to accommodate and have no choice but to reject the orthodoxy. Their only hope is to engage in heterodox innovations that may one day become the new orthodoxy.

The issue is less antagonistic at the heteronomous side of the field, at least as far as the irreconcilable position-takings at the autonomous pole are concerned. The main battle here is also for scientific capital, but is complemented by the legitimacy it brings to gain access to those who are in power outside of the scientific field. At the dominant side, those with more scientific capital tend to have access to the field of power, agents who hold the most economic and cultural capital, for example by holding positions in policy advisory committees or company boards. The dominated groups at this side of the field will cater more to practitioners or professionals outside of the field of science.

Overall, there will be fewer innovations on this side. Moreover, innovative strategies will be less concerned with the intricacies of the pure discussions that prevail at the autonomous pole and be of a more practical nature, but pursued from different degrees of legitimacy according to the differences in scientific capital. This affects the form these more practical, process-orientated innovations take. At the dominant side of this pole, agents tend to accept the outcome of the struggles at the autonomous pole: they will accept the orthodoxy because mastery of this provides them with scientific capital and the legitimacy they need to gain access to those in power. In contrast, agents at the dominated side will be more interested in doing “what works,” neutralizing the points of conflict at the autonomous pole and deriving less value from strictly following the orthodoxy. This way, a four-fold classification of innovative strategies in the scientific field emerges (see Fig.  2 ) that helps to understand the context in which MMR was developed.

figure 2

Scientific field and scientific innovation

In summary, the small group of researchers who have been identified as the core of MMR consist predominantly of users of methods, who were educated and have worked exclusively at US and British universities. The specific approach to combining methods that is proposed by MMR has been successful from an institutional point of view, achieving visibility through the foundation of a journal and association and a considerable output of core MMR scholars in terms of books, conference proceedings, and journal articles. Its origins and circulation in vocational studies rather than classical academic disciplines can be understood from the position these studies occupy in the scientific field and the kinds of position-taking and innovations these positions give rise to. This context allows a reflexive understanding of the content of MMR and the issues that are dominant in the approach. We turn to this in the next section.

Mixed methods research: Position-taking

The position of the subfield of MMR in the scientific field is related to the position-takings of agents that form the core of this subfield (Bourdieu 1993 , p. 35). The space of position takings, in turn, provides the framework to study the most salient issues that are debated within the subfield. Since we can consider MMR to be an emerging subfield, where positions and position takings are not as clearly defined as in more mature and settled fields, it comes as no surprise that there is a lively discussion of fundamental matters. Out of the various topics that are actively discussed, we have distilled three themes that are important for the way the subfield of MMR conveys its autonomy as a field and as a distinct approach to research. Footnote 7 In our view, these also represent the main problems with the way MMR approaches the issue of combining methods.

Methodology making and standardization

The first topic is that the approach is moving towards defining a unified MMR methodology. There are differences in opinion as to how this is best achieved, but there is widespread agreement that some kind of common methodological and conceptual foundation of MMR is needed. To this end, some propose a broad methodology that can serve as distinct marker of MMR research. For instance, in their introduction to the handbook, Tashakkori and Teddlie ( 2010b ) propose a definition of the methodology of mixed methods research as “the broad inquiry logic that guides the selection of specific methods and that is informed by conceptual positions common to mixed methods practitioners” (Tashakkori and Teddlie 2010b , p. 5). When they (later on in the text) provide two methodological principles that differentiate MMR from other communities of scholars, they state that they regard it as a “crucial mission” for the MMR community to generate distinct methodological principles (Tashakkori and Teddlie 2010b , pp. 16–17). They envision an MMR methodology that can function as a “guide” for selecting specific methods. Others are more in favour of finding a philosophical foundation that underlies MMR. For instance, Morgan ( 2007 ) and Hesse-Biber ( 2010 ) consider pragmatism as a philosophy that distinguishes MMR from qualitative (constructivism) and quantitative (positivist) research and that can provide a rationale for the paradigmatic pluralism typical of MMR.

Furthermore, there is wide agreement that some unified definition of MMR would be beneficial, but it is precisely here that there is a large variation in interpretations regarding the essentials of MMR. This can be seen in the plethora of definitions that have been proposed. Johnson et al. ( 2007 ) identified 19 alternative definitions of MMR at the time, out of which they condensed their own:

[MMR] is the type of research in which a researcher or team of researchers combines elements of qualitative and quantitative research approaches (e.g., use of qualitative and quantitative viewpoints, data collection, analysis, inference techniques) for the broad purpose of breath and depth of understanding and corroboration. Footnote 8

Four years later, the issue is not settled yet. Creswell and Plano Clark ( 2011 ) list a number of authors who have proposed a different definition of MMR, and conclude that there is a common trend in the content of these definitions over time. They take the view that earlier texts on mixing methods stressed a “disentanglement of methods and philosophy,” while later texts locate the practice of mixing methods in “all phases of the research process” (Creswell and Plano Clark 2011 , p. 2). It would seem, then, that according to these authors the definitions of MMR have become more abstract, further away from the practicality of “merely” combining methods. Specifically, researchers now seem to speak of mixing higher order concepts: some speak of mixing methodologies, others refer to mixing “research approaches,” or combining “types of research,” or engage in “multiple ways of seeing the social world” (Creswell and Plano Clark 2011 ).

This shift is in line with the direction in which MMR has developed and that emphasises practical ‘manuals’ and schemas for conducting research. A relatively large portion of the MMR literature is devoted to classifications of mixed methods designs. These classifications provide the basis for typologies that, in turn, provide guidelines to conduct MMR in a concrete research project. Tashakkori and Teddlie ( 2003 ) view these typologies as important elements of the organizational structure and legitimacy of the field. In addition, Leech and Onwuegbuzie ( 2009 ) see typologies as helpful guides for researchers and of pedagogical value (Leech and Onwuegbuzie 2009 , p. 272). Proposals for typologies can be found in textbooks, articles, and contributions to the handbook(s). For example, Creswell et al. ( 2003 , pp. 169-170) reviewed a number of studies and identified 8 different ways to classify MMR studies. This list was updated and extended by Creswell and Plano Clark ( 2011 , pp. 56-59) to 15 typologies. Leech and Onwuegbuzie ( 2009 ) identified 35 different research designs in the contributions to Teddlie and Tashakkori (2003) alone, and proposed their own three-dimensional typology that resulted in 8 different types of mixed methods studies. As another example of the ubiquity of these typologies, Nastasi et al. ( 2010 ) classified a large number of existing typologies in MMR into 7”meta-typologies” that each emphasize different aspects of the research process as important markers for MMR. According to the authors, these typologies have the same function in MMR as the more familiar names of “qualitative” or “quantitative” methods (e.g., “content analysis” or “structural equation modelling”) have: to signal readers of research what is going on, what procedures have been followed, how to interpret results, etc. (see also Creswell et al. 2003 , pp. 162–163). The criteria underlying these typologies mainly have to do with the degree of mixing (e.g., are methods mixed throughout the research project or not?), the timing (e.g., sequential or concurrent mixing of methods) and the emphasis (e.g., is one approach dominant, or do they have equal status?).

We find this strong drive to develop methodologies, definitions, and typologies of MMR as guides to valid mixed methods research problematic. What it amounts to in practice is a methodology that lays out the basic guidelines for doing MMR in a “proper way.” This entails the danger of straight-jacketing reflection about the use of methods, decoupling it from theoretical and empirical considerations, thus favouring the unreflexive use of a standard methodology. Researchers are asked to make a choice for a particular MMR design and adhere to the guidelines for a “proper” MMR study. Such methodological prescription diametrically opposes the initial critique of the mechanical and unreflexive use of methods. The insight offered by Bourdieu’s notion of reflexivity is, on the contrary, that the actual research practice is fundamentally open in terms of being guided by a logic of practice that cannot be captured by a preconceived and all-encompassing logic independent of that practice. Reflexivity in this view cannot be achieved by hiding behind the construct of a standardized methodology—of whatever signature—it can only be achieved by objectifying the process of objectification that goes on within the context of the field in which the researcher is embedded. This reflexivity, then, requires an analysis of the position of the researcher as a critical component of the research process, both as the embodiment of past choices that have consequences for the strategic position in the scientific field, and as predispositions regarding the choice for the subject and content of a research project. By adding the insight of STS researchers that the point of deconstructing science and technology is not so much to offer a new best way of doing science or technology, but to provide insights into the critical moments in research (for a take on such a debate, see, for example, Edge 1995 , pp. 16–20), this calls for a sociology of science that takes methods much more seriously as objects of study. Such a programme should be based on studying the process of codification and standardization of methods in their historical context of production, circulation, and use. It would provide a basis for a sociological understanding of methods that can illuminate the critical moments in research alluded to above, enabling a systematic reflection on the process of objectification. This, in turn, allows a more sophisticated validation of using—and combining—methods than relying on prescribed methodologies.

The role of epistemology

The second theme discussed in a large number of contributions is the role epistemology plays in MMR. In a sense, epistemology provides the lifeblood for MMR in that methods in MMR are mainly seen in epistemological terms. This interpretation of methods is at the core of the knowledge claim of MMR practitioners, i.e., that the mixing of methods means mixing broad, different ways of knowing, which leads to better knowledge of the research object. It is also part of the identity that MMR consciously assumes, and that serves to set it apart from previous, more practical attempts to combine methods. This can be seen in the historical overview that Creswell and Plano Clark ( 2011 ) presented and that was discussed above. This reading, in which combining methods has evolved from the rather unproblematic level (one could alternatively say “naïve” or “unaware”) of instrumental use of various tools and techniques into an act that requires deeper thinking on a methodological and epistemological level, provides the legitimacy of MMR.

At the core of the MMR approach we thus find that methods are seen as unproblematic representations of different epistemologies. But this leads to a paradox, since the epistemological frameworks need to be held flexible enough to allow researchers to integrate elements of each of them (in the shape of methods) into one MMR design. As a consequence, the issue becomes the following: methods need to be disengaged from too strict an interpretation of the epistemological context in which they were developed in order for them to be “mixable,”’, but, at the same time, they must keep the epistemology attributed to them firmly intact.

In the MMR discourse two epistemological positions are identified that matter most: a positivist approach that gives rise to quantitative methods and a constructivist approach that is home to qualitative methods. For MMR to be a feasible endeavour, the differences between both forms of research must be defined as reconcilable. This position necessitates an engagement with those who hold that the quantitative/qualitative dichotomy is unbridgeable. Within MMR an interesting way of doing so has emerged. In the first issue of the Journal of Mixed Methods Research, Morgan ( 2007 ) frames the debate about research methodology in the social sciences in terms of Kuhnian paradigms, and he argues that the pioneers of the emancipation of qualitative research methods used a particular interpretation of the paradigm-concept to state their case against the then dominant paradigm in the social sciences. According to Morgan, they interpreted a paradigm mainly in metaphysical terms, stressing the connections among the trinity of ontology, epistemology, and methodology as used in the philosophy of knowledge (Morgan 2007 , p. 57). This allowed these scholars to depict the line between research traditions in stark, contrasting terms, using Kuhn’s idea of “incommensurability” in the sense of its “early Kuhn” interpretation. This strategy fixed the contrast between the proposed alternative approach (a “constructivist paradigm”), and the traditional approach (constructed as “the positivist paradigm”) to research as a whole, and offered the alternative approach as a valid option rooted in the philosophy of knowledge. Morgan focuses especially on the work of Egon Guba and Yvonne Lincoln who developed what they initially termed a “naturalistic paradigm” as an alternative to their perception of positivism in the social sciences (e.g., Guba and Lincoln 1985 ). Footnote 9 MMR requires a more flexible or “a-paradigmatic stance” towards research, which would entail that “in real-world practice, methods can be separated from the epistemology out of which they emerged” (Patton 2002 , quoted in Tashakkori and Teddlie 2010b , p. 14).

This proposal of an ‘interpretative flexibility’ (Bijker 1987 , 1997 ) regarding paradigms is an interesting proposition. But it immediately raises the question: why stop there? Why not take a deeper look into the epistemological technology of methods themselves, to let the muted components speak up in order to look for alternative “mixing interfaces” that could potentially provide equally valid benefits in terms of the understanding of a research object? The answer, of course, was already seen above. It is that the MMR approach requires situating methods epistemologically in order to keep them intact as unproblematic mediators of specific epistemologies and, thus, make the methodological prescriptions work. There are several problems with this. First, seeing methods solely through an epistemological lens is problematic, but it would be less consequential if it were applied to multiple elements of methods separately. This would at least allow a look under the hood of a method, and new ways of mixing methods could be opened up that go beyond the crude “qualitative” versus “quantitative” dichotomy. Second, there is also the issue of the ontological dimension of methods that is disregarded in an exclusively epistemological framing of methods (e.g., Law 2004 ). Taking this ontological dimension seriously has at least two important facets. First, it draws attention to the ontological assumptions that are woven into methods in their respective fields of production and that are imported into fields of users. Second, it entails the ontological consequences of practising methods: using, applying, and referring to methods and the realities this produces. This latter facet brings the world-making and boundary-drawing capacities of methods to the fore. Both facets are ignored in MMR. We say more about the first facet in the next section. With regard to the second facet, a crucial element concerns the data that are generated, collected, and analysed in a research project. But rather than problematizing the link between the performativity of methods and the data that are enacted within the frame of a method, here too MMR relies on a dichotomy: that between quantitative and qualitative data. Methods are primarily viewed as ways of gathering data or as analytic techniques dealing with a specific kind of data. Methods and data are conceptualised intertwiningly: methods too are seen as either quantitative or qualitative (often written as QUANT and QUAL in the literature), and perform the role of linking epistemology and data. In the final analysis, the MMR approach is based on the epistemological legitimization of the dichotomy between qualitative and quantitative data in order to define and combine methods: data obtain epistemological currency through the supposed in-severable link to certain methods, and methods are reduced to the role of acting as neutral mediators between them.

In this way, methods are effectively reduced to, on the one hand, placeholders for epistemological paradigms and, on the other hand, mediators between one kind of data and the appropriate epistemology. To put it bluntly, the name “mixed methods research” is actually a misnomer, because what is mixed are paradigms or “approaches,” not methods. Thus, the act of mixing methods à la MMR has the paradoxical effect of encouraging a crude black box approach to methods. This is a third problematic characteristic of MMR, because it hinders a detailed study of methods that can lead to a much richer perspective on mixing methods.

Black boxed methods and how to open them

The third problem that we identified with the MMR approach, then, is that with the impetus to standardize the MMR methodology by fixing methods epistemologically, complemented by a dichotomous view of data, they are, in the words of philosopher Bruno Latour, “blackboxed.” This is a peculiar result of the prescription for mixing methods as proposed by MMR that thus not only denies practice and the ontological dimensions of methods and data, but also casts methods in the role of unyielding black boxes. Footnote 10 With this in mind, it will come as no surprise that most foundational contributions to the MMR literature do not explicitly define what a method is, nor that they do not provide an elaborative historical account of individual methods. The particular framing of methods in MMR results in a blind spot for the historical and social context of the production and circulation of methods as intellectual products. Instead it chooses to reify the boundaries that are drawn between “qualitative” and “quantitative” methods and reproduce them in the methodology it proposes. Footnote 11 This is an example of “circulation without context” (Bourdieu 2002 , p. 4): classifications that are constructed in the field of use or reception without taking the constellation within the field of production seriously.

Of course, this does not mean that the reality of the differences between quantitative and qualitative research must be denied. These labels are sticky and symbolically laden. They have come, in many ways, to represent “two cultures” (Goertz and Mahony 2012 ) of research, institutionalised in academia, and the effects of nominally “belonging” to (or being assigned to) one particular category have very real consequences in terms of, for instance, access to research grants and specific journals. However, if the goal of an approach such as MMR is to open up new pathways in social science research, (and why should that not be the case?) it is hard to see how that is accomplished by defining the act of combining methods solely in terms of reified differences between research using qualitative and quantitative data. In our view, methods are far richer and more interesting constructs than that, and a practice of combining methods in research should reflect that. Footnote 12

Addressing these problems entices a reflection on methods and using (multiple) methods that is missing in the MMR perspective. A fruitful way to open up the black boxes and take into account the epistemological and ontological facets of methods is to make them, and their use, the object of sociological-historical investigation. Methods are constituted through particular practices. In Bourdieusian terms, they are objectifications of the subjectively understood practices of scientists “in other fields.” Rather than basing a practice of combining methods on an uncritical acceptance of the historically grown classification of types of social research (and using these as the building stones of a methodology of mixing methods), we propose the development of a multifaceted approach that is based on a study of the different socio-historical contexts and practices in which methods developed and circulated.

A sociological understanding of methods based on these premises provides the tools to break with the dichotomously designed interface for combining methods in MMR. Instead, focusing on the historical and social contexts of production and use can reveal the traces that these contexts leave, both in the internal structure of methods, how they are perceived, how they are put into practice, and how this practice informs the ontological effects of methods. Seeing methods as complex technologies, with a history that entails the struggles among the different agents involved in their production, and use opens the way to identify multiple interfaces for combining them: the one-sided boxes become polyhedra. The critical study of methods as “objects of objectification” also entices analyses of the way in which methods intervene between subject (researcher) and object and the way in which different methods are employed in practice to draw this boundary differently. The reflexive position generated by such a systematic juxtaposition of methods is a fruitful basis to come to a richer perspective on combining methods.

We critically reviewed the emerging practice of combining methods under the label of MMR. MMR challenges the mono-method approaches that are still dominant in the social sciences, and this is both refreshing and important. Combining methods should indeed be taken much more seriously in the social sciences.

However, the direction that the practice of combining methods is taking under the MMR approach seems problematic to us. We identified three main concerns. First, MMR scholars seem to be committed to designing a standardized methodological framework for combining methods. This is unfortunate, since it amounts to enforcing an unnecessary codification of aspects of research practices that should not be formally standardized. Second, MMR constructs methods as unproblematic representations of an epistemology. Although methods must be separable from their native epistemology for MMR to work, at the same time they have to be nested within a qualitative or a quantitative research approach, which are characterized by the data they use. By this logic, combining quantitative methods with other quantitative methods, or qualitative methods with other qualitative methods, cannot offer the same benefits: they originate from the same way of viewing and knowing the world, so it would have the same effect as blending two gradations of the same colour paint. The importance attached to the epistemological grounding of methods and data in MMR also disregards the ontological aspects of methods. In this article, we are arguing that this one-sided perspective is problematic. Seeing combining methods as equivalent to combining epistemologies that are somehow pure and internally homogeneous because they can be placed in a qualitative or quantitative framework essentially amounts to reifying these categories.

It also leads to the third problem: the black boxing of methods as neutral mediators between these epistemologies and data. This not only constitutes a problem for trying to understand methods as intellectual products, but also for regarding the practice of combining methods, because it ignores the social-historical context of the development of individual methods and hinders a sociologically grounded notion of combining methods.

We proceed from a different perspective on methods. In our view, methods are complex constructions. They are world-making technologies that encapsulate different assumptions on causality, rely on different conceptual relations and categorizations, allow for different degrees of emergence, and employ different theories of the data that they internalise as objects of analysis. Even more importantly, their current form as intellectual products cannot be separated from the historical context of their production, circulation, and use.

A fully developed exposition of such an approach will have to await further work. Footnote 13 So far, the sociological study of methods has not (yet) developed into a consistent research programme, but important elements can be derived from existing contributions such as MacKenzie ( 1981 ), Chapoulie ( 1984 ), Platt ( 1996 ), Freeman ( 2004 ), and Desrosières ( 2008a , b ). The work on the “social life of methods” (e.g., Savage 2013 ) also contains important leads for the development of a systematic sociological approach to method production and circulation. Based on the discussion in this article and the contributions listed above, some tantalizing questions can be formulated. How are methods and their elements objectified? How are epistemology and ontology defined in different fields and how do those definitions feed into methods? How do they circulate and how are they translated and used in different contexts? What are the main controversies in fields of users and how are these related to the field of production? What are the homologies between these fields?

Setting out to answer these questions opens up the possibility of exploring other interesting combinations of methods that emerge from the combination of different practices, situated in different historical and epistemological contexts, and with their unique set of interpretations regarding their constituent elements. One of these must surely be the data-theoretical elements that different methods incorporate. The problematization of data has become all the more pressing now that the debate about the consequences of “big data” for social scientific practices has become prominent (Savage and Burrows 2007 ; Levallois et al. 2013 ; Burrows and Savage 2014 ). Whereas MMR emphasizes the dichotomy between qualitative and quantitative data, a historical analysis of the production and use of methods can explore the more subtle, different interpretations and enactments of the “same” data. These differences inform method construction, controversies surrounding methods and, hence, opportunities for combining methods. These could then be constructed based on alternative conceptualisations of data. Again, while in some contexts it might be enlightening to rely on the distinction between data as qualitative or quantitative, and to combine methods based on this categorization, it is an exciting possibility that in other research contexts other conceptualisations of data might be of more value to enhance a specific (contextual) form of knowledge.

Change history

06 may 2019.

Unfortunately, figure 2 was incorrectly published.

The search term used was “mixed method*” in the “topic” search field of SSCI, A&HCI, and CPCI-SSH as contained in the Web of Science. A Google NGram search (not shown) confirmed this pattern. The results of a search for “mixed methods” and “mixed methods research” showed a very steep increase after 1994: in the first case, the normalized share in the total corpus increased by 855% from 1994 till 2008. Also, Creswell ( 2012 ) reports an almost hundred-fold increase in the number of theses and dissertations with mixed methods’ in the citation and abstract (from 26 in 1990–1994 to 2524 in 2005–2009).

Retrieved from https://uk.sagepub.com/en-gb/eur/journal-of-mixed-methods-research/journal201775#aims-and-scope on 1/17/2019.

In terms of antecedents of mixed methods research, it is interesting to note that Bourdieu, whose sociology of science we draw on, was, from his earliest studies in Algeria onwards, a strong advocate of combining research methods. He made it into a central characteristic of his approach to social science in Bourdieu et al. ( 1991 [1968]). His approach, as we see below, was very different from the one now proposed under the banner of MMR. Significantly, there is no mention of Bourdieu’s take on combining methods in any of the sources we studied.

Morse’s example in particular warns us that restricting the analysis to the authors that have published in the JMMR runs the risk of missing some important contributors to the spread of MMR through the social sciences. On her website, Morse lists 11 publications (journal articles, book chapters, and books) that explicitly make reference to mixed methods (and a substantial number of other publications are about methodological aspects of research), so the fact that she has not (yet) published in the JMMR cannot, by itself, be taken as an indication of a lesser involvement with the practice of combining methods. See the website of Janice Morse at https://faculty.utah.edu/u0556920-Janice_Morse_RN,_PhD,_FAAN/hm/index.hml accessed 1/17/2019.

Bourdieu ( 1999 , p. 26) mentions that one has to be a scientific capitalist to be able to start a scientific revolution. But here he refers explicitly to the autonomy of the scientific field, making it virtually impossible for amateurs to stand up against the historically accumulated capital in the field and incite a revolution.

The themes summarize the key issues through which MMR as a group comes “into difference” (Bourdieu 1993 , p. 32). Of course, as in any (sub)field, the agents identified above often differ in their opinions on some of these key issues or disagree on the answer to the question if there should be a high degree of convergence of opinions at all. For instance, Bryman ( 2009 ) worried that MMR could become “a ghetto.” For him, the institutional landmarks of having a journal, conferences, and a handbook increase the risk of “not considering the whole range of possibilities.” He added: “I don’t regard it as a field, I kind of think of it as a way of thinking about how you go about research.” (Bryman, cited in Leech 2010 , p. 261). It is interesting to note that Bryman, like fellow sociologists Morgan and Denscombe, had published only one paper in the JMMR by the end of 2016 (Bryman passed away in June of 2017). Although these papers are among the most cited papers in the journal (see Table 1 ), this low number is consistent with the more eclectic approach that Bryman proposed.

Johnson, Onwuegbuzie, and Turner ( 2007 , p. 123).

Guba and Lincoln ( 1985 ) discuss the features of their version of a positivistic approach mainly in ontological and epistemological terms, but they are also careful to distinguish the opposition between naturalistic and positivist approaches from the difference between what they call the quantitative and the qualitative paradigms. Since they go on to state that, in principle, quantitative methods can be used within a naturalistic approach (although in practice, qualitative methods would be preferred by researchers embracing this paradigm), they seem to locate methods on a somewhat “lower,” i.e., less incommensurable level. However, in their later work (both together as well as with others or individually) and that of others in their wake, there seems to have been a shift towards a stricter interpretation of the qualitative/quantitative divide in metaphysical terms, enabling Teddlie and Tashakkori (2010b) to label this group “purists” (Tashakkori and Teddlie 2010b , p. 13).

See, for instance, Onwuegbuzie et al.’s ( 2011 ) classification of 58 qualitative data analysis techniques and 18 quantitative data analysis techniques.

This can also be seen in Morgan’s ( 2018 ) response to Sandelowski’s ( 2014 ) critique of the binary distinctions in MMR between qualitative and quantitative research approaches and methods. Morgan denounces the essentialist approach to categorizing qualitative and quantitative research in favor of a categorization based on “family resemblances,” in which he draws on Wittgenstein. However, this denies the fact that the essentialist way of categorizing is very common in the MMR corpus, particularly in textbooks and manuals (e.g., Plano Clark and Ivankova 2016 ). Moreover, and more importantly, he still does not extend this non-essentialist model of categorization to the level of methods, referring, for instance, to the different strengths of qualitative and quantitative methods in mixed methods studies (Morgan 2018 , p. 276).

While it goes beyond the scope of this article to delve into the history of the qualitative-quantitative divide in the social sciences, some broad observations can be made here. The history of method use in the social sciences can briefly be summarized as first, a rather fluid use of what can retrospectively be called different methods in large scale research projects—such as the Yankee City study of Lloyd Warner and his associates (see Platt 1996 , p. 102), the study on union democracy of Lipset et al. ( 1956 ), and the Marienthal study by Lazarsfeld and his associates (Jahoda et al. 1933 ); see Brewer and Hunter ( 2006 , p. xvi)—followed by an increasing emphasis on quantitative data and the objectification and standardization of methods. The rise of research using qualitative data can be understood as a reaction against this use and interpretation of method in the social sciences. However, out of the ensuing clash a new, still dominant classification of methods emerged, one that relies on the framing of methods as either “qualitative” or “quantitative.” Moreover, these labels have become synonymous with epistemological positions that are reproduced in MMR.

A proposal to come to such an approach can be found in Timans ( 2015 ).

Bijker, W. (1987). The social construction of bakelite: Toward a theory of invention. In W. Bijker, T. Hughes, T. Pinch, & D. Douglas (Eds.), The social construction of technological systems: New directions in the sociology and history of technology . Cambridge, MA: MIT press.

Google Scholar  

Bijker, W. (1997). Of bicycles, bakelites, and bulbs: Toward a theory of sociotechnical change . Cambridge, MA: MIT Press.

Bourdieu, P. (1975). La spécifité du champ scientifique et les conditions sociales du progrès de la raison. Sociologie et Sociétés, 7 (1), 91–118.

Article   Google Scholar  

Bourdieu, P. (1988). Homo academicus . Stanford, CA: Stanford University Press.

Bourdieu, P. (1993). The field of cultural production . Cambridge, UK: Polity Press.

Bourdieu, P. (1999). Les historiens et la sociologie de Pierre Bourdieu. Le Bulletin de la Société d'Histoire Moderne et Contemporaine/SHMC, 1999 (3&4), 4–27.

Bourdieu, P. (2002). Les conditions sociales de la circulation internationale des idées. Actes de la Recherche en Sciences Sociales, 145 (5), 3–8.

Bourdieu, P. (2004). Science of science and reflexivity . Cambridge, UK: Polity.

Bourdieu, P. (2007). Sketch for a self-analysis . Cambridge, UK: Polity.

Bourdieu, P., Chamboredon, J., & Passeron, J. (1991). The craft of sociology: Epistemological preliminaries . Berlin, Germany: De Gruyter.

Book   Google Scholar  

Brewer, J., & Hunter, A. (2006). Multimethod research: A synthesis of styles . London, UK: Sage.

Bryman, A. (2009). Sage Methodspace: Alan Bryman on research methods. Retrieved from http://www.youtube.com/watch?v=bHzM9RlO6j0 . Accessed 3/7/2019.

Bryman, A. (2012). Social research methods . Oxford, UK: Oxford University Press.

Burrows, R., & Savage, M. (2014). After the crisis? Big data and the methodological challenges of empirical sociology. Big Data & Society, 1 (1), 1–6.

Campbell, D., & Fiske, D. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56 (2), 81–105.

Chapoulie, J. (1984). Everett C. Hughes et le développement du travail de terrain en sociologie. Revue Française de Sociologie, 25 (4), 582–608.

Collins, R. (1998). The sociology of philosophies: A global theory of intellectual change . Cambridge, MA: Harvard University Press.

Creswell, J. (2008). Research design: Qualitative, quantitative, and mixed methods approaches . Thousand Oaks, CA: Sage.

Creswell, J. (2011). Controversies in mixed methods research. In N. Denzin & Y. Lincoln (Eds.), The Sage handbook of qualitative research . Thousand Oaks, CA: Sage.

Creswell, J. (2012). Qualitative inquiry and research design: Choosing among five approaches . Thousand Oaks, CA: Sage.

Creswell, J., & Plano Clark, V. (2011). Designing and conducting mixed methods research (2nd ed.). Thousand Oaks, CA: Sage.

Creswell, J., Plano Clark, V., Gutmann, M., & Hanson, W. (2003). Advanced mixed methods research designs. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research . Thousand Oaks, CA: Sage.

Denscombe, M. (2008). Communities of practice a research paradigm for the mixed methods approach. Journal of Mixed Methods Research, 2 (3), 270–283.

Desrosières, A. (2008a). Pour une sociologie historique de la quantification - L’Argument statistique I . Paris, France: Presses des Mines.

Desrosières, A. (2008b). Gouverner par les nombres - L’Argument statistique II . Paris, France: Presses des Mines.

Edge, D. (1995). Reinventing the wheel. In D. Edge, S. Jasanof, G. Markle, J. Petersen, & T. Pinch (Eds.), Handbook of science and technology studies . Thousand Oaks, CA: Sage.

Fielding, N. (2012). Triangulation and mixed methods designs data integration with new research technologies. Journal of Mixed Methods Research, 6 (2), 124–136.

Fielding, N., & Cisneros-Puebla, C. (2009). CAQDAS-GIS convergence: Toward a new integrated mixed method research practice? Journal of Mixed Methods Research, 3 (4), 349–370.

Fleck, C., Heilbron, J., Karady, V., & Sapiro, G. (2016). Handbook of indicators of institutionalization of academic disciplines in SSH. Serendipities, Journal for the Sociology and History of the Social Sciences, 1 (1) Retrieved from http://serendipities.uni-graz.at/index.php/serendipities/issue/view/1 . Accessed 10/10/2018.

Freeman, L. (2004). The development of social network analysis: A study in the sociology of science . Vancouver, Canada: Empirical Press.

Gingras, Y., & Gemme, B. (2006). L’Emprise du champ scientifique sur le champ universitaire et ses effets. Actes de la Recherche en Sciences Sociales, 164 , 51–60.

Goertz, G., & Mahony, J. (2012). A tale of two cultures: Qualitative and quantitative research in the social sciences . Princeton: Princeton University Press.

Greene, J. (2008). Is mixed methods social inquiry a distinctive methodology? Journal of Mixed Methods Research, 2 (1), 7–22.

Guba, E., & Lincoln, Y. (1985). Naturalistic inquiry . Thousand Oaks, CA: Sage.

Heilbron, J., Bedecarré, M., & Timans, R. (2017). European journals in the social sciences and humanities. Serendipities, Journal for the Sociology and History of the Social Sciences, 2 (1), 33–49 Retrieved from http://serendipities.uni-graz.at/index.php/serendipities/issue/view/5 . Accessed 10/10/2018.

Hesse-Biber, S. (2010). Mixed methods research: Merging theory with practice . New York, NY: Guilford Press.

Jahoda, M., Lazarsfeld, P., & Zeisel, H. (1933). Die Arbeitslosen von Marienthal. Psychologische Monographen, 5 .

Johnson, R., & Gray, R. (2010). A history of philosophical and theoretical issues for mixed methods research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.

Johnson, R., Onwuegbuzie, A., & Turner, L. (2007). Toward a definition of mixed methods research. Journal of Mixed Methods Research, 1 (2), 112–133.

Law, J. (2004). After method: Mess in social science research . London, UK: Routledge.

Leech, N. (2010). Interviews with the early developers of mixed methods research. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.

Leech, N., & Onwuegbuzie, T. (2009). A typology of mixed methods research designs. Quality & Quantity, 43 (2), 265–275.

Levallois, C., Steinmetz, S., & Wouters, P. (2013). Sloppy data floods or precise social science methodologies? In P. Wouters, A. Beaulieu, A. Scharnhorst, & S. Wyatt (Eds.), Virtual knowledge . Cambridge, MA: MIT Press.

Lipset, S., Trow, M., & Coleman, J. (1956). Union democracy: The internal politics of the international typographical union . Glencoe, UK: Free Press.

MacKenzie, D. (1981). Statistics in Britain: 1865–1930: The social construction of scientific knowledge . Edinburgh, UK: Edinburgh University Press.

Morgan, D. (2007). Paradigms lost and pragmatism regained: Methodological implications of combining qualitative and quantitative methods. Journal of Mixed Methods Research, 1 (1), 48–76.

Morgan, D. (2018). Living with blurry boundaries: The values of distinguishing between qualitative and quantitative research. Journal of Mixed Methods Research, 12 (3), 268–276.

Mullins, N. (1973). Theories and theory groups in contemporary American sociology . New York, NY: Harper & Row.

Nagy Hesse-Biber, S., & Leavy, P. (Eds.). (2008). Handbook of emergent methods . New York, NY and London, UK: Guilford Press.

Nastasi, B., Hitchcock, J., & Brown, L. (2010). An inclusive framework for conceptualizing mixed method design typologies. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.

Onwuegbuzie, A., Leech, N., & Collins, K. (2011). Toward a new era for conducting mixed analyses: The role of quantitative dominant and qualitative dominant crossover mixed analyses. In M. Williams & P. Vogt (Eds.), Handbook of innovation in social research methods . London, UK: Sage.

Patton, M. (2002). Qualitative research and evaluation methods (3rd ed.). Thousand Oaks, CA: Sage.

Plano Clark, V., & Creswell, J. (Eds.). (2007). The mixed methods reader . Thousand Oaks, CA: Sage.

Plano Clark, V., & Ivankova, N. (2016). Mixed methods research: A guide to the field . Thousand Oaks, CA: Sage.

Platt, J. (1996). A history of sociological research methods in America: 1920–1960 . Cambridge, UK: Cambridge University Press.

Plowright, D. (2010). Using mixed methods – Frameworks for an integrated methodology . Thousand Oaks, CA: Sage.

Sandelowski, M. (2014). Unmixing mixed methods research. Research in Nursing & Health, 37 (1), 3–8.

Sapiro, G., Brun, E., & Fordant, C. (2018). The rise of the social sciences and humanities in France: Iinstitutionalization, professionalization and autonomization. In C. Fleck, M. Duller, & V. Karady (Eds.), Shaping human science disciplines: Institutional developments in Europe and beyond . Basingstoke, UK: Palgrave.

Savage, M. (2013). The ‘social life of methods’: A critical introduction. Theory, Culture and Society, 30 (4), 3–21.

Savage, M., & Burrows, R. (2007). The coming crisis of empirical sociology. Sociology, 41 (5), 885–899.

Steinmetz, G. (2016). Social fields, subfields and social spaces at the scale of empires: Explaining the colonial state and colonial sociology. The Sociological Review, 64(2_suppl). 98-123.

Tashakkori, A., & Teddlie, C. (Eds.). (2003). Handbook of mixed methods in social and behavioral research . Thousand Oaks, CA: Sage.

Tashakkori, A., & Teddlie, C. (Eds.). (2010a). Handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.

Tashakkori, A., & Teddlie, C. (2010b). Overview of contemporary issues in mixed methods. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.

Chapter   Google Scholar  

Tashakkori, A., & Teddlie, C. (2010c). Epilogue. In A. Tashakkori & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (2nd ed.). Thousand Oaks, CA: Sage.

Tashakkori, A., & Teddlie, C. (2011). Mixed methods research: Contemporary issues in an emerging field. In N. Denzin & Y. Lincoln (Eds.), The SAGE handbook of qualitative research (4th ed.). Thousand Oaks, CA: Sage.

Teddlie, C., & Tashakkori, A. (2008). Foundations of mixed methods research – Integrating quantitative and qualitative approaches in the social and behavioral sciences . Thousand Oaks, CA: Sage.

Timans, R. (2015). Studying the Dutch business elite: Relational concepts and methods . Doctoral dissertation, Erasmus University Rotterdam, the Netherlands.

Wacquant, L. (2013). Bourdieu 1993: A case study in scientific consecration. Sociology, 47 (1), 15–29.

Download references

Acknowledgments

This research is part of the Interco-SSH project, funded by the European Union under the 7th Research Framework Programme (grant agreement no. 319974). Johan Heilbron would like to thank Louise and John Steffens, members of the Friends Founders’ Circle, who assisted his stay at the Princeton Institute for Advanced Study in 2017-18 during which he completed his part of the present article.

Author information

Authors and affiliations.

Erasmus Centre for Economic Sociology (ECES), Erasmus University Rotterdam, Rotterdam, Netherlands

Centre for Science and Technology Studies (CWTS), Leiden University, Leiden, Netherlands

Paul Wouters

Erasmus Centre for Economic Sociology (ECES), Rotterdam and Centre européen de sociologie et de science politique de la Sorbonne (CESSP-CNRS-EHESS), Paris, France

Johan Heilbron

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Rob Timans .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Timans, R., Wouters, P. & Heilbron, J. Mixed methods research: what it is and what it could be. Theor Soc 48 , 193–216 (2019). https://doi.org/10.1007/s11186-019-09345-5

Download citation

Published : 29 March 2019

Issue Date : 01 April 2019

DOI : https://doi.org/10.1007/s11186-019-09345-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data, field analysis
  • Mixed methods research
  • Multiple methods
  • Reflexivity
  • Sociology of science
  • Find a journal
  • Publish with us
  • Track your research

Banner

Research Methods: Peer-Reviewed Journal Articles

  • Getting Started
  • What Type of Source?
  • Credible Sources
  • Finding Background Information
  • Library Databases
  • Reference Books
  • Electronic Books
  • Online Reference Collections
  • Databases Presenting Two Sides of an Argument
  • Peer-Reviewed Journal Articles
  • Writing Style Guides and Citing Sources
  • Annotated Bibliographies

What is a Peer-Reviewed (Academic) Journal?

What Is a Peer-Reviewed Journal?

Peer Review is a process that journals use to ensure the articles they publish represent the best scholarship currently available. When an article is submitted to a peer reviewed journal, the editors send it out to other scholars in the same field (the author's peers) to get their opinion on the quality of the scholarship, its relevance to the field, its appropriateness for the journal, etc.

Publications that don't use peer review (Time, Cosmo, Salon) just rely on the judgement of the editors whether an article is up to snuff or not. That's why you can't count on them for solid, scientific scholarship. --University of Texas at Austin

Databases Containing Peer-Reviewed Journal Articles

Each database containing peer-reviewed journals has different content coverage and materials.  The databases listed in this Research Guide are available only to Truckee Meadows Community College students, faculty and staff. You will need your TMCC credentials (Username and Password) to access them off-campus.

When searching a database, a search term frequently will retrieve many articles.  Browse the article abstracts to find one or more relevant to your search.

Some of the databases provide citations for the articles.

Consult a librarian for assistance.

  • Databases with peer-reviewed articles and content . This list can also be sorted by subject!

peer reviewed journal articles on research methods

How to Read a Peer-Reviewed Journal Article

Tips for Reading a Research Article

Read the Abstract. It consists of a brief summary of the research questions and methods. It may also state the findings. Because it is short and often written in dense psychological language, you may need to read it a couple of times. Try to restate the abstract in your own nontechnical language.

  • Read the Introduction. This is the beginning of the article, appearing first after the Abstract. This contains information about the authors' interest in the research, why they chose the topic, their hypothesis , and methods. This part also sets out the operational definitions of variables.
  • Read the Discussion section. Skip over the Methods section for the time being. The Discussion section will explain the main findings in great detail and discuss any methodological problems or flaws that the researchers discovered.
  • Read the Methods section. Now that you know the results and what the researchers claim the results mean, you are prepared to read about the Methods. This section explains the type of research and the techniques and assessment instruments used. If the research utilized self-reports and questionnaires, the questions and statements used may be set out either in this section or in an appendix that appears at the end of the report.
  • Read the Results section. This is the most technically challenging part of a research report. But you already know the findings (from reading about them in the Discussion section). This section explains the statistical analyses that led the authors to their conclusions.
  • Read the Conclusion. The last section of the report (before any appendices) summarizes the findings, but, more important for social research, it sets out what the researchers think is the value of their research for real-life application and for public policy. This section often contains suggestions for future research, including issues that the researchers became aware of in the course of the study.
  • Following the conclusions are appendices, usually tables of findings, presentations of questions and statements used in self-reports and questionnaires, and examples of forms used (such as forms for behavioral assessments).

Modified from Net Lab

  • << Previous: Databases Presenting Two Sides of an Argument
  • Next: Writing Style Guides and Citing Sources >>
  • Last Updated: Dec 7, 2022 8:42 AM
  • URL: https://libguides.tmcc.edu/researchmethods
  • Research article
  • Open access
  • Published: 06 March 2019

Tools used to assess the quality of peer review reports: a methodological systematic review

  • Cecilia Superchi   ORCID: orcid.org/0000-0002-5375-6018 1 , 2 , 3 ,
  • José Antonio González 1 ,
  • Ivan Solà 4 , 5 ,
  • Erik Cobo 1 ,
  • Darko Hren 6 &
  • Isabelle Boutron 7  

BMC Medical Research Methodology volume  19 , Article number:  48 ( 2019 ) Cite this article

24k Accesses

42 Citations

66 Altmetric

Metrics details

A strong need exists for a validated tool that clearly defines peer review report quality in biomedical research, as it will allow evaluating interventions aimed at improving the peer review process in well-performed trials. We aim to identify and describe existing tools for assessing the quality of peer review reports in biomedical research.

We conducted a methodological systematic review by searching PubMed, EMBASE (via Ovid) and The Cochrane Methodology Register (via The Cochrane Library) as well as Google® for all reports in English describing a tool for assessing the quality of a peer review report in biomedical research. Data extraction was performed in duplicate using a standardized data extraction form. We extracted information on the structure, development and validation of each tool. We also identified quality components across tools using a systematic multi-step approach and we investigated quality domain similarities among tools by performing hierarchical, complete-linkage clustering analysis.

We identified a total number of 24 tools: 23 scales and 1 checklist. Six tools consisted of a single item and 18 had several items ranging from 4 to 26. None of the tools reported a definition of ‘quality’. Only 1 tool described the scale development and 10 provided measures of validity and reliability. Five tools were used as an outcome in a randomized controlled trial (RCT). Moreover, we classified the quality components of the 18 tools with more than one item into 9 main quality domains and 11 subdomains. The tools contained from two to seven quality domains. Some domains and subdomains were considered in most tools such as the detailed/thorough (11/18) nature of reviewer’s comments. Others were rarely considered, such as whether or not the reviewer made comments on the statistical methods (1/18).

Several tools are available to assess the quality of peer review reports; however, the development and validation process is questionable and the concepts evaluated by these tools vary widely. The results from this study and from further investigations will inform the development of a new tool for assessing the quality of peer review reports in biomedical research.

Peer Review reports

The use of editorial peer review originates in the eighteenth century [ 1 ]. It is a longstanding and established process that generally aims to provide a fair decision-making mechanism and improve the quality of a submitted manuscript [ 2 ]. Despite the long history and application of the peer review system, its efficacy is still a matter of controversy [ 3 , 4 , 5 , 6 , 7 ]. About 30 years after the first international Peer Review Congress, there are still ‘scarcely any bars to eventual publication. There seems to be no study too fragmented, no hypothesis too trivial [...] for a paper to end up in print’ (Drummond Rennie, chair of the advisory board) [ 8 ].

Recent evidence suggests that many current editors and peer reviewers in biomedical journals still lack the appropriate competencies [ 9 ]. In particular, it has been shown that peer reviewers rarely receive formal training [ 3 ]. Moreover, their capacity to detect errors [ 10 , 11 ], identify deficiencies in reporting [ 12 ] and spin [ 13 ] has been found lacking.

Some systematic reviews have been performed to estimate the effect of interventions aimed at improving the peer review process [ 2 , 14 , 15 ]. These studies showed that there is still a lack of evidence supporting the use of interventions to improve the quality of the peer review process. Furthermore, Bruce and colleagues highlighted the urgent need to clarify outcomes, such as peer review report quality, that should be used in randomized controlled trials evaluating these interventions [ 15 ].

A validated tool that clearly defines peer review report quality in biomedical research is greatly needed. This will allow researchers to have a structured instrument to evaluate the impact of interventions aimed at improving the peer review process in well-performed trials. Such a tool could also be regularly used by editors to evaluate the work of reviewers.

Herein, as starting point for the development of a new tool, we identify and describe existing tools that assess the quality of peer review reports in biomedical research.

Study design

We conducted a methodological systematic review and followed the standard Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines [ 16 ]. The quality of peer review reports is an outcome that in the long term is related to clinical relevance and patient care. However, the protocol was not registered in PROSPERO, as this review does not contain direct health-related outcomes [ 17 ].

Information sources and search strategy

We searched PubMed, EMBASE (via Ovid) and The Cochrane Methodology Register (via The Cochrane Library) from their inception to October 27, 2017 as well as Google® (search date: October 20, 2017) for all reports describing a tool to assess the quality of a peer review report in biomedical research. Search strategies were refined in collaboration with an expert methodologist (IS) and are presented in the Additional file  1 . We hand-searched the citation lists of included papers and consulted a senior editor with expertise in editorial policies and peer review processes to further identify relevant reports.

Eligibility criteria

We included all reports describing a tool to assess the quality of a peer review report. Sanderson and colleagues defined a tool as ‘any structured instrument aimed at aiding the user to assess the quality [...]’ [ 18 ]. Building on this definition, we defined a quality tool as any structured or unstructured instrument assisting the user to assess the quality of peer review report (for definitions see Table  1 ). We restricted inclusion to the English language.

Study selection

We exported the references retrieved from the search into the reference manager Endnote X7 (Clarivate Analytics, Philadelphia, United States), which was subsequently used to remove duplicates. We reviewed all records manually to verify and remove duplicates that had not been previously detected. A reviewer (CS) screened all titles and abstracts of the retrieved citations. A second reviewer (JAG) carried out quality control on a 25% random sample obtained using the statistical software R 3.3.3 [ 19 ]. We obtained and independently examined the full-text copies of potentially eligible reports for further assessment. In the case of disagreement, consensus was determined by a discussion or by involving a third reviewer (DH). We reported the result of this process through a PRISMA flowchart [ 16 ]. When several tools were reported in the same article, they were included as separate tools. When a tool was reported in more than one article, we extracted data from all related reports.

Data extraction

General characteristics of tools.

We designed a data extraction form using Google® Docs and extracted the general characteristics of the tools. We determined whether the tool was scale or checklist. We defined a tool as a scale when it included a numeric or nominal overall quality score while we considered it as a checklist when an overall quality score was not present. We recorded the total number of items (for definitions see Table 1 ). For scales with more than 1 item we extracted how items were weighted, how the overall score was calculated, and the scoring range. Moreover, we checked whether the scoring instructions were adequately defined, partially defined, or not defined according to the subjective judgement of two reviewers (CS and JAG) (an example of the definition for scoring instructions is shown in Table  2 ). Finally, we extracted all information related to the development, validation, and assessment of the tool’s reliability and if the concept of quality was defined.

Two reviewers (CS and JAG) piloted and refined the data extraction form on a random 5% sample of extracted articles. Full data extraction was conducted by two reviewers (CS and JAG) working independently for all included articles. In the case of disagreement, consensus was obtained by discussion or by involving a third reviewer (DH). Authors of the reports were contacted in cases where we needed further clarification of the tool.

Quality components of the peer review report considered in the tools

We followed the systematic multi-step approach recently described by Gentles [ 20 ], which is based on a constant comparative method of analysis developed within the Grounded Theory approach [ 21 ]. Initially, a researcher (CS) extracted all items included in the tools and for each item identified a ‘key concept’ representing a quality component of peer review reports. Next, two researchers (CS and DH) organized the key concepts into a domain-specific matrix (analogous to the topic-specific matrices described by Gentles). Initially, the matrix consisted of domains for peer review report quality, followed by items representative of each domain and references to literature sources that items were extracted from. As the analysis progressed, subdomains were created and the final version of the matrix included domains, subdomains, items and references.

Furthermore, we calculated the proportions of domains based on the number of items included in each domain for each tool. According to the proportions obtained, we created a domain profile for each tool. Then, we calculated the matrix of Euclidean distances between the domain profiles. These distances were used to perform the hierarchical, complete-linkage clustering analysis, which provided us with a tree structure that we represent in a chart. Through this graphical summary, we were able to identify domain similarities among the different tools, which helped us draw our analytical conclusions. The calculations and graphical representations were obtained using the statistical software R 3.3.3 [ 19 ].

Study selection and general characteristics of reports

The screening process is summarized in a flow diagram (Fig. 1 ). Of the 4312 records retrieved, we finally included 46 reports: 39 research articles; 3 editorials; 2 information guides; 1 was a letter to the editor and 1 study was available only as an abstract (excluded studies are listed in Additional file  2 ; included studies are listed in Additional file  3 ).

figure 1

Study selection flow diagram

General characteristics of the tools

In the 46 reports, we identified 24 tools, including 23 scales and 1 checklist. The tools were developed from 1985 to 2017. Four tools had from 2 to 4 versions [ 22 , 23 , 24 , 25 ]. Five tools were used as an outcome in a randomized controlled trial [ 23 , 25 , 26 , 27 , 28 ]. Table  3 lists the general characteristics of the identified tools. Table  4 presents a more complete descriptive summary of the tools’ characteristics, including types and measures of validity and reliability.

Six scales consisted of a single item enquiring into the overall quality of the peer review report, all of them based on directly asking users to score the overall quality [ 22 , 25 , 29 , 30 , 31 , 32 ]. These tools assessed the quality of a peer review report by using: 1) a 4 or 5 Likert point scale ( n  = 4); 2) as ‘good’, ‘fair’ and ‘poor’ ( n  = 1); and 3) a restricted scale from 80 to 100 (n = 1). Seventeen scales and one checklist had several items ranging in number from 4 to 26. Of these, 10 used the same weight for each item [ 23 , 24 , 27 , 28 , 33 , 34 , 35 , 36 , 37 , 38 ]. The overall quality score was the sum of the score for each item ( n  = 3); the mean of the score of the items ( n  = 6); or the summary score ( n  = 11) (for definitions see Table 1 ). Three scales reported more than one way to assess the overall quality [ 23 , 24 , 36 ]. The scoring system instructions were not defined in 67% of the tools.

None of the tools reported the definition of peer review report quality, and only one described the tool development [ 39 ]. The first version of this tool was designed by a development group composed of four researchers and three editors. It was based on a tool used in an earlier study and that had been developed by reviewing the literature and interviewing editors. Successively, the tool was modified by rewording some questions after some group discussions and a guideline for using the tool was drawn up.

Only 3 tools assessed and reported a validation process [ 39 , 40 , 41 ]. The assessed types of validity included face validity, content validity, construct validity, and preliminary criterion validity. Face and content validity could involve either a sole editor and author or a group of researchers and editors. Construct validity was assessed with multiple regression analysis using discriminant criteria (reviewer characteristics such as age, sex, and country of residence) and convergent criteria (training in epidemiology and/or statistics); or the overall assessment of the peer review report by authors and an assessment of ( n  = 4–8) specific components of the peer review report by editors or authors. Preliminary criterion was assessed by comparing grades obtained by an editor to those obtained by an editor-in-chief using an earlier version of the tool. Reliability was assessed in 9 tools [ 24 , 25 , 26 , 27 , 31 , 36 , 39 , 41 , 42 ]; all reported inter-rater reliability and 2 also reported test-retest reliability. One tool reported the internal consistency measured with the Cronbach’s alpha [ 39 ].

Quality components of the peer review reports considered in the tools with more than one item

We extracted 132 items included in the 18 tools. One item asking for the percentage of co-reviews the reviewer had graded was not included in the classification because it represented a method of measuring reviewer’s performance and not a component of peer review report quality.

We organized the key concepts from each item into ‘topic-specific matrices’ (Additional file  4 ), identifying nine main domains and 11 subdomains: 1) relevance of study ( n  = 9); 2) originality of the study ( n  = 5); 3) interpretation of study results ( n  = 6); 4) strengths and weaknesses of the study ( n  = 12) (general, methods and statistical methods); 5) presentation and organization of the manuscript ( n  = 8); 6) structure of the reviewer’s comments ( n  = 4); 7) characteristics of reviewer’s comments ( n  = 14) (clarity, constructiveness, detail/thoroughness, fairness, knowledgeability, tone); 8) timeliness of the review report ( n  = 7); and 9) usefulness of the review report ( n  = 10) (decision making and manuscript improvement). The total number of tools corresponding to each domain and subdomain is shown in Fig.  2 . An explanation and example of all domains and subdomains is provided in Table  5 . Some domains and subdomains were considered in most tools, such as whether the reviewers’ comments were detailed/thorough ( n  = 11) and constructive ( n  = 9), whether the reviewers’ comments were on the relevance of the study ( n  = 9) and if the peer review report was useful for manuscript improvement ( n  = 9). However, other items were rarely considered, such as whether the reviewer made comments on the statistical methods ( n  = 1).

figure 2

Frequency of quality domains and subdomains

Clustering analysis among tools

We created a domain profile for each tool. For example, the tool developed by Justice et al. consisted of 5 items [ 35 ]. We classified three items under the domain ‘ Characteristics of the reviewer’s comments ’, one under ‘ Timeliness of the review report ’ and one under ‘ Usefulness of the review report ’. According to the aforementioned classification, the domain profile (represented by proportions of domains) for this tool was 0.6:0.2:0.2 for the incorporating domains and 0 for the remaining ones. The hierarchical clustering used the matrix of Euclidean distances among domain profiles, which led to five main clusters (Fig.  3 ).

figure 3

Hierarchical clustering of tools based on the nine quality domains. The figure shows which quality domains are present in each tool. A slice of the chart represents a tool, and each slice is divided into sectors, indicating quality domains (in different colours). The area of each sector corresponds to the proportion of each domain within the tool. For instance, the “Review Rating” tool consists of two domains: Timeliness , meaning that 25% of all its items are encompassed in this domain, and Characteristics of reviewer’s comments occupying the remaining 75%. The blue lines starting from the centre of the chart define how the tools are divided into the five clusters. Clusters #1, #2 and #3 are sub-nodes of a major node grouping all three, meaning that the tools in these clusters have a similar domain profile compared to the tools in clusters #4 and #5

The first cluster consisted of 5 tools developed from 1990 to 2016. All tools included at least one item in the characteristics of the reviewer’s comments domain, representing at least 50% of each domain profile. In the second cluster, there were 3 tools developed from 1994 to 2006. These tools were characterized to incorporate at least one item in the usefulness and timeliness domains. The third cluster included 6 tools that had been developed from 1998 to 2010 and exhibited the most heterogeneous mix of domains. These tools were distinct from the rest because they encompassed items related to interpretation of the study results and originality of the study . Moreover, the third cluster included two tools with different versions and variations. The first, second, and third cluster were linked together in the hierarchical tree that presented tools with at least one quality component grouped in the domain characteristics of the reviewer’s comments. In the fourth cluster, there are 2 tools developed from 2011 to 2017 that consist of at least one component in the strengths and weaknesses domain. Finally, the fifth cluster included 2 tools developed from 2009 to 2012 and which consisted of the same 2 domains. The fourth and fifth clusters were separated from the rest in the hierarchical tree that presented tools with only a few domains.

To the best of our knowledge, this is the first comprehensive review that has systematically identified tools used in biomedical research for assessing the quality of peer review reports. We have identified 24 tools from both the medical literature and an internet search: 23 scales and 1 checklist. One out of four tools consisted of a single item that simply asked the evaluator for a direct assessment of the peer review report’s ‘overall quality’. The remaining tools had between 4 to 26 items in which the overall quality was assessed as the sum of all items, their mean, or as a summary score.

Since a definition of overall quality was not provided, these tools consisted exclusively of a subjective quality assessment by the evaluators. Moreover, we found that only one study reported a rigorous development process of the tool, although it included a very limited number of people. This is of concern because it means that the identified tools were, in fact, not suitable to assess the quality of a peer review report, particularly because they lack a focused theoretical basis. We found 10 tools that were evaluated for validity and reliability; in particular, criterion validity was not assessed for any tool.

Most of the scales with more than one item resulted in a summary score. These scales did not consider how items could be weighted differently. Although commonly used, scales are controversial tools in assessing quality primarily because using a score ‘in summarization weights’ would cause a biased estimation of the measured object [ 43 ]. It is not clear how weights should be assigned to each item of the scale [ 18 ]. Thus different weightings would produce different scales, which could provide varying quality assessments of an individual study [ 44 ].

n our methodological systematic review, we found only one checklist. However, it was neither rigorously developed nor validated and therefore we could not consider it adequate for assessing peer review report quality. We believe that checklists may be a more appropriate means for assessing quality because they do not present an overall score, meaning they do not require a weight for the items.

It is necessary to clearly define what the tool measures. For example, the Risk of Bias (RoB) tool [ 45 ] has a clear aim (to assess trial conduct and not reporting), and it provides a detailed definition of each domain in the tool, including support for judgment. Furthermore, it was developed with transparent procedures, including wide consultation and review of the empirical evidence. Bias and uncertainty can arise when using tools that are not evidence-based, rigorously developed, validated and reliable; and this is particularly true for tools that are used for evaluating interventions aimed at improving the peer review process in RCTs, thus affecting how trial results are interpreted.

We found that most of the items included in the different tools did not cover the scientific aspects of a peer review report nor were constrained to biomedical research. Surprisingly, few tools included an item related to the methods used in the study, and only one inquired about the statistical methods.

In line with a previous study published in 1990 [ 28 ], we believe that the quality components found across all tools could be further organized according to the perspective of either an editor or author, specifically by taking into account the different yet complementary uses of a peer review report. For instance, reviewer’s comments on the relevance of the study and interpretation of the study’s results could assist editors in making an editorial decision, clarity and detail/thoroughness of reviewer’s comments are important attributes which help authors improve manuscript quality. We plan to further investigate the perspectives of biomedical editors and authors towards the quality of peer review reports by conducting an international online survey. We will also include patient editors as survey’s participants as their involvement in the peer review process can further ensure that research manuscripts are relevant and appropriate to end-users [ 46 ].

The present study has strengths but also some limitations. Although we implemented a comprehensive search strategy for reports by following the guidance for conducting methodological reviews [ 20 ], we cannot exclude a possibility that some tools were not identified. Moreover, we limited the eligibility criteria to reports published only in English. Finally, although the number of eligible records we identified through Google® was very limited, it is possible that we introduced selection bias due to a (re)search bubble effect [ 47 ].

Due to the lack of a standard definition of quality, a variety of tools exist for assessing the quality of a peer review report. Overall, we were able to establish 9 quality domains. Between two to seven domains were used among each of the 18 tools. The variety of items and item combinations amongst tools raises concern about variations in the quality of publications across biomedical journals. Low-quality biomedical research implies a tremendous waste of resources [ 48 ] and explicitly affects patients’ lives. We strongly believe that a validated tool is necessary for providing a clear definition of peer review report quality in order to evaluate interventions aimed at improving the peer review process in well-performed trials.

Conclusions

The findings from this methodological systematic review show that the tools for assessing the quality of a peer review report have various components, which have been grouped into 9 domains. We plan to survey a sample of editors and authors in order to refine our preliminary classifications. The results from further investigations will allow us to develop a new tool for assessing the quality of peer review reports. This in turn could be used to evaluate interventions aimed at improving the peer review process in RCTs. Furthermore, it would help editors: 1) evaluate the work of reviewers; 2) provide specific feedback to reviewers; and 3) identify reviewers who provide outstanding review reports. Finally, it might be further used to score the quality of peer review reports in developing programs to train new reviewers.

Abbreviations

Preferred Reporting Items for Systematic Reviews

Randomized controlled trials

Risk of Bias

Kronick DA. Peer review in 18th-century scientific journalism. JAMA. 1990;263(10):1321–2.

Article   CAS   Google Scholar  

Jefferson T, Alderson P, Wager E, Davidoff F. Effects of editorial peer review. JAMA. 2002;287(21):2784–6.

Article   Google Scholar  

Smith R. Peer review: a flawed process at the heart of science and journals. J R Soc Med. 2006;99:178–82.

Baxt WG, Waeckerle JF, Berlin JA, Callaham ML. Who reviews the reviewers? Feasibility of using a fictitious manuscript to evaluate peer reviewer performance. Ann Emerg Med. 1998;32(3):310–7.

Kravitz RL, Franks P, Feldman MD, Gerrity M, Byrne C, William M. Editorial peer reviewers’ recommendations at a general medical journal : are they reliable and do editors care? PLoS One. 2010;5(4):2–6.

Yaffe MB. Re-reviewing peer review. Sci Signal. 2009;2(85):1–3.

Stahel PF, Moore EE. Peer review for biomedical publications : we can improve the system. BMC Med. 2014;12(179):1–4.

Google Scholar  

Rennie D. Make peer review scientific. Nature. 2016;535:31–3.

Moher D. Custodians of high-quality science: are editors and peer reviewers good enough? https://www.youtube.com/watch?v=RV2tknDtyDs&t=454s . Accessed 16 Oct 2017.

Ghimire S, Kyung E, Kang W, Kim E. Assessment of adherence to the CONSORT statement for quality of reports on randomized controlled trial abstracts from four high-impact general medical journals. Trials. 2012;13:77.

Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and interpretation of randomized controlled trials with statistically nonsignificant results. JAMA. 2010;303(20):2058–64.

Hopewell S, Collins GS, Boutron I, Yu L-M, Cook J, Shanyinde M, et al. Impact of peer review on reports of randomised trials published in open peer review journals: retrospective before and after study. BMJ. 2014;349:g4145.

Lazarus C, Haneef R, Ravaud P, Boutron I. Classification and prevalence of spin in abstracts of non-randomized studies evaluating an intervention. BMC Med Res Methodol. 2015;15:85.

Jefferson T, Rudin M, Brodney Folse S, et al. Editorial peer review for improving the quality of reports of biomedical studies. Cochrane Database Syst Rev. 2007;2:MR000016.

Bruce R, Chauvin A, Trinquart L, Ravaud P, Boutron I. Impact of interventions to improve the quality of peer review of biomedical journals: a systematic review and meta-analysis. BMC Med. 2016;14:85.

Moher D, Liberati A, Tetzlaff J, Altman DG, Group TP. Preferred reporting items for systematic reviews and meta-analyses : the PRISMA statement. PLoS Med. 2009;6(7):e1000097.

NHS. PROSPERO International prospective register of systematic reviews. https://www.crd.york.ac.uk/prospero/ . Accessed 6 Nov 2017.

Sanderson S, Tatt ID, Higgins JPT. Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: a systematic review and annotated bibliography. Intern J Epidemiol. 2007;36:666–76.

R Core Team. R: a language and environment for statistical computing. http://www.r-project.org/ . Accessed 4 Dec 2017.

Gentles SJ, Charles C, Nicholas DB, Ploeg J, McKibbon KA. Reviewing the research methods literature: principles and strategies illustrated by a systematic overview of sampling in qualitative research. Syst Rev. 2016;5:172.

Glaser B, Strauss A. The discovery of grounded theory. Chicago: Aldine; 1967.

Friedman DP. Manuscript peer review at the AJR: facts, figures, and quality assessment. Am J Roentgenol. 1995;164(4):1007–9.

Black N, Van Rooyen S, Godlee F, Smith R, Evans S. What makes a good reviewer and a good review for a general medical journal? JAMA. 1998;280(3):231–3.

Henly SJ, Dougherty MC. Quality of manuscript reviews in nursing research. Nurs Outlook. 2009;57(1):18–26.

Callaham ML, Baxt WG, Waeckerle JF, Wears RL. Reliability of editors’ subjective quality ratings of peer reviews of manuscripts. JAMA. 1998;280(3):229–31.

Callaham ML, Knopp RK, Gallagher EJ. Effect of written feedback by editors on quality of reviews: two randomized trials. JAMA. 2002;287(21):2781–3.

Van Rooyen S, Godlee F, Evans S, Black N, Smith R. Effect of open peer review on quality of reviews and on reviewers ’ recommendations : a randomised trial. BMJ. 1999;318(7175):23–7.

Mcnutt RA, Evans AT, Fletcher RH, Fletcher SW. The effects of blinding on the quality of peer review. JAMA. 1990;263(10):1371–6.

Moore A, Jones R. Supporting and enhancing peer review in the BJGP. Br J Gen Pract. 2014;64(624):e459–61.

Stossel TP. Reviewer status and review quality. N Engl J Med. 1985;312(10):658–9.

Thompson SR, Agel J, Losina E. The JBJS peer-review scoring scale: a valid, reliable instrument for measuring the quality of peer review reports. Learn Publ. 2016;29:23–5.

Rajesh A, Cloud G, Harisinghani MG. Improving the quality of manuscript reviews : impact of introducing a structured electronic template to submit reviews. AJR. 2013;200:20–3.

Shattell MM, Chinn P, Thomas SP, Cowling WR. Authors’ and editors’ perspectives on peer review quality in three scholarly nursing journals. J Nurs Scholarsh. 2010;42(1):58–65.

Jawaid SA, Jawaid M, Jafary MH. Characteristics of reviewers and quality of reviews: a retrospective study of reviewers at Pakistan journal of medical sciences. Pakistan J Med Sci. 2006;22(2):101–6.

Justice AC, Cho MK, Winker MA, Berlin JA. Does masking author identity improve peer review quality ? A randomized controlled trial. JAMA. 1998;280(3):240–3.

Henly SJ, Bennett JA, Dougherty MC. Scientific and statistical reviews of manuscripts submitted to nursing research: comparison of completeness, quality, and usefulness. Nurs Outlook. 2010;58(4):188–99.

Hettyey A, Griggio M, Mann M, Raveh S, Schaedelin FC, Thonhauser KE, et al. Peerage of science: will it work? Trends Ecol Evol. 2012;27(4):189–90.

Publons. Publons for editors: overview. https://static1.squarespace.com/static/576fcda2e4fcb5ab5152b4d8/t/58e21609d482e9ebf98163be/1491211787054/Publons_for_Editors_Overview.pdf . Accessed 20 Oct 2017.

Van Rooyen S, Black N, Godlee F. Development of the review quality instrument (RQI) for assessing peer reviews of manuscripts. J Clin Epidemiol. 1999;52(7):625–9.

Evans AT, McNutt RA, Fletcher SW, Fletcher RH. The characteristics of peer reviewers who produce good-quality reviews. J Gen Intern Med. 1993;8(8):422–8.

Feurer I, Becker G, Picus D, Ramirez E, Darcy M, Hicks M. Evaluating peer reviews: pilot testing of a grading instrument. JAMA. 1994;272(2):98–100.

Landkroon AP, Euser AM, Veeken H. Quality assessment of reviewers’ reports using a simple instrument. Obstet Gynecol. 2006;108(4):979–85.

Greenland S, O’Rourke K. On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions. Biostatistics. 2001;2(4):463–71.

Jüni P, Witschi A, Bloch R. The hazards of scoring the quality of clinical trials for meta-analysis. JAMA. 1999;282(11):1054–60.

Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928.

Schroter S, Price A, Flemyng E, et al. Perspectives on involvement in the peer-review process: surveys of patient and public reviewers at two journals. BMJ Open. 2018;8:e023357.

Ćurković M, Košec A. Bubble effect: including internet search engines in systematic reviews introduces selection bias and impedes scientific reproducibility. BMC Med Res Methodol. 2018;18(1):130.

Chalmers I, Bracken MB, Djulbegovic B, Garattini S, Grant J, Gülmezoglu AM, et al. How to increase value and reduce waste when research priorities are set. Lancet. 2014;383(9912):156–65.

Kliewer MA, Freed KS, DeLong DM, Pickhardt PJ, Provenzale JM. Reviewing the reviewers: comparison of review quality and reviewer characteristics at the American journal of roentgenology. AJR. 2005;184(6):1731–5.

Berquist T. Improving your reviewer score: it’s not that difficult. AJR. 2017;209:711–2.

Callaham ML, Mcculloch C. Longitudinal trends in the performance of scientific peer reviewers. Ann Emerg Med. 2011;57(2):141–8.

Yang Y. Effects of training reviewers on quality of peer review: a before-and-after study (Abstract). https://peerreviewcongress.org/abstracts_2009.html . Accessed 7 Nov 2017.

Prechelt L. Review quality collector. https://reviewqualitycollector.org/static/pdf/rqdef-example.pdf . Accessed 20 Oct 2017.

Das Sinha S, Sahni P, Nundy S. Does exchanging comments of Indian and non-Indian reviewers improve the quality of manuscript reviews? Natl Med J India. 1999;12(5):210–3.

Callaham ML, Schriger DL. Effect of structured workshop training on subsequent performance of journal peer reviewers. Ann Emerg Med. 2002;40(3):323–8.

Download references

Acknowledgments

The authors would like to thank the MiRoR consortium for their support, Elizabeth Moylan for helping to identify further relevant reports and Melissa Sharp for providing advice during the writing of this article.

This project was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement no 676207. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Availability of data and materials

The datasets supporting the conclusions of the present study will be available in the Zenodo repository in the Methods in Research on Research (MiRoR) community [ https://zenodo.org/communities/miror/?page=1&size=20 ].

Author information

Authors and affiliations.

Department of Statistics and Operations Research, Barcelona-Tech, UPC, c/ Jordi Girona 1-3, 08034, Barcelona, Spain

Cecilia Superchi, José Antonio González & Erik Cobo

INSERM, U1153 Epidemiology and Biostatistics Sorbonne Paris Cité Research Center (CRESS), Methods of therapeutic evaluation of chronic diseases Team (METHODS), F-75014, Paris, France

Cecilia Superchi

Paris Descartes University, Sorbonne Paris Cité, Paris, France

Iberoamerican Cochrane Centre, Hospital de la Santa Creu i Sant Pau, C/ Sant Antoni Maria Claret 167, Pavelló 18 - planta 0, 08025, Barcelona, Spain

CIBER de Epidemiología y Salud Pública (CIBERESP), Madrid, Spain

Department of Psychology, Faculty of Humanities and Social Sciences, University of Split, Split, Croatia

Centre d’épidémiologie Clinique, Hôpital Hôtel-Dieu, 1 place du Paris Notre-Dame, 75004, Paris, France

Isabelle Boutron

You can also search for this author in PubMed   Google Scholar

Contributions

All authors provided intellectual contributions to the development of this study. CS, EC and IB had the initial idea and with JAG and DH, designed the study. CS designed the search in collaboration with IS. CS conducted the screening and JAG carried out a quality control of a 25% random sample. CS and JAG conducted the data extraction. CS conducted the analysis and with JAG designed the figures. CS led the writing of the manuscript. IB led the supervision of the manuscript preparation. All authors provided detailed comments on earlier drafts and approved the final manuscript.

Corresponding author

Correspondence to Cecilia Superchi .

Ethics declarations

Ethics approval and consent to participate.

Not required.

Consent for publication

Not applicable.

Competing interests

All authors have completed the ICMJE uniform disclosure form at http://www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare that (1) no authors have support from any company for the submitted work; (2) IB is the deputy director of French EQUATOR that might have an interest in the work submitted; (3) no author’s spouse, partner, or children have any financial relationships that could be relevant to the submitted work; and (4) none of the authors has any non-financial interests that could be relevant to the submitted work.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:.

Search strategies. (PDF 182 kb)

Additional file 2:

Excluded studies. (PDF 332 kb)

Additional file 3:

Included studies. (PDF 244 kb)

Additional file 4:

Classification of peer review report quality components. (PDF 2660 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Superchi, C., González, J.A., Solà, I. et al. Tools used to assess the quality of peer review reports: a methodological systematic review. BMC Med Res Methodol 19 , 48 (2019). https://doi.org/10.1186/s12874-019-0688-x

Download citation

Received : 11 July 2018

Accepted : 20 February 2019

Published : 06 March 2019

DOI : https://doi.org/10.1186/s12874-019-0688-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Peer review
  • Quality control
  • Systematic review

BMC Medical Research Methodology

ISSN: 1471-2288

peer reviewed journal articles on research methods

Navigation group

Home banner.

Ice climbing under aurora

Where scientists empower society

Creating solutions for healthy lives on a healthy planet.

most-cited publisher

largest publisher

2.5 billion

article views and downloads

Main Content

  • Editors and reviewers
  • Collaborators

Male doctor examining petri dish at laboratory while coworker working in background

Find a journal

We have a home for your research. Our community led journals cover more than 1,500 academic disciplines and are some of the largest and most cited in their fields.

Confident young woman gesturing while teaching students in class

Submit your research

Start your submission and get more impact for your research by publishing with us.

Active senior woman concentrating while working on laptop

Author guidelines

Ready to publish? Check our author guidelines for everything you need to know about submitting, from choosing a journal and section to preparing your manuscript.

Smiling colleagues doing research over laptop computer on desk in office

Peer review

Our efficient collaborative peer review means you’ll get a decision on your manuscript in an average of 61 days.

Interior of a library with desks and bookshelves

Article publishing charges (APCs) apply to articles that are accepted for publication by our external and independent editorial boards

Group of international university students having fun studying in library, three colleagues of modern work co-working space talking and smiling while sitting at the desk table with laptop computer

Press office

Visit our press office for key media contact information, as well as Frontiers’ media kit, including our embargo policy, logos, key facts, leadership bios, and imagery.

Back view of man presenting to students at a lecture theatre

Institutional partnerships

Join more than 555 institutions around the world already benefiting from an institutional membership with Frontiers, including CERN, Max Planck Society, and the University of Oxford.

Happy senior old korean businesswoman discussing online project on laptop with african american male colleague, working together in pairs at shared workplace, analyzing electronic documents.

Publishing partnerships

Partner with Frontiers and make your society’s transition to open access a reality with our custom-built platform and publishing expertise.

Welsh Assembly debating chamber, UK.

Policy Labs

Connecting experts from business, science, and policy to strengthen the dialogue between scientific research and informed policymaking.

Smiling African American Woman Talking to Boss in Office

How we publish

All Frontiers journals are community-run and fully open access, so every research article we publish is immediately and permanently free to read.

Front view portrait of African American man wearing lab coat and raising hand asking question while sitting in audience and listening to lecture on medicine

Editor guidelines

Reviewing a manuscript? See our guidelines for everything you need to know about our peer review process.

Shaking hands. African American dark-skinned man touching hands of his light-skinned workmate in greeting gesture

Become an editor

Apply to join an editorial board and collaborate with an international team of carefully selected independent researchers.

Scientist looking at 3D rendered graphic scans from Magnetic Resonance Imaging (MRI) scanner, close up

My assignments

It’s easy to find and track your editorial assignments with our platform, 'My Frontiers' – saving you time to spend on your own research.

Photo of a forested area overlooking a smoggy cityscape

Scientists call for urgent action to prevent immune-mediated illnesses caused by climate change and biodiversity loss

Climate change, pollution, and collapsing biodiversity are damaging our immune systems, but improving the environment offers effective and fast-acting protection.

winter kayaking in Antarctica, extreme sport adventure, people paddling on kayak near iceberg

Safeguarding peer review to ensure quality at scale

Making scientific research open has never been more important. But for research to be trusted, it must be of the highest quality. Facing an industry-wide rise in fraudulent science, Frontiers has increased its focus on safeguarding quality.

FSCI_Hub_Inflammation_Vodovotz_Hub-header_Square

Chronic stress and inflammation linked to societal and environmental impacts in new study 

Scientists hypothesize that as-yet unrecognized inflammatory stress is spreading among people at unprecedented rates, affecting our cognitive ability to address climate change, war, and other critical issues.

jellyfish in aquarium in greece

Tiny crustaceans discovered preying on live jellyfish during harsh Arctic night

Scientists used DNA metabarcoding to show for the first time that jellyfish are an important food for amphipods during the Arctic polar night in waters off Svalbard, at a time of year when other food resources are scarce.

3d rendered illustration of of an astronaut infront of mars

Why studying astronauts’ microbiomes is crucial to ensure deep space mission success

In a new Frontiers’ guest editorial, Prof Dr Lembit Sihver, director of CRREAT at the Nuclear Physics Institute of the Czech Academy of Sciences and his co-authors explore the impact the microbiome has on human health in space.

Caucasian female holding delicious pizza slice eating takeaway food delivery while watching comedy film on television at night. Woman enjoying junk-food home delivered relaxing on couch

Cake and cookies may increase Alzheimer’s risk: Here are five Frontiers articles you won’t want to miss

At Frontiers, we bring some of the world’s best research to a global audience. But with tens of thousands of articles published each year, it’s impossible to cover all of them. Here are just five amazing papers you may have missed.

Young Asian male electrical engineer in glasses using a digital multimeter in hand checking voltage to fix an industrial machine with a blurred of automation robotic arm machine in the foreground.

2024's top 10 tech-driven Research Topics

Frontiers has compiled a list of 10 Research Topics that embrace the potential of technology to advance scientific breakthroughs and change the world for the better.

Get the latest research updates, subscribe to our newsletter

Research Methods: How to Perform an Effective Peer Review

Affiliations.

  • 1 Paul C. Gaffney Division of Pediatric Hospital Medicine, UPMC Children's Hospital of Pittsburgh, Pittsburgh, Pennsylvania.
  • 2 Department of Pediatrics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania.
  • 3 Weill Department of Medicine, Weill Cornell Medicine, New York, New York.
  • 4 Department of Medicine, University of Minnesota Medical School, Minneapolis, Minneapolis.
  • 5 Department of Pediatrics, University of Minnesota Medical School, Minneapolis, Minneapolis.
  • PMID: 36214067
  • DOI: 10.1542/hpeds.2022-006764

Scientific peer review has existed for centuries and is a cornerstone of the scientific publication process. Because the number of scientific publications has rapidly increased over the past decades, so has the number of peer reviews and peer reviewers. In this paper, drawing on the relevant medical literature and our collective experience as peer reviewers, we provide a user guide to the peer review process, including discussion of the purpose and limitations of peer review, the qualities of a good peer reviewer, and a step-by-step process of how to conduct an effective peer review.

Copyright © 2022 by the American Academy of Pediatrics.

  • Peer Review*
  • Peer Review, Research*

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • What Is Peer Review? | Types & Examples

What Is Peer Review? | Types & Examples

Published on December 17, 2021 by Tegan George . Revised on June 22, 2023.

Peer review, sometimes referred to as refereeing , is the process of evaluating submissions to an academic journal. Using strict criteria, a panel of reviewers in the same subject area decides whether to accept each submission for publication.

Peer-reviewed articles are considered a highly credible source due to the stringent process they go through before publication.

There are various types of peer review. The main difference between them is to what extent the authors, reviewers, and editors know each other’s identities. The most common types are:

  • Single-blind review
  • Double-blind review
  • Triple-blind review

Collaborative review

Open review.

Relatedly, peer assessment is a process where your peers provide you with feedback on something you’ve written, based on a set of criteria or benchmarks from an instructor. They then give constructive feedback, compliments, or guidance to help you improve your draft.

Table of contents

What is the purpose of peer review, types of peer review, the peer review process, providing feedback to your peers, peer review example, advantages of peer review, criticisms of peer review, other interesting articles, frequently asked questions about peer reviews.

Many academic fields use peer review, largely to determine whether a manuscript is suitable for publication. Peer review enhances the credibility of the manuscript. For this reason, academic journals are among the most credible sources you can refer to.

However, peer review is also common in non-academic settings. The United Nations, the European Union, and many individual nations use peer review to evaluate grant applications. It is also widely used in medical and health-related fields as a teaching or quality-of-care measure.

Peer assessment is often used in the classroom as a pedagogical tool. Both receiving feedback and providing it are thought to enhance the learning process, helping students think critically and collaboratively.

Prevent plagiarism. Run a free check.

Depending on the journal, there are several types of peer review.

Single-blind peer review

The most common type of peer review is single-blind (or single anonymized) review . Here, the names of the reviewers are not known by the author.

While this gives the reviewers the ability to give feedback without the possibility of interference from the author, there has been substantial criticism of this method in the last few years. Many argue that single-blind reviewing can lead to poaching or intellectual theft or that anonymized comments cause reviewers to be too harsh.

Double-blind peer review

In double-blind (or double anonymized) review , both the author and the reviewers are anonymous.

Arguments for double-blind review highlight that this mitigates any risk of prejudice on the side of the reviewer, while protecting the nature of the process. In theory, it also leads to manuscripts being published on merit rather than on the reputation of the author.

Triple-blind peer review

While triple-blind (or triple anonymized) review —where the identities of the author, reviewers, and editors are all anonymized—does exist, it is difficult to carry out in practice.

Proponents of adopting triple-blind review for journal submissions argue that it minimizes potential conflicts of interest and biases. However, ensuring anonymity is logistically challenging, and current editing software is not always able to fully anonymize everyone involved in the process.

In collaborative review , authors and reviewers interact with each other directly throughout the process. However, the identity of the reviewer is not known to the author. This gives all parties the opportunity to resolve any inconsistencies or contradictions in real time, and provides them a rich forum for discussion. It can mitigate the need for multiple rounds of editing and minimize back-and-forth.

Collaborative review can be time- and resource-intensive for the journal, however. For these collaborations to occur, there has to be a set system in place, often a technological platform, with staff monitoring and fixing any bugs or glitches.

Lastly, in open review , all parties know each other’s identities throughout the process. Often, open review can also include feedback from a larger audience, such as an online forum, or reviewer feedback included as part of the final published product.

While many argue that greater transparency prevents plagiarism or unnecessary harshness, there is also concern about the quality of future scholarship if reviewers feel they have to censor their comments.

In general, the peer review process includes the following steps:

  • First, the author submits the manuscript to the editor.
  • Reject the manuscript and send it back to the author, or
  • Send it onward to the selected peer reviewer(s)
  • Next, the peer review process occurs. The reviewer provides feedback, addressing any major or minor issues with the manuscript, and gives their advice regarding what edits should be made.
  • Lastly, the edited manuscript is sent back to the author. They input the edits and resubmit it to the editor for publication.

The peer review process

In an effort to be transparent, many journals are now disclosing who reviewed each article in the published product. There are also increasing opportunities for collaboration and feedback, with some journals allowing open communication between reviewers and authors.

It can seem daunting at first to conduct a peer review or peer assessment. If you’re not sure where to start, there are several best practices you can use.

Summarize the argument in your own words

Summarizing the main argument helps the author see how their argument is interpreted by readers, and gives you a jumping-off point for providing feedback. If you’re having trouble doing this, it’s a sign that the argument needs to be clearer, more concise, or worded differently.

If the author sees that you’ve interpreted their argument differently than they intended, they have an opportunity to address any misunderstandings when they get the manuscript back.

Separate your feedback into major and minor issues

It can be challenging to keep feedback organized. One strategy is to start out with any major issues and then flow into the more minor points. It’s often helpful to keep your feedback in a numbered list, so the author has concrete points to refer back to.

Major issues typically consist of any problems with the style, flow, or key points of the manuscript. Minor issues include spelling errors, citation errors, or other smaller, easy-to-apply feedback.

Tip: Try not to focus too much on the minor issues. If the manuscript has a lot of typos, consider making a note that the author should address spelling and grammar issues, rather than going through and fixing each one.

The best feedback you can provide is anything that helps them strengthen their argument or resolve major stylistic issues.

Give the type of feedback that you would like to receive

No one likes being criticized, and it can be difficult to give honest feedback without sounding overly harsh or critical. One strategy you can use here is the “compliment sandwich,” where you “sandwich” your constructive criticism between two compliments.

Be sure you are giving concrete, actionable feedback that will help the author submit a successful final draft. While you shouldn’t tell them exactly what they should do, your feedback should help them resolve any issues they may have overlooked.

As a rule of thumb, your feedback should be:

  • Easy to understand
  • Constructive

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

peer reviewed journal articles on research methods

Below is a brief annotated research example. You can view examples of peer feedback by hovering over the highlighted sections.

Influence of phone use on sleep

Studies show that teens from the US are getting less sleep than they were a decade ago (Johnson, 2019) . On average, teens only slept for 6 hours a night in 2021, compared to 8 hours a night in 2011. Johnson mentions several potential causes, such as increased anxiety, changed diets, and increased phone use.

The current study focuses on the effect phone use before bedtime has on the number of hours of sleep teens are getting.

For this study, a sample of 300 teens was recruited using social media, such as Facebook, Instagram, and Snapchat. The first week, all teens were allowed to use their phone the way they normally would, in order to obtain a baseline.

The sample was then divided into 3 groups:

  • Group 1 was not allowed to use their phone before bedtime.
  • Group 2 used their phone for 1 hour before bedtime.
  • Group 3 used their phone for 3 hours before bedtime.

All participants were asked to go to sleep around 10 p.m. to control for variation in bedtime . In the morning, their Fitbit showed the number of hours they’d slept. They kept track of these numbers themselves for 1 week.

Two independent t tests were used in order to compare Group 1 and Group 2, and Group 1 and Group 3. The first t test showed no significant difference ( p > .05) between the number of hours for Group 1 ( M = 7.8, SD = 0.6) and Group 2 ( M = 7.0, SD = 0.8). The second t test showed a significant difference ( p < .01) between the average difference for Group 1 ( M = 7.8, SD = 0.6) and Group 3 ( M = 6.1, SD = 1.5).

This shows that teens sleep fewer hours a night if they use their phone for over an hour before bedtime, compared to teens who use their phone for 0 to 1 hours.

Peer review is an established and hallowed process in academia, dating back hundreds of years. It provides various fields of study with metrics, expectations, and guidance to ensure published work is consistent with predetermined standards.

  • Protects the quality of published research

Peer review can stop obviously problematic, falsified, or otherwise untrustworthy research from being published. Any content that raises red flags for reviewers can be closely examined in the review stage, preventing plagiarized or duplicated research from being published.

  • Gives you access to feedback from experts in your field

Peer review represents an excellent opportunity to get feedback from renowned experts in your field and to improve your writing through their feedback and guidance. Experts with knowledge about your subject matter can give you feedback on both style and content, and they may also suggest avenues for further research that you hadn’t yet considered.

  • Helps you identify any weaknesses in your argument

Peer review acts as a first defense, helping you ensure your argument is clear and that there are no gaps, vague terms, or unanswered questions for readers who weren’t involved in the research process. This way, you’ll end up with a more robust, more cohesive article.

While peer review is a widely accepted metric for credibility, it’s not without its drawbacks.

  • Reviewer bias

The more transparent double-blind system is not yet very common, which can lead to bias in reviewing. A common criticism is that an excellent paper by a new researcher may be declined, while an objectively lower-quality submission by an established researcher would be accepted.

  • Delays in publication

The thoroughness of the peer review process can lead to significant delays in publishing time. Research that was current at the time of submission may not be as current by the time it’s published. There is also high risk of publication bias , where journals are more likely to publish studies with positive findings than studies with negative findings.

  • Risk of human error

By its very nature, peer review carries a risk of human error. In particular, falsification often cannot be detected, given that reviewers would have to replicate entire experiments to ensure the validity of results.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Measures of central tendency
  • Chi square tests
  • Confidence interval
  • Quartiles & Quantiles
  • Cluster sampling
  • Stratified sampling
  • Thematic analysis
  • Discourse analysis
  • Cohort study
  • Ethnography

Research bias

  • Implicit bias
  • Cognitive bias
  • Conformity bias
  • Hawthorne effect
  • Availability heuristic
  • Attrition bias
  • Social desirability bias

Peer review is a process of evaluating submissions to an academic journal. Utilizing rigorous criteria, a panel of reviewers in the same subject area decide whether to accept each submission for publication. For this reason, academic journals are often considered among the most credible sources you can use in a research project– provided that the journal itself is trustworthy and well-regarded.

In general, the peer review process follows the following steps: 

  • Reject the manuscript and send it back to author, or 
  • Send it onward to the selected peer reviewer(s) 
  • Next, the peer review process occurs. The reviewer provides feedback, addressing any major or minor issues with the manuscript, and gives their advice regarding what edits should be made. 
  • Lastly, the edited manuscript is sent back to the author. They input the edits, and resubmit it to the editor for publication.

Peer review can stop obviously problematic, falsified, or otherwise untrustworthy research from being published. It also represents an excellent opportunity to get feedback from renowned experts in your field. It acts as a first defense, helping you ensure your argument is clear and that there are no gaps, vague terms, or unanswered questions for readers who weren’t involved in the research process.

Peer-reviewed articles are considered a highly credible source due to this stringent process they go through before publication.

Many academic fields use peer review , largely to determine whether a manuscript is suitable for publication. Peer review enhances the credibility of the published manuscript.

However, peer review is also common in non-academic settings. The United Nations, the European Union, and many individual nations use peer review to evaluate grant applications. It is also widely used in medical and health-related fields as a teaching or quality-of-care measure. 

A credible source should pass the CRAAP test  and follow these guidelines:

  • The information should be up to date and current.
  • The author and publication should be a trusted authority on the subject you are researching.
  • The sources the author cited should be easy to find, clear, and unbiased.
  • For a web source, the URL and layout should signify that it is trustworthy.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

George, T. (2023, June 22). What Is Peer Review? | Types & Examples. Scribbr. Retrieved April 9, 2024, from https://www.scribbr.com/methodology/peer-review/

Is this article helpful?

Tegan George

Tegan George

Other students also liked, what are credible sources & how to spot them | examples, ethical considerations in research | types & examples, applying the craap test & evaluating sources, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

  • Open access
  • Published: 06 April 2024

Statistical analyses of ordinal outcomes in randomised controlled trials: a scoping review

  • Chris J. Selman   ORCID: orcid.org/0000-0002-1277-5538 1 , 2 ,
  • Katherine J. Lee 1 , 2 ,
  • Kristin N. Ferguson 4 ,
  • Clare L. Whitehead 4 , 5 ,
  • Brett J. Manley 4 , 6 , 7 &
  • Robert K. Mahar 1 , 3  

Trials volume  25 , Article number:  241 ( 2024 ) Cite this article

294 Accesses

3 Altmetric

Metrics details

Randomised controlled trials (RCTs) aim to estimate the causal effect of one or more interventions relative to a control. One type of outcome that can be of interest in an RCT is an ordinal outcome, which is useful to answer clinical questions regarding complex and evolving patient states. The target parameter of interest for an ordinal outcome depends on the research question and the assumptions the analyst is willing to make. This review aimed to provide an overview of how ordinal outcomes have been used and analysed in RCTs.

The review included RCTs with an ordinal primary or secondary outcome published between 2017 and 2022 in four highly ranked medical journals (the British Medical Journal , New England Journal of Medicine , The Lancet , and the Journal of the American Medical Association ) identified through PubMed. Details regarding the study setting, design, the target parameter, and statistical methods used to analyse the ordinal outcome were extracted.

The search identified 309 studies, of which 144 were eligible for inclusion. The most used target parameter was an odds ratio, reported in 78 (54%) studies. The ordinal outcome was dichotomised for analysis in 47 ( \(33\%\) ) studies, and the most common statistical model used to analyse the ordinal outcome on the full ordinal scale was the proportional odds model (64 [ \(44\%\) ] studies). Notably, 86 (60%) studies did not explicitly check or describe the robustness of the assumptions for the statistical method(s) used.

Conclusions

The results of this review indicate that in RCTs that use an ordinal outcome, there is variation in the target parameter and the analytical approaches used, with many dichotomising the ordinal outcome. Few studies provided assurance regarding the appropriateness of the assumptions and methods used to analyse the ordinal outcome. More guidance is needed to improve the transparent reporting of the analysis of ordinal outcomes in future trials.

Peer Review reports

Randomised controlled trials (RCTs) aim to estimate the causal effect of one or more interventions relative to a control or reference intervention. Ordinal outcomes are useful in RCTs because the categories can represent multiple patient states within a single endpoint. The definition of an ordinal outcome is one that comprises monotonically ranked categories that are ordered hierarchically such that the distance between any two categories is not necessarily equal (or even meaningfully quantifiable) [ 1 ]. Ordinal outcomes should have categories that are mutually exclusive and unambiguously defined and can be used to capture improvement and deterioration relative to a baseline value where relevant [ 2 ]. If an ordinal scale is being used to capture change in patient status, then the ordinal outcome should also be symmetric to avoid favouring a better or worse health outcome [ 2 ]. Commonly used ordinal outcomes in RCTs include the modified-Rankin scale, a 7-category measure of disability following stroke or neurological insult [ 3 , 4 , 5 , 6 ], the Glasgow Outcome Scale-Extended (GOS-E), an 8-category measure of functional impairment post traumatic brain injury [ 7 ], and the World Health Organization (WHO) COVID-19 Clinical Progression Scale [ 8 ], an 11-point measure of disease severity among patients with COVID-19. The WHO Clinical Progression Scale, developed specifically for COVID-19 in 2020 [ 8 ], has been used in many RCTs evaluating COVID-19 disease severity and progression [ 9 , 10 ] and has helped to increase the familiarity of ordinal data and modelling approaches for ordinal outcomes for clinicians and statisticians alike [ 11 ].

Randomised controlled trials that use ordinal outcomes need to be designed and analysed with care. This includes the need to explicitly define the target parameter to compare the intervention groups (i.e. the target of estimation, for example, a proportional odds ratio (OR)), the analysis approach, and whether assumptions used in the analysis are valid. Although this is true for all RCTs, these issues are more complex when using an ordinal outcome compared to a binary or continuous outcome. For example, the choice of target parameter for an ordinal outcome depends on both the research question [ 12 , 13 ] and the assumptions that the analyst is willing to make about the data.

One option is to preserve the ordinal nature of the outcome, which can give rise to a number of different target parameters. Principled analysis of ordinal data often relies on less familiar statistical methods and underlying assumptions. Many statistical methods have been proposed to analyse ordinal outcomes. One approach to estimate the effect of treatment on the distribution of ordinal endpoints is to use a cumulative logistic model [ 14 , 15 ]. This model uses the distribution of the cumulative log-odds of the ordinal outcome to estimate a set of ORs [ 16 ], which, for an increase in the value of a covariate, represents the odds of being in the same or higher category at each level of the ordinal scale [ 15 ]. Modelling is vastly simplified by assuming that each covariate in the model exerts the same effect on the cumulative log odds for each binary split of the ordinal outcome, regardless of the threshold. This is known as the proportional odds (PO) assumption, with the model referred to as ordered logistic regression or the PO model (we refer to the latter term herein). The PO model has desirable properties of palindromic invariance (where the estimates of the parameters are not equivalent when the order of the categories are reversed) and invariance under collapsibility (where the estimated target parameter is changed when categories of the response are combined or removed) [ 17 ]. Studies have shown that an ordinal analysis of the outcome using a PO model increases the statistical power relative to an analysis of the dichotomised scale [ 18 , 19 ]. The target parameter from this model, the proportional or common OR, also has a relatively intuitive interpretation [ 20 , 21 ], representing a shift in the distribution of ordinal scale scores toward a better outcome in an intervention group compared to a reference group.

The PO model approach makes the assumption that the odds are proportional for each binary split of the ordinal outcome. If this assumption is violated then the proportional OR may be misleading in certain circumstances. Specifically, violation to PO can affect type I or II errors and/or may distort the magnitude of the treatment effect. For example, violation of proportional odds can increase the likelihood of making a type I error since the model may incorrectly identify evidence of a relationship between the treatment and outcome. Violation of the proportional odds assumption may also increase the likelihood of a type II error as the model may fail to identify a relationship between the treatment and the ordinal outcome because the model may fail to capture the true complexity of the relationship. In addition, a treatment may exert a harmful effect for some categories of the ordinal outcome but exert a beneficial effect for the remaining categories, which can ‘average’ out to no treatment effect when assuming a constant OR across the levels of the ordinal scale. The violation of PO may be harmful if the interest is also to estimate predicted probabilities for the categories of the ordinal scale, which will be too low or high for some outcomes when PO is assumed. Although the PO assumption will ‘average’ the treatment effect across the categories of the ordinal outcome, this may not be a problem if all of the treatment effects for each cut-point are in the same direction and the research aim is to simply show whether the treatment is effective even in the presence of non-PO. If the PO assumption is meaningfully violated and the interest is either in the treatment effect on a specific range of the outcome or to obtain predicted probabilities for each category of the scale, the PO model can be extended to a partial proportional odds (PPO) model which allows the PO assumption to be relaxed for a specific set or for all covariates in the model [ 22 ]. There are two types of PPO models: the unconstrained PPO model, in which the cumulative log-ORs across each cut-point vary freely across some or all of the cut-points [ 23 ], and the constrained PPO model, which assumes some functional relationship between the cumulative log-ORs [ 21 ]. However, such an approach may be more inefficient than using a PO model [ 24 , 25 ].

Alternative statistical methods that can be used to analyse the ordinal outcome include multinomial regression, which estimates an OR for each category of the ordinal outcome relative to the baseline category. The disadvantage of multinomial regression is that the number of ORs requiring estimation increases with the number of categories in the ordinal outcome. A larger sample size may therefore be required to ensure accurate precision of the many target parameters. Other methods are the continuation ratio model or adjacent-category logistic model, though these models lack two desirable properties: palindromic invariance and invariance under collapsibility [ 15 , 17 , 26 ].

Another option is to use alternative methods, such as the Mann-Whitney U  test or Wilcoxon rank-sum test [ 27 ] (referred to as the Wilcoxon test herein). The Wilcoxon test is equivalent to the PO model with a single binary exposure variable [ 15 , 28 ]. The treatment effect from a Wilcoxon test is the concordance probability that represents the probability that a randomly chosen observation from a treatment group is greater than a randomly chosen observation from a control group [ 29 , 30 ]. This parameter closely mirrors the OR derived from the PO model. Importantly, the direction of the OR from the PO model always agrees with the direction of the concordance probability. The disadvantages of the Wilcoxon test are that the concordance probability may be unfamiliar to clinicians, and the Wilcoxon test cannot be adjusted for covariates.

Another option is to dichotomise the ordinal outcome and use an OR or risk difference as the target parameter, estimated using logistic or binomial regression. This produces an effect estimate with clear clinical interpretations that may be suitable for specific clinical settings. The disadvantage of dichotomising an ordinal outcome is that it means discarding potentially useful information within the levels of the scale. This means that the trial may require a larger sample size to maintain the same level of statistical power to detect a clinically important treatment effect [ 19 ], which may not be feasible in all RCTs depending on cost constraints or the rate of recruitment. The decision to dichotomise may also depend on when the outcome is being measured. This was highlighted in a study that showed that an ordinal analysis of the modified-Rankin scale captured differences in long-term outcomes in survivors of stroke better than an analysis that dichotomised the ordinal outcome [ 3 , 31 ].

An alternative to dichotomisation is to treat the ordinal outcome as continuous and focus on the mean difference as the target parameter. This choice to treat the outcome as continuous may be based on the number of categories, where the more categories, the more the outcome resembles a continuum if proximate categories measure similar states or if the scale reflects a latent continuous variable. This has the advantage that modelling is straightforward and familiar, but it can lead to ill-defined clinical interpretations of the treatment effect since the difference between proximate categories is unequal nor quantifiable. Such an analysis also wrongly assumes that the outcome has an unbounded range.

There has been commentary [ 32 ] and research conducted on the methodology of using ordinal outcomes in certain RCT settings that have mainly focused on the benefit of an ordinal analysis using a PO model [ 19 , 33 , 34 , 35 ], including investigations into the use of a PPO model when the PO assumption is violated [ 36 ]. However, these studies have primarily focused on a limited number of statistical methods and in mostly specific medical areas such as neurology and may not be applicable more generally. Given the growing use of ordinal outcomes in RCTs, it is crucial to gain a deeper understanding of how ordinal outcomes are utilised in practice. This understanding will help identify any issues in the use of ordinal outcomes in RCTs and facilitate discussions on improving the reporting and analysis of such outcomes. To address this, we conducted a scoping review to systematically examine the use and analysis of ordinal outcomes in the current literature. Specifically, we aimed to:

Identify which target parameters are of interest in RCTs that use an ordinal outcome and whether these are explicitly defined.

Describe how ordinal outcomes are analysed in RCTs to estimate a treatment effect.

Describe whether RCTs that use an ordinal outcome adequately report key methodological aspects specific to the analysis of the ordinal outcome.

A pre-specified protocol was developed for this scoping review [ 37 ]. Deviations from the protocol are outlined in Additional file 1 . Here, we provide an overview of the protocol and present the findings from the review which have been reported using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist [ 38 ].

Eligibility criteria

Studies were included in the review if they were published in one of four highly ranked medical journals ( British Medical Journal (BMJ), New England Journal of Medical (NEJM), Journal of the American Medical Association (JAMA), or The Lancet) between 1 January 2017 and 31 July 2022 and reported the results of at least one RCT (e.g. if reporting results from multiple trials) with either a primary or secondary outcome that was measured on an ordinal scale. These journals were chosen because they are leading medical journals that publish original and peer-reviewed research with primarily clinical aims and have been used in other reviews of trial methodology [ 39 , 40 ]. RCTs were defined using the Cochrane definition of an RCT, which is a study that prospectively assigns individuals to one of two (or more) interventions using some random or quasi-random method of allocation [ 41 ].

Studies were excluded from this review if they were written in a language other than English, since we did not have sufficient resources to translate studies written in another language. We also excluded studies which were purely methodological, where the abstract or full-text was not available, which reported data from non-human subjects, and those that provided a commentary, review opinion, or were description only. Manuscripts that reported only a trial protocol or statistical analysis plan were also excluded, since one of the main objectives of this review was to determine which statistical methods are being used to analyse trial data. Studies that used ordinal outcomes that were measured on a numerical rating or visual analogue scale were also excluded. Although these scales are often considered ordinal, they imply equidistance between contiguous categories, and can conceivably be analysed as continuous data.

Information sources

Studies were identified and included in the review by searching the online bibliographic database, PubMed, executed on 3 August, 2022.

Search strategy

The search strategy for this review was developed by CJS in consultation with KJL and RKM. The search strategy employed terms that have been developed to identify RCTs [ 41 ] and terms that have been used to describe an ordinal outcome in published manuscripts for RCTs. The complete search strategy that was used in this review is described in Table 1 .

Selection of sources of evidence

There was no pre-specified sample size for this review. All eligible studies that were identified via the search strategy were included in the review.

Piloting of the eligibility criteria was conducted by CJS and RKM who independently assessed the titles and abstracts of 20 studies to ensure consistency between reviewers. CJS then performed the search on the PubMed database. All titles and abstracts identified were extracted into Covidence, a web-based tool for managing systematic reviews [ 42 ]. A two-phase screening process was employed, where all abstracts and titles were screened by CJS in the first phase. Those studies that were not excluded were then moved to the second phase of the screening process, where the full text was evaluated against the eligibility criteria by CJS. A random sample of 40 studies were also assessed for eligibility by a second reviewer (one of KJL, RKM, BJM, or CLW). All studies that were deemed eligible were included in the data extraction.

Data extraction

A data extraction questionnaire was developed in Covidence [ 42 ] and piloted by CJS and RKM using a sample of 10 studies, which was further refined. The final version of the questionnaire is shown in Additional file 2 , and a full list of the data extraction items is provided in Table 2 . Data was extracted from both the main manuscript and any supplementary material, including statistical analysis plans. CJS extracted data from all eligible studies in the review. Double data extraction was performed by KJL and RKM on a random sample of 20 studies. Any uncertainties in the screening and data extraction process were discussed and resolved by consensus among all reviewers. Simplifications and assumptions that were made for eligibility and data extraction are outlined in Additional file 1 .

Synthesis of results

The data extracted from Covidence were cleaned and analysed using Stata [ 43 ]. Descriptive statistics were used to summarise the data. Frequencies and percentages and medians and interquartile ranges (IQRs) were reported for categorical and continuous variables respectively. Qualitative data were synthesised in a narrative format.

Results of the search

The initial search identified 309 studies, of which 46 were excluded for not being an RCT. There were 263 studies that underwent full text review. Of these, 119 were excluded: 110 because they did not have an ordinal outcome, and nine because they were not an RCT. In total, 144 studies were eligible for data extraction [ 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 77 , 78 , 79 , 80 , 81 , 82 , 83 , 84 , 85 , 86 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 , 115 , 116 , 117 , 118 , 119 , 120 , 121 , 122 , 123 , 124 , 125 , 126 , 127 , 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 ]. A flow diagram of the study selection is shown in Fig. 1 . The questionnaire that was used to extract the data from each study is provided in Additional file 2 .

figure 1

Flow diagram of the study

Study characteristics

A summary of the study characteristics is presented in Table 3 . The highest proportion of studies were published in the NEJM (61 studies, \(42\%\) ), followed by JAMA (40, 28%) and The Lancet (34, 24%), with only nine studies published in the BMJ ( \(6\%\) ). The number of studies that used an ordinal outcome were higher in 2020 and 2021 ( \(30, 21\%\) in each year) compared to earlier years ( \(21, 15\%\) in 2019, \(24, 17\%\) in 2018 and \(23, 16\%\) in 2017). Nearly all studies were conducted in a clinical setting ( \(141, 98\%\) ). The most common medical condition being studied was stroke ( \(39, 28\%\) ), followed by COVID-19 ( \(22, 16\%\) ) and atopic dermatitis ( \(6, 4\%\) ). The most common medical field was neurology ( \(54, 38\%\) ) followed by infectious diseases ( \(22, 16\%\) , all of which were COVID-19 studies), dermatology ( \(13, 9\%\) ), and psychiatry ( \(12, 9\%\) ). Studies were mostly funded by public sources ( \(104, 72\%\) ). The median number of participants in the primary analysis of the ordinal outcome was 380 (interquartile range (IQR): 202–803).

Of the 144 included studies, 58 (40%) used some form of adaptive design, with 47 ( \(33\%\) ) having explicitly defined early stopping rules for efficacy or futility, 18 ( \(13\%\) ) used sample size re-estimation, three ( \(2\%\) ) used response adaptive randomisation, three ( \(2\%\) ) used covariate adaptive randomisation, three ( \(2\%\) ) were platform trials, and three ( \(2\%\) ) used adaptive enrichment that focused on specific subgroups of patients.

Ordinal outcomes and target parameters

A summary of the properties of the ordinal outcomes used in the studies is shown in Table 4 . An ordinal scale was used as a primary outcome in 59 ( \(41\%\) ) of studies. Most studies used an ordinal scale to describe an outcome at a single point in time ( \(128, 89\%\) ), with 16 studies using an ordinal outcome to capture changes over time ( \(11\%\) ). One study used a Likert scale where the categories were ambiguously defined in the manuscript. Another study used an ordinal outcome to measure change over time, but it was asymmetric and biased towards a favourable outcome. The median number of categories in the ordinal outcome was 7 (IQR: 6–7) and ranged from 3 to 23 categories.

There were 32 studies that determined the sample size in advance based on the ordinal outcome, of which 26 out of 32 studies ( \(81\%\) ) used an analytical approach and 6 out of 32 studies ( \(19\%\) ) used simulation to estimate the sample size. Among those studies that used an analytical approach, five studies reported to have used the Whitehead method and three studies reported to have used a t -test. Among the remaining studies that used an analytical approach, it was unclear which specific method was used to compute the sample size.

The ordinal outcome was dichotomised for analysis in 47 ( \(33\%\) ) studies. Some justifications for the dichotomisation of the ordinal outcome included that it represented a clinically meaningful effect and/or that it was common in the analysis of the outcome in similar studies (reported in 24 studies), that the dichotomised outcome represented an agreeable endpoint based on feedback between clinicians and/or patients and families (two studies), or that the assumptions of the statistical model for the categorical outcome were violated (reported in three studies).

There were a variety of target parameters used for the ordinal outcomes. In 130 studies, the target parameter could be determined; however, 59 of these studies ( \(45\%\) ) did not clearly or explicitly define the target parameter of interest. Of those where the target parameter could be determined based on the information provided in the manuscript (e.g. since it was not reported), an OR was the most common target parameter ( \(78, 54\%\) ), followed by a risk difference ( \(31, 22\%\) ). A difference in mean or median was the target parameter in 11 (8%) and 8 (6%) studies respectively. There were 14 ( \(10\%\) ) studies that did not estimate a target parameter. This was either because the study was descriptive in nature, the analysis used a non-parametric procedure, or the target parameter could not be determined (or some combination thereof).

Statistical methods and assumptions

There was a variety of descriptive measures used to summarise the distribution of the ordinal outcome by intervention groups (Table 5 ). The most common descriptive statistics were frequencies and/or percentages in each category of the ordinal outcome ( \(116, 81\%\) ), followed by the median score across all categories ( \(33, 23\%\) ) and IQRs ( \(31, 22\%\) ). The mean and standard deviation across the categories of the ordinal outcome were only summarised in 16 (11%) and 10 (7%) studies respectively.

Many different statistical methods were used to analyse the ordinal outcome (Table 5 ). The PO model was the most common statistical method used to analyse the ordinal outcome (64, \(44\%\) ) that was used to estimate a proportional OR in 62 studies. In studies that used a PO model for the analysis, the interpretation of the target parameter varied between studies (see Additional file 3 ). The most frequent definition used was that the proportional OR represented an ordinal shift in the distribution of ordinal scale scores toward a better outcome in the intervention relative to the control group ( \(12, 19\%\) ). When the outcome was dichotomised, logistic regression was used in 16 studies ( \(11\%\) of all studies) that usually estimated an OR or a risk difference using g-computation. Seven studies estimated a risk difference or risk ratio using binomial regression. Studies also calculated and reported a risk difference with corresponding \(95\%\) confidence intervals estimated using methods such as the Wald method or bootstrapping ( \(31, 22\%\) ). There were 19 (13%) studies that used a non-parametric method to analyse the ordinal outcome (either dichotomised or not), including the Cochran-Mantel-Haenszel test ( \(15, 10\%\) ) to estimate an odds ratio, the Wilcoxon test ( \(14, 10\%\) ), of which no study reported a concordance probability as the target parameter, or the Fisher’s exact or Chi-Square test (12, \(8\%\) ). Other statistical methods that were used were the Hodges-Lehmann estimator, used to estimate a median difference ( \(3, 2\%\) ) and the Van Elteren test ( \(2, 1\%\) ), an extension of the Wilcoxon test for comparing treatments in a stratified experiment. Linear regression was used in 16 ( \(11\%\) ) studies that tended to estimate a mean or risk difference (despite the model having an unbounded support).

The majority of studies ( \(86, 60\%\) ) did not explicitly check the validity of the assumptions for the statistical method(s) used. For example, no study that analysed the ordinal outcome using linear regression commented on the appropriateness of assigning specific numbers of the outcome categories. Among the 64 studies that used a PO model, 20 (31%) did not report whether the assumption of PO was satisfied. Overall, there were 46 studies that reported checking key modelling assumptions; however, the method that was used to check these assumptions were not reported in 6 ( \(13\%)\) of these studies. The most common way to verify model assumptions was to use statistical methods ( \(31, 67\%\) ), followed by graphical methods ( \(2, 4\%\) ).

Among the 44 studies that assessed the validity of the PO assumption for a PO model, 13 studies ( \(30\%\) ) used a likelihood ratio test, 10 studies ( \(23\%\) ) used the Brant test, and 10 studies ( \(23\%\) ) also used the Score test. Six ( \(14\%\) ) studies assessed the robustness of the PO assumption by fitting a logistic regression model to every level of the ordinal outcome across the scale, in which the OR for each dichotomous break was presented. Two studies assessed the PO assumption using graphical methods, which plotted either the inverse cumulative log odds or the empirical cumulative log odds. It was unclear which method was used to assess the PO assumption in ten studies that reported to have checked the assumption.

There were 12 studies ( \(8\%\) ) that reported using a different statistical method than originally planned. Ten of these studies had originally planned to use a PO model, but the PO assumption was determined to have been violated and an alternative method was chosen. One study removed the covariate that was reported to have violated the PO assumption and still used a PO model to analyse the outcome. Two studies used an unconstrained PPO model that reported an adjusted OR for each binary split of the ordinal outcome. Three studies used a Wilcoxon test, with one study stratifying by a baseline covariate that violated the PO assumption. Another study dichotomised the ordinal outcome for the analysis. One study used a Van Elteren test that estimated a median difference (which inappropriately assumes that there is an equal distance between proximate categories), another used a Poisson model with robust standard errors, and one study retained the analysis despite the violation in PO. Notably, a PPO model was not reported to have been used in studies that reported that a covariate other than the treatment violated the PO assumption. Seven studies also did not report which covariate(s) violated the PO assumption.

Frequentist inference was the most common framework for conducting the analysis (133, 92%), with Bayesian methods being used in eight (6%) studies (where two studies used both), of which all eight studies used an adaptive design. Of those using Bayesian methods, seven studies used a Bayesian PO model for analysis. Of these studies, four used a Dirichlet prior distribution to model the baseline probabilities, and three used a normally distributed prior on the proportional log-OR scale. Two of these studies reported to use the median proportional OR with corresponding \(95\%\) credible interval, while one study reported the mean proportional OR. Three studies reported that the models were fitted with the use of a Markov-chain Monte Carlo algorithm with either 10, 000 (one study) or 100, 000 (two studies) samples from the joint posterior distribution. No study reported how the goodness-of-fit of the model was assessed.

For the 38 studies that collected repeated measurements on the ordinal outcome, 18 adjusted for the baseline measurement ( \(47\%\) ), 14 used mixed effects models ( \(37\%\) ), and four used generalised estimated equations ( \(11\%\) ) to capture the correlation among the repeated measures for an individual.

A range of statistical packages were used for the analysis of the ordinal outcome, with SAS ( \(81, 56\%\) ) and R ( \(35, 24\%\) ) being most common. Twelve ( \(8\%\) ) studies did not report the software used.

This review has provided an overview of how ordinal outcomes are used and analysed in contemporary RCTs. We describe the insight this review has provided on the study design, statistical analyses and reporting of trials using ordinal outcomes.

Target parameter

The target parameter of interest is an important consideration when planning any trial and should be aligned with the research question [ 12 , 13 ]. The most common target parameter in this review was an OR, either for a dichotomised version of the ordinal outcome or in an analysis that used the ordinal scale. When an ordinal analysis was used, it was common that the target parameter was a proportional OR, although there was variation in the interpretation of this parameter between studies. We found that it was most common to interpret the proportional OR as an average shift in the distribution of the ordinal scale scores toward a better outcome in the intervention, relative to the comparator(s) [ 19 , 35 , 188 , 189 ]. In the studies that dichotomised the ordinal outcome, many lacked justification for doing so and, in one case, dichotomisation was used only due to the violation of PO, despite the fact that this changed the target parameter.

Some studies in our review treated the ordinal outcome as if it were continuous, and used a difference in means or medians as the target parameter. These quantities do not represent a clinically meaningful effect when the outcome is ordinal, since proximate categories in the scale are not necessarily separated by a quantifiable or equal distance, which can affect the translation of the trial results into practice. If a study is to use a mean difference then the researchers should justify the appropriateness of assigning specific numbers used to the ordinal outcome categories.

The target parameter and statistical method used to estimate it could not be determined in some studies. Notably, the definition of the target parameter was not explicitly defined in almost half of the studies, despite the current recommendations on the importance of clearly defining the estimand of interest, one component of which is the target parameter [ 12 , 13 ]. Furthermore, there is a lack of clarity in defining the target parameter when a PO model was used, despite the interpretation being analogous to the OR for a binary outcome, but applying to an interval of the ordinal scale instead of a single value. Consistency in the definition of a target parameter in RCTs can allow easy interpretation for clinicians and applied researchers. Explicit definition of the target parameter of interest is essential for readers to understand the interpretation of a clinically meaningful treatment effect, and also reflects the present push within clinical research with regards to estimands [ 12 , 13 ].

Statistical methods

It is important to summarise the distribution of the outcome by intervention group in any RCT. When the outcome is ordinal, frequencies and percentages in each category can provide a useful summary of this distribution. Most studies in this review reported frequencies and percentages in each category, although some studies that dichotomised the outcome only reported these summaries for the dichotomised scale. Some studies reported means and standard deviations across the categories which, as mentioned previously, may not have a valid interpretation.

Although there are a range of statistical methods that can be used to analyse an ordinal outcome, we found that the PO model was the most commonly used. This is likely because the PO model is relatively well-known among statisticians and is quite straightforward to fit in most statistical packages, and it possesses the desirable properties of palindromic invariance and invariance under collapsibility. However, when using this approach to estimate a specific treatment effect across all levels of the outcome, it is important to assess and report whether the PO assumption has been met when the aim is to estimate the treatment effect across the different categories or to estimate predicted probabilities in each category. The validity of the PO assumption is less important when the objective is to understand whether one treatment is ‘better’ on average compared to a comparator. In this review, it was common for studies that used a PO model to define the target parameter that related to a treatment benefiting patients with regard to every level of the outcome scale. However, only 44 out of 64 studies reported to have checked the PO assumption, which highlights the deficiency in this practice. Statistical methods were commonly used to assess the PO assumption, although it may be preferable to avoid hypothesis testing when assessing the PO assumption, particularly with small sample sizes, as these statistical tests can have poor statistical power [ 22 , 190 ]. Also, researchers should keep in mind that when the PO assumption is tested, the type I error of the analysis may change and that p -values and confidence intervals based on the updated model ignore the model-fitting uncertainty [ 191 ].

When the PO assumption was violated, a PPO model was rarely used, and instead baseline covariates were removed from the model to address the departure to PO. The fact that the PPO model is underused could be due to a lack of knowledge that such models exist and can be used to address violations in PO. Such a model could have been particularly useful in these studies that had only covariates other than the treatment variable that violated the PO assumption, as the PPO model could have been used to estimate a single proportional OR for the treatment effect. Of note, however, is that an unconstrained PPO model does not necessarily require ordinality as the categories can be arranged and the model fit would be hardly affected [ 192 ], and that estimated probabilities can be negative [ 193 ].

There are other methods that can be used to assess the validity of the PO assumption, such as plotting the differences in predicted log-odds between different categories of the ordinal outcome that should be parallel [ 16 ]. Another option is to fit a logistic regression model to every level of the ordinal outcome across the scale and compare the estimated ORs and corresponding confidence interval for each binary split of the ordinal outcome or simulating predictive distributions. However, estimating separate ORs in this way can be inefficient, particularly when the ordinal outcome has a high number of categories. Arguably, more important than assessing the validity of the PO assumption is to assess the impact of making compared to not making the assumption. If the treatment effect goes in the same direction across each category of the ordinal scale and the objective is to simply understand whether one treatment is better overall, then departures from PO may not be important. If, however, the interest is in estimating a treatment effect for every level of the ordinal outcome and/or the treatment has a detrimental effect for one end of the ordinal scale but a beneficial effect for the remaining categories, there should be careful consideration as to the validity to the type I and II error and the treatment effect if the PO model is used.

Finally, a handful of studies also used the Wilcoxon, Chi-Square, or Fisher’s exact test (the latter being too conservative [ 194 ] and potentially providing misleading results), where commonly only a p -value, not a target parameter, was reported when these methods were used. The lack of a target parameter for the treatment effect can make it difficult for clinicians to translate the results to practice.

Strengths and limitations

The strengths of this study are that we present a review of a large number of RCTs that used ordinal outcomes published in four highly ranked medical journals to highlight the current state of practice for analysing ordinal outcomes. The screening and data extraction process was conducted systematically, and pilot tests and double data extraction ensured the consistency and reliability of the extracted data. The PRISMA-ScR checklist was used to ensure that reporting has been conducted to the highest standard.

This review does, however, have limitations. The restriction to the PubMed database and four highly ranked medical journals may affect the generalisability of this review. We made this decision given the scoping nature of the review, to ensure reproducibility and to ensure that the total number of studies included in the review was manageable. We also aimed to include studies that are likely to reflect best practice of how research using ordinal outcomes is being conducted and reported upon at present. Given the selected journals represent highly ranked medical journals, these findings are likely to reflect the best-case scenario given these journals' reputation for rigour. In addition, our search strategy may have missed certain phrases or variants (particularly related to an ordinal outcome); however, we attempted to mitigate this through our piloting phase. Finally, we also did not review the protocol papers of the trials that may have included additional information related to the statistical methodology. This includes methods that were planned to be used to assess the PO assumption, and any alternative methods that were to be used instead.

Implications of this research

This review has implications for researchers designing RCTs that use an ordinal outcome. Although the majority of studies included in this review were in the fields of neurology and infectious diseases, the results of this review would apply to RCTs in all medical fields that use an ordinal outcome. We have shown that there is substantial variation in the analysis and reporting of ordinal outcomes in practice. Our results suggest that researchers should carefully consider the target parameter of interest and explicitly report what the target parameter represents; this is particularly important for an ordinal outcome which can be unfamiliar to readers. Defining the target parameter upfront will help to ensure that appropriate analytical methods are used to analyse the ordinal outcome and make transparent the assumptions the researchers are willing to make.

Our review also highlights the need for careful assessment and reporting of the validity of the model assumptions made during the analysis of an ordinal outcome. Doing so will ensure that robust statistical methods that align with the research question and categorical nature of the ordinal outcome are used to estimate a valid, clinically relevant target parameter that can be translated to practice.

Availability of data and materials

The datasets and code generated and/or analysed during the current study are available on GitHub [ 195 ].

Abbreviations

Randomised controlled trial

Proportional odds

Partial proportional odds

Statistical analysis plan

Velleman PF, Wilkinson L. Nominal, ordinal, interval, and ratio typologies are misleading. Am Stat. 1993;47(1):65–72.

Article   Google Scholar  

MacKenzie CR, Charlson ME. Standards for the use of ordinal scales in clinical trials. Br Med J (Clin Res Ed). 1986;292(6512):40–3.

Article   CAS   PubMed   Google Scholar  

Banks JL, Marotta CA. Outcomes validity and reliability of the modified Rankin scale: implications for stroke clinical trials: a literature review and synthesis. Stroke. 2007;38(3):1091–6.

Article   PubMed   Google Scholar  

de la Ossa NP, Abilleira S, Jovin TG, García-Tornel Á, Jimenez X, Urra X, et al. Effect of direct transportation to thrombectomy-capable center vs local stroke center on neurological outcomes in patients with suspected large-vessel occlusion stroke in nonurban areas: the RACECAT randomized clinical Trial. JAMA. 2022;327(18):1782–94.

Hubert GJ, Hubert ND, Maegerlein C, Kraus F, Wiestler H, Müller-Barna P, et al. Association between use of a flying intervention team vs patient interhospital transfer and time to endovascular thrombectomy among patients with acute ischemic stroke in nonurban Germany. JAMA. 2022;327(18):1795–805.

Article   PubMed   PubMed Central   Google Scholar  

Bösel J, Niesen WD, Salih F, Morris NA, Ragland JT, Gough B, et al. Effect of early vs standard approach to tracheostomy on functional outcome at 6 months among patients with severe stroke receiving mechanical ventilation: the SETPOINT2 Randomized Clinical Trial. JAMA. 2022;327(19):1899–909.

Wilson L, Boase K, Nelson LD, Temkin NR, Giacino JT, Markowitz AJ, et al. A manual for the glasgow outcome scale-extended interview. J Neurotrauma. 2021;38(17):2435–46.

Marshall JC, Murthy S, Diaz J, Adhikari N, Angus DC, Arabi YM, et al. A minimal common outcome measure set for COVID-19 clinical research. Lancet Infect Dis. 2020;20(8):e192–7.

Article   CAS   Google Scholar  

Lovre D, Bateman K, Sherman M, Fonseca VA, Lefante J, Mauvais-Jarvis F. Acute estradiol and progesterone therapy in hospitalised adults to reduce COVID-19 severity: a randomised control trial. BMJ Open. 2021;11(11):e053684.

Song AT, Rocha V, Mendrone-Júnior A, Calado RT, De Santis GC, Benites BD, et al. Treatment of severe COVID-19 patients with either low-or high-volume of convalescent plasma versus standard of care: a multicenter Bayesian randomized open-label clinical trial (COOP-COVID-19-MCTI). Lancet Reg Health-Am. 2022;10:100216.

PubMed   PubMed Central   Google Scholar  

Mathioudakis AG, Fally M, Hashad R, Kouta A, Hadi AS, Knight SB, et al. Outcomes evaluated in controlled clinical trials on the management of COVID-19: a methodological systematic review. Life. 2020;10(12):350.

Akacha M, Bretz F, Ohlssen D, Rosenkranz G, Schmidli H. Estimands and their role in clinical trials. Stat Biopharm Res. 2017;9(3):268–71.

Mallinckrodt C, Molenberghs G, Lipkovich I, Ratitch B. Estimands, estimators and sensitivity analysis in clinical trials. CRC Press; 2019.

Walker SH, Duncan DB. Estimation of the probability of an event as a function of several independent variables. Biometrika. 1967;54(1–2):167–79.

McCullagh P. Regression models for ordinal data. J R Stat Soc Ser B Methodol. 1980;42(2):109–27.

Google Scholar  

Harrell FE, et al. Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis, vol 3. Springer; 2015.

Ananth CV, Kleinbaum DG. Regression models for ordinal responses: a review of methods and applications. Int J Epidemiol. 1997;26(6):1323–33.

Armstrong BG, Sloan M. Ordinal regression models for epidemiologic data. Am J Epidemiol. 1989;129(1):191–204.

Roozenbeek B, Lingsma HF, Perel P, Edwards P, Roberts I, Murray GD, et al. The added value of ordinal analysis in clinical trials: an example in traumatic brain injury. Crit Care. 2011;15(3):1–7.

Breheny P. Proportional odds models. 2015. MyWeb. https://myweb.uiowa.edu/pbreheny/uk/teaching/760-s13/notes/4-23.pdf .

Abreu MNS, Siqueira AL, Cardoso CS, Caiaffa WT. Ordinal logistic regression models: application in quality of life studies. Cad Saúde Pública. 2008;24:s581–91.

Peterson B, Harrell FE Jr. Partial proportional odds models for ordinal response variables. J R Stat Soc: Ser C: Appl Stat. 1990;39(2):205–17.

Fullerton AS. A conceptual framework for ordered logistic regression models. Sociol Methods Res. 2009;38(2):306–47.

Senn S, Julious S. Measurement in clinical trials: a neglected issue for statisticians? Stat Med. 2009;28(26):3189–209.

Maas AI, Steyerberg EW, Marmarou A, McHugh GS, Lingsma HF, Butcher I, et al. IMPACT recommendations for improving the design and analysis of clinical trials in moderate to severe traumatic brain injury. Neurotherapeutics. 2010;7:127–34.

McFadden D, et al. Conditional logit analysis of qualitative choice behavior.  Front Econ. 1973;105–142.

Wilcoxon F. Individual comparisons by ranking methods. Springer; 1992.

Liu Q, Shepherd BE, Li C, Harrell FE Jr. Modeling continuous response variables using ordinal regression. Stat Med. 2017;36(27):4316–35.

Fay MP, Brittain EH, Shih JH, Follmann DA, Gabriel EE. Causal estimands and confidence intervals associated with Wilcoxon-Mann-Whitney tests in randomized experiments. Stat Med. 2018;37(20):2923–37.

De Neve J, Thas O, Gerds TA. Semiparametric linear transformation models: effect measures, estimators, and applications. Stat Med. 2019;38(8):1484–501.

Ganesh A, Luengo-Fernandez R, Wharton RM, Rothwell PM. Ordinal vs dichotomous analyses of modified Rankin Scale, 5-year outcome, and cost of stroke. Neurology. 2018;91(21):e1951–60.

French B, Shotwell MS. Regression models for ordinal outcomes. JAMA. 2022;328(8):772–3.

Bath PM, Geeganage C, Gray LJ, Collier T, Pocock S. Use of ordinal outcomes in vascular prevention trials: comparison with binary outcomes in published trials. Stroke. 2008;39(10):2817–23.

Scott SC, Goldberg MS, Mayo NE. Statistical assessment of ordinal outcomes in comparative studies. J Clin Epidemiol. 1997;50(1):45–55.

McHugh GS, Butcher I, Steyerberg EW, Marmarou A, Lu J, Lingsma HF, et al. A simulation study evaluating approaches to the analysis of ordinal outcome data in randomized controlled trials in traumatic brain injury: results from the IMPACT Project. Clin Trials. 2010;7(1):44–57.

DeSantis SM, Lazaridis C, Palesch Y, Ramakrishnan V. Regression analysis of ordinal stroke clinical trial outcomes: an application to the NINDS t-PA trial. Int J Stroke. 2014;9(2):226–31.

Selman CJ, Lee KJ, Whitehead CL, Manley BJ, Mahar RK. Statistical analyses of ordinal outcomes in randomised controlled trials: protocol for a scoping review. Trials. 2023;24(1):1–7.

Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169(7):467–73.

Bell ML, Fiero M, Horton NJ, Hsu CH. Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodol. 2014;14(1):1–8.

Berwanger O, Ribeiro RA, Finkelsztejn A, Watanabe M, Suzumura EA, Duncan BB, et al. The quality of reporting of trial abstracts is suboptimal: survey of major general medical journals. J Clin Epidemiol. 2009;62(4):387–92.

Higgins JP, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, et al.. Cochrane handbook for systematic reviews of interventions. John Wiley & Sons; 2019.

Veritas Health Innovation. Covidence systematic review software. Melbourne; 2022.

StataCorp L. Stata statistical software: Release 17 (2021). College Station: StataCorp LP; 2021.

Hanley DF, Lane K, McBee N, Ziai W, Tuhrim S, Lees KR, et al. Thrombolytic removal of intraventricular haemorrhage in treatment of severe stroke: results of the randomised, multicentre, multiregion, placebo-controlled CLEAR III trial. Lancet. 2017;389(10069):603–11. https://doi.org/10.1016/S0140-6736(16)32410-2 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Nangia J, Wang T, Osborne C, Niravath P, Otte K, Papish S, et al. Effect of a scalp cooling device on alopecia in women undergoing chemotherapy for breast cancer: the SCALP randomized clinical trial. JAMA. 2017;317(6):596–605. https://doi.org/10.1001/jama.2016.20939 . United States.

Ruzicka T, Hanifin JM, Furue M, Pulka G, Mlynarczyk I, Wollenberg A, et al. Anti-interleukin-31 receptor A antibody for atopic dermatitis. N Engl J Med. 2017;376(9):826–35. https://doi.org/10.1056/NEJMoa1606490 . United States.

Németh G, Laszlovszky I, Czobor P, Szalai E, Szatmári B, Harsányi J, et al. Cariprazine versus risperidone monotherapy for treatment of predominant negative symptoms in patients with schizophrenia: a randomised, double-blind, controlled trial. Lancet. 2017;389(10074):1103–13. https://doi.org/10.1016/S0140-6736(17)30060-0 . England.

Mathieson S, Maher CG, McLachlan AJ, Latimer J, Koes BW, Hancock MJ, et al. Trial of pregabalin for acute and chronic sciatica. N Engl J Med. 2017;376(12):1111–20. https://doi.org/10.1056/NEJMoa1614292 . United States.

Baud O, Trousson C, Biran V, Leroy E, Mohamed D, Alberti C. Association between early low-dose hydrocortisone therapy in extremely preterm neonates and neurodevelopmental outcomes at 2 years of age. JAMA. 2017;317(13):1329–37. https://doi.org/10.1001/jama.2017.2692 . United States.

van den Berg LA, Dijkgraaf MG, Berkhemer OA, Fransen PS, Beumer D, Lingsma HF, et al. Two-year outcome after endovascular treatment for acute ischemic stroke. N Engl J Med. 2017;376(14):1341–9. https://doi.org/10.1056/NEJMoa1612136 . United States.

Kaufman J, Fitzpatrick P, Tosif S, Hopper SM, Donath SM, Bryant PA, et al. Faster clean catch urine collection (Quick-Wee method) from infants: randomised controlled trial. BMJ. 2017;357:j1341. https://doi.org/10.1136/bmj.j1341 .

Costa Leme A, Hajjar LA, Volpe MS, Fukushima JT, De Santis Santiago RR, Osawa EA, et al. Effect of intensive vs moderate alveolar recruitment strategies added to lung-protective ventilation on postoperative pulmonary complications: a randomized clinical trial. JAMA. 2017;317(14):1422–32. https://doi.org/10.1001/jama.2017.2297 . United States.

Breitenstein C, Grewe T, Flöel A, Ziegler W, Springer L, Martus P, et al. Intensive speech and language therapy in patients with chronic aphasia after stroke: a randomised, open-label, blinded-endpoint, controlled trial in a health-care setting. Lancet. 2017;389(10078):1528–38. https://doi.org/10.1016/S0140-6736(17)30067-3 . England.

Wechsler ME, Akuthota P, Jayne D, Khoury P, Klion A, Langford CA, et al. Mepolizumab or placebo for eosinophilic granulomatosis with polyangiitis. N Engl J Med. 2017;376(20):1921–32. https://doi.org/10.1056/NEJMoa1702079 .

Devinsky O, Cross JH, Laux L, Marsh E, Miller I, Nabbout R, et al. Trial of cannabidiol for drug-resistant seizures in the Dravet syndrome. N Engl J Med. 2017;376(21):2011–20. https://doi.org/10.1056/NEJMoa1611618 . United States.

Anderson CS, Arima H, Lavados P, Billot L, Hackett ML, Olavarría VV, et al. Cluster-randomized, crossover trial of head positioning in acute stroke. N Engl J Med. 2017;376(25):2437–47. https://doi.org/10.1056/NEJMoa1615715 . United States.

Juch JNS, Maas ET, Ostelo RWJG, Groeneweg JG, Kallewaard JW, Koes BW, et al. Effect of radiofrequency denervation on pain intensity among patients with chronic low back pain: the Mint randomized clinical trials. JAMA. 2017;318(1):68–81. https://doi.org/10.1001/jama.2017.7918 .

Mohamed S, Johnson GR, Chen P, Hicks PB, Davis LL, Yoon J, et al. Effect of antidepressant switching vs augmentation on remission among patients with major depressive disorder unresponsive to antidepressant treatment: the VAST-D randomized clinical trial. JAMA. 2017;318(2):132–45. https://doi.org/10.1001/jama.2017.8036 .

Kanes S, Colquhoun H, Gunduz-Bruce H, Raines S, Arnold R, Schacterle A, et al. Brexanolone (SAGE-547 injection) in post-partum depression: a randomised controlled trial. Lancet. 2017;390(10093):480–9. https://doi.org/10.1016/S0140-6736(17)31264-3 . England.

Lapergue B, Blanc R, Gory B, Labreuche J, Duhamel A, Marnat G, et al. Effect of endovascular contact aspiration vs stent retriever on revascularization in patients with acute ischemic stroke and large vessel occlusion: the ASTER randomized clinical trial. JAMA. 2017;318(5):443–52. https://doi.org/10.1001/jama.2017.9644 .

Lindley RI, Anderson CS, Billot L, Forster A, Hackett ML, Harvey LA, et al. Family-led rehabilitation after stroke in India (ATTEND): a randomised controlled trial. Lancet. 2017;390(10094):588–99. https://doi.org/10.1016/S0140-6736(17)31447-2 . England.

Berlowitz DR, Foy CG, Kazis LE, Bolin LP, Conroy MB, Fitzpatrick P, et al. Effect of intensive blood-pressure treatment on patient-reported outcomes. N Engl J Med. 2017;377(8):733–44. https://doi.org/10.1056/NEJMoa1611179 .

Hui D, Frisbee-Hume S, Wilson A, Dibaj SS, Nguyen T, De La Cruz M, et al. Effect of lorazepam with haloperidol vs haloperidol alone on agitated delirium in patients with advanced cancer receiving palliative care: a randomized clinical trial. JAMA. 2017;318(11):1047–56. https://doi.org/10.1001/jama.2017.11468 .

Roffe C, Nevatte T, Sim J, Bishop J, Ives N, Ferdinand P, et al. Effect of routine low-dose oxygen supplementation on death and disability in adults with acute stroke: the stroke oxygen study randomized clinical trial. JAMA. 2017;318(12):1125–35. https://doi.org/10.1001/jama.2017.11463 .

Dwivedi R, Ramanujam B, Chandra PS, Sapra S, Gulati S, Kalaivani M, et al. Surgery for drug-resistant epilepsy in children. N Engl J Med. 2017;377(17):1639–47. https://doi.org/10.1056/NEJMoa1615335 . United States.

Nogueira RG, Jadhav AP, Haussen DC, Bonafe A, Budzik RF, Bhuva P, et al. Thrombectomy 6 to 24 hours after stroke with a mismatch between deficit and infarct. N Engl J Med. 2018;378(1):11–21. https://doi.org/10.1056/NEJMoa1706442 . United States.

Zheng MX, Hua XY, Feng JT, Li T, Lu YC, Shen YD, et al. Trial of Contralateral seventh cervical nerve transfer for spastic arm paralysis. N Engl J Med. 2018;378(1):22–34. https://doi.org/10.1056/NEJMoa1615208 . United States.

Atri A, Frölich L, Ballard C, Tariot PN, Molinuevo JL, Boneva N, et al. Effect of idalopirdine as adjunct to cholinesterase inhibitors on change in cognition in patients with Alzheimer disease: three randomized clinical trials. JAMA. 2018;319(2):130–42. https://doi.org/10.1001/jama.2017.20373 .

Bassler D, Shinwell ES, Hallman M, Jarreau PH, Plavka R, Carnielli V, et al. Long-term effects of inhaled budesonide for bronchopulmonary dysplasia. N Engl J Med. 2018;378(2):148–57. https://doi.org/10.1056/NEJMoa1708831 . United States.

Raskind MA, Peskind ER, Chow B, Harris C, Davis-Karim A, Holmes HA, et al. Trial of prazosin for post-traumatic stress disorder in military veterans. N Engl J Med. 2018;378(6):507–17. https://doi.org/10.1056/NEJMoa1507598 . United States.

Albers GW, Marks MP, Kemp S, Christensen S, Tsai JP, Ortega-Gutierrez S, et al. Thrombectomy for stroke at 6 to 16 hours with selection by perfusion imaging. N Engl J Med. 2018;378(8):708–18. https://doi.org/10.1056/NEJMoa1713973 .

Bath PM, Woodhouse LJ, Appleton JP, Beridze M, Christensen H, Dineen RA, et al. Antiplatelet therapy with aspirin, clopidogrel, and dipyridamole versus clopidogrel alone or aspirin and dipyridamole in patients with acute cerebral ischaemia (TARDIS): a randomised, open-label, phase 3 superiority trial. Lancet. 2018;391(10123):850–9. https://doi.org/10.1016/S0140-6736(17)32849-0 .

Krebs EE, Gravely A, Nugent S, Jensen AC, DeRonne B, Goldsmith ES, et al. Effect of opioid vs nonopioid medications on pain-related function in patients with chronic back pain or hip or knee osteoarthritis pain: the SPACE randomized clinical trial. JAMA. 2018;319(9):872–82. https://doi.org/10.1001/jama.2018.0899 .

Campbell BCV, Mitchell PJ, Churilov L, Yassi N, Kleinig TJ, Dowling RJ, et al. Tenecteplase versus alteplase before thrombectomy for ischemic stroke. N Engl J Med. 2018;378(17):1573–82. https://doi.org/10.1056/NEJMoa1716405 . United States.

Mellor R, Bennell K, Grimaldi A, Nicolson P, Kasza J, Hodges P, et al. Education plus exercise versus corticosteroid injection use versus a wait and see approach on global outcome and pain from gluteal tendinopathy: prospective, single blinded, randomised clinical trial. BMJ. 2018;361. https://doi.org/10.1136/bmj.k1662 .

Sprigg N, Flaherty K, Appleton JP, Al-Shahi Salman R, Bereczki D, Beridze M, et al. Tranexamic acid for hyperacute primary IntraCerebral Haemorrhage (TICH-2): an international randomised, placebo-controlled, phase 3 superiority trial. Lancet. 2018;391(10135):2107–15. https://doi.org/10.1016/S0140-6736(18)31033-X .

Jolly K, Sidhu MS, Hewitt CA, Coventry PA, Daley A, Jordan R, et al. Self management of patients with mild COPD in primary care: randomised controlled trial. BMJ. 2018;361. https://doi.org/10.1136/bmj.k2241 .

Brock PR, Maibach R, Childs M, Rajput K, Roebuck D, Sullivan MJ, et al. Sodium thiosulfate for protection from cisplatin-induced hearing loss. N Engl J Med. 2018;378(25):2376–85. https://doi.org/10.1056/NEJMoa1801109 .

Khatri P, Kleindorfer DO, Devlin T, Sawyer RN Jr, Starr M, Mejilla J, et al. Effect of alteplase vs aspirin on functional outcome for patients with acute ischemic stroke and minor nondisabling neurologic deficits: the PRISMS randomized clinical trial. JAMA. 2018;320(2):156–66. https://doi.org/10.1001/jama.2018.8496 .

Wang Y, Li Z, Zhao X, Wang C, Wang X, Wang D, et al. Effect of a multifaceted quality improvement intervention on hospital personnel adherence to performance measures in patients with acute ischemic stroke in china: a randomized clinical trial. JAMA. 2018;320(3):245–54. https://doi.org/10.1001/jama.2018.8802 . United States.

Fossat G, Baudin F, Courtes L, Bobet S, Dupont A, Bretagnol A, et al. Effect of in-bed leg cycling and electrical stimulation of the quadriceps on global muscle strength in critically ill adults: a randomized clinical trial. JAMA. 2018;320(4):368–78. https://doi.org/10.1001/jama.2018.9592 .

Thomalla G, Simonsen CZ, Boutitie F, Andersen G, Berthezene Y, Cheng B, et al. MRI-guided thrombolysis for stroke with unknown time of onset. N Engl J Med. 2018;379(7):611–22. https://doi.org/10.1056/NEJMoa1804355 . United States.

Perkins GD, Ji C, Deakin CD, Quinn T, Nolan JP, Scomparin C, et al. A randomized trial of epinephrine in out-of-hospital cardiac arrest. N Engl J Med. 2018;379(8):711–21. https://doi.org/10.1056/NEJMoa1806842 . United States.

Wang HE, Schmicker RH, Daya MR, Stephens SW, Idris AH, Carlson JN, et al. Effect of a strategy of initial laryngeal tube insertion vs endotracheal intubation on 72-hour survival in adults with out-of-hospital cardiac arrest: a randomized clinical trial. JAMA. 2018;320(8):769–78. https://doi.org/10.1001/jama.2018.7044 .

Benger JR, Kirby K, Black S, Brett SJ, Clout M, Lazaroo MJ, et al. Effect of a strategy of a supraglottic airway device vs tracheal intubation during out-of-hospital cardiac arrest on functional outcome: the AIRWAYS-2 randomized clinical trial. JAMA. 2018;320(8):779–91. https://doi.org/10.1001/jama.2018.11597 .

Meltzer-Brody S, Colquhoun H, Riesenberg R, Epperson CN, Deligiannidis KM, Rubinow DR, et al. Brexanolone injection in post-partum depression: two multicentre, double-blind, randomised, placebo-controlled, phase 3 trials. Lancet. 2018;392(10152):1058–70. https://doi.org/10.1016/S0140-6736(18)31551-4 . England.

Cooper DJ, Nichol AD, Bailey M, Bernard S, Cameron PA, Pili-Floury S, et al. Effect of early sustained prophylactic hypothermia on neurologic outcomes among patients with severe traumatic brain injury: the POLAR randomized clinical trial. JAMA. 2018;320(21):2211–20. https://doi.org/10.1001/jama.2018.17075 .

Bonell C, Allen E, Warren E, McGowan J, Bevilacqua L, Jamal F, et al. Effects of the Learning Together intervention on bullying and aggression in English secondary schools (INCLUSIVE): a cluster randomised controlled trial. Lancet. 2018;392(10163):2452–64. https://doi.org/10.1016/S0140-6736(18)31782-3 .

Stunnenberg BC, Raaphorst J, Groenewoud HM, Statland JM, Griggs RC, Woertman W, et al. Effect of mexiletine on muscle stiffness in patients with nondystrophic myotonia evaluated using aggregated N-of-1 trials. JAMA. 2018;320(22):2344–53. https://doi.org/10.1001/jama.2018.18020 .

Burt RK, Balabanov R, Burman J, Sharrack B, Snowden JA, Oliveira MC, et al. Effect of nonmyeloablative hematopoietic stem cell transplantation vs continued disease-modifying therapy on disease progression in patients with relapsing-remitting multiple sclerosis: a randomized clinical trial. JAMA. 2019;321(2):165–74. https://doi.org/10.1001/jama.2018.18743 .

Dennis M, Mead G, Forbes J, Graham C, Hackett M, Hankey GJ, et al. Effects of fluoxetine on functional outcomes after acute stroke (FOCUS): a pragmatic, double-blind, randomised, controlled trial. Lancet. 2019;393(10168):265–74. https://doi.org/10.1016/S0140-6736(18)32823-X .

Anderson CS, Huang Y, Lindley RI, Chen X, Arima H, Chen G, et al. Intensive blood pressure reduction with intravenous thrombolysis therapy for acute ischaemic stroke (ENCHANTED): an international, randomised, open-label, blinded-endpoint, phase 3 trial. Lancet. 2019;393(10174):877–88. https://doi.org/10.1016/S0140-6736(19)30038-8 . England.

Basner M, Asch DA, Shea JA, Bellini LM, Carlin M, Ecker AJ, et al. Sleep and alertness in a duty-hour flexibility trial in internal medicine. N Engl J Med. 2019;380(10):915–23. https://doi.org/10.1056/NEJMoa1810641 .

Bath PM, Scutt P, Anderson CS, Appleton JP, Berge E, Cala L, et al. Prehospital transdermal glyceryl trinitrate in patients with ultra-acute presumed stroke (RIGHT-2): an ambulance-based, randomised, sham-controlled, blinded, phase 3 trial. Lancet. 2019;393(10175):1009–20. https://doi.org/10.1016/S0140-6736(19)30194-1 .

Hanley DF, Thompson RE, Rosenblum M, Yenokyan G, Lane K, McBee N, et al. Efficacy and safety of minimally invasive surgery with thrombolysis in intracerebral haemorrhage evacuation (MISTIE III): a randomised, controlled, open-label, blinded endpoint phase 3 trial. Lancet. 2019;393(10175):1021–32. https://doi.org/10.1016/S0140-6736(19)30195-3 .

Turk AS 3rd, Siddiqui A, Fifi JT, De Leacy RA, Fiorella DJ, Gu E, et al. Aspiration thrombectomy versus stent retriever thrombectomy as first-line approach for large vessel occlusion (COMPASS): a multicentre, randomised, open label, blinded outcome, non-inferiority trial. Lancet. 2019;393(10175):998–1008.  https://doi.org/10.1016/S0140-6736(19)30297-1 . England.

Ma H, Campbell BCV, Parsons MW, Churilov L, Levi CR, Hsu C, et al. Thrombolysis guided by perfusion imaging up to 9 hours after onset of stroke. N Engl J Med. 2019;380(19):1795–803. https://doi.org/10.1056/NEJMoa1813046 . United States.

Fischer K, Al-Sawaf O, Bahlo J, Fink AM, Tandon M, Dixon M, et al. Venetoclax and obinutuzumab in patients with CLL and coexisting conditions. N Engl J Med. 2019;380(23):2225–36. https://doi.org/10.1056/NEJMoa1815281 . United States.

Shehabi Y, Howe BD, Bellomo R, Arabi YM, Bailey M, Bass FE, et al. Early sedation with dexmedetomidine in critically ill patients. N Engl J Med. 2019;380(26):2506–17. https://doi.org/10.1056/NEJMoa1904710 . United States.

Johnston KC, Bruno A, Pauls Q, Hall CE, Barrett KM, Barsan W, et al. Intensive vs standard treatment of hyperglycemia and functional outcome in patients with acute ischemic stroke: the SHINE randomized clinical trial. JAMA. 2019;322(4):326–35. https://doi.org/10.1001/jama.2019.9346 .

Widmark A, Gunnlaugsson A, Beckman L, Thellenberg-Karlsson C, Hoyer M, Lagerlund M, et al. Ultra-hypofractionated versus conventionally fractionated radiotherapy for prostate cancer: 5-year outcomes of the HYPO-RT-PC randomised, non-inferiority, phase 3 trial. Lancet. 2019;394(10196):385–95. https://doi.org/10.1016/S0140-6736(19)31131-6 . England.

Pittock SJ, Berthele A, Fujihara K, Kim HJ, Levy M, Palace J, et al. Eculizumab in aquaporin-4-positive neuromyelitis optica spectrum disorder. N Engl J Med. 2019;381(7):614–25.  https://doi.org/10.1056/NEJMoa1900866 . United States.

Gunduz-Bruce H, Silber C, Kaul I, Rothschild AJ, Riesenberg R, Sankoh AJ, et al. Trial of SAGE-217 in patients with major depressive disorder. N Engl J Med. 2019;381(10):903–11.  https://doi.org/10.1056/NEJMoa1815981 . United States.

Nave AH, Rackoll T, Grittner U, Bläsing H, Gorsler A, Nabavi DG, et al. Physical Fitness Training in Patients with Subacute Stroke (PHYS-STROKE): multicentre, randomised controlled, endpoint blinded trial. BMJ. 2019;366:l5101. https://doi.org/10.1136/bmj.l5101 .

Sands BE, Peyrin-Biroulet L, Loftus EV Jr, Danese S, Colombel JF, Törüner M, et al. Vedolizumab versus adalimumab for moderate-to-severe ulcerative colitis. N Engl J Med. 2019;381(13):1215–26.  https://doi.org/10.1056/NEJMoa1905725 . United States.

Cree BAC, Bennett JL, Kim HJ, Weinshenker BG, Pittock SJ, Wingerchuk DM, et al. Inebilizumab for the treatment of neuromyelitis optica spectrum disorder (N-MOmentum): a double-blind, randomised placebo-controlled phase 2/3 trial. Lancet. 2019;394(10206):1352–63.  https://doi.org/10.1016/S0140-6736(19)31817-3 . England.

Cooper K, Breeman S, Scott NW, Scotland G, Clark J, Hawe J, et al. Laparoscopic supracervical hysterectomy versus endometrial ablation for women with heavy menstrual bleeding (HEALTH): a parallel-group, open-label, randomised controlled trial. Lancet. 2019;394(10207):1425–36. https://doi.org/10.1016/S0140-6736(19)31790-8 .

Reddihough DS, Marraffa C, Mouti A, O’Sullivan M, Lee KJ, Orsini F, et al. Effect of fluoxetine on obsessive-compulsive behaviors in children and adolescents with autism spectrum disorders: a randomized clinical trial. JAMA. 2019;322(16):1561–9. https://doi.org/10.1001/jama.2019.14685 .

John LK, Loewenstein G, Marder A, Callaham ML. Effect of revealing authors’ conflicts of interests in peer review: randomized controlled trial. BMJ. 2019;367. https://doi.org/10.1136/bmj.l5896 .

Yamamura T, Kleiter I, Fujihara K, Palace J, Greenberg B, Zakrzewska-Pniewska B, et al. Trial of satralizumab in neuromyelitis optica spectrum disorder. N Engl J Med. 2019;381(22):2114–24.  https://doi.org/10.1056/NEJMoa1901747 . United States.

Hoskin PJ, Hopkins K, Misra V, Holt T, McMenemin R, Dubois D, et al. Effect of single-fraction vs multifraction radiotherapy on ambulatory status among patients with spinal canal compression from metastatic cancer: the SCORAD randomized clinical trial. JAMA. 2019;322(21):2084–94. https://doi.org/10.1001/jama.2019.17913 .

Lascarrou JB, Merdji H, Le Gouge A, Colin G, Grillet G, Girardie P, et al. Targeted temperature management for cardiac arrest with nonshockable rhythm. N Engl J Med. 2019;381(24):2327–37.  https://doi.org/10.1056/NEJMoa1906661 . United States.

Ständer S, Yosipovitch G, Legat FJ, Lacour JP, Paul C, Narbutt J, et al. Trial of nemolizumab in moderate-to-severe prurigo nodularis. N Engl J Med. 2020;382(8):706–16.  https://doi.org/10.1056/NEJMoa1908316 . United States.

Hill MD, Goyal M, Menon BK, Nogueira RG, McTaggart RA, Demchuk AM, et al. Efficacy and safety of nerinetide for the treatment of acute ischaemic stroke (ESCAPE-NA1): a multicentre, double-blind, randomised controlled trial. Lancet. 2020;395(10227):878–87.  https://doi.org/10.1016/S0140-6736(20)30258-0 . England.

Olsen HT, Nedergaard HK, Strøm T, Oxlund J, Wian KA, Ytrebø LM, et al. Nonsedation or light sedation in critically ill, mechanically ventilated patients. N Engl J Med. 2020;382(12):1103–11.  https://doi.org/10.1056/NEJMoa1906759 . United States.

Campbell BCV, Mitchell PJ, Churilov L, Yassi N, Kleinig TJ, Dowling RJ, et al. Effect of intravenous tenecteplase dose on cerebral reperfusion before thrombectomy in patients with large vessel occlusion ischemic stroke: the EXTEND-IA TNK Part 2 randomized clinical trial. JAMA. 2020;323(13):1257–65. https://doi.org/10.1001/jama.2020.1511 .

Deyle GD, Allen CS, Allison SC, Gill NW, Hando BR, Petersen EJ, et al. Physical therapy versus glucocorticoid injection for osteoarthritis of the knee. N Engl J Med. 2020;382(15):1420–29.  https://doi.org/10.1056/NEJMoa1905877 . United States.

Koblan KS, Kent J, Hopkins SC, Krystal JH, Cheng H, Goldman R, et al. A non-D2-receptor-binding drug for the treatment of schizophrenia. N Engl J Med. 2020;382(16):1497–506.  https://doi.org/10.1056/NEJMoa1911772 . United States.

Cao B, Wang Y, Wen D, Liu W, Wang J, Fan G, et al. A trial of lopinavir-ritonavir in adults hospitalized with severe COVID-19. N Engl J Med. 2020;382(19):1787–99. https://doi.org/10.1056/NEJMoa2001282 .

Wang Y, Zhang D, Du G, Du R, Zhao J, Jin Y, et al. Remdesivir in adults with severe COVID-19: a randomised, double-blind, placebo-controlled, multicentre trial. Lancet. 2020;395(10236):1569–78. https://doi.org/10.1016/S0140-6736(20)31022-9 .

Yang P, Zhang Y, Zhang L, Treurniet KM, Chen W, Peng Y, et al. Endovascular thrombectomy with or without intravenous alteplase in acute stroke. N Engl J Med. 2020;382(21):1981–93.  https://doi.org/10.1056/NEJMoa2001123 . United States.

Martins SO, Mont’Alverne F, Rebello LC, Abud DG, Silva GS, Lima FO, et al. Thrombectomy for stroke in the public health care system of Brazil. N Engl J Med. 2020;382(24):2316–26.  https://doi.org/10.1056/NEJMoa2000120 . United States.

Kabashima K, Matsumura T, Komazaki H, Kawashima M. Trial of nemolizumab and topical agents for atopic dermatitis with pruritus. N Engl J Med. 2020;383(2):141–50.  https://doi.org/10.1056/NEJMoa1917006 . United States.

Johnston SC, Amarenco P, Denison H, Evans SR, Himmelmann A, James S, et al. Ticagrelor and aspirin or aspirin alone in acute ischemic stroke or TIA. N Engl J Med. 2020;383(3):207–17.  https://doi.org/10.1056/NEJMoa1916870 . United States.

Lebwohl MG, Papp KA, Stein Gold L, Gooderham MJ, Kircik LH, Draelos ZD, et al. Trial of roflumilast cream for chronic plaque psoriasis. N Engl J Med. 2020;383(3):229–39.  https://doi.org/10.1056/NEJMoa2000073 . United States.

Simpson EL, Sinclair R, Forman S, Wollenberg A, Aschoff R, Cork M, et al. Efficacy and safety of abrocitinib in adults and adolescents with moderate-to-severe atopic dermatitis (JADE MONO-1): a multicentre, double-blind, randomised, placebo-controlled, phase 3 trial. Lancet. 2020;396(10246):255–66.  https://doi.org/10.1016/S0140-6736(20)30732-7 . England.

Rowell SE, Meier EN, McKnight B, Kannas D, May S, Sheehan K, et al. Effect of out-of-hospital tranexamic acid vs placebo on 6-month functional neurologic outcomes in patients with moderate or severe traumatic brain injury. JAMA. 2020;324(10):961–74. https://doi.org/10.1001/jama.2020.8958 .

van der Vlist AC, van Oosterom RF, van Veldhoven PLJ, Bierma-Zeinstra SMA, Waarsing JH, Verhaar JAN, et al. Effectiveness of a high volume injection as treatment for chronic Achilles tendinopathy: randomised controlled trial. BMJ. 2020;370. https://doi.org/10.1136/bmj.m3027 .

Spinner CD, Gottlieb RL, Criner GJ, Arribas López JR, Cattelan AM, Soriano Viladomiu A, et al. Effect of remdesivir vs standard care on clinical status at 11 days in patients with moderate COVID-19: a randomized clinical trial. JAMA. 2020;324(11):1048–57. https://doi.org/10.1001/jama.2020.16349 .

Horne AW, Vincent K, Hewitt CA, Middleton LJ, Koscielniak M, Szubert W, et al. Gabapentin for chronic pelvic pain in women (GaPP2): a multicentre, randomised, double-blind, placebo-controlled trial. Lancet. 2020;396(10255):909–17. https://doi.org/10.1016/S0140-6736(20)31693-7 .

Furtado RHM, Berwanger O, Fonseca HA, Corrêa TD, Ferraz LR, Lapa MG, et al. Azithromycin in addition to standard of care versus standard of care alone in the treatment of patients admitted to the hospital with severe COVID-19 in Brazil (COALITION II): a randomised clinical trial. Lancet. 2020;396(10256):959–67. https://doi.org/10.1016/S0140-6736(20)31862-6 .

Tomazini BM, Maia IS, Cavalcanti AB, Berwanger O, Rosa RG, Veiga VC, et al. Effect of dexamethasone on days alive and ventilator-free in patients with moderate or severe acute respiratory distress syndrome and COVID-19: the CoDEX randomized clinical trial. JAMA. 2020;324(13):1307–16. https://doi.org/10.1001/jama.2020.17021 .

Beigel JH, Tomashek KM, Dodd LE, Mehta AK, Zingman BS, Kalil AC, et al. Remdesivir for the treatment of COVID-19 - final report. N Engl J Med. 2020;383(19):1813–26. https://doi.org/10.1056/NEJMoa2007764 .

Goldman JD, Lye DCB, Hui DS, Marks KM, Bruno R, Montejano R, et al. Remdesivir for 5 or 10 days in patients with severe COVID-19. N Engl J Med. 2020;383(19):1827–37. https://doi.org/10.1056/NEJMoa2015301 .

Cavalcanti AB, Zampieri FG, Rosa RG, Azevedo LCP, Veiga VC, Avezum A, et al. Hydroxychloroquine with or without azithromycin in mild-to-moderate COVID-19. N Engl J Med. 2020;383(21):2041–52. https://doi.org/10.1056/NEJMoa2019014 .

Self WH, Semler MW, Leither LM, Casey JD, Angus DC, Brower RG, et al. Effect of hydroxychloroquine on clinical status at 14 days in hospitalized patients with COVID-19: a randomized clinical trial. JAMA. 2020;324(21):2165–76. https://doi.org/10.1001/jama.2020.22240 .

Martínez-Fernández R, Máñez-Miró JU, Rodríguez-Rojas R, Del Álamo M, Shah BB, Hernández-Fernández F, et al. Randomized trial of focused ultrasound subthalamotomy for Parkinson’s disease. N Engl J Med. 2020;383(26):2501–13.  https://doi.org/10.1056/NEJMoa2016311 . United States.

Hutchinson PJ, Edlmann E, Bulters D, Zolnourian A, Holton P, Suttner N, et al. Trial of dexamethasone for chronic subdural hematoma. N Engl J Med. 2020;383(27):2616–27.  https://doi.org/10.1056/NEJMoa2020473 . United States.

Klein AL, Imazio M, Cremer P, Brucato A, Abbate A, Fang F, et al. Phase 3 trial of interleukin-1 trap rilonacept in recurrent pericarditis. N Engl J Med. 2021;384(1):31–41.  https://doi.org/10.1056/NEJMoa2027892 . United States.

Post R, Germans MR, Tjerkstra MA, Vergouwen MDI, Jellema K, Koot RW, et al. Ultra-early tranexamic acid after subarachnoid haemorrhage (ULTRA): a randomised controlled trial. Lancet. 2021;397(10269):112–8.  https://doi.org/10.1016/S0140-6736(20)32518-6 . England.

Suzuki K, Matsumaru Y, Takeuchi M, Morimoto M, Kanazawa R, Takayama Y, et al. Effect of mechanical thrombectomy without vs with intravenous thrombolysis on functional outcome among patients with acute ischemic stroke: the SKIP randomized clinical trial. JAMA. 2021;325(3):244–53. https://doi.org/10.1001/jama.2020.23522 .

Zi W, Qiu Z, Li F, Sang H, Wu D, Luo W, et al. Effect of endovascular treatment alone vs intravenous alteplase plus endovascular treatment on functional independence in patients with acute ischemic stroke: the DEVT randomized clinical trial. JAMA. 2021;325(3):234–43. https://doi.org/10.1001/jama.2020.23523 .

Veiga VC, Prats JAGG, Farias DLC, Rosa RG, Dourado LK, Zampieri FG, et al. Effect of tocilizumab on clinical outcomes at 15 days in patients with severe or critical coronavirus disease 2019: randomised controlled trial. BMJ. 2021;372. https://doi.org/10.1136/bmj.n84 .

Gordon KB, Foley P, Krueger JG, Pinter A, Reich K, Vender R, et al. Bimekizumab efficacy and safety in moderate to severe plaque psoriasis (BE READY): a multicentre, double-blind, placebo-controlled, randomised withdrawal phase 3 trial. Lancet. 2021;397(10273):475–86.  https://doi.org/10.1016/S0140-6736(21)00126-4 . England.

Reich K, Papp KA, Blauvelt A, Langley RG, Armstrong A, Warren RB, et al. Bimekizumab versus ustekinumab for the treatment of moderate to severe plaque psoriasis (BE VIVID): efficacy and safety from a 52-week, multicentre, double-blind, active comparator and placebo controlled phase 3 trial. Lancet. 2021;397(10273):487–98.  https://doi.org/10.1016/S0140-6736(21)00125-2 . England.

Blauvelt A, Kempers S, Lain E, Schlesinger T, Tyring S, Forman S, et al. Phase 3 trials of tirbanibulin ointment for actinic keratosis. N Engl J Med. 2021;384(6):512–20.  https://doi.org/10.1056/NEJMoa2024040 . United States.

Simonovich VA, Burgos Pratx LD, Scibona P, Beruto MV, Vallone MG, Vázquez C, et al. A randomized trial of convalescent plasma in COVID-19 severe pneumonia. N Engl J Med. 2021;384(7):619–29. https://doi.org/10.1056/NEJMoa2031304 .

Brannan SK, Sawchak S, Miller AC, Lieberman JA, Paul SM, Breier A. Muscarinic cholinergic receptor agonist and peripheral antagonist for schizophrenia. N Engl J Med. 2021;384(8):717–26. https://doi.org/10.1056/NEJMoa2017015 .

Lundgren JD, Grund B, Barkauskas CE, Holland TL, Gottlieb RL, Sandkovsky U, et al. A neutralizing monoclonal antibody for hospitalized patients with COVID-19. N Engl J Med. 2021;384(10):905–14. https://doi.org/10.1056/NEJMoa2033130 .

Bieber T, Simpson EL, Silverberg JI, Thaçi D, Paul C, Pink AE, et al. Abrocitinib versus placebo or dupilumab for atopic dermatitis. N Engl J Med. 2021;384(12):1101–12.  https://doi.org/10.1056/NEJMoa2019380 . United States.

Gordon AC, Mouncey PR, Al-Beidh F, Rowan KM, Nichol AD, Arabi YM, et al. Interleukin-6 receptor antagonists in critically ill patients with COVID-19. N Engl J Med. 2021;384(16):1491–502. https://doi.org/10.1056/NEJMoa2100433 .

Rosas IO, Bräu N, Waters M, Go RC, Hunter BD, Bhagani S, et al. Tocilizumab in hospitalized patients with severe COVID-19 pneumonia. N Engl J Med. 2021;384(16):1503–16. https://doi.org/10.1056/NEJMoa2028700 .

Aspvall K, Andersson E, Melin K, Norlin L, Eriksson V, Vigerland S, et al. Effect of an Internet-delivered stepped-care program vs in-person cognitive behavioral therapy on obsessive-compulsive disorder symptoms in children and adolescents: a randomized clinical trial. JAMA. 2021;325(18):1863–73. https://doi.org/10.1001/jama.2021.3839 .

Langezaal LCM, van der Hoeven EJRJ, Mont’Alverne FJA, de Carvalho JJF, Lima FO, Dippel DWJ, et al. Endovascular therapy for stroke due to basilar-artery occlusion. N Engl J Med. 2021;384(20):1910–20.  https://doi.org/10.1056/NEJMoa2030297 . United States.

Roquilly A, Moyer JD, Huet O, Lasocki S, Cohen B, Dahyot-Fizelier C, et al. Effect of continuous infusion of hypertonic saline vs standard care on 6-month neurological outcomes in patients with traumatic brain injury: the COBI randomized clinical trial. JAMA. 2021;325(20):2056–66. https://doi.org/10.1001/jama.2021.5561 .

Guttman-Yassky E, Teixeira HD, Simpson EL, Papp KA, Pangan AL, Blauvelt A, et al. Once-daily upadacitinib versus placebo in adolescents and adults with moderate-to-severe atopic dermatitis (Measure Up 1 and Measure Up 2): results from two replicate double-blind, randomised controlled phase 3 trials. Lancet. 2021;397(10290):2151–68.  https://doi.org/10.1016/S0140-6736(21)00588-2 . England.

Reich K, Teixeira HD, de Bruin-Weller M, Bieber T, Soong W, Kabashima K, et al. Safety and efficacy of upadacitinib in combination with topical corticosteroids in adolescents and adults with moderate-to-severe atopic dermatitis (AD Up): results from a randomised, double-blind, placebo-controlled, phase 3 trial. Lancet. 2021;397(10290):2169–81.  https://doi.org/10.1016/S0140-6736(21)00589-4 . England.

Dankiewicz J, Cronberg T, Lilja G, Jakobsen JC, Levin H, Ullén S, et al. Hypothermia versus normothermia after out-of-hospital cardiac arrest. N Engl J Med. 2021;384(24):2283–94.  https://doi.org/10.1056/NEJMoa2100591 . United States.

Tariot PN, Cummings JL, Soto-Martin ME, Ballard C, Erten-Lyons D, Sultzer DL, et al. Trial of pimavanserin in dementia-related psychosis. N Engl J Med. 2021;385(4):309–19.  https://doi.org/10.1056/NEJMoa2034634 . United States.

Guimarães PO, Quirk D, Furtado RH, Maia LN, Saraiva JF, Antunes MO, et al. Tofacitinib in patients hospitalized with COVID-19 pneumonia. N Engl J Med. 2021;385(5):406–15. https://doi.org/10.1056/NEJMoa2101643 .

Lawler PR, Goligher EC, Berger JS, Neal MD, McVerry BJ, Nicolau JC, et al. Therapeutic anticoagulation with heparin in noncritically ill patients with COVID-19. N Engl J Med. 2021;385(9):790–802. https://doi.org/10.1056/NEJMoa2105911 .

Goligher EC, Bradbury CA, McVerry BJ, Lawler PR, Berger JS, Gong MN, et al. Therapeutic Anticoagulation with heparin in critically ill patients with COVID-19. N Engl J Med. 2021;385(9):777–89. https://doi.org/10.1056/NEJMoa2103417 .

Schwarzschild MA, Ascherio A, Casaceli C, Curhan GC, Fitzgerald R, Kamp C, et al. Effect of urate-elevating inosine on early Parkinson disease progression: the SURE-PD3 randomized clinical trial. JAMA. 2021;326(10):926–39. https://doi.org/10.1001/jama.2021.10207 .

Halliday A, Bulbulia R, Bonati LH, Chester J, Cradduck-Bamford A, Peto R, et al. Second asymptomatic carotid surgery trial (ACST-2): a randomised comparison of carotid artery stenting versus carotid endarterectomy. Lancet. 2021;398(10305):1065–73. https://doi.org/10.1016/S0140-6736(21)01910-3 .

Paget LDA, Reurink G, de Vos RJ, Weir A, Moen MH, Bierma-Zeinstra SMA, et al. Effect of platelet-rich plasma injections vs placebo on ankle symptoms and function in patients with ankle osteoarthritis: a randomized clinical trial. JAMA. 2021;326(16):1595–605. https://doi.org/10.1001/jama.2021.16602 .

Estcourt LJ, Turgeon AF, McQuilten ZK, McVerry BJ, Al-Beidh F, Annane D, et al. Effect of convalescent plasma on organ support-free days in critically ill patients with COVID-19: a randomized clinical trial. JAMA. 2021;326(17):1690–702. https://doi.org/10.1001/jama.2021.18178 .

LeCouffe NE, Kappelhof M, Treurniet KM, Rinkel LA, Bruggeman AE, Berkhemer OA, et al. A randomized trial of intravenous alteplase before endovascular treatment for stroke. N Engl J Med. 2021;385(20):1833–44.  https://doi.org/10.1056/NEJMoa2107727 . United States.

Korley FK, Durkalski-Mauldin V, Yeatts SD, Schulman K, Davenport RD, Dumont LJ, et al. Early convalescent plasma for high-risk outpatients with COVID-19. N Engl J Med. 2021;385(21):1951–60. https://doi.org/10.1056/NEJMoa2103784 .

Bennell KL, Paterson KL, Metcalf BR, Duong V, Eyles J, Kasza J, et al. Effect of intra-articular platelet-rich plasma vs placebo injection on pain and medial tibial cartilage volume in patients with knee osteoarthritis: the RESTORE randomized clinical trial. JAMA. 2021;326(20):2021–30. https://doi.org/10.1001/jama.2021.19415 .

Ospina-Tascón GA, Calderón-Tapia LE, García AF, Zarama V, Gómez-Álvarez F, Álvarez-Saa T, et al. Effect of high-flow oxygen therapy vs conventional oxygen therapy on invasive mechanical ventilation and clinical recovery in patients with severe COVID-19: a randomized clinical trial. JAMA. 2021;326(21):2161–71. https://doi.org/10.1001/jama.2021.20714 .

Lebwohl MG, Stein Gold L, Strober B, Papp KA, Armstrong AW, Bagel J, et al. Phase 3 Trials of tapinarof cream for plaque psoriasis. N Engl J Med. 2021;385(24):2219–29.  https://doi.org/10.1056/NEJMoa2103629 . United States.

Berger JS, Kornblith LZ, Gong MN, Reynolds HR, Cushman M, Cheng Y, et al. Effect of P2Y12 Inhibitors on survival free of organ support among non-critically ill hospitalized patients with COVID-19: a randomized clinical trial. JAMA. 2022;327(3):227–36. https://doi.org/10.1001/jama.2021.23605 .

Polizzotto MN, Nordwall J, Babiker AG, Phillips A, Vock DM, Eriobu N, et al. Hyperimmune immunoglobulin for hospitalised patients with COVID-19 (ITAC): a double-blind, placebo-controlled, phase 3, randomised trial. Lancet. 2022;399(10324):530–40. https://doi.org/10.1016/S0140-6736(22)00101-5 .

Gadjradj PS, Rubinstein SM, Peul WC, Depauw PR, Vleggeert-Lankamp CL, Seiger A, et al. Full endoscopic versus open discectomy for sciatica: randomised controlled non-inferiority trial. BMJ. 2022;376. https://doi.org/10.1136/bmj-2021-065846 .

Preskorn SH, Zeller S, Citrome L, Finman J, Goldberg JF, Fava M, et al. Effect of sublingual dexmedetomidine vs placebo on acute agitation associated with bipolar disorder: a randomized clinical trial. JAMA. 2022;327(8):727–36. https://doi.org/10.1001/jama.2022.0799 .

Ruijter BJ, Keijzer HM, Tjepkema-Cloostermans MC, Blans MJ, Beishuizen A, Tromp SC, et al. Treating rhythmic and periodic EEG patterns in comatose survivors of cardiac arrest. N Engl J Med. 2022;386(8):724–34.  https://doi.org/10.1056/NEJMoa2115998 . United States.

Renú A, Millán M, San Román L, Blasco J, Martí-Fàbregas J, Terceño M, et al. Effect of intra-arterial alteplase vs placebo following successful thrombectomy on functional outcomes in patients with large vessel occlusion acute ischemic stroke: the CHOICE randomized clinical trial. JAMA. 2022;327(9):826–35. https://doi.org/10.1001/jama.2022.1645 .

van der Steen W, van de Graaf RA, Chalos V, Lingsma HF, van Doormaal PJ, Coutinho JM, et al. Safety and efficacy of aspirin, unfractionated heparin, both, or neither during endovascular stroke treatment (MR CLEAN-MED): an open-label, multicentre, randomised controlled trial. Lancet. 2022;399(10329):1059–69.  https://doi.org/10.1016/S0140-6736(22)00014-9 . England.

Paskins Z, Bromley K, Lewis M, Hughes G, Hughes E, Hennings S, et al. Clinical effectiveness of one ultrasound guided intra-articular corticosteroid and local anaesthetic injection in addition to advice and education for hip osteoarthritis (HIT trial): single blind, parallel group, three arm, randomised controlled trial. BMJ. 2022;377:e068446. https://doi.org/10.1136/bmj-2021-068446 .

Yoshimura S, Sakai N, Yamagami H, Uchida K, Beppu M, Toyoda K, et al. Endovascular therapy for acute stroke with a large ischemic region. N Engl J Med. 2022;386(14):1303–13.  https://doi.org/10.1056/NEJMoa2118191 . United States.

Pérez de la Ossa N, Abilleira S, Jovin TG, García-Tornel Á, Jimenez X, Urra X, et al. Effect of direct transportation to thrombectomy-capable center vs local stroke center on neurological outcomes in patients with suspected large-vessel occlusion stroke in nonurban areas: the RACECAT randomized clinical trial. JAMA. 2022;327(18):1782–94. https://doi.org/10.1001/jama.2022.4404 .

Bösel J, Niesen WD, Salih F, Morris NA, Ragland JT, Gough B, et al. Effect of early vs standard approach to tracheostomy on functional outcome at 6 months among patients with severe stroke receiving mechanical ventilation: the SETPOINT2 randomized clinical trial. JAMA. 2022;327(19):1899–909. https://doi.org/10.1001/jama.2022.4798 .

Perry DC, Achten J, Knight R, Appelbe D, Dutton SJ, Dritsaki M, et al. Immobilisation of torus fractures of the wrist in children (FORCE): a randomised controlled equivalence trial in the UK. Lancet. 2022;400(10345):39–47.  https://doi.org/10.1016/S0140-6736(22)01015-7 . England.

Fischer U, Kaesmacher J, Strbian D, Eker O, Cognard C, Plattner PS, et al. Thrombectomy alone versus intravenous alteplase plus thrombectomy in patients with stroke: an open-label, blinded-outcome, randomised non-inferiority trial. Lancet. 2022;400(10346):104–15.  https://doi.org/10.1016/S0140-6736(22)00537-2 . England.

Mitchell PJ, Yan B, Churilov L, Dowling RJ, Bush SJ, Bivard A, et al. Endovascular thrombectomy versus standard bridging thrombolytic with endovascular thrombectomy within 4 \(\cdot\) 5 h of stroke onset: an open-label, blinded-endpoint, randomised non-inferiority trial. Lancet. 2022;400(10346):116–25.  https://doi.org/10.1016/S0140-6736(22)00564-5 . England.

Wu YW, Comstock BA, Gonzalez FF, Mayock DE, Goodman AM, Maitre NL, et al. Trial of erythropoietin for hypoxic-ischemic encephalopathy in newborns. N Engl J Med. 2022;387(2):148–59.  https://doi.org/10.1056/NEJMoa2119660 . United States.

Menon BK, Buck BH, Singh N, Deschaintre Y, Almekhlafi MA, Coutts SB, et al. Intravenous tenecteplase compared with alteplase for acute ischaemic stroke in Canada (AcT): a pragmatic, multicentre, open-label, registry-linked, randomised, controlled, non-inferiority trial. Lancet. 2022;400(10347):161–9.  https://doi.org/10.1016/S0140-6736(22)01054-6 . England.

Valenta Z, Pitha J, Poledne R. Proportional odds logistic regression—effective means of dealing with limited uncertainty in dichotomizing clinical outcomes. Stat Med. 2006;25(24):4227–34.

Saver JL. Novel end point analytic techniques and interpreting shifts across the entire range of outcome scales in acute stroke trials. Stroke. 2007;38(11):3055–62.

Fullerton AS, Xu J. The proportional odds with partial proportionality constraints model for ordinal response variables. Soc Sci Res. 2012;41(1):182–98.

Faraway JJ. On the cost of data analysis. J Comput Graph Stat. 1992;1(3):213–29.

Clogg CC, Shihadeh ES. Statistical models for ordinal variables. AGE Publications; 1994.

McCullagh P. Generalized linear models. Routledge; 2019.

Crans GG, Shuster JJ. How conservative is Fisher’s exact test? A quantitative evaluation of the two-sample comparative binomial trial. Stat Med. 2008;27(18):3598–611.

Selman CJ. Statistic analyses of ordinal outcomes in RCTs: scoping review data. GitHub. 2023. https://github.com/chrisselman/ordscopingreview . Accessed 31 May 2023.

Download references

Acknowledgements

Not applicable.

This work forms part of Chris Selman’s PhD, which is supported by the Research Training Program Scholarship, administered by the Australian Commonwealth Government and The University of Melbourne, Australia. Chris Selman’s PhD was also supported by a Centre of Research Excellence grant from the National Health and Medical Research Council of Australia ID 1171422, to the Australian Trials Methodology (AusTriM) Research Network. Research at the Murdoch Children’s Research Institute is supported by the Victorian Government’s Operational Infrastructure Support Program. This work was supported by the Australian National Health and Medical Research Council (NHMRC) Centre for Research Excellence grants to the Victorian Centre for Biostatistics (ID1035261) and the Australian Trials Methodology Research Network (ID1171422), including through seed funding awarded to Robert Mahar. Katherine Lee is funded by an NHMRC Career Development Fellowship (ID1127984). Brett Manley is funded by the NHMRC Investigator Grant (Leadership 1). The funding bodies played no role in the study conception, design, data collection, data analysis, data interpretation, or writing of the report.

Author information

Authors and affiliations.

Clinical Epidemiology and Biostatistics Unit, Murdoch Children’s Research Institute, Parkville, VIC, 3052, Australia

Chris J. Selman, Katherine J. Lee & Robert K. Mahar

Department of Paediatrics, University of Melbourne, Parkville, VIC, 3052, Australia

Chris J. Selman & Katherine J. Lee

Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, University of Melbourne, Parkville, VIC, 3052, Australia

Robert K. Mahar

Department of Obstetrics and Gynaecology, University of Melbourne, Parkville, VIC, 3052, Australia

Kristin N. Ferguson, Clare L. Whitehead & Brett J. Manley

Department of Maternal Fetal Medicine, The Royal Women’s Hospital, Parkville, VIC, 3052, Australia

Clare L. Whitehead

Newborn Research, The Royal Women’s Hospital, Parkville, VIC, 3052, Australia

Brett J. Manley

Clinical Sciences, Murdoch Children’s Research Institute, Parkville, VIC, 3052, Australia

You can also search for this author in PubMed   Google Scholar

Contributions

CJS, RKM, KJL, CLW, and BJM conceived the study and CJS wrote the first draft of the manuscript. All authors contributed to the design of the study, revision of the manuscript, and take responsibility for its content.

Corresponding author

Correspondence to Chris J. Selman .

Ethics declarations

Ethics approval and consent to participate.

As data and information was only extracted from published studies, ethics approval was not required.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1..

Deviations from the protocol. This presents a summary of the deviations from the protocol, with reasons. We also provide an explanation of any simplifications and assumptions that were made for eligibility criteria and data extraction.

Additional file 2.

Data extraction questionnaire. This is a copy of the data extraction questionnaire that will be used for this review in PDF format.

Additional file 3.

Interpretation of the proportional odds ratio in proportional odds models. This presents a summary of the ways that the proportional odds ratio was interpreted across the studies.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Selman, C.J., Lee, K.J., Ferguson, K.N. et al. Statistical analyses of ordinal outcomes in randomised controlled trials: a scoping review. Trials 25 , 241 (2024). https://doi.org/10.1186/s13063-024-08072-2

Download citation

Received : 02 July 2023

Accepted : 22 March 2024

Published : 06 April 2024

DOI : https://doi.org/10.1186/s13063-024-08072-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Ordinal outcomes
  • Proportional odds model
  • Randomised controlled trials
  • Scoping review

ISSN: 1745-6215

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

peer reviewed journal articles on research methods

UNC Charlotte Homepage

Sociological Research Methods 4155 - Dr. Powell

  • Finding Books
  • Find Articles
  • Sociological Research

Peer Reviewed Articles

  • Statistics/Data Sets
  • Annotated Bibliography
  • Academic Integrity

How to recognize peer-reviewed (refereed) journals

In many cases professors will require that students utilize articles from “peer-reviewed” journals. Sometimes the phrases “refereed journals” or “scholarly journals” are used to describe the same type of journals. But what are peer-reviewed (or refereed or scholarly) journal articles, and why do faculty require their use?

Three categories of information resources:

  • Newspapers and magazines containing news - Articles are written by reporters who may or may not be experts in the field of the article. Consequently, articles may contain incorrect information.
  • Journals containing articles written by academics and/or professionals — Although the articles are written by “experts,” any particular “expert” may have some ideas that are really “out there!”
  • Peer-reviewed (refereed or scholarly) journals - Articles are written by experts and are reviewed by several other experts in the field before the article is published in the journal in order to insure the article’s quality. (The article is more likely to be scientifically valid, reach reasonable conclusions, etc.) In most cases the reviewers do not know who the author of the article is, so that the article succeeds or fails on its own merit, not the reputation of the expert.

Helpful hint!

Not all information in a peer-reviewed journal is actually refereed, or reviewed. For example, editorials, letters to the editor, book reviews, and other types of information don’t count as articles, and may not be accepted by your professor.

How do you determine whether an article qualifies as being a peer-reviewed journal article?

First, you need to be able to identify which journals are peer-reviewed. There are generally four methods for doing this

  • Limiting a database search to peer-reviewed journals only. Some databases allow you to limit searches for articles to peer reviewed journals only. For example, Academic Search Complete has this feature on the initial search screen - click on the pertinent box to limit the search. In some databases you may have to go to an “advanced” or “expert” search screen to do this. Remember, many databases do not allow you to limit your search in this way.
  • Locate the journal in the Library or online, then identify the most current entire year’s issues.
  • Locate the masthead of the publication. This oftentimes consists of a box towards either the front or the end of the periodical, and contains publication information such as the editors of the journal, the publisher, the place of publication, the subscription cost and similar information.
  • Does the journal say that it is peer-reviewed? If so, you’re done! If not, move on to step d.
  • Check in and around the masthead to locate the method for submitting articles to the publication.  If you find information similar to “to submit articles, send three copies…”, the journal is probably peer-reviewed. In this case, you are inferring that the publication is then going to send the multiple copies of the article to the journal’s reviewers. This may not always be the case, so relying upon this criterion alone may prove inaccurate.
  • If you do not see this type of statement in the first issue of the journal that you look at, examine the remaining journals to see if this information is included. Sometimes publications will include this information in only a single issue a year.
  • Is it scholarly, using technical terminology? Does the article format approximate the following - abstract, literature review, methodology, results, conclusion, and references? Are the articles written by scholarly researchers in the field that the periodical pertains to? Is advertising non-existent, or kept to a minimum? Are there references listed in footnotes or bibliographies? If you answered yes to all these questions , the journal may very well be peer-reviewed. This determination would be strengthened by having met the previous criterion of a multiple-copies submission requirement. If you answered these questions no , the journal is probably not peer-reviewed.
  • Find the official web site on the internet, and check to see if it states that the journal is peer-reviewed. Be careful to use the official site (often located at the journal publisher’s web site), and, even then, information could potentially be “inaccurate.”

If you have used the previous four methods in trying to determine if an article is from a peer-reviewed journal and are still unsure, speak to your instructor.

  • << Previous: Empirical Research
  • Next: Statistics/Data Sets >>
  • Last Updated: Mar 28, 2024 7:17 AM
  • URL: https://guides.library.charlotte.edu/c.php?g=720757

Disclaimer » Advertising

  • HealthyChildren.org

Issue Cover

  • Previous Article
  • Next Article

Eligibility Criteria

Search strategy, data extraction, risk of bias, data synthesis and analysis, medications, youth-directed psychosocial treatments, parent support, school interventions, cognitive training, neurofeedback, nutrition and supplements, complementary, alternative, or integrative medicine, combined medication and behavioral treatments, moderation of treatment response, long-term outcomes, clinical implications, strengths and limitations, future research needs, acknowledgments, treatments for adhd in children and adolescents: a systematic review.

  • Split-Screen
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • CME Quiz Close Quiz
  • Open the PDF for in another window
  • Get Permissions
  • Cite Icon Cite
  • Search Site

Bradley S. Peterson , Joey Trampush , Margaret Maglione , Maria Bolshakova , Mary Rozelle , Jeremy Miles , Sheila Pakdaman , Morah Brown , Sachi Yagyu , Aneesa Motala , Susanne Hempel; Treatments for ADHD in Children and Adolescents: A Systematic Review. Pediatrics April 2024; 153 (4): e2024065787. 10.1542/peds.2024-065787

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Effective treatment of attention-deficit/hyperactivity disorder (ADHD) is essential to improving youth outcomes.

This systematic review provides an overview of the available treatment options.

We identified controlled treatment evaluations in 12 databases published from 1980 to June 2023; treatments were not restricted by intervention content.

Studies in children and adolescents with clinically diagnosed ADHD, reporting patient health and psychosocial outcomes, were eligible. Publications were screened by trained reviewers, supported by machine learning.

Data were abstracted and critically appraised by 1 reviewer and checked by a methodologist. Data were pooled using random-effects models. Strength of evidence and applicability assessments followed Evidence-based Practice Center standards.

In total, 312 studies reported in 540 publications were included. We grouped evidence for medication, psychosocial interventions, parent support, nutrition and supplements, neurofeedback, neurostimulation, physical exercise, complementary medicine, school interventions, and provider approaches. Several treatments improved ADHD symptoms. Medications had the strongest evidence base for improving outcomes, including disruptive behaviors and broadband measures, but were associated with adverse events.

We found limited evidence of studies comparing alternative treatments directly and indirect analyses identified few systematic differences across stimulants and nonstimulants. Identified combination of medication with youth-directed psychosocial interventions did not systematically produce better results than monotherapy, though few combinations have been evaluated.

A growing number of treatments are available that improve ADHD symptoms and other outcomes, in particular for school-aged youth. Medication therapies remain important treatment options but are associated with adverse events.

Attention-deficit/hyperactivity disorder (ADHD) is a common mental health problem in youth, with a prevalence of ∼5.3%. 1 , 2   Youth with ADHD are prone to future risk-taking problems, including substance abuse, motor vehicle accidents, unprotected sex, criminal behavior, and suicide attempts. 3   Although stimulant medications are currently the mainstay of treatment of school-age youth with ADHD, other treatments have been developed for ADHD, including cognitive training, neurofeedback, neuromodulation, and dietary and nutritional interventions. 4   – 7  

This systematic review summarizes evidence for treatments of ADHD in children and adolescents. The evidence review extends back to 1980, when contemporary diagnostic criteria for ADHD and long-acting stimulants were first introduced. Furthermore, we did not restrict to a set of prespecified known interventions for ADHD, and instead explored the range of available treatment options for children and adolescents, including novel treatments. Medication evaluations had to adhere to a randomized controlled trial (RCT) design, all other treatments could be evaluated in RCTs or nonrandomized controlled studies that are more common in the psychological literature, as long as the study reported on a concurrent comparator. Outcomes were selected with input from experts and stakeholders and were not restricted to ADHD symptoms. To our knowledge, no previous review for ADHD treatments has been as comprehensive in the range of interventions, clinical and psychosocial outcomes, participant ages, and publication years.

The review aims were developed in consultation with the Agency for Healthcare Research and Quality (AHRQ), the Patient-Centered Outcomes Research Institute, the topic nominator American Academy of Pediatrics (AAP), key informants, a technical expert panel (TEP), and public input. The TEP reviewed the protocol and advised on key outcomes. Subgroup analyses and key outcomes were prespecified. The review is registered in PROSPERO (#CRD42022312656) and the protocol is available on the AHRQ Web site as part of a larger evidence report on ADHD. The systematic review followed Methods of the (AHRQ) Evidence-based Practice Center Program. 8  

Population: Children or adolescents with a clinical diagnosis of ADHD, age <18 years

Interventions: Any ADHD treatment, alone or in combination, and ≥4 weeks’ treatment

Comparators: No treatment, waitlist, placebo, passive comparators, or active comparators

Outcomes: Patient health and psychosocial outcomes

Setting: Any

Study designs: RCTs for medication; RCTs, controlled clinical trials without random assignment, or cohort studies comparing 1 or more treatment groups for nondrug treatments. Studies either had to be large or demonstrate that they could detect effects as a standalone study (operationalized as ≥100 participants or a power calculation)

Other limiters: English-language (to ensure transparency for a US guideline), published from 1980

We searched the databases PubMed, Embase, PsycINFO, ERIC, and ClinicalTrials.gov. We identified reviews for reference-mining through PubMed, Cochrane Database of Systematic Reviews, Campbell Collaboration, What Works in Education, PROSPERO, ECRI Guidelines Trust, G-I-N, and ClinicalKey. The search underwent peer review; the full strategy is in the Online Appendix. All citations were reviewed by trained literature reviewers supported by machine learning to ensure no studies were inadvertently missed. Two independent reviewers assessed full-text studies for eligibility. Publications reporting on the same participants were consolidated into 1 record so that no study entered the analyses more than once. The TEP reviewed studies to ensure all were captured.

The data abstraction form included extensive guidance to aid reproducibility and standardization in recording study details, outcomes, 9   – 12   study quality, and applicability. One reviewer abstracted data, and a methodologist checked its accuracy and completeness. Data are publicly available in the Systematic Review Data Repository.

We assessed 6 domains 13   : Selection, performance, attrition, detection, reporting, and study-specific biases ( Supplemental Figs 6 and 7 ).

We organized analyses by treatment and comparison type. We grouped treatments according to intervention content and target (eg, youth or parents). The intervention taxonomy differentiated medication, psychosocial interventions, parent support, nutrition and supplements, neurofeedback, neurostimulation, physical exercise, complementary medicine, school interventions, and provider approaches. We differentiated effects versus passive control groups (eg, placebo) and comparative effects (ie, comparing to an alternative treatment). The following outcomes were selected as key outcomes: (1) ADHD symptoms (eg, ADHD Rating Scale 14 , 15   ), (2) disruptive behavior (eg, conduct problems), (3) broadband measures (eg, Clinical Global Impression 16   ), (4) functional impairment (eg, Weiss Functional Impairment Rating Scale 17 , 18   ), (5) academic performance (eg, grade point average), (6) appetite suppression, and (7) number of participants reporting adverse events.

Studies reported on a large range of outcome measures as documented in the evidence table in the Online Appendix. To facilitate comparisons across studies, we converted outcomes to scale-independent standardized mean differences (SMDs) for continuous symptom outcome variables and relative risks (RRs) for categorical reports, presenting summary estimates and 95% confidence intervals (CIs) for all analyses. We used random-effects models performed in R with Metafor_v4.2-0 for statistical pooling, correcting for small numbers of studies when necessary, to synthesize available evidence. 19   We conducted sensitivity analyses for all analyses that included studies without random assignment. We also compared treatment effectiveness indirectly across studies in meta-regressions that added potential, prespecified effect modifiers to the meta-analytic model. In particular, we assessed whether ADHD presentation or cooccurring disorders modified intervention effects. We tested for heterogeneity using graphical displays, documented I 2 statistics (values >50% are highlighted in the text), and explored sources of heterogeneity in subgroup and sensitivity analyses. 20  

We assessed publication bias with Begg and Egger tests 21 , 22   and used the trim-and-fill methods for alternative estimates where necessary. 23   Applicability of findings to real-world clinical practices in typical US settings was assessed qualitatively using AHRQ’s Methods Guide. An overall strength of evidence (SoE) assessment communicating our confidence in each finding was determined initially by 1 researcher with experience in use of specified standardized criteria 24   ( Supplemental Information ), then discussed with the study team. We downgraded SoE for study limitations, imprecision, inconsistency, and reporting bias, and we differentiated high, moderate, low, and insufficient SoE.

We screened 23 139 citations and retrieved 7534 publications as full text against the eligibility criteria. In total, 312 treatment studies, reported in 540 publications (see list of included studies in the Online Appendix), met eligibility criteria ( Fig 1 ).

Literature flow diagram.

Literature flow diagram.

Although studies from 1980 were eligible, the earliest study meeting all eligibility criteria was from 1995. All included studies are documented in the evidence table in the Supplemental Information . The following highlights key findings. Results for intervention groups and individual studies, subgroup and sensitivity analyses, characteristics of participants and interventions contributing to the analyses, and considerations that determined the SoE for results are documented in the Online Appendix.

As a class, traditional stimulants (methylphenidate, amphetamines) significantly improved ADHD symptom severity (SMD, −0.88; CI, −1.13 to −0.63; studies = 12; n = 1620) and broadband measures (RR, 0.38; CI, 0.30–0.48; studies = 12; n = 1582) (both high SoE), but not functional impairment (SMD, 1.00; CI, −0.25 to 2.26; studies = 4; n = 540) ( Fig 2 , Supplemental Fig 8 , Supplemental Table 1 ). Methylphenidate formulations significantly improved ADHD symptoms (SMD, −0.68; CI, −0.91 to −0.46; studies = 7; n = 863) ( Fig 2 , Supplemental Table 1 ) and broadband measures (SMD, 0.66; CI, 0.04–1.28; studies = 2; n = 302). Only 1 study assessed academic performance, reporting large improvements compared with a control group (SMD, −1.37; CI, −1.72 to −1.03; n = 156) ( Supplemental Fig 9 ). 25   Methylphenidate statistically significantly suppressed appetite (RR, 2.80; CI, 1.47–5.32; studies = 8; n = 1110) ( Fig 3 ), and more patients reported adverse events (RR, 1.32; CI, 1.25–1.40; studies = 6; n = 945). Amphetamine formulations significantly improved ADHD symptoms (SMD, −1.16; CI, −1.64 to −0.67; studies = 5; n = 757) ( Fig 2 , Supplemental Table 1 ) but not broadband measures (SMD, 0.68; CI, −0.72 to 2.08; studies = 3; n = 561) ( Supplemental Fig 9 ). Amphetamines significantly suppressed appetite (RR, 7.08; CI, 2.72–18.42; studies = 8; n = 1229) ( Fig 3 ), and more patients reported adverse events (RR, 1.41; CI, 1.25–1.58; studies = 8; n = 1151). Modafinil (US Food and Drug Administration [FDA]-approved to treat narcolepsy and sleep apnea but not ADHD) in each individual study significantly improved ADHD symptoms, but aggregated estimates were nonsignificant (SMD, −0.76; CI, −1.75 to 0.23; studies = 4; n = 667) ( Fig 2 , Supplemental Table 1 ) because of high heterogeneity (I 2 = 91%). It did not improve broadband measures (RR, 0.49; CI, −0.12 to 2.07; studies = 3; n = 539) ( Supplemental Fig 9 ), and it significantly suppressed appetite (RR, 4.44; CI, 2.27–8.69; studies = 5; n = 780) ( Fig 3 ).

Medication effects on ADHD symptom severity. S-AMPH-LDX, lisdexamfetamine; S-AMPH-MAS, mixed amphetamines salts; S-MPH-DEX, dexmethylphenidate; S-MPH-ER, extended-release methylphenidate; S-MPH-IR, immediate release methylphenidate; S-MPH-OROS, osmotic-release oral system methylphenidate; S-MPH-TP, dermal patch methylphenidate; NS-NRI-ATX, atomoxetine; NS-NRI-VLX, viloxazine; NS-ALA-CLON, clonidine; NS-ALA-GXR, guanfacine extended-release.

Medication effects on ADHD symptom severity. S-AMPH-LDX, lisdexamfetamine; S-AMPH-MAS, mixed amphetamines salts; S-MPH-DEX, dexmethylphenidate; S-MPH-ER, extended-release methylphenidate; S-MPH-IR, immediate release methylphenidate; S-MPH-OROS, osmotic-release oral system methylphenidate; S-MPH-TP, dermal patch methylphenidate; NS-NRI-ATX, atomoxetine; NS-NRI-VLX, viloxazine; NS-ALA-CLON, clonidine; NS-ALA-GXR, guanfacine extended-release.

Medication effects on appetite suppression. Abbreviations as in legend for Fig 2.

Medication effects on appetite suppression. Abbreviations as in legend for Fig 2 .

As a class, nonstimulants significantly improved ADHD symptoms (SMD, −0.52; CI, −0.59 to −0.46; studies = 37; n = 6065; high SoE) ( Fig 2 , Supplemental Table 1 ), broadband measures (RR, 0.66; CI, 0.58–0.76; studies = 12; n = 2312) ( Supplemental Fig 8 ), and disruptive behaviors (SMD, 0.66; CI, 0.22–1.10; studies = 4; n = 523), but not functional impairment (SMD, 0.20; CI, −0.05 to 0.44; studies = 6; n = 1163). Norepinephrine reuptake inhibitors (NRI) improved ADHD symptoms (SMD, −0.55; CI, −0.62 to −0.47; studies=28; n = 4493) ( Fig 2 , Supplemental Table 1 ) but suppressed appetite (RR, 3.23; CI, 2.40–4.34; studies = 27; n = 4176) ( Fig 3 ), and more patients reported adverse events (RR, 1.31; CI, 1.18–1.46; studies = 15; n = 2600). Alpha-agonists (guanfacine and clonidine) improved ADHD symptoms (SMD, −0.52; CI, −0.67 to −0.37; studies = 11; n = 1885) ( Fig 2 , Supplemental Table 1 ), without (guanfacine) significantly suppressing appetite (RR, 1.49; CI, 0.94–2.37; studies = 4; n = 919) ( Fig 3 ), but more patients reported adverse events (RR, 1.21; CI, 1.11–1.31; studies = 14, n = 2544).

One study compared amphetamine versus methylphenidate, head-to-head, finding more improvement in ADHD symptoms (SMD, −0.46; CI, −0.73 to −0.19; n = 222) and broadband measures (SMD, 0.29; CI, 0.02–0.56; n = 211), but not functional impairment (SMD, 0.16; CI, −0.11 to 0.43; n = 211), 26   with lisdexamfetamine (an amphetamine) than osmotic-release oral system methylphenidate. No difference was found in appetite suppression (RR, 1.01; CI, 0.72–1.42; studies = 2, n = 414) ( Fig 3 ) or adverse events (RR, 1.11; CI, 0.93–1.33; study = 1, n = 222). Indirect comparisons yielded significantly larger effects for amphetamine than methylphenidate in improving ADHD symptoms ( P = .02) but not broadband measures ( P = .97) or functional impairment ( P = .68). Stimulants did not differ in appetite suppression ( P = .08) or adverse events ( P = .35).

One study provided information on NRI versus alpha-agonists by directly comparing an alpha-agonist (guanfacine) with an NRI (atomoxetine), 27   finding significantly greater improvement in ADHD symptoms with guanfacine (SMD, −0.47; CI, −0.73 to −0.2; n = 226) but not a broadband measure (RR, 0.84; CI, 0.68–1.04; n = 226). It reported less appetite suppression for guanfacine (RR, 0.48; CI, 0.27–0.83; n = 226) but no difference in adverse events (RR, 1.14; CI, 0.97–1.34; n = 226). Indirect comparisons did not indicate significantly different effect sizes for ADHD symptoms ( P = .90), disruptive behaviors ( P = .31), broadband measures ( P = .41), functional impairment ( P = .46), or adverse events ( P = .06), but suggested NRIs more often suppressed appetite compared with guanfacine ( P = .01).

Studies directly comparing nonstimulants versus stimulants (all were the NRI atomoxetine and stimulants methylphenidate in all but 1) tended to favor stimulants but did not yield significance for ADHD symptom severity (SMD, 0.23; CI, −0.03 to 0.49; studies = 7; n = 1611) ( Fig 2 ). Atomoxetine slightly but statistically significantly produced greater improvements in disruptive behaviors (SMD, −0.08; CI, −0.14 to −0.03; studies = 4; n = 608) ( Supplemental Fig 10 ) but not broadband measures (SMD, −0.16; CI, −0.36 to 0.04; studies = 4; n = 1080) ( Supplemental Fig 9 ). They did not differ significantly in appetite suppression (RR, 0.82; CI, 0.53–1.26; studies = 8; n = 1463) ( Fig 3 ) or number with adverse events (RR, 1.11; CI, 0.90–1.37; studies = 4; n = 756). Indirect comparisons indicated significant differences favoring stimulants over nonstimulants in improving ADHD symptom severity ( P < .0001), broadband measures ( P = .0002), and functional impairment ( P = .04), but not appetite suppression ( P = .31) or number with adverse events ( P = .12).

Several studies assessed whether adding nonstimulant to stimulant medication (all were alpha-agonists added to different stimulants) improved outcomes compared with stimulant medication alone, yielding a small but significant additional improvement in ADHD symptoms (SMD, −0.36; CI, −0.52 to −0.19; studies = 5; n = 724) ( Fig 4 ).

Combination treatment. CLON, clonidine, GXR guanfacine.

Combination treatment. CLON, clonidine, GXR guanfacine.

We identified 32 studies evaluating psychosocial, psychological, or behavioral interventions targeting ADHD youth, either alone or combined with components for parents and teachers. Interventions were highly diverse, and most were complex with multiple components (see supplemental results in the Online Appendix). They significantly improved ADHD symptoms (SMD, −0.35; CI, −0.51 to −0.19; studies = 14; n = 1686; moderate SoE) ( Fig 4 ), even when restricting to RCTs only (SMD, −0.36; CI, −0.53 to −0.19; removing high-risk-of-bias studies left 7 with similar effects SMD, −0.38; CI, −0.69 to −0.07), with minimal heterogeneity (I 2 = 52%); but not disruptive behaviors (SMD, −0.18; CI, −0.48 to 0.12; studies = 8; n = 947) or academic performance (SMD, −0.07; CI, −0.49 to 0.62; studies = 3; n = 459) ( Supplemental Fig 11 ).

We identified 19 studies primarily targeting parents of youth aged 3 to 18 years, though only 3 included teenagers. Interventions were highly diverse (see Online Appendix), but significantly improved ADHD symptoms (SMD, −0.31; CI, −0.57 to −0.05; studies = 11; n = 1078; low SoE) ( Fig 4 ), even when restricting to RCTs only (SMD, −0.35; CI, −0.61 to −0.09; removing high-risk-of-bias studies yielded the same point estimate, but CIs were wider, and the effect was nonsignificant SMD, −0.31; CI, −0.76 to 0.14). There was some evidence of publication bias (Begg P = .16; Egger P = .02), but the trim and fill method to correct it found a similar effect (SMD, −0.43; CI, −0.63 to −0.22). Interventions improved broadband scores (SMD, 0.41; CI, 0.23–0.58; studies = 7; n = 613) and disruptive behaviors (SMD, −0.52; CI, −0.85 to −0.18; studies = 4; n = 357) but not functional impairment (SMD, 0.35; CI, −0.69 to 1.39; studies = 3; n = 252) (all low SoE) ( Supplemental Fig 12 ).

We identified 10 studies, mostly for elementary or middle schools (see Online Appendix). Interventions did not significantly improve ADHD symptoms (SMD, −0.50; CI, −1.05 to 0.06; studies = 5; n = 822; moderate SoE) ( Fig 4 ), but there was evidence of heterogeneity (I 2 = 87%). Although most studies reported improved academic performance, this was not statistically significant across studies (SMD, −0.19; CI, −0.48 to 0.09; studies = 5; n = 854) ( Supplemental Fig 13 ).

We identified 22 studies, for youth aged 6 to 17 years without intellectual disability (see Online Appendix). Cognitive training did improve ADHD symptoms (SMD, −0.37; CI, −0.65 to −0.06; studies = 12; n = 655; low SoE) ( Fig 4 ), with some heterogeneity (I 2 = 65%), but not functional impairment (SMD, 0.41; CI, −0.24 to 1.06; studies = 5; n = 387) ( Supplemental Fig 14 ) or disruptive behaviors (SMD, −0.29; CI, −0.84 to 0.27; studies [all RCTs] = 5; n = 337). It improved broadband measures (SMD, 0.50; CI, 0.12–0.88; studies = 6; n = 344; RCTs only: SMD, 0.43; CI, −0.06 to 0.93) (both low SoE). It did not increase adverse events (RR, 3.30; CI, 0.03–431.32; studies = 2; n = 402).

We identified 21 studies: Two-thirds involved θ/β EEG marker modulation, and one-third modulation of slow cortical potentials (see Online Appendix). Neurofeedback significantly improved ADHD symptoms (SMD, −0.44; CI, −0.65 to −0.22; studies = 12; n = 945; low SoE) ( Fig 4 ), with little heterogeneity (I 2 = 33%); restricting to the 10 RCTs yielded the same point estimate, also statistically significant (SMD, −0.44; CI, −0.71 to −0.16). Neurofeedback did not systematically improve disruptive behaviors (SMD, −0.33; CI, −1.33 to 0.66; studies = 4; n = 372), or functional impairment (SMD, 0.21; CI, −0.14 to 0.55; studies = 3; n = 332) ( Supplemental Fig 15 ).

We identified 39 studies with highly diverse nutrition interventions (see Online Appendix), including omega-3 (studies = 13), vitamins (studies = 3), or diets (studies = 3), and several evaluated supplements as augmentation to stimulants. Most were placebo-controlled. Across studies, interventions improved ADHD symptoms (SMD, −0.39; CI, −0.67 to −0.12; studies = 23; n = 2357) ( Fig 4 ), even when restricting to RCTs (SMD, −0.32; CI, −0.55 to −0.08), with high heterogeneity (I 2 = 89%) but no publication bias. The group of nutritional approaches also improved disruptive behaviors (SMD, −0.28; CI, −0.37 to −0.18; studies [all RCTs] = 5; n = 360) ( Supplemental Fig 16 , low SoE), without increasing the number reporting adverse events (RR, 0.77; CI, 0.47–1.27; studies = 8; n = 735). However, we did not identify any specific supplements that consistently improved outcomes, including omega-3 (eg, ADHD symptoms: SMD, −0.11; CI, −0.45, 0.24; studies = 7; n = 719; broadband measures: SMD, 0.04; CI, −0.24 to 0.32; studies = 7; n = 755, low SoE).

We identified 6 studies assessing acupuncture, homeopathy, and hippotherapy. They did not individually or as a group significantly improve ADHD symptoms (SMD, −0.15; CI, −1.84 to 1.53; studies = 3; n = 313) ( Fig 4 ) or improve other outcomes across studies (eg, broadband measures: SMD, 0.03; CI, −3.66 to 3.73; studies = 2; n = 218) ( Supplemental Fig 17 ).

Eleven identified studies evaluated a combination of medication- and youth-directed psychosocial treatments. Most allowed children to have common cooccurring conditions, but intellectual disability and severe neurodevelopmental conditions were exclusionary. Medication treatments were stimulant or atomoxetine. Psychosocial treatments included multimodal psychosocial treatment, cognitive behavioral therapy, solution-focused therapy, behavioral therapy, and a humanistic intervention. Studies mostly compared combinations of medication and psychosocial treatment to medication alone, rather than no treatment or placebo. Combined therapy did not statistically significantly improve ADHD symptoms across studies (SMD, −0.36; CI, −0.73 to 0.01; studies = 7; n = 841; low SoE; only 2 individual studies reported statistically significant effects) ( Fig 5 ) or broadband measures (SMD, 0.42; CI, −0.72 to 1.56; studies = 3; n = 171), but there was indication of heterogeneity (I 2 = 71% and 62%, respectively).

Nonmedication intervention effects on ADHD symptom severity.

Nonmedication intervention effects on ADHD symptom severity.

We found little evidence that either ADHD presentation (inattentive, hyperactive, combined-type) or cooccurring psychiatric disorders modified treatment effects on any ADHD outcome, but few studies addressed this question systematically (see Online Appendix).

Only a very small number of studies (33 of 312) reported on outcomes at or beyond 12 months of follow-up (see Online Appendix). Many did not report on key outcomes of this review. Studies evaluating combined psychosocial and medication interventions, such as the multimodal treatment of ADHD study, 28   did not find sustained effects beyond 12 months. Analyses for medication, psychosocial, neurofeedback, parent support, school intervention, and provider-focused interventions did not find sustained effects for more than a single study reporting on the same outcome. No complementary medicine, neurostimulation, physical exercise, or cognitive training studies reported long-term outcomes.

We identified a large body of evidence contributing to knowledge of ADHD treatments. A substantial number of treatments have been evaluated in strong study designs that provide evidence statements regarding the effects of the treatments on children and adolescents with ADHD. The body of evidence shows that numerous intervention classes significantly improve ADHD symptom severity. This includes large but variable effects for amphetamines, moderate-sized effects for methylphenidate, NRIs, and alpha-agonists, and small effects for youth-directed psychosocial treatment, parent support, neurofeedback, and cognitive training. The SoE for effects on ADHD symptoms was high across FDA-approved medications (methylphenidate, amphetamines, NRIs, alpha-agonists); moderate for psychosocial interventions; and low for parent support, neurofeedback, and nutritional interventions. Augmentation of stimulant medication with non-stimulants produced small but significant additional improvement in ADHD symptoms over stimulant medication alone (low SoE).

We also summarized evidence for other outcomes beyond specific ADHD symptoms and found that broadband measures (ie, global clinical measures not restricted to assessing specific symptoms and documenting overall psychosocial adjustment), methylphenidate (low SoE), nonstimulant medications (moderate SoE), and cognitive training (low SoE) yielded significant, medium-sized effects, and parent support small effects (moderate SoE). For disruptive behaviors, nonstimulant medications (high SoE) and parent support (low SoE) produced significant improvement with medium effect. No treatment modality significantly improved functional impairment or academic performance, though the latter was rarely assessed as a treatment outcome.

The enormous variability in treatment components and delivery of youth-directed psychotherapies, parent support, neurofeedback, and nutrition and supplement therapies, and in ADHD outcomes they have targeted, complicates the synthesis and meta-analysis of their effects compared with the much more uniform interventions, delivery, and outcome assessments for medication therapies. Moreover, most psychosocial and parent support studies compared an active treatment against wait list controls or treatment as usual, which did not control well for the effects of parent or therapist attention or other nonspecific effects of therapy, and they have rarely been able to blind adequately either participants or study assessors to treatment assignment. 29 , 30   These design limitations weaken the SoE for these interventions.

The large number of studies, combined with their medium-to-large effect sizes, indicate collectively and with high SoE that FDA-approved medications improve ADHD symptom severity, broadband measures, functional impairment, and disruptive behaviors. Indirect comparison showed larger effect sizes for stimulants than for nonstimulants in improving ADHD symptoms and functional impairment. Results for amphetamines and methylphenidate varied, and we did not identify head-to-head comparisons of NRIs versus alpha-agonists that met eligibility criteria. Despite compelling evidence for their effectiveness, stimulants and nonstimulants produced more adverse events than did other interventions, with a high SoE. Stimulants and nonstimulant NRIs produced significantly more appetite suppression than placebo, with similar effect sizes for methylphenidate, amphetamine, and NRI, and much larger effects for modafinil. Nonstimulant alpha-agonists (specifically, guanfacine) did not suppress appetite. Rates of other adverse events were similar between NRIs and alpha-agonists.

Perhaps contrary to common belief, we found no evidence that youth-directed psychosocial and medication interventions are systematically better in improving ADHD outcomes when delivered as combination treatments 31   – 33   ; both were effective as monotherapies, but the combination did not signal additional statistically significant benefits (low SoE). However, it should be noted that few psychosocial and medication intervention combinations have been studied to date. We also found that treatment outcomes did not vary with ADHD presentation or the presence of cooccurring psychiatric disorders, but indirect analyses are limited in detecting these effect modifiers, and more research is needed. Furthermore, although children of all ages were eligible for inclusion in the review, we note that very few studies assessed treatments (especially medications) in children <6 years of age; evidence is primarily available for school-age children and adolescents. Finally, despite the research volume, we still know little about long-term effects of ADHD treatments. The limited available body of evidence suggests that most interventions, including combined medication and psychological treatment, yield few significant long-term improvements for most ADHD outcomes.

This review provides compelling evidence that numerous, diverse treatments are available and helpful for the treatment of ADHD. These include stimulant and nonstimulant medications, youth-targeted psychosocial treatments, parent support, neurofeedback, and cognitive training, though nonmedication interventions appear to have considerably weaker effects than medications on ADHD symptoms. Nonetheless, the body of evidence provides youth with ADHD, their parents, and health care providers with options.

The paucity of head-to-head studies comparing treatments precludes research-based recommendations regarding which is likely to be most helpful and which should be tried first, and decisions need to be based on clinical considerations and patient preferences. Stimulant and nonstimulant NRI medications, separately and in head-to-head comparisons, have shown similar effectiveness and rates of side effects, including appetite suppression, across identified studies. The moderate effect sizes for nonstimulant alpha-agonists, their low rate of appetite suppression, and their evidence for effectiveness in augmenting the effects of stimulant medications in reducing ADHD symptom severity provides additional treatment options. Furthermore, we found low SoE that neurofeedback and cognitive training improve ADHD symptoms. We also found that nutritional supplements and dietary interventions improve ADHD symptoms and disruptive behaviors. The SoE for nutritional interventions, however, is still low, and despite the research volume, we did not identify systematic benefits for specific supplements.

Clinical guidelines currently advise starting treatment of youth >6 years of age with FDA-approved medications, 33   which the findings of this review support. Furthermore, FDA-approved medications have been shown to significantly improve broadband measures, and nonstimulant medications have been shown to improve disruptive behaviors, suggesting their clinical benefits extend beyond improving only ADHD symptoms. Clinical guidelines for preschool children advise parent training and/or classroom behavioral interventions as the first line of treatment, if available. These recommendations remain supported by the present review, given the paucity of studies in preschool children in general, and because many existing studies, in particular medication and youth-directed psychosocial interventions, do not include young children. 31   – 33  

This review incorporated publications dating from 1980, assessing diverse intervention targets (youth, parent, school) and ADHD outcomes across numerous functional domains. Limitations in its scope derive from eligibility criteria. Requiring treatment of 4 weeks ensured that interventions were intended as patient treatment rather than proof of concept experiments, but it also excluded some early studies contributing to the field and other brief but intense psychosocial interventions. Requiring studies to be sufficiently large to detect effects excluded smaller studies that contribute to the evidence base. We explicitly did not restrict to RCTs (ie, a traditional medical study design), but instead identified all studies with concurrent comparators so as not to bias against psychosocial research; nonetheless, the large majority of identified studies were RCTs. Our review aimed to provide an overview of the diverse treatment options and we abstracted findings regardless of the suitability of the study results for meta-analysis. Although many ADHD treatments are very different in nature and the clinical decision for 1 treatment approach over another is likely not made primarily on effect size estimates, future research could use the identified study pool and systematically analyze comparative effectiveness of functionally interchangeable treatments in a network meta-analysis, building on previous work on medication options. 34  

Future studies of psychosocial, parent, school-based, neurofeedback, and nutritional treatments should employ more uniform interventions and study designs that provide a higher SoE for effectiveness, including active attention comparators and effective blinding of outcome assessments. Higher-quality studies are needed for exercise and neuromodulation interventions. More trials are needed that compare alternative interventions head-to-head or compare combination treatments with monotherapy. Clinical trials should assess patient-centered outcomes other than ADHD symptoms, including functional impairment and academic performance. Much more research is needed to assess long-term treatment effectiveness, compliance, and safety, including in preschool youth. Studies should assess patient characteristics as modifiers of treatment effects, to identify which treatments are most effective for which patients. To aid discovery and confirmation of these modifiers, studies should make publicly available all individual-level demographic, clinical, treatment, and outcome data.

We thank the following individuals providing expertise and helpful comments that contributed to the systematic review: Esther Lee, Becky Nguyen, Cynthia Ramirez, Erin Tokutomi, Ben Coughli, Jennifer Rivera, Coleman Schaefer, Cindy Pham, Jerusalem Belay, Anne Onyekwuluje, Mario Gastelum, Karin Celosse, Samantha Fleck, Janice Kang, and Sreya Molakalaplli for help with data acquisition. We thank Kymika Okechukwu, Lauren Pilcher, Joanna King, and Robyn Wheatley from the American Academy of Pediatrics; Jennie Dalton and Paula Eguino Medina from the Patient-Centered Outcomes Research Institute; Christine Chang and Kim Wittenberg from AHRQ; and Mary Butler from the Minnesota Evidence-based Practice Center. We thank Glendy Burnett, Eugenia Chan, MD, MPH; Matthew J. Gormley, PhD; Laurence Greenhill, MD; Joseph Hagan, Jr, MD; Cecil Reynolds, PhD; Le’Ann Solmonson, PhD, LPC-S, CSC; and Peter Ziemkowski, MD, FAAFP; who served as key informants. We thank Angelika Claussen, PhD; Alysa Doyle, PhD; Tiffany Farchione, MD; Matthew J. Gormley, PhD; Laurence Greenhill, MD; Jeffrey M. Halperin, PhD; Marisa Perez-Martin, MS, LMFT; Russell Schachar, MD; Le’Ann Solmonson, PhD, LPC-S, CSC; and James Swanson, PhD; who served as a technical expert panel. Finally, we thank Joel Nigg, PhD; and Peter S. Jensen, MD; for their peer review of the data.

Drs Peterson and Hempel conceptualized and designed the study, collected data, conducted the analyses, drafted the initial manuscript, and critically reviewed and revised the manuscript; Dr Trampush conducted the critical appraisal; Drs Bolshakova and Pakdaman, and Ms Rozelle, Ms Maglione, and Ms Brown screened citations and abstracted the data; Dr Miles conducted the analyses; Ms Yagyu designed and executed the search strategy; Ms Motala served as data manager; and all authors provided critical input for the manuscript, approved the final manuscript as submitted, and agree to be accountable for all aspects of the work.

This study is registered at PROSPERO, #CRD42022312656. Data are available in SRDRPlus.

COMPANION PAPER: A companion to this article can be found online at www.pediatrics.org/cgi/doi/10.1542/peds.2024-065854 .

FUNDING: The work is based on research conducted by the Southern California Evidence-based Practice Center under contract to the Agency for Healthcare Research and Quality (AHRQ), Rockville, MD (Contract No. 75Q80120D00009). The Patient-Centered Outcomes Research Institute funded the research (Publication No. 2023-SR-03). The findings and conclusions in this manuscript are those of the authors, who are responsible for its contents; the findings and conclusions do not necessarily represent the views of the AHRQ or the Patient-Centered Outcomes Research Institute, its board of governors or methodology committee. Therefore, no statement in this report should be construed as an official position of the Patient-Centered Outcomes Research Institute, the AHRQ, or the US Department of Health and Human Services.

CONFLICT OF INTEREST DISCLOSURES: The authors have indicated they have no conflicts of interest relevant to this article to disclose.

attention-deficit/hyperactivity disorder

Agency for Healthcare Research and Quality

US Food and Drug Administration

confidence interval

norepinephrine reuptake inhibitors

randomized controlled trial

relative risk

standardized mean difference

strength of evidence

technical expert panel

Supplementary data

Advertising Disclaimer »

Citing articles via

Email alerts.

peer reviewed journal articles on research methods

Affiliations

  • Editorial Board
  • Editorial Policies
  • Journal Blogs
  • Pediatrics On Call
  • Online ISSN 1098-4275
  • Print ISSN 0031-4005
  • Pediatrics Open Science
  • Hospital Pediatrics
  • Pediatrics in Review
  • AAP Grand Rounds
  • Latest News
  • Pediatric Care Online
  • Red Book Online
  • Pediatric Patient Education
  • AAP Toolkits
  • AAP Pediatric Coding Newsletter

First 1,000 Days Knowledge Center

Institutions/librarians, group practices, licensing/permissions, integrations, advertising.

  • Privacy Statement | Accessibility Statement | Terms of Use | Support Center | Contact Us
  • © Copyright American Academy of Pediatrics

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 26 March 2024

Predicting and improving complex beer flavor through machine learning

  • Michiel Schreurs   ORCID: orcid.org/0000-0002-9449-5619 1 , 2 , 3   na1 ,
  • Supinya Piampongsant 1 , 2 , 3   na1 ,
  • Miguel Roncoroni   ORCID: orcid.org/0000-0001-7461-1427 1 , 2 , 3   na1 ,
  • Lloyd Cool   ORCID: orcid.org/0000-0001-9936-3124 1 , 2 , 3 , 4 ,
  • Beatriz Herrera-Malaver   ORCID: orcid.org/0000-0002-5096-9974 1 , 2 , 3 ,
  • Christophe Vanderaa   ORCID: orcid.org/0000-0001-7443-5427 4 ,
  • Florian A. Theßeling 1 , 2 , 3 ,
  • Łukasz Kreft   ORCID: orcid.org/0000-0001-7620-4657 5 ,
  • Alexander Botzki   ORCID: orcid.org/0000-0001-6691-4233 5 ,
  • Philippe Malcorps 6 ,
  • Luk Daenen 6 ,
  • Tom Wenseleers   ORCID: orcid.org/0000-0002-1434-861X 4 &
  • Kevin J. Verstrepen   ORCID: orcid.org/0000-0002-3077-6219 1 , 2 , 3  

Nature Communications volume  15 , Article number:  2368 ( 2024 ) Cite this article

53k Accesses

861 Altmetric

Metrics details

  • Chemical engineering
  • Gas chromatography
  • Machine learning
  • Metabolomics
  • Taste receptors

The perception and appreciation of food flavor depends on many interacting chemical compounds and external factors, and therefore proves challenging to understand and predict. Here, we combine extensive chemical and sensory analyses of 250 different beers to train machine learning models that allow predicting flavor and consumer appreciation. For each beer, we measure over 200 chemical properties, perform quantitative descriptive sensory analysis with a trained tasting panel and map data from over 180,000 consumer reviews to train 10 different machine learning models. The best-performing algorithm, Gradient Boosting, yields models that significantly outperform predictions based on conventional statistics and accurately predict complex food features and consumer appreciation from chemical profiles. Model dissection allows identifying specific and unexpected compounds as drivers of beer flavor and appreciation. Adding these compounds results in variants of commercial alcoholic and non-alcoholic beers with improved consumer appreciation. Together, our study reveals how big data and machine learning uncover complex links between food chemistry, flavor and consumer perception, and lays the foundation to develop novel, tailored foods with superior flavors.

Similar content being viewed by others

peer reviewed journal articles on research methods

BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules

Rudraksh Tuwani, Somin Wadhwa & Ganesh Bagler

peer reviewed journal articles on research methods

Predicting odor from molecular structure: a multi-label classification approach

Kushagra Saini & Venkatnarayan Ramanathan

peer reviewed journal articles on research methods

Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach

Lorenzo Pallante, Aigli Korfiati, … Marco A. Deriu

Introduction

Predicting and understanding food perception and appreciation is one of the major challenges in food science. Accurate modeling of food flavor and appreciation could yield important opportunities for both producers and consumers, including quality control, product fingerprinting, counterfeit detection, spoilage detection, and the development of new products and product combinations (food pairing) 1 , 2 , 3 , 4 , 5 , 6 . Accurate models for flavor and consumer appreciation would contribute greatly to our scientific understanding of how humans perceive and appreciate flavor. Moreover, accurate predictive models would also facilitate and standardize existing food assessment methods and could supplement or replace assessments by trained and consumer tasting panels, which are variable, expensive and time-consuming 7 , 8 , 9 . Lastly, apart from providing objective, quantitative, accurate and contextual information that can help producers, models can also guide consumers in understanding their personal preferences 10 .

Despite the myriad of applications, predicting food flavor and appreciation from its chemical properties remains a largely elusive goal in sensory science, especially for complex food and beverages 11 , 12 . A key obstacle is the immense number of flavor-active chemicals underlying food flavor. Flavor compounds can vary widely in chemical structure and concentration, making them technically challenging and labor-intensive to quantify, even in the face of innovations in metabolomics, such as non-targeted metabolic fingerprinting 13 , 14 . Moreover, sensory analysis is perhaps even more complicated. Flavor perception is highly complex, resulting from hundreds of different molecules interacting at the physiochemical and sensorial level. Sensory perception is often non-linear, characterized by complex and concentration-dependent synergistic and antagonistic effects 15 , 16 , 17 , 18 , 19 , 20 , 21 that are further convoluted by the genetics, environment, culture and psychology of consumers 22 , 23 , 24 . Perceived flavor is therefore difficult to measure, with problems of sensitivity, accuracy, and reproducibility that can only be resolved by gathering sufficiently large datasets 25 . Trained tasting panels are considered the prime source of quality sensory data, but require meticulous training, are low throughput and high cost. Public databases containing consumer reviews of food products could provide a valuable alternative, especially for studying appreciation scores, which do not require formal training 25 . Public databases offer the advantage of amassing large amounts of data, increasing the statistical power to identify potential drivers of appreciation. However, public datasets suffer from biases, including a bias in the volunteers that contribute to the database, as well as confounding factors such as price, cult status and psychological conformity towards previous ratings of the product.

Classical multivariate statistics and machine learning methods have been used to predict flavor of specific compounds by, for example, linking structural properties of a compound to its potential biological activities or linking concentrations of specific compounds to sensory profiles 1 , 26 . Importantly, most previous studies focused on predicting organoleptic properties of single compounds (often based on their chemical structure) 27 , 28 , 29 , 30 , 31 , 32 , 33 , thus ignoring the fact that these compounds are present in a complex matrix in food or beverages and excluding complex interactions between compounds. Moreover, the classical statistics commonly used in sensory science 34 , 35 , 36 , 37 , 38 , 39 require a large sample size and sufficient variance amongst predictors to create accurate models. They are not fit for studying an extensive set of hundreds of interacting flavor compounds, since they are sensitive to outliers, have a high tendency to overfit and are less suited for non-linear and discontinuous relationships 40 .

In this study, we combine extensive chemical analyses and sensory data of a set of different commercial beers with machine learning approaches to develop models that predict taste, smell, mouthfeel and appreciation from compound concentrations. Beer is particularly suited to model the relationship between chemistry, flavor and appreciation. First, beer is a complex product, consisting of thousands of flavor compounds that partake in complex sensory interactions 41 , 42 , 43 . This chemical diversity arises from the raw materials (malt, yeast, hops, water and spices) and biochemical conversions during the brewing process (kilning, mashing, boiling, fermentation, maturation and aging) 44 , 45 . Second, the advent of the internet saw beer consumers embrace online review platforms, such as RateBeer (ZX Ventures, Anheuser-Busch InBev SA/NV) and BeerAdvocate (Next Glass, inc.). In this way, the beer community provides massive data sets of beer flavor and appreciation scores, creating extraordinarily large sensory databases to complement the analyses of our professional sensory panel. Specifically, we characterize over 200 chemical properties of 250 commercial beers, spread across 22 beer styles, and link these to the descriptive sensory profiling data of a 16-person in-house trained tasting panel and data acquired from over 180,000 public consumer reviews. These unique and extensive datasets enable us to train a suite of machine learning models to predict flavor and appreciation from a beer’s chemical profile. Dissection of the best-performing models allows us to pinpoint specific compounds as potential drivers of beer flavor and appreciation. Follow-up experiments confirm the importance of these compounds and ultimately allow us to significantly improve the flavor and appreciation of selected commercial beers. Together, our study represents a significant step towards understanding complex flavors and reinforces the value of machine learning to develop and refine complex foods. In this way, it represents a stepping stone for further computer-aided food engineering applications 46 .

To generate a comprehensive dataset on beer flavor, we selected 250 commercial Belgian beers across 22 different beer styles (Supplementary Fig.  S1 ). Beers with ≤ 4.2% alcohol by volume (ABV) were classified as non-alcoholic and low-alcoholic. Blonds and Tripels constitute a significant portion of the dataset (12.4% and 11.2%, respectively) reflecting their presence on the Belgian beer market and the heterogeneity of beers within these styles. By contrast, lager beers are less diverse and dominated by a handful of brands. Rare styles such as Brut or Faro make up only a small fraction of the dataset (2% and 1%, respectively) because fewer of these beers are produced and because they are dominated by distinct characteristics in terms of flavor and chemical composition.

Extensive analysis identifies relationships between chemical compounds in beer

For each beer, we measured 226 different chemical properties, including common brewing parameters such as alcohol content, iso-alpha acids, pH, sugar concentration 47 , and over 200 flavor compounds (Methods, Supplementary Table  S1 ). A large portion (37.2%) are terpenoids arising from hopping, responsible for herbal and fruity flavors 16 , 48 . A second major category are yeast metabolites, such as esters and alcohols, that result in fruity and solvent notes 48 , 49 , 50 . Other measured compounds are primarily derived from malt, or other microbes such as non- Saccharomyces yeasts and bacteria (‘wild flora’). Compounds that arise from spices or staling are labeled under ‘Others’. Five attributes (caloric value, total acids and total ester, hop aroma and sulfur compounds) are calculated from multiple individually measured compounds.

As a first step in identifying relationships between chemical properties, we determined correlations between the concentrations of the compounds (Fig.  1 , upper panel, Supplementary Data  1 and 2 , and Supplementary Fig.  S2 . For the sake of clarity, only a subset of the measured compounds is shown in Fig.  1 ). Compounds of the same origin typically show a positive correlation, while absence of correlation hints at parameters varying independently. For example, the hop aroma compounds citronellol, and alpha-terpineol show moderate correlations with each other (Spearman’s rho=0.39 and 0.57), but not with the bittering hop component iso-alpha acids (Spearman’s rho=0.16 and −0.07). This illustrates how brewers can independently modify hop aroma and bitterness by selecting hop varieties and dosage time. If hops are added early in the boiling phase, chemical conversions increase bitterness while aromas evaporate, conversely, late addition of hops preserves aroma but limits bitterness 51 . Similarly, hop-derived iso-alpha acids show a strong anti-correlation with lactic acid and acetic acid, likely reflecting growth inhibition of lactic acid and acetic acid bacteria, or the consequent use of fewer hops in sour beer styles, such as West Flanders ales and Fruit beers, that rely on these bacteria for their distinct flavors 52 . Finally, yeast-derived esters (ethyl acetate, ethyl decanoate, ethyl hexanoate, ethyl octanoate) and alcohols (ethanol, isoamyl alcohol, isobutanol, and glycerol), correlate with Spearman coefficients above 0.5, suggesting that these secondary metabolites are correlated with the yeast genetic background and/or fermentation parameters and may be difficult to influence individually, although the choice of yeast strain may offer some control 53 .

figure 1

Spearman rank correlations are shown. Descriptors are grouped according to their origin (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)), and sensory aspect (aroma, taste, palate, and overall appreciation). Please note that for the chemical compounds, for the sake of clarity, only a subset of the total number of measured compounds is shown, with an emphasis on the key compounds for each source. For more details, see the main text and Methods section. Chemical data can be found in Supplementary Data  1 , correlations between all chemical compounds are depicted in Supplementary Fig.  S2 and correlation values can be found in Supplementary Data  2 . See Supplementary Data  4 for sensory panel assessments and Supplementary Data  5 for correlation values between all sensory descriptors.

Interestingly, different beer styles show distinct patterns for some flavor compounds (Supplementary Fig.  S3 ). These observations agree with expectations for key beer styles, and serve as a control for our measurements. For instance, Stouts generally show high values for color (darker), while hoppy beers contain elevated levels of iso-alpha acids, compounds associated with bitter hop taste. Acetic and lactic acid are not prevalent in most beers, with notable exceptions such as Kriek, Lambic, Faro, West Flanders ales and Flanders Old Brown, which use acid-producing bacteria ( Lactobacillus and Pediococcus ) or unconventional yeast ( Brettanomyces ) 54 , 55 . Glycerol, ethanol and esters show similar distributions across all beer styles, reflecting their common origin as products of yeast metabolism during fermentation 45 , 53 . Finally, low/no-alcohol beers contain low concentrations of glycerol and esters. This is in line with the production process for most of the low/no-alcohol beers in our dataset, which are produced through limiting fermentation or by stripping away alcohol via evaporation or dialysis, with both methods having the unintended side-effect of reducing the amount of flavor compounds in the final beer 56 , 57 .

Besides expected associations, our data also reveals less trivial associations between beer styles and specific parameters. For example, geraniol and citronellol, two monoterpenoids responsible for citrus, floral and rose flavors and characteristic of Citra hops, are found in relatively high amounts in Christmas, Saison, and Brett/co-fermented beers, where they may originate from terpenoid-rich spices such as coriander seeds instead of hops 58 .

Tasting panel assessments reveal sensorial relationships in beer

To assess the sensory profile of each beer, a trained tasting panel evaluated each of the 250 beers for 50 sensory attributes, including different hop, malt and yeast flavors, off-flavors and spices. Panelists used a tasting sheet (Supplementary Data  3 ) to score the different attributes. Panel consistency was evaluated by repeating 12 samples across different sessions and performing ANOVA. In 95% of cases no significant difference was found across sessions ( p  > 0.05), indicating good panel consistency (Supplementary Table  S2 ).

Aroma and taste perception reported by the trained panel are often linked (Fig.  1 , bottom left panel and Supplementary Data  4 and 5 ), with high correlations between hops aroma and taste (Spearman’s rho=0.83). Bitter taste was found to correlate with hop aroma and taste in general (Spearman’s rho=0.80 and 0.69), and particularly with “grassy” noble hops (Spearman’s rho=0.75). Barnyard flavor, most often associated with sour beers, is identified together with stale hops (Spearman’s rho=0.97) that are used in these beers. Lactic and acetic acid, which often co-occur, are correlated (Spearman’s rho=0.66). Interestingly, sweetness and bitterness are anti-correlated (Spearman’s rho = −0.48), confirming the hypothesis that they mask each other 59 , 60 . Beer body is highly correlated with alcohol (Spearman’s rho = 0.79), and overall appreciation is found to correlate with multiple aspects that describe beer mouthfeel (alcohol, carbonation; Spearman’s rho= 0.32, 0.39), as well as with hop and ester aroma intensity (Spearman’s rho=0.39 and 0.35).

Similar to the chemical analyses, sensorial analyses confirmed typical features of specific beer styles (Supplementary Fig.  S4 ). For example, sour beers (Faro, Flanders Old Brown, Fruit beer, Kriek, Lambic, West Flanders ale) were rated acidic, with flavors of both acetic and lactic acid. Hoppy beers were found to be bitter and showed hop-associated aromas like citrus and tropical fruit. Malt taste is most detected among scotch, stout/porters, and strong ales, while low/no-alcohol beers, which often have a reputation for being ‘worty’ (reminiscent of unfermented, sweet malt extract) appear in the middle. Unsurprisingly, hop aromas are most strongly detected among hoppy beers. Like its chemical counterpart (Supplementary Fig.  S3 ), acidity shows a right-skewed distribution, with the most acidic beers being Krieks, Lambics, and West Flanders ales.

Tasting panel assessments of specific flavors correlate with chemical composition

We find that the concentrations of several chemical compounds strongly correlate with specific aroma or taste, as evaluated by the tasting panel (Fig.  2 , Supplementary Fig.  S5 , Supplementary Data  6 ). In some cases, these correlations confirm expectations and serve as a useful control for data quality. For example, iso-alpha acids, the bittering compounds in hops, strongly correlate with bitterness (Spearman’s rho=0.68), while ethanol and glycerol correlate with tasters’ perceptions of alcohol and body, the mouthfeel sensation of fullness (Spearman’s rho=0.82/0.62 and 0.72/0.57 respectively) and darker color from roasted malts is a good indication of malt perception (Spearman’s rho=0.54).

figure 2

Heatmap colors indicate Spearman’s Rho. Axes are organized according to sensory categories (aroma, taste, mouthfeel, overall), chemical categories and chemical sources in beer (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)). See Supplementary Data  6 for all correlation values.

Interestingly, for some relationships between chemical compounds and perceived flavor, correlations are weaker than expected. For example, the rose-smelling phenethyl acetate only weakly correlates with floral aroma. This hints at more complex relationships and interactions between compounds and suggests a need for a more complex model than simple correlations. Lastly, we uncovered unexpected correlations. For instance, the esters ethyl decanoate and ethyl octanoate appear to correlate slightly with hop perception and bitterness, possibly due to their fruity flavor. Iron is anti-correlated with hop aromas and bitterness, most likely because it is also anti-correlated with iso-alpha acids. This could be a sign of metal chelation of hop acids 61 , given that our analyses measure unbound hop acids and total iron content, or could result from the higher iron content in dark and Fruit beers, which typically have less hoppy and bitter flavors 62 .

Public consumer reviews complement expert panel data

To complement and expand the sensory data of our trained tasting panel, we collected 180,000 reviews of our 250 beers from the online consumer review platform RateBeer. This provided numerical scores for beer appearance, aroma, taste, palate, overall quality as well as the average overall score.

Public datasets are known to suffer from biases, such as price, cult status and psychological conformity towards previous ratings of a product. For example, prices correlate with appreciation scores for these online consumer reviews (rho=0.49, Supplementary Fig.  S6 ), but not for our trained tasting panel (rho=0.19). This suggests that prices affect consumer appreciation, which has been reported in wine 63 , while blind tastings are unaffected. Moreover, we observe that some beer styles, like lagers and non-alcoholic beers, generally receive lower scores, reflecting that online reviewers are mostly beer aficionados with a preference for specialty beers over lager beers. In general, we find a modest correlation between our trained panel’s overall appreciation score and the online consumer appreciation scores (Fig.  3 , rho=0.29). Apart from the aforementioned biases in the online datasets, serving temperature, sample freshness and surroundings, which are all tightly controlled during the tasting panel sessions, can vary tremendously across online consumers and can further contribute to (among others, appreciation) differences between the two categories of tasters. Importantly, in contrast to the overall appreciation scores, for many sensory aspects the results from the professional panel correlated well with results obtained from RateBeer reviews. Correlations were highest for features that are relatively easy to recognize even for untrained tasters, like bitterness, sweetness, alcohol and malt aroma (Fig.  3 and below).

figure 3

RateBeer text mining results can be found in Supplementary Data  7 . Rho values shown are Spearman correlation values, with asterisks indicating significant correlations ( p  < 0.05, two-sided). All p values were smaller than 0.001, except for Esters aroma (0.0553), Esters taste (0.3275), Esters aroma—banana (0.0019), Coriander (0.0508) and Diacetyl (0.0134).

Besides collecting consumer appreciation from these online reviews, we developed automated text analysis tools to gather additional data from review texts (Supplementary Data  7 ). Processing review texts on the RateBeer database yielded comparable results to the scores given by the trained panel for many common sensory aspects, including acidity, bitterness, sweetness, alcohol, malt, and hop tastes (Fig.  3 ). This is in line with what would be expected, since these attributes require less training for accurate assessment and are less influenced by environmental factors such as temperature, serving glass and odors in the environment. Consumer reviews also correlate well with our trained panel for 4-vinyl guaiacol, a compound associated with a very characteristic aroma. By contrast, correlations for more specific aromas like ester, coriander or diacetyl are underrepresented in the online reviews, underscoring the importance of using a trained tasting panel and standardized tasting sheets with explicit factors to be scored for evaluating specific aspects of a beer. Taken together, our results suggest that public reviews are trustworthy for some, but not all, flavor features and can complement or substitute taste panel data for these sensory aspects.

Models can predict beer sensory profiles from chemical data

The rich datasets of chemical analyses, tasting panel assessments and public reviews gathered in the first part of this study provided us with a unique opportunity to develop predictive models that link chemical data to sensorial features. Given the complexity of beer flavor, basic statistical tools such as correlations or linear regression may not always be the most suitable for making accurate predictions. Instead, we applied different machine learning models that can model both simple linear and complex interactive relationships. Specifically, we constructed a set of regression models to predict (a) trained panel scores for beer flavor and quality and (b) public reviews’ appreciation scores from beer chemical profiles. We trained and tested 10 different models (Methods), 3 linear regression-based models (simple linear regression with first-order interactions (LR), lasso regression with first-order interactions (Lasso), partial least squares regressor (PLSR)), 5 decision tree models (AdaBoost regressor (ABR), extra trees (ET), gradient boosting regressor (GBR), random forest (RF) and XGBoost regressor (XGBR)), 1 support vector regression (SVR), and 1 artificial neural network (ANN) model.

To compare the performance of our machine learning models, the dataset was randomly split into a training and test set, stratified by beer style. After a model was trained on data in the training set, its performance was evaluated on its ability to predict the test dataset obtained from multi-output models (based on the coefficient of determination, see Methods). Additionally, individual-attribute models were ranked per descriptor and the average rank was calculated, as proposed by Korneva et al. 64 . Importantly, both ways of evaluating the models’ performance agreed in general. Performance of the different models varied (Table  1 ). It should be noted that all models perform better at predicting RateBeer results than results from our trained tasting panel. One reason could be that sensory data is inherently variable, and this variability is averaged out with the large number of public reviews from RateBeer. Additionally, all tree-based models perform better at predicting taste than aroma. Linear models (LR) performed particularly poorly, with negative R 2 values, due to severe overfitting (training set R 2  = 1). Overfitting is a common issue in linear models with many parameters and limited samples, especially with interaction terms further amplifying the number of parameters. L1 regularization (Lasso) successfully overcomes this overfitting, out-competing multiple tree-based models on the RateBeer dataset. Similarly, the dimensionality reduction of PLSR avoids overfitting and improves performance, to some extent. Still, tree-based models (ABR, ET, GBR, RF and XGBR) show the best performance, out-competing the linear models (LR, Lasso, PLSR) commonly used in sensory science 65 .

GBR models showed the best overall performance in predicting sensory responses from chemical information, with R 2 values up to 0.75 depending on the predicted sensory feature (Supplementary Table  S4 ). The GBR models predict consumer appreciation (RateBeer) better than our trained panel’s appreciation (R 2 value of 0.67 compared to R 2 value of 0.09) (Supplementary Table  S3 and Supplementary Table  S4 ). ANN models showed intermediate performance, likely because neural networks typically perform best with larger datasets 66 . The SVR shows intermediate performance, mostly due to the weak predictions of specific attributes that lower the overall performance (Supplementary Table  S4 ).

Model dissection identifies specific, unexpected compounds as drivers of consumer appreciation

Next, we leveraged our models to infer important contributors to sensory perception and consumer appreciation. Consumer preference is a crucial sensory aspects, because a product that shows low consumer appreciation scores often does not succeed commercially 25 . Additionally, the requirement for a large number of representative evaluators makes consumer trials one of the more costly and time-consuming aspects of product development. Hence, a model for predicting chemical drivers of overall appreciation would be a welcome addition to the available toolbox for food development and optimization.

Since GBR models on our RateBeer dataset showed the best overall performance, we focused on these models. Specifically, we used two approaches to identify important contributors. First, rankings of the most important predictors for each sensorial trait in the GBR models were obtained based on impurity-based feature importance (mean decrease in impurity). High-ranked parameters were hypothesized to be either the true causal chemical properties underlying the trait, to correlate with the actual causal properties, or to take part in sensory interactions affecting the trait 67 (Fig.  4A ). In a second approach, we used SHAP 68 to determine which parameters contributed most to the model for making predictions of consumer appreciation (Fig.  4B ). SHAP calculates parameter contributions to model predictions on a per-sample basis, which can be aggregated into an importance score.

figure 4

A The impurity-based feature importance (mean deviance in impurity, MDI) calculated from the Gradient Boosting Regression (GBR) model predicting RateBeer appreciation scores. The top 15 highest ranked chemical properties are shown. B SHAP summary plot for the top 15 parameters contributing to our GBR model. Each point on the graph represents a sample from our dataset. The color represents the concentration of that parameter, with bluer colors representing low values and redder colors representing higher values. Greater absolute values on the horizontal axis indicate a higher impact of the parameter on the prediction of the model. C Spearman correlations between the 15 most important chemical properties and consumer overall appreciation. Numbers indicate the Spearman Rho correlation coefficient, and the rank of this correlation compared to all other correlations. The top 15 important compounds were determined using SHAP (panel B).

Both approaches identified ethyl acetate as the most predictive parameter for beer appreciation (Fig.  4 ). Ethyl acetate is the most abundant ester in beer with a typical ‘fruity’, ‘solvent’ and ‘alcoholic’ flavor, but is often considered less important than other esters like isoamyl acetate. The second most important parameter identified by SHAP is ethanol, the most abundant beer compound after water. Apart from directly contributing to beer flavor and mouthfeel, ethanol drastically influences the physical properties of beer, dictating how easily volatile compounds escape the beer matrix to contribute to beer aroma 69 . Importantly, it should also be noted that the importance of ethanol for appreciation is likely inflated by the very low appreciation scores of non-alcoholic beers (Supplementary Fig.  S4 ). Despite not often being considered a driver of beer appreciation, protein level also ranks highly in both approaches, possibly due to its effect on mouthfeel and body 70 . Lactic acid, which contributes to the tart taste of sour beers, is the fourth most important parameter identified by SHAP, possibly due to the generally high appreciation of sour beers in our dataset.

Interestingly, some of the most important predictive parameters for our model are not well-established as beer flavors or are even commonly regarded as being negative for beer quality. For example, our models identify methanethiol and ethyl phenyl acetate, an ester commonly linked to beer staling 71 , as a key factor contributing to beer appreciation. Although there is no doubt that high concentrations of these compounds are considered unpleasant, the positive effects of modest concentrations are not yet known 72 , 73 .

To compare our approach to conventional statistics, we evaluated how well the 15 most important SHAP-derived parameters correlate with consumer appreciation (Fig.  4C ). Interestingly, only 6 of the properties derived by SHAP rank amongst the top 15 most correlated parameters. For some chemical compounds, the correlations are so low that they would have likely been considered unimportant. For example, lactic acid, the fourth most important parameter, shows a bimodal distribution for appreciation, with sour beers forming a separate cluster, that is missed entirely by the Spearman correlation. Additionally, the correlation plots reveal outliers, emphasizing the need for robust analysis tools. Together, this highlights the need for alternative models, like the Gradient Boosting model, that better grasp the complexity of (beer) flavor.

Finally, to observe the relationships between these chemical properties and their predicted targets, partial dependence plots were constructed for the six most important predictors of consumer appreciation 74 , 75 , 76 (Supplementary Fig.  S7 ). One-way partial dependence plots show how a change in concentration affects the predicted appreciation. These plots reveal an important limitation of our models: appreciation predictions remain constant at ever-increasing concentrations. This implies that once a threshold concentration is reached, further increasing the concentration does not affect appreciation. This is false, as it is well-documented that certain compounds become unpleasant at high concentrations, including ethyl acetate (‘nail polish’) 77 and methanethiol (‘sulfury’ and ‘rotten cabbage’) 78 . The inability of our models to grasp that flavor compounds have optimal levels, above which they become negative, is a consequence of working with commercial beer brands where (off-)flavors are rarely too high to negatively impact the product. The two-way partial dependence plots show how changing the concentration of two compounds influences predicted appreciation, visualizing their interactions (Supplementary Fig.  S7 ). In our case, the top 5 parameters are dominated by additive or synergistic interactions, with high concentrations for both compounds resulting in the highest predicted appreciation.

To assess the robustness of our best-performing models and model predictions, we performed 100 iterations of the GBR, RF and ET models. In general, all iterations of the models yielded similar performance (Supplementary Fig.  S8 ). Moreover, the main predictors (including the top predictors ethanol and ethyl acetate) remained virtually the same, especially for GBR and RF. For the iterations of the ET model, we did observe more variation in the top predictors, which is likely a consequence of the model’s inherent random architecture in combination with co-correlations between certain predictors. However, even in this case, several of the top predictors (ethanol and ethyl acetate) remain unchanged, although their rank in importance changes (Supplementary Fig.  S8 ).

Next, we investigated if a combination of RateBeer and trained panel data into one consolidated dataset would lead to stronger models, under the hypothesis that such a model would suffer less from bias in the datasets. A GBR model was trained to predict appreciation on the combined dataset. This model underperformed compared to the RateBeer model, both in the native case and when including a dataset identifier (R 2  = 0.67, 0.26 and 0.42 respectively). For the latter, the dataset identifier is the most important feature (Supplementary Fig.  S9 ), while most of the feature importance remains unchanged, with ethyl acetate and ethanol ranking highest, like in the original model trained only on RateBeer data. It seems that the large variation in the panel dataset introduces noise, weakening the models’ performances and reliability. In addition, it seems reasonable to assume that both datasets are fundamentally different, with the panel dataset obtained by blind tastings by a trained professional panel.

Lastly, we evaluated whether beer style identifiers would further enhance the model’s performance. A GBR model was trained with parameters that explicitly encoded the styles of the samples. This did not improve model performance (R2 = 0.66 with style information vs R2 = 0.67). The most important chemical features are consistent with the model trained without style information (eg. ethanol and ethyl acetate), and with the exception of the most preferred (strong ale) and least preferred (low/no-alcohol) styles, none of the styles were among the most important features (Supplementary Fig.  S9 , Supplementary Table  S5 and S6 ). This is likely due to a combination of style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original models, as well as the low number of samples belonging to some styles, making it difficult for the model to learn style-specific patterns. Moreover, beer styles are not rigorously defined, with some styles overlapping in features and some beers being misattributed to a specific style, all of which leads to more noise in models that use style parameters.

Model validation

To test if our predictive models give insight into beer appreciation, we set up experiments aimed at improving existing commercial beers. We specifically selected overall appreciation as the trait to be examined because of its complexity and commercial relevance. Beer flavor comprises a complex bouquet rather than single aromas and tastes 53 . Hence, adding a single compound to the extent that a difference is noticeable may lead to an unbalanced, artificial flavor. Therefore, we evaluated the effect of combinations of compounds. Because Blond beers represent the most extensive style in our dataset, we selected a beer from this style as the starting material for these experiments (Beer 64 in Supplementary Data  1 ).

In the first set of experiments, we adjusted the concentrations of compounds that made up the most important predictors of overall appreciation (ethyl acetate, ethanol, lactic acid, ethyl phenyl acetate) together with correlated compounds (ethyl hexanoate, isoamyl acetate, glycerol), bringing them up to 95 th percentile ethanol-normalized concentrations (Methods) within the Blond group (‘Spiked’ concentration in Fig.  5A ). Compared to controls, the spiked beers were found to have significantly improved overall appreciation among trained panelists, with panelist noting increased intensity of ester flavors, sweetness, alcohol, and body fullness (Fig.  5B ). To disentangle the contribution of ethanol to these results, a second experiment was performed without the addition of ethanol. This resulted in a similar outcome, including increased perception of alcohol and overall appreciation.

figure 5

Adding the top chemical compounds, identified as best predictors of appreciation by our model, into poorly appreciated beers results in increased appreciation from our trained panel. Results of sensory tests between base beers and those spiked with compounds identified as the best predictors by the model. A Blond and Non/Low-alcohol (0.0% ABV) base beers were brought up to 95th-percentile ethanol-normalized concentrations within each style. B For each sensory attribute, tasters indicated the more intense sample and selected the sample they preferred. The numbers above the bars correspond to the p values that indicate significant changes in perceived flavor (two-sided binomial test: alpha 0.05, n  = 20 or 13).

In a last experiment, we tested whether using the model’s predictions can boost the appreciation of a non-alcoholic beer (beer 223 in Supplementary Data  1 ). Again, the addition of a mixture of predicted compounds (omitting ethanol, in this case) resulted in a significant increase in appreciation, body, ester flavor and sweetness.

Predicting flavor and consumer appreciation from chemical composition is one of the ultimate goals of sensory science. A reliable, systematic and unbiased way to link chemical profiles to flavor and food appreciation would be a significant asset to the food and beverage industry. Such tools would substantially aid in quality control and recipe development, offer an efficient and cost-effective alternative to pilot studies and consumer trials and would ultimately allow food manufacturers to produce superior, tailor-made products that better meet the demands of specific consumer groups more efficiently.

A limited set of studies have previously tried, to varying degrees of success, to predict beer flavor and beer popularity based on (a limited set of) chemical compounds and flavors 79 , 80 . Current sensitive, high-throughput technologies allow measuring an unprecedented number of chemical compounds and properties in a large set of samples, yielding a dataset that can train models that help close the gaps between chemistry and flavor, even for a complex natural product like beer. To our knowledge, no previous research gathered data at this scale (250 samples, 226 chemical parameters, 50 sensory attributes and 5 consumer scores) to disentangle and validate the chemical aspects driving beer preference using various machine-learning techniques. We find that modern machine learning models outperform conventional statistical tools, such as correlations and linear models, and can successfully predict flavor appreciation from chemical composition. This could be attributed to the natural incorporation of interactions and non-linear or discontinuous effects in machine learning models, which are not easily grasped by the linear model architecture. While linear models and partial least squares regression represent the most widespread statistical approaches in sensory science, in part because they allow interpretation 65 , 81 , 82 , modern machine learning methods allow for building better predictive models while preserving the possibility to dissect and exploit the underlying patterns. Of the 10 different models we trained, tree-based models, such as our best performing GBR, showed the best overall performance in predicting sensory responses from chemical information, outcompeting artificial neural networks. This agrees with previous reports for models trained on tabular data 83 . Our results are in line with the findings of Colantonio et al. who also identified the gradient boosting architecture as performing best at predicting appreciation and flavor (of tomatoes and blueberries, in their specific study) 26 . Importantly, besides our larger experimental scale, we were able to directly confirm our models’ predictions in vivo.

Our study confirms that flavor compound concentration does not always correlate with perception, suggesting complex interactions that are often missed by more conventional statistics and simple models. Specifically, we find that tree-based algorithms may perform best in developing models that link complex food chemistry with aroma. Furthermore, we show that massive datasets of untrained consumer reviews provide a valuable source of data, that can complement or even replace trained tasting panels, especially for appreciation and basic flavors, such as sweetness and bitterness. This holds despite biases that are known to occur in such datasets, such as price or conformity bias. Moreover, GBR models predict taste better than aroma. This is likely because taste (e.g. bitterness) often directly relates to the corresponding chemical measurements (e.g., iso-alpha acids), whereas such a link is less clear for aromas, which often result from the interplay between multiple volatile compounds. We also find that our models are best at predicting acidity and alcohol, likely because there is a direct relation between the measured chemical compounds (acids and ethanol) and the corresponding perceived sensorial attribute (acidity and alcohol), and because even untrained consumers are generally able to recognize these flavors and aromas.

The predictions of our final models, trained on review data, hold even for blind tastings with small groups of trained tasters, as demonstrated by our ability to validate specific compounds as drivers of beer flavor and appreciation. Since adding a single compound to the extent of a noticeable difference may result in an unbalanced flavor profile, we specifically tested our identified key drivers as a combination of compounds. While this approach does not allow us to validate if a particular single compound would affect flavor and/or appreciation, our experiments do show that this combination of compounds increases consumer appreciation.

It is important to stress that, while it represents an important step forward, our approach still has several major limitations. A key weakness of the GBR model architecture is that amongst co-correlating variables, the largest main effect is consistently preferred for model building. As a result, co-correlating variables often have artificially low importance scores, both for impurity and SHAP-based methods, like we observed in the comparison to the more randomized Extra Trees models. This implies that chemicals identified as key drivers of a specific sensory feature by GBR might not be the true causative compounds, but rather co-correlate with the actual causative chemical. For example, the high importance of ethyl acetate could be (partially) attributed to the total ester content, ethanol or ethyl hexanoate (rho=0.77, rho=0.72 and rho=0.68), while ethyl phenylacetate could hide the importance of prenyl isobutyrate and ethyl benzoate (rho=0.77 and rho=0.76). Expanding our GBR model to include beer style as a parameter did not yield additional power or insight. This is likely due to style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original model, as well as the smaller sample size per style, limiting the power to uncover style-specific patterns. This can be partly attributed to the curse of dimensionality, where the high number of parameters results in the models mainly incorporating single parameter effects, rather than complex interactions such as style-dependent effects 67 . A larger number of samples may overcome some of these limitations and offer more insight into style-specific effects. On the other hand, beer style is not a rigid scientific classification, and beers within one style often differ a lot, which further complicates the analysis of style as a model factor.

Our study is limited to beers from Belgian breweries. Although these beers cover a large portion of the beer styles available globally, some beer styles and consumer patterns may be missing, while other features might be overrepresented. For example, many Belgian ales exhibit yeast-driven flavor profiles, which is reflected in the chemical drivers of appreciation discovered by this study. In future work, expanding the scope to include diverse markets and beer styles could lead to the identification of even more drivers of appreciation and better models for special niche products that were not present in our beer set.

In addition to inherent limitations of GBR models, there are also some limitations associated with studying food aroma. Even if our chemical analyses measured most of the known aroma compounds, the total number of flavor compounds in complex foods like beer is still larger than the subset we were able to measure in this study. For example, hop-derived thiols, that influence flavor at very low concentrations, are notoriously difficult to measure in a high-throughput experiment. Moreover, consumer perception remains subjective and prone to biases that are difficult to avoid. It is also important to stress that the models are still immature and that more extensive datasets will be crucial for developing more complete models in the future. Besides more samples and parameters, our dataset does not include any demographic information about the tasters. Including such data could lead to better models that grasp external factors like age and culture. Another limitation is that our set of beers consists of high-quality end-products and lacks beers that are unfit for sale, which limits the current model in accurately predicting products that are appreciated very badly. Finally, while models could be readily applied in quality control, their use in sensory science and product development is restrained by their inability to discern causal relationships. Given that the models cannot distinguish compounds that genuinely drive consumer perception from those that merely correlate, validation experiments are essential to identify true causative compounds.

Despite the inherent limitations, dissection of our models enabled us to pinpoint specific molecules as potential drivers of beer aroma and consumer appreciation, including compounds that were unexpected and would not have been identified using standard approaches. Important drivers of beer appreciation uncovered by our models include protein levels, ethyl acetate, ethyl phenyl acetate and lactic acid. Currently, many brewers already use lactic acid to acidify their brewing water and ensure optimal pH for enzymatic activity during the mashing process. Our results suggest that adding lactic acid can also improve beer appreciation, although its individual effect remains to be tested. Interestingly, ethanol appears to be unnecessary to improve beer appreciation, both for blond beer and alcohol-free beer. Given the growing consumer interest in alcohol-free beer, with a predicted annual market growth of >7% 84 , it is relevant for brewers to know what compounds can further increase consumer appreciation of these beers. Hence, our model may readily provide avenues to further improve the flavor and consumer appreciation of both alcoholic and non-alcoholic beers, which is generally considered one of the key challenges for future beer production.

Whereas we see a direct implementation of our results for the development of superior alcohol-free beverages and other food products, our study can also serve as a stepping stone for the development of novel alcohol-containing beverages. We want to echo the growing body of scientific evidence for the negative effects of alcohol consumption, both on the individual level by the mutagenic, teratogenic and carcinogenic effects of ethanol 85 , 86 , as well as the burden on society caused by alcohol abuse and addiction. We encourage the use of our results for the production of healthier, tastier products, including novel and improved beverages with lower alcohol contents. Furthermore, we strongly discourage the use of these technologies to improve the appreciation or addictive properties of harmful substances.

The present work demonstrates that despite some important remaining hurdles, combining the latest developments in chemical analyses, sensory analysis and modern machine learning methods offers exciting avenues for food chemistry and engineering. Soon, these tools may provide solutions in quality control and recipe development, as well as new approaches to sensory science and flavor research.

Beer selection

250 commercial Belgian beers were selected to cover the broad diversity of beer styles and corresponding diversity in chemical composition and aroma. See Supplementary Fig.  S1 .

Chemical dataset

Sample preparation.

Beers within their expiration date were purchased from commercial retailers. Samples were prepared in biological duplicates at room temperature, unless explicitly stated otherwise. Bottle pressure was measured with a manual pressure device (Steinfurth Mess-Systeme GmbH) and used to calculate CO 2 concentration. The beer was poured through two filter papers (Macherey-Nagel, 500713032 MN 713 ¼) to remove carbon dioxide and prevent spontaneous foaming. Samples were then prepared for measurements by targeted Headspace-Gas Chromatography-Flame Ionization Detector/Flame Photometric Detector (HS-GC-FID/FPD), Headspace-Solid Phase Microextraction-Gas Chromatography-Mass Spectrometry (HS-SPME-GC-MS), colorimetric analysis, enzymatic analysis, Near-Infrared (NIR) analysis, as described in the sections below. The mean values of biological duplicates are reported for each compound.

HS-GC-FID/FPD

HS-GC-FID/FPD (Shimadzu GC 2010 Plus) was used to measure higher alcohols, acetaldehyde, esters, 4-vinyl guaicol, and sulfur compounds. Each measurement comprised 5 ml of sample pipetted into a 20 ml glass vial containing 1.75 g NaCl (VWR, 27810.295). 100 µl of 2-heptanol (Sigma-Aldrich, H3003) (internal standard) solution in ethanol (Fisher Chemical, E/0650DF/C17) was added for a final concentration of 2.44 mg/L. Samples were flushed with nitrogen for 10 s, sealed with a silicone septum, stored at −80 °C and analyzed in batches of 20.

The GC was equipped with a DB-WAXetr column (length, 30 m; internal diameter, 0.32 mm; layer thickness, 0.50 µm; Agilent Technologies, Santa Clara, CA, USA) to the FID and an HP-5 column (length, 30 m; internal diameter, 0.25 mm; layer thickness, 0.25 µm; Agilent Technologies, Santa Clara, CA, USA) to the FPD. N 2 was used as the carrier gas. Samples were incubated for 20 min at 70 °C in the headspace autosampler (Flow rate, 35 cm/s; Injection volume, 1000 µL; Injection mode, split; Combi PAL autosampler, CTC analytics, Switzerland). The injector, FID and FPD temperatures were kept at 250 °C. The GC oven temperature was first held at 50 °C for 5 min and then allowed to rise to 80 °C at a rate of 5 °C/min, followed by a second ramp of 4 °C/min until 200 °C kept for 3 min and a final ramp of (4 °C/min) until 230 °C for 1 min. Results were analyzed with the GCSolution software version 2.4 (Shimadzu, Kyoto, Japan). The GC was calibrated with a 5% EtOH solution (VWR International) containing the volatiles under study (Supplementary Table  S7 ).

HS-SPME-GC-MS

HS-SPME-GC-MS (Shimadzu GCMS-QP-2010 Ultra) was used to measure additional volatile compounds, mainly comprising terpenoids and esters. Samples were analyzed by HS-SPME using a triphase DVB/Carboxen/PDMS 50/30 μm SPME fiber (Supelco Co., Bellefonte, PA, USA) followed by gas chromatography (Thermo Fisher Scientific Trace 1300 series, USA) coupled to a mass spectrometer (Thermo Fisher Scientific ISQ series MS) equipped with a TriPlus RSH autosampler. 5 ml of degassed beer sample was placed in 20 ml vials containing 1.75 g NaCl (VWR, 27810.295). 5 µl internal standard mix was added, containing 2-heptanol (1 g/L) (Sigma-Aldrich, H3003), 4-fluorobenzaldehyde (1 g/L) (Sigma-Aldrich, 128376), 2,3-hexanedione (1 g/L) (Sigma-Aldrich, 144169) and guaiacol (1 g/L) (Sigma-Aldrich, W253200) in ethanol (Fisher Chemical, E/0650DF/C17). Each sample was incubated at 60 °C in the autosampler oven with constant agitation. After 5 min equilibration, the SPME fiber was exposed to the sample headspace for 30 min. The compounds trapped on the fiber were thermally desorbed in the injection port of the chromatograph by heating the fiber for 15 min at 270 °C.

The GC-MS was equipped with a low polarity RXi-5Sil MS column (length, 20 m; internal diameter, 0.18 mm; layer thickness, 0.18 µm; Restek, Bellefonte, PA, USA). Injection was performed in splitless mode at 320 °C, a split flow of 9 ml/min, a purge flow of 5 ml/min and an open valve time of 3 min. To obtain a pulsed injection, a programmed gas flow was used whereby the helium gas flow was set at 2.7 mL/min for 0.1 min, followed by a decrease in flow of 20 ml/min to the normal 0.9 mL/min. The temperature was first held at 30 °C for 3 min and then allowed to rise to 80 °C at a rate of 7 °C/min, followed by a second ramp of 2 °C/min till 125 °C and a final ramp of 8 °C/min with a final temperature of 270 °C.

Mass acquisition range was 33 to 550 amu at a scan rate of 5 scans/s. Electron impact ionization energy was 70 eV. The interface and ion source were kept at 275 °C and 250 °C, respectively. A mix of linear n-alkanes (from C7 to C40, Supelco Co.) was injected into the GC-MS under identical conditions to serve as external retention index markers. Identification and quantification of the compounds were performed using an in-house developed R script as described in Goelen et al. and Reher et al. 87 , 88 (for package information, see Supplementary Table  S8 ). Briefly, chromatograms were analyzed using AMDIS (v2.71) 89 to separate overlapping peaks and obtain pure compound spectra. The NIST MS Search software (v2.0 g) in combination with the NIST2017, FFNSC3 and Adams4 libraries were used to manually identify the empirical spectra, taking into account the expected retention time. After background subtraction and correcting for retention time shifts between samples run on different days based on alkane ladders, compound elution profiles were extracted and integrated using a file with 284 target compounds of interest, which were either recovered in our identified AMDIS list of spectra or were known to occur in beer. Compound elution profiles were estimated for every peak in every chromatogram over a time-restricted window using weighted non-negative least square analysis after which peak areas were integrated 87 , 88 . Batch effect correction was performed by normalizing against the most stable internal standard compound, 4-fluorobenzaldehyde. Out of all 284 target compounds that were analyzed, 167 were visually judged to have reliable elution profiles and were used for final analysis.

Discrete photometric and enzymatic analysis

Discrete photometric and enzymatic analysis (Thermo Scientific TM Gallery TM Plus Beermaster Discrete Analyzer) was used to measure acetic acid, ammonia, beta-glucan, iso-alpha acids, color, sugars, glycerol, iron, pH, protein, and sulfite. 2 ml of sample volume was used for the analyses. Information regarding the reagents and standard solutions used for analyses and calibrations is included in Supplementary Table  S7 and Supplementary Table  S9 .

NIR analyses

NIR analysis (Anton Paar Alcolyzer Beer ME System) was used to measure ethanol. Measurements comprised 50 ml of sample, and a 10% EtOH solution was used for calibration.

Correlation calculations

Pairwise Spearman Rank correlations were calculated between all chemical properties.

Sensory dataset

Trained panel.

Our trained tasting panel consisted of volunteers who gave prior verbal informed consent. All compounds used for the validation experiment were of food-grade quality. The tasting sessions were approved by the Social and Societal Ethics Committee of the KU Leuven (G-2022-5677-R2(MAR)). All online reviewers agreed to the Terms and Conditions of the RateBeer website.

Sensory analysis was performed according to the American Society of Brewing Chemists (ASBC) Sensory Analysis Methods 90 . 30 volunteers were screened through a series of triangle tests. The sixteen most sensitive and consistent tasters were retained as taste panel members. The resulting panel was diverse in age [22–42, mean: 29], sex [56% male] and nationality [7 different countries]. The panel developed a consensus vocabulary to describe beer aroma, taste and mouthfeel. Panelists were trained to identify and score 50 different attributes, using a 7-point scale to rate attributes’ intensity. The scoring sheet is included as Supplementary Data  3 . Sensory assessments took place between 10–12 a.m. The beers were served in black-colored glasses. Per session, between 5 and 12 beers of the same style were tasted at 12 °C to 16 °C. Two reference beers were added to each set and indicated as ‘Reference 1 & 2’, allowing panel members to calibrate their ratings. Not all panelists were present at every tasting. Scores were scaled by standard deviation and mean-centered per taster. Values are represented as z-scores and clustered by Euclidean distance. Pairwise Spearman correlations were calculated between taste and aroma sensory attributes. Panel consistency was evaluated by repeating samples on different sessions and performing ANOVA to identify differences, using the ‘stats’ package (v4.2.2) in R (for package information, see Supplementary Table  S8 ).

Online reviews from a public database

The ‘scrapy’ package in Python (v3.6) (for package information, see Supplementary Table  S8 ). was used to collect 232,288 online reviews (mean=922, min=6, max=5343) from RateBeer, an online beer review database. Each review entry comprised 5 numerical scores (appearance, aroma, taste, palate and overall quality) and an optional review text. The total number of reviews per reviewer was collected separately. Numerical scores were scaled and centered per rater, and mean scores were calculated per beer.

For the review texts, the language was estimated using the packages ‘langdetect’ and ‘langid’ in Python. Reviews that were classified as English by both packages were kept. Reviewers with fewer than 100 entries overall were discarded. 181,025 reviews from >6000 reviewers from >40 countries remained. Text processing was done using the ‘nltk’ package in Python. Texts were corrected for slang and misspellings; proper nouns and rare words that are relevant to the beer context were specified and kept as-is (‘Chimay’,’Lambic’, etc.). A dictionary of semantically similar sensorial terms, for example ‘floral’ and ‘flower’, was created and collapsed together into one term. Words were stemmed and lemmatized to avoid identifying words such as ‘acid’ and ‘acidity’ as separate terms. Numbers and punctuation were removed.

Sentences from up to 50 randomly chosen reviews per beer were manually categorized according to the aspect of beer they describe (appearance, aroma, taste, palate, overall quality—not to be confused with the 5 numerical scores described above) or flagged as irrelevant if they contained no useful information. If a beer contained fewer than 50 reviews, all reviews were manually classified. This labeled data set was used to train a model that classified the rest of the sentences for all beers 91 . Sentences describing taste and aroma were extracted, and term frequency–inverse document frequency (TFIDF) was implemented to calculate enrichment scores for sensorial words per beer.

The sex of the tasting subject was not considered when building our sensory database. Instead, results from different panelists were averaged, both for our trained panel (56% male, 44% female) and the RateBeer reviews (70% male, 30% female for RateBeer as a whole).

Beer price collection and processing

Beer prices were collected from the following stores: Colruyt, Delhaize, Total Wine, BeerHawk, The Belgian Beer Shop, The Belgian Shop, and Beer of Belgium. Where applicable, prices were converted to Euros and normalized per liter. Spearman correlations were calculated between these prices and mean overall appreciation scores from RateBeer and the taste panel, respectively.

Pairwise Spearman Rank correlations were calculated between all sensory properties.

Machine learning models

Predictive modeling of sensory profiles from chemical data.

Regression models were constructed to predict (a) trained panel scores for beer flavors and quality from beer chemical profiles and (b) public reviews’ appreciation scores from beer chemical profiles. Z-scores were used to represent sensory attributes in both data sets. Chemical properties with log-normal distributions (Shapiro-Wilk test, p  <  0.05 ) were log-transformed. Missing chemical measurements (0.1% of all data) were replaced with mean values per attribute. Observations from 250 beers were randomly separated into a training set (70%, 175 beers) and a test set (30%, 75 beers), stratified per beer style. Chemical measurements (p = 231) were normalized based on the training set average and standard deviation. In total, three linear regression-based models: linear regression with first-order interaction terms (LR), lasso regression with first-order interaction terms (Lasso) and partial least squares regression (PLSR); five decision tree models, Adaboost regressor (ABR), Extra Trees (ET), Gradient Boosting regressor (GBR), Random Forest (RF) and XGBoost regressor (XGBR); one support vector machine model (SVR) and one artificial neural network model (ANN) were trained. The models were implemented using the ‘scikit-learn’ package (v1.2.2) and ‘xgboost’ package (v1.7.3) in Python (v3.9.16). Models were trained, and hyperparameters optimized, using five-fold cross-validated grid search with the coefficient of determination (R 2 ) as the evaluation metric. The ANN (scikit-learn’s MLPRegressor) was optimized using Bayesian Tree-Structured Parzen Estimator optimization with the ‘Optuna’ Python package (v3.2.0). Individual models were trained per attribute, and a multi-output model was trained on all attributes simultaneously.

Model dissection

GBR was found to outperform other methods, resulting in models with the highest average R 2 values in both trained panel and public review data sets. Impurity-based rankings of the most important predictors for each predicted sensorial trait were obtained using the ‘scikit-learn’ package. To observe the relationships between these chemical properties and their predicted targets, partial dependence plots (PDP) were constructed for the six most important predictors of consumer appreciation 74 , 75 .

The ‘SHAP’ package in Python (v0.41.0) was implemented to provide an alternative ranking of predictor importance and to visualize the predictors’ effects as a function of their concentration 68 .

Validation of causal chemical properties

To validate the effects of the most important model features on predicted sensory attributes, beers were spiked with the chemical compounds identified by the models and descriptive sensory analyses were carried out according to the American Society of Brewing Chemists (ASBC) protocol 90 .

Compound spiking was done 30 min before tasting. Compounds were spiked into fresh beer bottles, that were immediately resealed and inverted three times. Fresh bottles of beer were opened for the same duration, resealed, and inverted thrice, to serve as controls. Pairs of spiked samples and controls were served simultaneously, chilled and in dark glasses as outlined in the Trained panel section above. Tasters were instructed to select the glass with the higher flavor intensity for each attribute (directional difference test 92 ) and to select the glass they prefer.

The final concentration after spiking was equal to the within-style average, after normalizing by ethanol concentration. This was done to ensure balanced flavor profiles in the final spiked beer. The same methods were applied to improve a non-alcoholic beer. Compounds were the following: ethyl acetate (Merck KGaA, W241415), ethyl hexanoate (Merck KGaA, W243906), isoamyl acetate (Merck KGaA, W205508), phenethyl acetate (Merck KGaA, W285706), ethanol (96%, Colruyt), glycerol (Merck KGaA, W252506), lactic acid (Merck KGaA, 261106).

Significant differences in preference or perceived intensity were determined by performing the two-sided binomial test on each attribute.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The data that support the findings of this work are available in the Supplementary Data files and have been deposited to Zenodo under accession code 10653704 93 . The RateBeer scores data are under restricted access, they are not publicly available as they are property of RateBeer (ZX Ventures, USA). Access can be obtained from the authors upon reasonable request and with permission of RateBeer (ZX Ventures, USA).  Source data are provided with this paper.

Code availability

The code for training the machine learning models, analyzing the models, and generating the figures has been deposited to Zenodo under accession code 10653704 93 .

Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355 , 391–394 (2017).

Article   ADS   CAS   PubMed   Google Scholar  

Plutowska, B. & Wardencki, W. Application of gas chromatography–olfactometry (GC–O) in analysis and quality assessment of alcoholic beverages – A review. Food Chem. 107 , 449–463 (2008).

Article   CAS   Google Scholar  

Legin, A., Rudnitskaya, A., Seleznev, B. & Vlasov, Y. Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Anal. Chim. Acta 534 , 129–135 (2005).

Loutfi, A., Coradeschi, S., Mani, G. K., Shankar, P. & Rayappan, J. B. B. Electronic noses for food quality: A review. J. Food Eng. 144 , 103–111 (2015).

Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.-L. Flavor network and the principles of food pairing. Sci. Rep. 1 , 196 (2011).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bartoshuk, L. M. & Klee, H. J. Better fruits and vegetables through sensory analysis. Curr. Biol. 23 , R374–R378 (2013).

Article   CAS   PubMed   Google Scholar  

Piggott, J. R. Design questions in sensory and consumer science. Food Qual. Prefer. 3293 , 217–220 (1995).

Article   Google Scholar  

Kermit, M. & Lengard, V. Assessing the performance of a sensory panel-panellist monitoring and tracking. J. Chemom. 19 , 154–161 (2005).

Cook, D. J., Hollowood, T. A., Linforth, R. S. T. & Taylor, A. J. Correlating instrumental measurements of texture and flavour release with human perception. Int. J. Food Sci. Technol. 40 , 631–641 (2005).

Chinchanachokchai, S., Thontirawong, P. & Chinchanachokchai, P. A tale of two recommender systems: The moderating role of consumer expertise on artificial intelligence based product recommendations. J. Retail. Consum. Serv. 61 , 1–12 (2021).

Ross, C. F. Sensory science at the human-machine interface. Trends Food Sci. Technol. 20 , 63–72 (2009).

Chambers, E. IV & Koppel, K. Associations of volatile compounds with sensory aroma and flavor: The complex nature of flavor. Molecules 18 , 4887–4905 (2013).

Pinu, F. R. Metabolomics—The new frontier in food safety and quality research. Food Res. Int. 72 , 80–81 (2015).

Danezis, G. P., Tsagkaris, A. S., Brusic, V. & Georgiou, C. A. Food authentication: state of the art and prospects. Curr. Opin. Food Sci. 10 , 22–31 (2016).

Shepherd, G. M. Smell images and the flavour system in the human brain. Nature 444 , 316–321 (2006).

Meilgaard, M. C. Prediction of flavor differences between beers from their chemical composition. J. Agric. Food Chem. 30 , 1009–1017 (1982).

Xu, L. et al. Widespread receptor-driven modulation in peripheral olfactory coding. Science 368 , eaaz5390 (2020).

Kupferschmidt, K. Following the flavor. Science 340 , 808–809 (2013).

Billesbølle, C. B. et al. Structural basis of odorant recognition by a human odorant receptor. Nature 615 , 742–749 (2023).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Smith, B. Perspective: Complexities of flavour. Nature 486 , S6–S6 (2012).

Pfister, P. et al. Odorant receptor inhibition is fundamental to odor encoding. Curr. Biol. 30 , 2574–2587 (2020).

Moskowitz, H. W., Kumaraiah, V., Sharma, K. N., Jacobs, H. L. & Sharma, S. D. Cross-cultural differences in simple taste preferences. Science 190 , 1217–1218 (1975).

Eriksson, N. et al. A genetic variant near olfactory receptor genes influences cilantro preference. Flavour 1 , 22 (2012).

Ferdenzi, C. et al. Variability of affective responses to odors: Culture, gender, and olfactory knowledge. Chem. Senses 38 , 175–186 (2013).

Article   PubMed   Google Scholar  

Lawless, H. T. & Heymann, H. Sensory evaluation of food: Principles and practices. (Springer, New York, NY). https://doi.org/10.1007/978-1-4419-6488-5 (2010).

Colantonio, V. et al. Metabolomic selection for enhanced fruit flavor. Proc. Natl. Acad. Sci. 119 , e2115865119 (2022).

Fritz, F., Preissner, R. & Banerjee, P. VirtualTaste: a web server for the prediction of organoleptic properties of chemical compounds. Nucleic Acids Res 49 , W679–W684 (2021).

Tuwani, R., Wadhwa, S. & Bagler, G. BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules. Sci. Rep. 9 , 1–13 (2019).

Dagan-Wiener, A. et al. Bitter or not? BitterPredict, a tool for predicting taste from chemical structure. Sci. Rep. 7 , 1–13 (2017).

Pallante, L. et al. Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach. Sci. Rep. 12 , 1–11 (2022).

Malavolta, M. et al. A survey on computational taste predictors. Eur. Food Res. Technol. 248 , 2215–2235 (2022).

Lee, B. K. et al. A principal odor map unifies diverse tasks in olfactory perception. Science 381 , 999–1006 (2023).

Mayhew, E. J. et al. Transport features predict if a molecule is odorous. Proc. Natl. Acad. Sci. 119 , e2116576119 (2022).

Niu, Y. et al. Sensory evaluation of the synergism among ester odorants in light aroma-type liquor by odor threshold, aroma intensity and flash GC electronic nose. Food Res. Int. 113 , 102–114 (2018).

Yu, P., Low, M. Y. & Zhou, W. Design of experiments and regression modelling in food flavour and sensory analysis: A review. Trends Food Sci. Technol. 71 , 202–215 (2018).

Oladokun, O. et al. The impact of hop bitter acid and polyphenol profiles on the perceived bitterness of beer. Food Chem. 205 , 212–220 (2016).

Linforth, R., Cabannes, M., Hewson, L., Yang, N. & Taylor, A. Effect of fat content on flavor delivery during consumption: An in vivo model. J. Agric. Food Chem. 58 , 6905–6911 (2010).

Guo, S., Na Jom, K. & Ge, Y. Influence of roasting condition on flavor profile of sunflower seeds: A flavoromics approach. Sci. Rep. 9 , 11295 (2019).

Ren, Q. et al. The changes of microbial community and flavor compound in the fermentation process of Chinese rice wine using Fagopyrum tataricum grain as feedstock. Sci. Rep. 9 , 3365 (2019).

Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning. (Springer, New York, NY). https://doi.org/10.1007/978-0-387-21606-5 (2001).

Dietz, C., Cook, D., Huismann, M., Wilson, C. & Ford, R. The multisensory perception of hop essential oil: a review. J. Inst. Brew. 126 , 320–342 (2020).

CAS   Google Scholar  

Roncoroni, Miguel & Verstrepen, Kevin Joan. Belgian Beer: Tested and Tasted. (Lannoo, 2018).

Meilgaard, M. Flavor chemistry of beer: Part II: Flavor and threshold of 239 aroma volatiles. in (1975).

Bokulich, N. A. & Bamforth, C. W. The microbiology of malting and brewing. Microbiol. Mol. Biol. Rev. MMBR 77 , 157–172 (2013).

Dzialo, M. C., Park, R., Steensels, J., Lievens, B. & Verstrepen, K. J. Physiology, ecology and industrial applications of aroma formation in yeast. FEMS Microbiol. Rev. 41 , S95–S128 (2017).

Article   PubMed   PubMed Central   Google Scholar  

Datta, A. et al. Computer-aided food engineering. Nat. Food 3 , 894–904 (2022).

American Society of Brewing Chemists. Beer Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A.).

Olaniran, A. O., Hiralal, L., Mokoena, M. P. & Pillay, B. Flavour-active volatile compounds in beer: production, regulation and control. J. Inst. Brew. 123 , 13–23 (2017).

Verstrepen, K. J. et al. Flavor-active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Meilgaard, M. C. Flavour chemistry of beer. part I: flavour interaction between principal volatiles. Master Brew. Assoc. Am. Tech. Q 12 , 107–117 (1975).

Briggs, D. E., Boulton, C. A., Brookes, P. A. & Stevens, R. Brewing 227–254. (Woodhead Publishing). https://doi.org/10.1533/9781855739062.227 (2004).

Bossaert, S., Crauwels, S., De Rouck, G. & Lievens, B. The power of sour - A review: Old traditions, new opportunities. BrewingScience 72 , 78–88 (2019).

Google Scholar  

Verstrepen, K. J. et al. Flavor active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Snauwaert, I. et al. Microbial diversity and metabolite composition of Belgian red-brown acidic ales. Int. J. Food Microbiol. 221 , 1–11 (2016).

Spitaels, F. et al. The microbial diversity of traditional spontaneously fermented lambic beer. PLoS ONE 9 , e95384 (2014).

Blanco, C. A., Andrés-Iglesias, C. & Montero, O. Low-alcohol Beers: Flavor Compounds, Defects, and Improvement Strategies. Crit. Rev. Food Sci. Nutr. 56 , 1379–1388 (2016).

Jackowski, M. & Trusek, A. Non-Alcohol. beer Prod. – Overv. 20 , 32–38 (2018).

Takoi, K. et al. The contribution of geraniol metabolism to the citrus flavour of beer: Synergy of geraniol and β-citronellol under coexistence with excess linalool. J. Inst. Brew. 116 , 251–260 (2010).

Kroeze, J. H. & Bartoshuk, L. M. Bitterness suppression as revealed by split-tongue taste stimulation in humans. Physiol. Behav. 35 , 779–783 (1985).

Mennella, J. A. et al. A spoonful of sugar helps the medicine go down”: Bitter masking bysucrose among children and adults. Chem. Senses 40 , 17–25 (2015).

Wietstock, P., Kunz, T., Perreira, F. & Methner, F.-J. Metal chelation behavior of hop acids in buffered model systems. BrewingScience 69 , 56–63 (2016).

Sancho, D., Blanco, C. A., Caballero, I. & Pascual, A. Free iron in pale, dark and alcohol-free commercial lager beers. J. Sci. Food Agric. 91 , 1142–1147 (2011).

Rodrigues, H. & Parr, W. V. Contribution of cross-cultural studies to understanding wine appreciation: A review. Food Res. Int. 115 , 251–258 (2019).

Korneva, E. & Blockeel, H. Towards better evaluation of multi-target regression models. in ECML PKDD 2020 Workshops (eds. Koprinska, I. et al.) 353–362 (Springer International Publishing, Cham, 2020). https://doi.org/10.1007/978-3-030-65965-3_23 .

Gastón Ares. Mathematical and Statistical Methods in Food Science and Technology. (Wiley, 2013).

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Preprint at http://arxiv.org/abs/2207.08815 (2022).

Gries, S. T. Statistics for Linguistics with R: A Practical Introduction. in Statistics for Linguistics with R (De Gruyter Mouton, 2021). https://doi.org/10.1515/9783110718256 .

Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2 , 56–67 (2020).

Ickes, C. M. & Cadwallader, K. R. Effects of ethanol on flavor perception in alcoholic beverages. Chemosens. Percept. 10 , 119–134 (2017).

Kato, M. et al. Influence of high molecular weight polypeptides on the mouthfeel of commercial beer. J. Inst. Brew. 127 , 27–40 (2021).

Wauters, R. et al. Novel Saccharomyces cerevisiae variants slow down the accumulation of staling aldehydes and improve beer shelf-life. Food Chem. 398 , 1–11 (2023).

Li, H., Jia, S. & Zhang, W. Rapid determination of low-level sulfur compounds in beer by headspace gas chromatography with a pulsed flame photometric detector. J. Am. Soc. Brew. Chem. 66 , 188–191 (2008).

Dercksen, A., Laurens, J., Torline, P., Axcell, B. C. & Rohwer, E. Quantitative analysis of volatile sulfur compounds in beer using a membrane extraction interface. J. Am. Soc. Brew. Chem. 54 , 228–233 (1996).

Molnar, C. Interpretable Machine Learning: A Guide for Making Black-Box Models Interpretable. (2020).

Zhao, Q. & Hastie, T. Causal interpretations of black-box models. J. Bus. Econ. Stat. Publ. Am. Stat. Assoc. 39 , 272–281 (2019).

Article   MathSciNet   Google Scholar  

Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. (Springer, 2019).

Labrado, D. et al. Identification by NMR of key compounds present in beer distillates and residual phases after dealcoholization by vacuum distillation. J. Sci. Food Agric. 100 , 3971–3978 (2020).

Lusk, L. T., Kay, S. B., Porubcan, A. & Ryder, D. S. Key olfactory cues for beer oxidation. J. Am. Soc. Brew. Chem. 70 , 257–261 (2012).

Gonzalez Viejo, C., Torrico, D. D., Dunshea, F. R. & Fuentes, S. Development of artificial neural network models to assess beer acceptability based on sensory properties using a robotic pourer: A comparative model approach to achieve an artificial intelligence system. Beverages 5 , 33 (2019).

Gonzalez Viejo, C., Fuentes, S., Torrico, D. D., Godbole, A. & Dunshea, F. R. Chemical characterization of aromas in beer and their effect on consumers liking. Food Chem. 293 , 479–485 (2019).

Gilbert, J. L. et al. Identifying breeding priorities for blueberry flavor using biochemical, sensory, and genotype by environment analyses. PLOS ONE 10 , 1–21 (2015).

Goulet, C. et al. Role of an esterase in flavor volatile variation within the tomato clade. Proc. Natl. Acad. Sci. 109 , 19009–19014 (2012).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Borisov, V. et al. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 1–21 https://doi.org/10.1109/TNNLS.2022.3229161 (2022).

Statista. Statista Consumer Market Outlook: Beer - Worldwide.

Seitz, H. K. & Stickel, F. Molecular mechanisms of alcoholmediated carcinogenesis. Nat. Rev. Cancer 7 , 599–612 (2007).

Voordeckers, K. et al. Ethanol exposure increases mutation rate through error-prone polymerases. Nat. Commun. 11 , 3664 (2020).

Goelen, T. et al. Bacterial phylogeny predicts volatile organic compound composition and olfactory response of an aphid parasitoid. Oikos 129 , 1415–1428 (2020).

Article   ADS   Google Scholar  

Reher, T. et al. Evaluation of hop (Humulus lupulus) as a repellent for the management of Drosophila suzukii. Crop Prot. 124 , 104839 (2019).

Stein, S. E. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass Spectrom. 10 , 770–781 (1999).

American Society of Brewing Chemists. Sensory Analysis Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A., 1992).

McAuley, J., Leskovec, J. & Jurafsky, D. Learning Attitudes and Attributes from Multi-Aspect Reviews. Preprint at https://doi.org/10.48550/arXiv.1210.3926 (2012).

Meilgaard, M. C., Carr, B. T. & Carr, B. T. Sensory Evaluation Techniques. (CRC Press, Boca Raton). https://doi.org/10.1201/b16452 (2014).

Schreurs, M. et al. Data from: Predicting and improving complex beer flavor through machine learning. Zenodo https://doi.org/10.5281/zenodo.10653704 (2024).

Download references

Acknowledgements

We thank all lab members for their discussions and thank all tasting panel members for their contributions. Special thanks go out to Dr. Karin Voordeckers for her tremendous help in proofreading and improving the manuscript. M.S. was supported by a Baillet-Latour fellowship, L.C. acknowledges financial support from KU Leuven (C16/17/006), F.A.T. was supported by a PhD fellowship from FWO (1S08821N). Research in the lab of K.J.V. is supported by KU Leuven, FWO, VIB, VLAIO and the Brewing Science Serves Health Fund. Research in the lab of T.W. is supported by FWO (G.0A51.15) and KU Leuven (C16/17/006).

Author information

These authors contributed equally: Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni.

Authors and Affiliations

VIB—KU Leuven Center for Microbiology, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni, Lloyd Cool, Beatriz Herrera-Malaver, Florian A. Theßeling & Kevin J. Verstrepen

CMPG Laboratory of Genetics and Genomics, KU Leuven, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Leuven Institute for Beer Research (LIBR), Gaston Geenslaan 1, B-3001, Leuven, Belgium

Laboratory of Socioecology and Social Evolution, KU Leuven, Naamsestraat 59, B-3000, Leuven, Belgium

Lloyd Cool, Christophe Vanderaa & Tom Wenseleers

VIB Bioinformatics Core, VIB, Rijvisschestraat 120, B-9052, Ghent, Belgium

Łukasz Kreft & Alexander Botzki

AB InBev SA/NV, Brouwerijplein 1, B-3000, Leuven, Belgium

Philippe Malcorps & Luk Daenen

You can also search for this author in PubMed   Google Scholar

Contributions

S.P., M.S. and K.J.V. conceived the experiments. S.P., M.S. and K.J.V. designed the experiments. S.P., M.S., M.R., B.H. and F.A.T. performed the experiments. S.P., M.S., L.C., C.V., L.K., A.B., P.M., L.D., T.W. and K.J.V. contributed analysis ideas. S.P., M.S., L.C., C.V., T.W. and K.J.V. analyzed the data. All authors contributed to writing the manuscript.

Corresponding author

Correspondence to Kevin J. Verstrepen .

Ethics declarations

Competing interests.

K.J.V. is affiliated with bar.on. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Florian Bauer, Andrew John Macintosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, supplementary data 2, supplementary data 3, supplementary data 4, supplementary data 5, supplementary data 6, supplementary data 7, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Schreurs, M., Piampongsant, S., Roncoroni, M. et al. Predicting and improving complex beer flavor through machine learning. Nat Commun 15 , 2368 (2024). https://doi.org/10.1038/s41467-024-46346-0

Download citation

Received : 30 October 2023

Accepted : 21 February 2024

Published : 26 March 2024

DOI : https://doi.org/10.1038/s41467-024-46346-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

peer reviewed journal articles on research methods

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.25(3); 2014 Oct

Logo of ejifcc

Peer Review in Scientific Publications: Benefits, Critiques, & A Survival Guide

Jacalyn kelly.

1 Clinical Biochemistry, Department of Pediatric Laboratory Medicine, The Hospital for Sick Children, University of Toronto, Toronto, Ontario, Canada

Tara Sadeghieh

Khosrow adeli.

2 Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada

3 Chair, Communications and Publications Division (CPD), International Federation for Sick Clinical Chemistry (IFCC), Milan, Italy

The authors declare no conflicts of interest regarding publication of this article.

Peer review has been defined as a process of subjecting an author’s scholarly work, research or ideas to the scrutiny of others who are experts in the same field. It functions to encourage authors to meet the accepted high standards of their discipline and to control the dissemination of research data to ensure that unwarranted claims, unacceptable interpretations or personal views are not published without prior expert review. Despite its wide-spread use by most journals, the peer review process has also been widely criticised due to the slowness of the process to publish new findings and due to perceived bias by the editors and/or reviewers. Within the scientific community, peer review has become an essential component of the academic writing process. It helps ensure that papers published in scientific journals answer meaningful research questions and draw accurate conclusions based on professionally executed experimentation. Submission of low quality manuscripts has become increasingly prevalent, and peer review acts as a filter to prevent this work from reaching the scientific community. The major advantage of a peer review process is that peer-reviewed articles provide a trusted form of scientific communication. Since scientific knowledge is cumulative and builds on itself, this trust is particularly important. Despite the positive impacts of peer review, critics argue that the peer review process stifles innovation in experimentation, and acts as a poor screen against plagiarism. Despite its downfalls, there has not yet been a foolproof system developed to take the place of peer review, however, researchers have been looking into electronic means of improving the peer review process. Unfortunately, the recent explosion in online only/electronic journals has led to mass publication of a large number of scientific articles with little or no peer review. This poses significant risk to advances in scientific knowledge and its future potential. The current article summarizes the peer review process, highlights the pros and cons associated with different types of peer review, and describes new methods for improving peer review.

WHAT IS PEER REVIEW AND WHAT IS ITS PURPOSE?

Peer Review is defined as “a process of subjecting an author’s scholarly work, research or ideas to the scrutiny of others who are experts in the same field” ( 1 ). Peer review is intended to serve two primary purposes. Firstly, it acts as a filter to ensure that only high quality research is published, especially in reputable journals, by determining the validity, significance and originality of the study. Secondly, peer review is intended to improve the quality of manuscripts that are deemed suitable for publication. Peer reviewers provide suggestions to authors on how to improve the quality of their manuscripts, and also identify any errors that need correcting before publication.

HISTORY OF PEER REVIEW

The concept of peer review was developed long before the scholarly journal. In fact, the peer review process is thought to have been used as a method of evaluating written work since ancient Greece ( 2 ). The peer review process was first described by a physician named Ishaq bin Ali al-Rahwi of Syria, who lived from 854-931 CE, in his book Ethics of the Physician ( 2 ). There, he stated that physicians must take notes describing the state of their patients’ medical conditions upon each visit. Following treatment, the notes were scrutinized by a local medical council to determine whether the physician had met the required standards of medical care. If the medical council deemed that the appropriate standards were not met, the physician in question could receive a lawsuit from the maltreated patient ( 2 ).

The invention of the printing press in 1453 allowed written documents to be distributed to the general public ( 3 ). At this time, it became more important to regulate the quality of the written material that became publicly available, and editing by peers increased in prevalence. In 1620, Francis Bacon wrote the work Novum Organum, where he described what eventually became known as the first universal method for generating and assessing new science ( 3 ). His work was instrumental in shaping the Scientific Method ( 3 ). In 1665, the French Journal des sçavans and the English Philosophical Transactions of the Royal Society were the first scientific journals to systematically publish research results ( 4 ). Philosophical Transactions of the Royal Society is thought to be the first journal to formalize the peer review process in 1665 ( 5 ), however, it is important to note that peer review was initially introduced to help editors decide which manuscripts to publish in their journals, and at that time it did not serve to ensure the validity of the research ( 6 ). It did not take long for the peer review process to evolve, and shortly thereafter papers were distributed to reviewers with the intent of authenticating the integrity of the research study before publication. The Royal Society of Edinburgh adhered to the following peer review process, published in their Medical Essays and Observations in 1731: “Memoirs sent by correspondence are distributed according to the subject matter to those members who are most versed in these matters. The report of their identity is not known to the author.” ( 7 ). The Royal Society of London adopted this review procedure in 1752 and developed the “Committee on Papers” to review manuscripts before they were published in Philosophical Transactions ( 6 ).

Peer review in the systematized and institutionalized form has developed immensely since the Second World War, at least partly due to the large increase in scientific research during this period ( 7 ). It is now used not only to ensure that a scientific manuscript is experimentally and ethically sound, but also to determine which papers sufficiently meet the journal’s standards of quality and originality before publication. Peer review is now standard practice by most credible scientific journals, and is an essential part of determining the credibility and quality of work submitted.

IMPACT OF THE PEER REVIEW PROCESS

Peer review has become the foundation of the scholarly publication system because it effectively subjects an author’s work to the scrutiny of other experts in the field. Thus, it encourages authors to strive to produce high quality research that will advance the field. Peer review also supports and maintains integrity and authenticity in the advancement of science. A scientific hypothesis or statement is generally not accepted by the academic community unless it has been published in a peer-reviewed journal ( 8 ). The Institute for Scientific Information ( ISI ) only considers journals that are peer-reviewed as candidates to receive Impact Factors. Peer review is a well-established process which has been a formal part of scientific communication for over 300 years.

OVERVIEW OF THE PEER REVIEW PROCESS

The peer review process begins when a scientist completes a research study and writes a manuscript that describes the purpose, experimental design, results, and conclusions of the study. The scientist then submits this paper to a suitable journal that specializes in a relevant research field, a step referred to as pre-submission. The editors of the journal will review the paper to ensure that the subject matter is in line with that of the journal, and that it fits with the editorial platform. Very few papers pass this initial evaluation. If the journal editors feel the paper sufficiently meets these requirements and is written by a credible source, they will send the paper to accomplished researchers in the field for a formal peer review. Peer reviewers are also known as referees (this process is summarized in Figure 1 ). The role of the editor is to select the most appropriate manuscripts for the journal, and to implement and monitor the peer review process. Editors must ensure that peer reviews are conducted fairly, and in an effective and timely manner. They must also ensure that there are no conflicts of interest involved in the peer review process.

An external file that holds a picture, illustration, etc.
Object name is ejifcc-25-227-g001.jpg

Overview of the review process

When a reviewer is provided with a paper, he or she reads it carefully and scrutinizes it to evaluate the validity of the science, the quality of the experimental design, and the appropriateness of the methods used. The reviewer also assesses the significance of the research, and judges whether the work will contribute to advancement in the field by evaluating the importance of the findings, and determining the originality of the research. Additionally, reviewers identify any scientific errors and references that are missing or incorrect. Peer reviewers give recommendations to the editor regarding whether the paper should be accepted, rejected, or improved before publication in the journal. The editor will mediate author-referee discussion in order to clarify the priority of certain referee requests, suggest areas that can be strengthened, and overrule reviewer recommendations that are beyond the study’s scope ( 9 ). If the paper is accepted, as per suggestion by the peer reviewer, the paper goes into the production stage, where it is tweaked and formatted by the editors, and finally published in the scientific journal. An overview of the review process is presented in Figure 1 .

WHO CONDUCTS REVIEWS?

Peer reviews are conducted by scientific experts with specialized knowledge on the content of the manuscript, as well as by scientists with a more general knowledge base. Peer reviewers can be anyone who has competence and expertise in the subject areas that the journal covers. Reviewers can range from young and up-and-coming researchers to old masters in the field. Often, the young reviewers are the most responsive and deliver the best quality reviews, though this is not always the case. On average, a reviewer will conduct approximately eight reviews per year, according to a study on peer review by the Publishing Research Consortium (PRC) ( 7 ). Journals will often have a pool of reviewers with diverse backgrounds to allow for many different perspectives. They will also keep a rather large reviewer bank, so that reviewers do not get burnt out, overwhelmed or time constrained from reviewing multiple articles simultaneously.

WHY DO REVIEWERS REVIEW?

Referees are typically not paid to conduct peer reviews and the process takes considerable effort, so the question is raised as to what incentive referees have to review at all. Some feel an academic duty to perform reviews, and are of the mentality that if their peers are expected to review their papers, then they should review the work of their peers as well. Reviewers may also have personal contacts with editors, and may want to assist as much as possible. Others review to keep up-to-date with the latest developments in their field, and reading new scientific papers is an effective way to do so. Some scientists use peer review as an opportunity to advance their own research as it stimulates new ideas and allows them to read about new experimental techniques. Other reviewers are keen on building associations with prestigious journals and editors and becoming part of their community, as sometimes reviewers who show dedication to the journal are later hired as editors. Some scientists see peer review as a chance to become aware of the latest research before their peers, and thus be first to develop new insights from the material. Finally, in terms of career development, peer reviewing can be desirable as it is often noted on one’s resume or CV. Many institutions consider a researcher’s involvement in peer review when assessing their performance for promotions ( 11 ). Peer reviewing can also be an effective way for a scientist to show their superiors that they are committed to their scientific field ( 5 ).

ARE REVIEWERS KEEN TO REVIEW?

A 2009 international survey of 4000 peer reviewers conducted by the charity Sense About Science at the British Science Festival at the University of Surrey, found that 90% of reviewers were keen to peer review ( 12 ). One third of respondents to the survey said they were happy to review up to five papers per year, and an additional one third of respondents were happy to review up to ten.

HOW LONG DOES IT TAKE TO REVIEW ONE PAPER?

On average, it takes approximately six hours to review one paper ( 12 ), however, this number may vary greatly depending on the content of the paper and the nature of the peer reviewer. One in every 100 participants in the “Sense About Science” survey claims to have taken more than 100 hours to review their last paper ( 12 ).

HOW TO DETERMINE IF A JOURNAL IS PEER REVIEWED

Ulrichsweb is a directory that provides information on over 300,000 periodicals, including information regarding which journals are peer reviewed ( 13 ). After logging into the system using an institutional login (eg. from the University of Toronto), search terms, journal titles or ISSN numbers can be entered into the search bar. The database provides the title, publisher, and country of origin of the journal, and indicates whether the journal is still actively publishing. The black book symbol (labelled ‘refereed’) reveals that the journal is peer reviewed.

THE EVALUATION CRITERIA FOR PEER REVIEW OF SCIENTIFIC PAPERS

As previously mentioned, when a reviewer receives a scientific manuscript, he/she will first determine if the subject matter is well suited for the content of the journal. The reviewer will then consider whether the research question is important and original, a process which may be aided by a literature scan of review articles.

Scientific papers submitted for peer review usually follow a specific structure that begins with the title, followed by the abstract, introduction, methodology, results, discussion, conclusions, and references. The title must be descriptive and include the concept and organism investigated, and potentially the variable manipulated and the systems used in the study. The peer reviewer evaluates if the title is descriptive enough, and ensures that it is clear and concise. A study by the National Association of Realtors (NAR) published by the Oxford University Press in 2006 indicated that the title of a manuscript plays a significant role in determining reader interest, as 72% of respondents said they could usually judge whether an article will be of interest to them based on the title and the author, while 13% of respondents claimed to always be able to do so ( 14 ).

The abstract is a summary of the paper, which briefly mentions the background or purpose, methods, key results, and major conclusions of the study. The peer reviewer assesses whether the abstract is sufficiently informative and if the content of the abstract is consistent with the rest of the paper. The NAR study indicated that 40% of respondents could determine whether an article would be of interest to them based on the abstract alone 60-80% of the time, while 32% could judge an article based on the abstract 80-100% of the time ( 14 ). This demonstrates that the abstract alone is often used to assess the value of an article.

The introduction of a scientific paper presents the research question in the context of what is already known about the topic, in order to identify why the question being studied is of interest to the scientific community, and what gap in knowledge the study aims to fill ( 15 ). The introduction identifies the study’s purpose and scope, briefly describes the general methods of investigation, and outlines the hypothesis and predictions ( 15 ). The peer reviewer determines whether the introduction provides sufficient background information on the research topic, and ensures that the research question and hypothesis are clearly identifiable.

The methods section describes the experimental procedures, and explains why each experiment was conducted. The methods section also includes the equipment and reagents used in the investigation. The methods section should be detailed enough that it can be used it to repeat the experiment ( 15 ). Methods are written in the past tense and in the active voice. The peer reviewer assesses whether the appropriate methods were used to answer the research question, and if they were written with sufficient detail. If information is missing from the methods section, it is the peer reviewer’s job to identify what details need to be added.

The results section is where the outcomes of the experiment and trends in the data are explained without judgement, bias or interpretation ( 15 ). This section can include statistical tests performed on the data, as well as figures and tables in addition to the text. The peer reviewer ensures that the results are described with sufficient detail, and determines their credibility. Reviewers also confirm that the text is consistent with the information presented in tables and figures, and that all figures and tables included are important and relevant ( 15 ). The peer reviewer will also make sure that table and figure captions are appropriate both contextually and in length, and that tables and figures present the data accurately.

The discussion section is where the data is analyzed. Here, the results are interpreted and related to past studies ( 15 ). The discussion describes the meaning and significance of the results in terms of the research question and hypothesis, and states whether the hypothesis was supported or rejected. This section may also provide possible explanations for unusual results and suggestions for future research ( 15 ). The discussion should end with a conclusions section that summarizes the major findings of the investigation. The peer reviewer determines whether the discussion is clear and focused, and whether the conclusions are an appropriate interpretation of the results. Reviewers also ensure that the discussion addresses the limitations of the study, any anomalies in the results, the relationship of the study to previous research, and the theoretical implications and practical applications of the study.

The references are found at the end of the paper, and list all of the information sources cited in the text to describe the background, methods, and/or interpret results. Depending on the citation method used, the references are listed in alphabetical order according to author last name, or numbered according to the order in which they appear in the paper. The peer reviewer ensures that references are used appropriately, cited accurately, formatted correctly, and that none are missing.

Finally, the peer reviewer determines whether the paper is clearly written and if the content seems logical. After thoroughly reading through the entire manuscript, they determine whether it meets the journal’s standards for publication,

and whether it falls within the top 25% of papers in its field ( 16 ) to determine priority for publication. An overview of what a peer reviewer looks for when evaluating a manuscript, in order of importance, is presented in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is ejifcc-25-227-g002.jpg

How a peer review evaluates a manuscript

To increase the chance of success in the peer review process, the author must ensure that the paper fully complies with the journal guidelines before submission. The author must also be open to criticism and suggested revisions, and learn from mistakes made in previous submissions.

ADVANTAGES AND DISADVANTAGES OF THE DIFFERENT TYPES OF PEER REVIEW

The peer review process is generally conducted in one of three ways: open review, single-blind review, or double-blind review. In an open review, both the author of the paper and the peer reviewer know one another’s identity. Alternatively, in single-blind review, the reviewer’s identity is kept private, but the author’s identity is revealed to the reviewer. In double-blind review, the identities of both the reviewer and author are kept anonymous. Open peer review is advantageous in that it prevents the reviewer from leaving malicious comments, being careless, or procrastinating completion of the review ( 2 ). It encourages reviewers to be open and honest without being disrespectful. Open reviewing also discourages plagiarism amongst authors ( 2 ). On the other hand, open peer review can also prevent reviewers from being honest for fear of developing bad rapport with the author. The reviewer may withhold or tone down their criticisms in order to be polite ( 2 ). This is especially true when younger reviewers are given a more esteemed author’s work, in which case the reviewer may be hesitant to provide criticism for fear that it will damper their relationship with a superior ( 2 ). According to the Sense About Science survey, editors find that completely open reviewing decreases the number of people willing to participate, and leads to reviews of little value ( 12 ). In the aforementioned study by the PRC, only 23% of authors surveyed had experience with open peer review ( 7 ).

Single-blind peer review is by far the most common. In the PRC study, 85% of authors surveyed had experience with single-blind peer review ( 7 ). This method is advantageous as the reviewer is more likely to provide honest feedback when their identity is concealed ( 2 ). This allows the reviewer to make independent decisions without the influence of the author ( 2 ). The main disadvantage of reviewer anonymity, however, is that reviewers who receive manuscripts on subjects similar to their own research may be tempted to delay completing the review in order to publish their own data first ( 2 ).

Double-blind peer review is advantageous as it prevents the reviewer from being biased against the author based on their country of origin or previous work ( 2 ). This allows the paper to be judged based on the quality of the content, rather than the reputation of the author. The Sense About Science survey indicates that 76% of researchers think double-blind peer review is a good idea ( 12 ), and the PRC survey indicates that 45% of authors have had experience with double-blind peer review ( 7 ). The disadvantage of double-blind peer review is that, especially in niche areas of research, it can sometimes be easy for the reviewer to determine the identity of the author based on writing style, subject matter or self-citation, and thus, impart bias ( 2 ).

Masking the author’s identity from peer reviewers, as is the case in double-blind review, is generally thought to minimize bias and maintain review quality. A study by Justice et al. in 1998 investigated whether masking author identity affected the quality of the review ( 17 ). One hundred and eighteen manuscripts were randomized; 26 were peer reviewed as normal, and 92 were moved into the ‘intervention’ arm, where editor quality assessments were completed for 77 manuscripts and author quality assessments were completed for 40 manuscripts ( 17 ). There was no perceived difference in quality between the masked and unmasked reviews. Additionally, the masking itself was often unsuccessful, especially with well-known authors ( 17 ). However, a previous study conducted by McNutt et al. had different results ( 18 ). In this case, blinding was successful 73% of the time, and they found that when author identity was masked, the quality of review was slightly higher ( 18 ). Although Justice et al. argued that this difference was too small to be consequential, their study targeted only biomedical journals, and the results cannot be generalized to journals of a different subject matter ( 17 ). Additionally, there were problems masking the identities of well-known authors, introducing a flaw in the methods. Regardless, Justice et al. concluded that masking author identity from reviewers may not improve review quality ( 17 ).

In addition to open, single-blind and double-blind peer review, there are two experimental forms of peer review. In some cases, following publication, papers may be subjected to post-publication peer review. As many papers are now published online, the scientific community has the opportunity to comment on these papers, engage in online discussions and post a formal review. For example, online publishers PLOS and BioMed Central have enabled scientists to post comments on published papers if they are registered users of the site ( 10 ). Philica is another journal launched with this experimental form of peer review. Only 8% of authors surveyed in the PRC study had experience with post-publication review ( 7 ). Another experimental form of peer review called Dynamic Peer Review has also emerged. Dynamic peer review is conducted on websites such as Naboj, which allow scientists to conduct peer reviews on articles in the preprint media ( 19 ). The peer review is conducted on repositories and is a continuous process, which allows the public to see both the article and the reviews as the article is being developed ( 19 ). Dynamic peer review helps prevent plagiarism as the scientific community will already be familiar with the work before the peer reviewed version appears in print ( 19 ). Dynamic review also reduces the time lag between manuscript submission and publishing. An example of a preprint server is the ‘arXiv’ developed by Paul Ginsparg in 1991, which is used primarily by physicists ( 19 ). These alternative forms of peer review are still un-established and experimental. Traditional peer review is time-tested and still highly utilized. All methods of peer review have their advantages and deficiencies, and all are prone to error.

PEER REVIEW OF OPEN ACCESS JOURNALS

Open access (OA) journals are becoming increasingly popular as they allow the potential for widespread distribution of publications in a timely manner ( 20 ). Nevertheless, there can be issues regarding the peer review process of open access journals. In a study published in Science in 2013, John Bohannon submitted 304 slightly different versions of a fictional scientific paper (written by a fake author, working out of a non-existent institution) to a selected group of OA journals. This study was performed in order to determine whether papers submitted to OA journals are properly reviewed before publication in comparison to subscription-based journals. The journals in this study were selected from the Directory of Open Access Journals (DOAJ) and Biall’s List, a list of journals which are potentially predatory, and all required a fee for publishing ( 21 ). Of the 304 journals, 157 accepted a fake paper, suggesting that acceptance was based on financial interest rather than the quality of article itself, while 98 journals promptly rejected the fakes ( 21 ). Although this study highlights useful information on the problems associated with lower quality publishers that do not have an effective peer review system in place, the article also generalizes the study results to all OA journals, which can be detrimental to the general perception of OA journals. There were two limitations of the study that made it impossible to accurately determine the relationship between peer review and OA journals: 1) there was no control group (subscription-based journals), and 2) the fake papers were sent to a non-randomized selection of journals, resulting in bias.

JOURNAL ACCEPTANCE RATES

Based on a recent survey, the average acceptance rate for papers submitted to scientific journals is about 50% ( 7 ). Twenty percent of the submitted manuscripts that are not accepted are rejected prior to review, and 30% are rejected following review ( 7 ). Of the 50% accepted, 41% are accepted with the condition of revision, while only 9% are accepted without the request for revision ( 7 ).

SATISFACTION WITH THE PEER REVIEW SYSTEM

Based on a recent survey by the PRC, 64% of academics are satisfied with the current system of peer review, and only 12% claimed to be ‘dissatisfied’ ( 7 ). The large majority, 85%, agreed with the statement that ‘scientific communication is greatly helped by peer review’ ( 7 ). There was a similarly high level of support (83%) for the idea that peer review ‘provides control in scientific communication’ ( 7 ).

HOW TO PEER REVIEW EFFECTIVELY

The following are ten tips on how to be an effective peer reviewer as indicated by Brian Lucey, an expert on the subject ( 22 ):

1) Be professional

Peer review is a mutual responsibility among fellow scientists, and scientists are expected, as part of the academic community, to take part in peer review. If one is to expect others to review their work, they should commit to reviewing the work of others as well, and put effort into it.

2) Be pleasant

If the paper is of low quality, suggest that it be rejected, but do not leave ad hominem comments. There is no benefit to being ruthless.

3) Read the invite

When emailing a scientist to ask them to conduct a peer review, the majority of journals will provide a link to either accept or reject. Do not respond to the email, respond to the link.

4) Be helpful

Suggest how the authors can overcome the shortcomings in their paper. A review should guide the author on what is good and what needs work from the reviewer’s perspective.

5) Be scientific

The peer reviewer plays the role of a scientific peer, not an editor for proofreading or decision-making. Don’t fill a review with comments on editorial and typographic issues. Instead, focus on adding value with scientific knowledge and commenting on the credibility of the research conducted and conclusions drawn. If the paper has a lot of typographical errors, suggest that it be professionally proof edited as part of the review.

6) Be timely

Stick to the timeline given when conducting a peer review. Editors track who is reviewing what and when and will know if someone is late on completing a review. It is important to be timely both out of respect for the journal and the author, as well as to not develop a reputation of being late for review deadlines.

7) Be realistic

The peer reviewer must be realistic about the work presented, the changes they suggest and their role. Peer reviewers may set the bar too high for the paper they are editing by proposing changes that are too ambitious and editors must override them.

8) Be empathetic

Ensure that the review is scientific, helpful and courteous. Be sensitive and respectful with word choice and tone in a review.

Remember that both specialists and generalists can provide valuable insight when peer reviewing. Editors will try to get both specialised and general reviewers for any particular paper to allow for different perspectives. If someone is asked to review, the editor has determined they have a valid and useful role to play, even if the paper is not in their area of expertise.

10) Be organised

A review requires structure and logical flow. A reviewer should proofread their review before submitting it for structural, grammatical and spelling errors as well as for clarity. Most publishers provide short guides on structuring a peer review on their website. Begin with an overview of the proposed improvements; then provide feedback on the paper structure, the quality of data sources and methods of investigation used, the logical flow of argument, and the validity of conclusions drawn. Then provide feedback on style, voice and lexical concerns, with suggestions on how to improve.

In addition, the American Physiology Society (APS) recommends in its Peer Review 101 Handout that peer reviewers should put themselves in both the editor’s and author’s shoes to ensure that they provide what both the editor and the author need and expect ( 11 ). To please the editor, the reviewer should ensure that the peer review is completed on time, and that it provides clear explanations to back up recommendations. To be helpful to the author, the reviewer must ensure that their feedback is constructive. It is suggested that the reviewer take time to think about the paper; they should read it once, wait at least a day, and then re-read it before writing the review ( 11 ). The APS also suggests that Graduate students and researchers pay attention to how peer reviewers edit their work, as well as to what edits they find helpful, in order to learn how to peer review effectively ( 11 ). Additionally, it is suggested that Graduate students practice reviewing by editing their peers’ papers and asking a faculty member for feedback on their efforts. It is recommended that young scientists offer to peer review as often as possible in order to become skilled at the process ( 11 ). The majority of students, fellows and trainees do not get formal training in peer review, but rather learn by observing their mentors. According to the APS, one acquires experience through networking and referrals, and should therefore try to strengthen relationships with journal editors by offering to review manuscripts ( 11 ). The APS also suggests that experienced reviewers provide constructive feedback to students and junior colleagues on their peer review efforts, and encourages them to peer review to demonstrate the importance of this process in improving science ( 11 ).

The peer reviewer should only comment on areas of the manuscript that they are knowledgeable about ( 23 ). If there is any section of the manuscript they feel they are not qualified to review, they should mention this in their comments and not provide further feedback on that section. The peer reviewer is not permitted to share any part of the manuscript with a colleague (even if they may be more knowledgeable in the subject matter) without first obtaining permission from the editor ( 23 ). If a peer reviewer comes across something they are unsure of in the paper, they can consult the literature to try and gain insight. It is important for scientists to remember that if a paper can be improved by the expertise of one of their colleagues, the journal must be informed of the colleague’s help, and approval must be obtained for their colleague to read the protected document. Additionally, the colleague must be identified in the confidential comments to the editor, in order to ensure that he/she is appropriately credited for any contributions ( 23 ). It is the job of the reviewer to make sure that the colleague assisting is aware of the confidentiality of the peer review process ( 23 ). Once the review is complete, the manuscript must be destroyed and cannot be saved electronically by the reviewers ( 23 ).

COMMON ERRORS IN SCIENTIFIC PAPERS

When performing a peer review, there are some common scientific errors to look out for. Most of these errors are violations of logic and common sense: these may include contradicting statements, unwarranted conclusions, suggestion of causation when there is only support for correlation, inappropriate extrapolation, circular reasoning, or pursuit of a trivial question ( 24 ). It is also common for authors to suggest that two variables are different because the effects of one variable are statistically significant while the effects of the other variable are not, rather than directly comparing the two variables ( 24 ). Authors sometimes oversee a confounding variable and do not control for it, or forget to include important details on how their experiments were controlled or the physical state of the organisms studied ( 24 ). Another common fault is the author’s failure to define terms or use words with precision, as these practices can mislead readers ( 24 ). Jargon and/or misused terms can be a serious problem in papers. Inaccurate statements about specific citations are also a common occurrence ( 24 ). Additionally, many studies produce knowledge that can be applied to areas of science outside the scope of the original study, therefore it is better for reviewers to look at the novelty of the idea, conclusions, data, and methodology, rather than scrutinize whether or not the paper answered the specific question at hand ( 24 ). Although it is important to recognize these points, when performing a review it is generally better practice for the peer reviewer to not focus on a checklist of things that could be wrong, but rather carefully identify the problems specific to each paper and continuously ask themselves if anything is missing ( 24 ). An extremely detailed description of how to conduct peer review effectively is presented in the paper How I Review an Original Scientific Article written by Frederic G. Hoppin, Jr. It can be accessed through the American Physiological Society website under the Peer Review Resources section.

CRITICISM OF PEER REVIEW

A major criticism of peer review is that there is little evidence that the process actually works, that it is actually an effective screen for good quality scientific work, and that it actually improves the quality of scientific literature. As a 2002 study published in the Journal of the American Medical Association concluded, ‘Editorial peer review, although widely used, is largely untested and its effects are uncertain’ ( 25 ). Critics also argue that peer review is not effective at detecting errors. Highlighting this point, an experiment by Godlee et al. published in the British Medical Journal (BMJ) inserted eight deliberate errors into a paper that was nearly ready for publication, and then sent the paper to 420 potential reviewers ( 7 ). Of the 420 reviewers that received the paper, 221 (53%) responded, the average number of errors spotted by reviewers was two, no reviewer spotted more than five errors, and 35 reviewers (16%) did not spot any.

Another criticism of peer review is that the process is not conducted thoroughly by scientific conferences with the goal of obtaining large numbers of submitted papers. Such conferences often accept any paper sent in, regardless of its credibility or the prevalence of errors, because the more papers they accept, the more money they can make from author registration fees ( 26 ). This misconduct was exposed in 2014 by three MIT graduate students by the names of Jeremy Stribling, Dan Aguayo and Maxwell Krohn, who developed a simple computer program called SCIgen that generates nonsense papers and presents them as scientific papers ( 26 ). Subsequently, a nonsense SCIgen paper submitted to a conference was promptly accepted. Nature recently reported that French researcher Cyril Labbé discovered that sixteen SCIgen nonsense papers had been used by the German academic publisher Springer ( 26 ). Over 100 nonsense papers generated by SCIgen were published by the US Institute of Electrical and Electronic Engineers (IEEE) ( 26 ). Both organisations have been working to remove the papers. Labbé developed a program to detect SCIgen papers and has made it freely available to ensure publishers and conference organizers do not accept nonsense work in the future. It is available at this link: http://scigendetect.on.imag.fr/main.php ( 26 ).

Additionally, peer review is often criticized for being unable to accurately detect plagiarism. However, many believe that detecting plagiarism cannot practically be included as a component of peer review. As explained by Alice Tuff, development manager at Sense About Science, ‘The vast majority of authors and reviewers think peer review should detect plagiarism (81%) but only a minority (38%) think it is capable. The academic time involved in detecting plagiarism through peer review would cause the system to grind to a halt’ ( 27 ). Publishing house Elsevier began developing electronic plagiarism tools with the help of journal editors in 2009 to help improve this issue ( 27 ).

It has also been argued that peer review has lowered research quality by limiting creativity amongst researchers. Proponents of this view claim that peer review has repressed scientists from pursuing innovative research ideas and bold research questions that have the potential to make major advances and paradigm shifts in the field, as they believe that this work will likely be rejected by their peers upon review ( 28 ). Indeed, in some cases peer review may result in rejection of innovative research, as some studies may not seem particularly strong initially, yet may be capable of yielding very interesting and useful developments when examined under different circumstances, or in the light of new information ( 28 ). Scientists that do not believe in peer review argue that the process stifles the development of ingenious ideas, and thus the release of fresh knowledge and new developments into the scientific community.

Another issue that peer review is criticized for, is that there are a limited number of people that are competent to conduct peer review compared to the vast number of papers that need reviewing. An enormous number of papers published (1.3 million papers in 23,750 journals in 2006), but the number of competent peer reviewers available could not have reviewed them all ( 29 ). Thus, people who lack the required expertise to analyze the quality of a research paper are conducting reviews, and weak papers are being accepted as a result. It is now possible to publish any paper in an obscure journal that claims to be peer-reviewed, though the paper or journal itself could be substandard ( 29 ). On a similar note, the US National Library of Medicine indexes 39 journals that specialize in alternative medicine, and though they all identify themselves as “peer-reviewed”, they rarely publish any high quality research ( 29 ). This highlights the fact that peer review of more controversial or specialized work is typically performed by people who are interested and hold similar views or opinions as the author, which can cause bias in their review. For instance, a paper on homeopathy is likely to be reviewed by fellow practicing homeopaths, and thus is likely to be accepted as credible, though other scientists may find the paper to be nonsense ( 29 ). In some cases, papers are initially published, but their credibility is challenged at a later date and they are subsequently retracted. Retraction Watch is a website dedicated to revealing papers that have been retracted after publishing, potentially due to improper peer review ( 30 ).

Additionally, despite its many positive outcomes, peer review is also criticized for being a delay to the dissemination of new knowledge into the scientific community, and as an unpaid-activity that takes scientists’ time away from activities that they would otherwise prioritize, such as research and teaching, for which they are paid ( 31 ). As described by Eva Amsen, Outreach Director for F1000Research, peer review was originally developed as a means of helping editors choose which papers to publish when journals had to limit the number of papers they could print in one issue ( 32 ). However, nowadays most journals are available online, either exclusively or in addition to print, and many journals have very limited printing runs ( 32 ). Since there are no longer page limits to journals, any good work can and should be published. Consequently, being selective for the purpose of saving space in a journal is no longer a valid excuse that peer reviewers can use to reject a paper ( 32 ). However, some reviewers have used this excuse when they have personal ulterior motives, such as getting their own research published first.

RECENT INITIATIVES TOWARDS IMPROVING PEER REVIEW

F1000Research was launched in January 2013 by Faculty of 1000 as an open access journal that immediately publishes papers (after an initial check to ensure that the paper is in fact produced by a scientist and has not been plagiarised), and then conducts transparent post-publication peer review ( 32 ). F1000Research aims to prevent delays in new science reaching the academic community that are caused by prolonged publication times ( 32 ). It also aims to make peer reviewing more fair by eliminating any anonymity, which prevents reviewers from delaying the completion of a review so they can publish their own similar work first ( 32 ). F1000Research offers completely open peer review, where everything is published, including the name of the reviewers, their review reports, and the editorial decision letters ( 32 ).

PeerJ was founded by Jason Hoyt and Peter Binfield in June 2012 as an open access, peer reviewed scholarly journal for the Biological and Medical Sciences ( 33 ). PeerJ selects articles to publish based only on scientific and methodological soundness, not on subjective determinants of ‘impact ’, ‘novelty’ or ‘interest’ ( 34 ). It works on a “lifetime publishing plan” model which charges scientists for publishing plans that give them lifetime rights to publish with PeerJ, rather than charging them per publication ( 34 ). PeerJ also encourages open peer review, and authors are given the option to post the full peer review history of their submission with their published article ( 34 ). PeerJ also offers a pre-print review service called PeerJ Pre-prints, in which paper drafts are reviewed before being sent to PeerJ to publish ( 34 ).

Rubriq is an independent peer review service designed by Shashi Mudunuri and Keith Collier to improve the peer review system ( 35 ). Rubriq is intended to decrease redundancy in the peer review process so that the time lost in redundant reviewing can be put back into research ( 35 ). According to Keith Collier, over 15 million hours are lost each year to redundant peer review, as papers get rejected from one journal and are subsequently submitted to a less prestigious journal where they are reviewed again ( 35 ). Authors often have to submit their manuscript to multiple journals, and are often rejected multiple times before they find the right match. This process could take months or even years ( 35 ). Rubriq makes peer review portable in order to help authors choose the journal that is best suited for their manuscript from the beginning, thus reducing the time before their paper is published ( 35 ). Rubriq operates under an author-pay model, in which the author pays a fee and their manuscript undergoes double-blind peer review by three expert academic reviewers using a standardized scorecard ( 35 ). The majority of the author’s fee goes towards a reviewer honorarium ( 35 ). The papers are also screened for plagiarism using iThenticate ( 35 ). Once the manuscript has been reviewed by the three experts, the most appropriate journal for submission is determined based on the topic and quality of the paper ( 35 ). The paper is returned to the author in 1-2 weeks with the Rubriq Report ( 35 ). The author can then submit their paper to the suggested journal with the Rubriq Report attached. The Rubriq Report will give the journal editors a much stronger incentive to consider the paper as it shows that three experts have recommended the paper to them ( 35 ). Rubriq also has its benefits for reviewers; the Rubriq scorecard gives structure to the peer review process, and thus makes it consistent and efficient, which decreases time and stress for the reviewer. Reviewers also receive feedback on their reviews and most significantly, they are compensated for their time ( 35 ). Journals also benefit, as they receive pre-screened papers, reducing the number of papers sent to their own reviewers, which often end up rejected ( 35 ). This can reduce reviewer fatigue, and allow only higher-quality articles to be sent to their peer reviewers ( 35 ).

According to Eva Amsen, peer review and scientific publishing are moving in a new direction, in which all papers will be posted online, and a post-publication peer review will take place that is independent of specific journal criteria and solely focused on improving paper quality ( 32 ). Journals will then choose papers that they find relevant based on the peer reviews and publish those papers as a collection ( 32 ). In this process, peer review and individual journals are uncoupled ( 32 ). In Keith Collier’s opinion, post-publication peer review is likely to become more prevalent as a complement to pre-publication peer review, but not as a replacement ( 35 ). Post-publication peer review will not serve to identify errors and fraud but will provide an additional measurement of impact ( 35 ). Collier also believes that as journals and publishers consolidate into larger systems, there will be stronger potential for “cascading” and shared peer review ( 35 ).

CONCLUDING REMARKS

Peer review has become fundamental in assisting editors in selecting credible, high quality, novel and interesting research papers to publish in scientific journals and to ensure the correction of any errors or issues present in submitted papers. Though the peer review process still has some flaws and deficiencies, a more suitable screening method for scientific papers has not yet been proposed or developed. Researchers have begun and must continue to look for means of addressing the current issues with peer review to ensure that it is a full-proof system that ensures only quality research papers are released into the scientific community.

IMAGES

  1. What is Peer Review?

    peer reviewed journal articles on research methods

  2. Peer-review research: Objections and obligations

    peer reviewed journal articles on research methods

  3. (PDF) How to Write and Publish a Research Paper for a Peer-Reviewed Journal

    peer reviewed journal articles on research methods

  4. Finding Peer Reviewed Journal Articles

    peer reviewed journal articles on research methods

  5. 7 Types Of Peer-Review Process

    peer reviewed journal articles on research methods

  6. How to cite a peer-reviewed journal article in APA format

    peer reviewed journal articles on research methods

VIDEO

  1. The scientific approach and alternative approaches to investigation

  2. Simplify journal articles & research papers with these ChatGPT prompts

  3. Gather Articles for your Research using this website

  4. Methodological Reviews

  5. Pharma Pulse: Are peer reviewed medical journal articles reliable?

  6. How Can You Tell if an Article is Peer-Reviewed?

COMMENTS

  1. Planning Qualitative Research: Design and Decision ...

    While many books and articles guide various qualitative research methods and analyses, there is currently no concise resource that explains and differentiates among the most common qualitative approaches. We believe novice qualitative researchers, students planning the design of a qualitative study or taking an introductory qualitative research course, and faculty teaching such courses can ...

  2. Quantitative and Qualitative Approaches to Generalization and

    Hence, mixed methods methodology does not provide a conceptual unification of the two approaches. Lacking a common methodological background, qualitative and quantitative research methodologies have developed rather distinct standards with regard to the aims and scope of empirical science (Freeman et al., 2007). These different standards affect ...

  3. Reviewing the research methods literature: principles and strategies

    Peer Review reports. Background. ... this would entail an initial review of the research methods literature. ... In the overview on sampling , out of 41 full-text publications retrieved and reviewed, only 4 were journal articles, while 37 were books or book chapters. Since many books and book chapters did not exist electronically, their full ...

  4. Clarification of research design, research methods, and research

    Aguado AN (2009) Teaching research methods: Learning by doing. Journal of Public Affairs Education. 15(2): 251-260. Crossref. Google Scholar. ... Book Review: Research Design: Qualitative and Quantitative Approaches. Show details Hide details. R. Dale Wilson. Journal of Marketing Research. May 1996.

  5. Frontiers

    Therefore, this systematised review aimed to determine what research methods are being used, how these methods are being used and for what topics in the field. Our review of 999 articles from five journals over a period of 5 years indicated that psychology research is conducted in 10 topics via predominantly quantitative research methods.

  6. The Use of Research Methods in Psychological Research: A Systematised

    Therefore, this systematised review aimed to determine what research methods are being used, how these methods are being used and for what topics in the field. Our review of 999 articles from five journals over a period of 5 years indicated that psychology research is conducted in 10 topics via predominantly quantitative research methods.

  7. Mixed methods research: what it is and what it could be

    Combining methods in social scientific research has recently gained momentum through a research strand called Mixed Methods Research (MMR). This approach, which explicitly aims to offer a framework for combining methods, has rapidly spread through the social and behavioural sciences, and this article offers an analysis of the approach from a field theoretical perspective. After a brief outline ...

  8. Understanding and Evaluating Survey Research

    Survey research is defined as "the collection of information from a sample of individuals through their responses to questions" ( Check & Schutt, 2012, p. 160 ). This type of research allows for a variety of methods to recruit participants, collect data, and utilize various methods of instrumentation. Survey research can use quantitative ...

  9. Research Methods: How to Perform an Effective Peer Review

    Peer review has been a part of scientific publications since 1665, when the Philosophical Transactions of the Royal Society became the first publication to formalize a system of expert review. 1,2 It became an institutionalized part of science in the latter half of the 20 th century and is now the standard in scientific research publications. 3 In 2012, there were more than 28 000 scholarly ...

  10. Research Methods: Peer-Reviewed Journal Articles

    Databases Containing Peer-Reviewed Journal Articles. Each database containing peer-reviewed journals has different content coverage and materials. The databases listed in this Research Guide are available only to Truckee Meadows Community College students, faculty and staff. You will need your TMCC credentials (Username and Password) to access ...

  11. Demystifying the process of scholarly peer-review: an ...

    The peer-review process is the longstanding method by which research quality is assured. On the one hand, it aims to assess the quality of a manuscript, with the desired outcome being (in theory ...

  12. Tools used to assess the quality of peer review reports: a

    Background A strong need exists for a validated tool that clearly defines peer review report quality in biomedical research, as it will allow evaluating interventions aimed at improving the peer review process in well-performed trials. We aim to identify and describe existing tools for assessing the quality of peer review reports in biomedical research. Methods We conducted a methodological ...

  13. Frontiers

    Open access publisher of peer-reviewed scientific articles across the entire spectrum of academia. Research network for academics to stay up-to-date with the latest scientific publications, events, blogs and news. ... Find a journal. We have a home for your research. Our community led journals cover more than 1,500 academic disciplines and are ...

  14. Research Methods: How to Perform an Effective Peer Review

    Scientific peer review has existed for centuries and is a cornerstone of the scientific publication process. Because the number of scientific publications has rapidly increased over the past decades, so has the number of peer reviews and peer reviewers. ... Research Methods: How to Perform an Effective Peer Review Hosp Pediatr. 2022 Nov 1;12(11 ...

  15. What Is Peer Review?

    The most common types are: Single-blind review. Double-blind review. Triple-blind review. Collaborative review. Open review. Relatedly, peer assessment is a process where your peers provide you with feedback on something you've written, based on a set of criteria or benchmarks from an instructor.

  16. Statistical analyses of ordinal outcomes in randomised controlled

    The review included RCTs with an ordinal primary or secondary outcome published between 2017 and 2022 in four highly ranked medical journals (the British Medical Journal, New England Journal of Medicine, The Lancet, and the Journal of the American Medical Association) identified through PubMed. Details regarding the study setting, design, the ...

  17. A Practical Guide to Writing Quantitative and Qualitative Research

    There is a continuing need to support researchers in the creation of innovative research questions and hypotheses, as well as for journal articles that carefully review these elements.1 When research questions and hypotheses are not carefully thought of, unethical studies and poor outcomes usually ensue. Carefully formulated research questions ...

  18. Peer Reviewed Articles

    First, you need to be able to identify which journals are peer-reviewed. There are generally four methods for doing this. Limiting a database search to peer-reviewed journals only. Some databases allow you to limit searches for articles to peer reviewed journals only. For example, Academic Search Complete has this feature on the initial search ...

  19. Treatments for ADHD in Children and Adolescents: A Systematic Review

    The systematic review followed Methods of the ... Collaboration, What Works in Education, PROSPERO, ECRI Guidelines Trust, G-I-N, and ClinicalKey. The search underwent peer review; the full strategy is in the Online Appendix. ... The work is based on research conducted by the Southern California Evidence-based Practice Center under contract to ...

  20. Research Methods in Psychology

    This course covers foundations of the research process for experimental Psychology: reviewing and evaluating published journal articles, refining new research questions, conducting pilot studies, creating stimuli, sequencing experiments for optimal control and data quality, analyzing data, and communicating scientific methods and results clearly, effectively, and professionally in APA style.

  21. Journal of Mixed Methods Research: Sage Journals

    The scope includes delineating where mixed methods research may be used most effectively, illuminating design and procedure issues, and determining the logistics of conducting mixed methods research. This journal is a member of COPE. View full journal description

  22. Nirmatrelvir for Vaccinated or Unvaccinated Adult Outpatients with

    Nirmatrelvir in combination with ritonavir is an antiviral treatment for mild-to-moderate coronavirus disease 2019 (Covid-19). The efficacy of this treatment in patients who are at standard risk fo...

  23. Predicting and improving complex beer flavor through machine ...

    GBR was found to outperform other methods, resulting in models with the highest average R 2 values in both trained panel and public review data sets. Impurity-based rankings of the most important ...

  24. Peer Review in Scientific Publications: Benefits, Critiques, & A

    HISTORY OF PEER REVIEW. The concept of peer review was developed long before the scholarly journal. In fact, the peer review process is thought to have been used as a method of evaluating written work since ancient Greece ().The peer review process was first described by a physician named Ishaq bin Ali al-Rahwi of Syria, who lived from 854-931 CE, in his book Ethics of the Physician ().

  25. Efficient extraction methods and callus culture of quercetin from

    MS medium enriched with 2,4-D (0.5 mg) induced the highest percentage of callus at 4 weeks. Among the extraction methods, the SME method yielded the highest recovery of quercetin (96.9%; 1.92 mg g −1 DW), phenolic (26.07 mg g −1 DW) and flavonoid (34.03 mg g −1 DW), compared to other methods and solvents. The callus extract included 0.5 ...

  26. Case Study Methodology of Qualitative Research: Key ...

    A case study is one of the most commonly used methodologies of social research. This article attempts to look into the various dimensions of a case study research strategy, the different epistemological strands which determine the particular case study type and approach adopted in the field, discusses the factors which can enhance the effectiveness of a case study research, and the debate ...

  27. Sustainability

    This research aims to explore the complex interplay between supply chain resilience (SCR), digital supply chain (DSC), and sustainability, focusing on the moderating influence of supply chain dynamism. The goal is to understand how these elements interact within the framework of contemporary supply chain management and how they collectively contribute to enhancing sustainability outcomes. The ...

  28. Organizational Research Methods: Sage Journals

    Organizational Research Methods (ORM), peer-reviewed and published quarterly, brings relevant methodological developments to a wide range of researchers in organizational and management studies and promotes a more effective understanding of current and new methodologies and their application in organizational settings.ORM is an elite scholarly journal, known for high-quality, from the ...