Graph Analytics in 2024: Types, Tools, and Top 10 Use Cases

graph analysis for research

Analytics is generally used on numeric data to gain insights. However, graph analytics analyzes relationships between entities rather than numeric data. By using graph algorithms and relationships in graph databases, graph analytics solutions are uncovering insights in fields like social network analysis, fraud detection, supply chain and search engine optimization.

What is a graph?

To understand graph analytics, we need to understand what a graph means. Graph is a mathematical term and it represents relationships between entities. Two elements make up a graph: nodes or vertices (representing entities) and edges or links (representing relationships). The study of graphs is also known as Graph Theory  in mathematics.

There are different types of graphs:

  • Directed graphs:  All edges are directed from one node to another. It is also called digraph or directed network. Directed graphs represent asymmetric relationships.
  • Undirected graphs:  All edges are connected from one node to another, but the direction of the relationship is not drawn. It is also called an undirected network. Undirected graphs express symmetric relationships.
  • Weighted graphs:   A weighted graph has numerical weights on its edges. Those weights are required for shortest path problems and other analysis.
  • Cyclic graphs:  A cyclic graph has a path from at least one node back to itself. A graph that does not contain a cycle is called acyclic.

What is Graph Analytics?

Graph analytics, also called network analysis, is the analysis of relations among entities such as customers, products, operations, and devices. Organizations leverage graph models to gain insights that can be used in marketing or for example for analyzing social networks.

Many businesses work with graphs. Some examples are:

  • Telecom operators’ operate fixed or mobile networks which can be modeled as graphs.
  • Telecom customers talk to one another and these relationships form graphs.

Why is it important now?

Graph analytics is important due to the expected market growth. According to a recent graph analytics  market report , the graph analytics market size was ~$600 million in 2019, and it is expected to reach ~$2.5 billion by 2024, at a Compound Annual Growth Rate (CAGR) of 34% during the forecast period.

What are the different types of graph analytics?

For each type of graph analytics, there are numerous graph analytics algorithms including both simple heuristics and computationally intensive algorithms aimed at finding perfect solutions. Depending on the value of the solution, different algorithms can be implemented.

Analyzing the current graph

  • Centrality analysis: Estimates how important a node is for the connectivity of the network. It helps to estimate the most influential people in a social network or most frequently accessed web pages by using the PageRank algorithm.
  • Community detection: Distance and density of relationships can be used to find groups of people interacting frequently with each other in a social network. Community analytics also deals with the detection and behavior patterns of communities.
  • Connectivity analysis: Determine how strongly or weakly connected two nodes are.
  • Path analysis: Examines the relationships between nodes. Mostly used in shortest distance problems.

Predicting future changes

  • Link Prediction: Estimates new relationships or undocumented existence connections by calculating the proximity and structural form of nodes

What are its use cases?

Graph analytics applications exist in journalist, telecom, social networks, finance and operations.

A now classic example of using graph analytics to identify networks of relationships is the International Consortium of Investigative Journalists (ICIJ) research on Panama Papers. This research shed light on how authoritarian leaders and politicians used complex sets of shell companies to obscure their wealth from the public.

Armed with graph analytics and document extraction tools , journalists were able to get structured data from thousands of documents on companies in off-shore jurisdictions and use graph analytics to navigate through the structured data in the documents to identify the real owners of these companies.

Graph analytics are used to  spot frauds or criminals and unlawful actions such as money laundering and payments to sanctioned entities. To detect criminals, analysts use the data of social media, texting, phone calls and emails to create a graph that shows how these data are related to criminals’ records. With that graph, government agencies can identify threats from non-obvious patterns of relationships.

  • Financial transactions form graphs and can be analyzed for compliance reasons for example. Banks need to ensure that their customers are not in any way connected to sanctioned entities.
  • Loan decisions can be made using social or financial networks.

National security

Though controversial, graph analytics is being used by national intelligence agencies to detect unlawful activity. Communication activity of both suspected and not suspected individuals are collected and analyzed to identify non-obvious relationships and identify potential crimes.

Financial entities are required to prevent payments to sanctioned entities and graph analytics are used to spot such payments.

Fraud detection

In businesses that work with networks such as telecom companies, e-commerce marketplaces or financial institutions, graph analytics is used in fraud detection.

Supply Chain Optimization

Graph analytics algorithms such as shortest path and partitioning are tools to optimize routes in airlines, transportation networks, and supply chain networks.

Utility optimization

Companies that provide utilities such as water, sewage services, electricity, dams, and natural gas can leverage graph analysis to build the most optimal utility distribution network.

Social Network Analysis

Social media networks such as Instagram, Spotify and LinkedIn are relationship and connection driven applications. Graph analytics helps identify influencers and communities in social media networks. Social network influencer marketing is an emerging trend due to the increasing number of social media network users and increasing customer skepticism with more established forms of marketing.

Recommendation engines

You must have noticed social networks suggesting “People you may know” or “Songs you may like”. These recommendations rely on collaborative filtering which is a method commonly used by recommendation engines . Collaborative filtering relies on graph analytics to identify similar users and enables personalized recommendations.

Technology companies that are not social networks also rely on collaborative filtering. For example, eBay provides the most relevant search results according to purchase history.

Pandemic Search

The world is facing a pandemic of COVID-19. Since the virus is known as highly infectious, using a graph database help governments track the spread of the virus. A company called We-Yun has built an application using Neo4j graph database that allows Chinese citizens to check if they came in contact with a known carrier of the virus. The image below is a screenshot of the application that shows all known cases that are connected with the name.

An illustration of how graph database help track people during pandemic

How is it different than regular analytics?

Regular analytics  relies on statistics, computer programming and operations research to uncover insights. Graph analytics uses graph specific algorithms to analyze relationships between entities. Clustering, partitioning, PageRank and shortest path algorithms are unique to graph analytics.

Graph databases, which are necessary for advanced graph analytics, are more flexible than relational database management systems (RDBMS). RDBMSs have rigid schemas and it is difficult to add new data relations to them. However, new data relationships can be added in a flexible manner in graph databases.

What are the leading graph database software tools?

Graph database tools are required for advanced graph analytics. Graph databases connect nodes (representing customers, companies, or any other entity.) and create relationships (edges) in the form of graphs that can be queried by users. Some of the leading graph database software tools are:

  • Amazon Neptune
  • Apache Giraph

For example,  Neo4j is available both open source and through a commercial license for enterprises. 

 You can also check out our sortable and data-driven lists of analytics platforms. If you want to learn more about analytics, feel free to check our other articles about it:

  • What is analytics? How is it evolving?
  • Application Analytics
  • Augmented Analytics

And if you still have questions, don’t hesitate to contact us:

graph analysis for research

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 60% of Fortune 500 every month. Cem's work has been cited by leading global publications including Business Insider , Forbes, Washington Post , global firms like Deloitte , HPE, NGOs like World Economic Forum and supranational organizations like European Commission . You can see more reputable companies and media that referenced AIMultiple. Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised businesses on their enterprise software, automation, cloud, AI / ML and other technology related decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization. He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider . Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

To stay up-to-date on B2B tech & accelerate your enterprise:

Next to Read

The ultimate guide to alternative data analytics in 2024, ai in analytics: how ai is shaping analytics in 2024 in 4 ways, citizen data scientists: 4 ways to democratize data science [2024].

Your email address will not be published. All fields are required.

graph analysis for research

Thanks for this overview. Do you know of any libraries that can be used to create graph-like visualizations without a graph database? Can a graph database be used to produce the 2d headliner image for this article?

graph analysis for research

To prepare a 2d graphic, a graph database would be a bit too much effort. If you need something like the 2D headliner image of this article, you can use a Javascript charting library like Plotly which can do X Y scatter with lines.

Related research

Top 10 Healthcare Analytics Use Cases & Challenges in 2024

Top 10 Healthcare Analytics Use Cases & Challenges in 2024

How Do Businesses Democratize Analytics With AI in 2024?

How Do Businesses Democratize Analytics With AI in 2024?

A Beginner’s Guide to Important Topics in AI, Machine Learning, and Deep Learning.

  • Accuracy, Precision, Recall, and F1
  • AI Infrastructure
  • AI vs. ML vs. DL
  • Attention, Memory Networks & Transformers
  • Automated Machine Learning & AI
  • Backpropagation
  • Bag of Words & TF-IDF
  • Bayes' Theorem & Naive Bayes Classifiers
  • Convolutional Neural Network (CNN)
  • Data for Deep Learning
  • Datasets and Machine Learning
  • Decision Tree
  • Deep Autoencoders
  • Deep-Belief Networks
  • Deep Reinforcement Learning
  • Deep Learning Resources
  • Define Artificial Intelligence (AI)
  • Denoising Autoencoders
  • Differentiable Programming
  • Eigenvectors, Eigenvalues, PCA, Covariance and Entropy
  • Evolutionary & Genetic Algorithms
  • Gaussian Processes & Machine Learning
  • Generative AI & Generative Adversarial Networks (GANs)
  • AI and Machine Learning Glossary
  • Graph Analytics
  • Java Tooling for AI
  • Logistic Regression
  • LSTMs & RNNs
  • Machine Learning Algorithms
  • Machine Learning Demos
  • Machine Learning Research Groups & Labs
  • Machine Learning Workflows
  • Machine Learning definition
  • Markov Chain Monte Carlo
  • MNIST database
  • Multilayer Perceptron
  • Natural Language Processing (NLP)
  • Neural Network Tuning
  • Neural Networks & Deep Learning
  • Open Datasets
  • Python Tooling for AI
  • Questions When Applying Deep Learning
  • Random Forest
  • Recurrent Network (RNN)
  • Recursive Neural Tensor Network
  • Reinforcement Learning for Business Use Cases
  • Reinforcement Learning Definitions
  • Restricted Boltzmann Machine (RBM)
  • Simulation, AI and Optimization
  • Spiking Neural Networks
  • Strong AI vs. Weak AI
  • Supervised Learning
  • Symbolic Reasoning
  • Thought Vectors
  • Unsupervised Learning
  • Deep Learning Use Cases
  • Variational Autoencoder (VAE)
  • Word2Vec, Doc2Vec and Neural Word Embeddings

A Beginner's Guide to Graph Analytics and Deep Learning

Graphs are networks of dots and lines. - Richard J. Trudeau

Concrete Examples of Graph Data Structures

Difficulties of graph data: size and structure, representing and traversing graphs for machine learning, further resources on graph data structures and deep learning.

Graphs are data structures that can be ingested by various algorithms, notably neural nets, learning to perform tasks such as classification, clustering and regression.

TL;DR: here’s one way to make graph data ingestable for the algorithms:

Algorithms can “embed” each node of a graph into a real vector (similar to the embedding of a word ). The result will be vector representation of each node in the graph with some information preserved. Once you have the real number vector, you can feed it to the neural network.

Learn How to Apply AI to Simulations »

The simplest definition of a graph is “a collection of items connected by edges.” Anyone who played with Tinker Toys as a child was building graphs with their spools and sticks. There are many problems where it’s helpful to think of things as graphs. 1 The items are often called nodes or points and the edges are often called vertices , the plural of vertex. Here are a few concrete examples of a graph:

  • Cities are nodes and highways are edges
  • Humans are nodes and relationships between them are edges (in a social network)
  • States are nodes and the transitions between them are edges (for more on states, see our post on deep reinforcement learning ). For example, a video game is a graph of states connected by actions that lead from one state to the next…
  • Atoms are nodes and chemical bonds are edges (in a molecule)
  • Web pages are nodes and hyperlinks are edges (Hello, Google)
  • A thought is a graph of synaptic firings (edges) between neurons (nodes)
  • A neural network is a graph … that makes predictions about other graphs. The nodes are places where computation happens and the edges are the paths by which signal flows through the mathematical operations

Any ontology, or knowledge graph, charts the interrelationship of entities (combining symbolic AI with the graph structure):

  • Taxonomies of animal species
  • Diseases that share etiologies and symptoms
  • Medications that share ingredients

Applying neural networks and other machine-learning techniques to graph data can be difficult.

The first question to answer is: What kind of graph are you dealing with?

Let’s say you have a finite state machine, where each state is a node in the graph. You can give each state-node a unique ID, maybe a number. Then you give all the rows the names of the states, and you give all the columns the same names, so that the matrix contains an element for every state to intersect with every other state. Then you could mark those elements with a 1 or 0 to indicate whether the two states were connected in the graph, or even use weighted nodes (a continuous number) to indicate the likelihood of a transition from one state to the next. (The transition matrix below represents a finite state machine for the weather.)

transition matrix

That seems simple enough, but many graphs, like social network graphs with billions of nodes (where each member is a node and each connection to another member is an edge), are simply too large to be computed. Size is one problem that graphs present as a data structure. In other words, you can’t efficiently store a large social network in a tensor. They don’t compute.

Neural nets do well on vectors and tensors; data types like images (which have structure embedded in them via pixel proximity – they have fixed size and spatiality); and sequences such as text and time series (which display structure in one direction, forward in time).

Graphs have an arbitrary structure : they are collections of things without a location in space, or with an arbitrary location. They have no proper beginning and no end, and two nodes connected to each other are not necessarily “close”.

You usually don’t feed whole graphs into neural networks, for example. They would have to be the same shape and size, and you’d have to line up your graph nodes with your network’s input nodes. But the whole point of graph-structured input is to not know or have that order. There’s no first, there’s no last.

graph

The second question when dealing with graphs is: What kind of question are you trying to answer by applying machine learning to them? In social networks, you’re usually trying to make a decision about what kind person you’re looking at, represented by the node, or what kind of friends and interactions does that person have. So you’re making predictions about the node itself or its edges.

Since that’s the case, you can address the uncomputable size of a Facebook-scale graph by looking at a node and its neighbors maybe 1-3 degrees away; i.e. a subgraph. The immediate neighborhood of the node, taking k steps down the graph in all directions, probably captures most of the information you care about. You’re filtering out the giant graph’s overwhelming size.

Let’s say you decide to give each node an arbitrary representation vector, like a low-dimensional word embedding, each node’s vector being the same length. The next step would be to traverse the graph, and that traversal could be represented by arranging the node vectors next to each other in a matrix. You could then feed that matrix representing the graph to a recurrent neural net. That’s basically DeepWalk (see below), which treats truncated random walks across a large graph as sentences.

If you turn each node into an embedding, much like word2vec does with words, then you can force a neural net model to learn representations for each node, which can then be helpful in making downstream predictions about them. (How close is this node to other things we care about?)

Another more recent approach is a graph convolutional network , which is very similar to convolutional networks: it passes a node filter over a graph much as you would pass a convolutional filter over an image, registering each time it sees a certain kind of node. The readings taken by the filters are stacked and passed to a maxpooling layer, which discards all but the strongest signal, before we return to a filter-passing convolutional layer.

One interesting aspect of graph is so-called side information, or the attributes and features associated with each node. For example, each node could have an image associated to it, in which case an algorithm attempting to make a decision about that graph might have a CNN subroutine embedded in it for those image nodes. Or the side data could be text, and the graph could be a tree (the leaves are words, intermediate nodes are phrases combining the words) over which we run a recursive neural net, an algorithm popolarized by Richard Socher.

Finally, you can compute derivative functions such as graph Laplacians from the tensors that represent the graphs, much like you might perform an eigen analysis on a tensor. These functions will tell you things about the graph that may help you classify or cluster it. (See below for more information.)

1) In a weird meta way it’s just graphs all the way down, not turtles . A human scientist whose head is full of firing synapses (graph) is both embedded in a larger social network (graph) and engaged in constructing ontologies of knowledge (graph) and making predictions about data with neural nets (graph).

  • Stanford Course (Video): CS224W: Machine Learning with Graphs by Jure Leskovec

Below are a few papers discussing how neural nets can be applied to data in graphs.

Graph Matching Networks for Learning the Similarity of Graph Structured Objects

This paper addresses the challenging problem of retrieval and matching of graph structured objects, and makes two key contributions. First, we demonstrate how Graph Neural Networks (GNN), which have emerged as an effective model for various supervised prediction problems defined on structured data, can be trained to produce embedding of graphs in vector spaces that enables efficient similarity reasoning. Second, we propose a novel Graph Matching Network model that, given a pair of graphs as input, computes a similarity score between them by jointly reasoning on the pair through a new cross-graph attention-based matching mechanism. We demonstrate the effectiveness of our models on different domains including the challenging problem of control-flow-graph based function similarity search that plays an important role in the detection of vulnerabilities in software systems. The experimental analysis demonstrates that our models are not only able to exploit structure in the context of similarity learning but they can also outperform domain-specific baseline systems that have been carefully hand-engineered for these problems.

A Comprehensive Survey on Graph Neural Networks

by Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, Philip S. Yu

Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art graph neural networks into different categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this fast-growing field.

Representation Learning on Graphs: Methods and Applications (2017)

by William Hamilton, Rex Ying and Jure Leskovec

Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.

A Short Tutorial on Graph Laplacians, Laplacian Embedding, and Spectral Clustering

by Radu Horaud

Community Detection with Graph Neural Networks (2017)

by Joan Bruna and Xiang Li

DeepWalk: Online Learning of Social Representations (2014)

by Bryan Perozzi, Rami Al-Rfou and Steven Skiena

We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multi-label network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection.

DeepWalk is implemented in Deeplearning4j .

Deep Neural Networks for Learning Graph Representations (2016) by Shaosheng Cao, Wei Lu and Qiongkai Xu

In this paper, we propose a novel model for learning graph representations, which generates a low-dimensional vector representation for each vertex by capturing the graph structural information. Different from other previous research efforts, we adopt a random surfing model to capture graph structural information directly, instead of using the samplingbased method for generating linear sequences proposed by Perozzi et al. (2014). The advantages of our approach will be illustrated from both theorical and empirical perspectives. We also give a new perspective for the matrix factorization method proposed by Levy and Goldberg (2014), in which the pointwise mutual information (PMI) matrix is considered as an analytical solution to the objective function of the skipgram model with negative sampling proposed by Mikolov et al. (2013). Unlike their approach which involves the use of the SVD for finding the low-dimensitonal projections from the PMI matrix, however, the stacked denoising autoencoder is introduced in our model to extract complex features and model non-linearities. To demonstrate the effectiveness of our model, we conduct experiments on clustering and visualization tasks, employing the learned vertex representations as features. Empirical results on datasets of varying sizes show that our model outperforms other state-of-the-art models in such tasks.

Deep Feature Learning for Graphs

by Ryan A. Rossi, Rong Zhou, Nesreen K. Ahmed

Learning multi-faceted representations of individuals from heterogeneous evidence using neural networks (2015)

by Jiwei Li, Alan Ritter and Dan Jurafsky

Inferring latent attributes of people online is an important social computing task, but requires integrating the many heterogeneous sources of information available on the web. We propose learning individual representations of people using neural nets to integrate rich linguistic and network evidence gathered from social media. The algorithm is able to combine diverse cues, such as the text a person writes, their attributes (e.g. gender, employer, education, location) and social relations to other people. We show that by integrating both textual and network evidence, these representations offer improved performance at four important tasks in social media inference on Twitter: predicting (1) gender, (2) occupation, (3) location, and (4) friendships for users. Our approach scales to large datasets and the learned representations can be used as general features in and have the potential to benefit a large number of downstream tasks including link prediction, community detection, or probabilistic reasoning over social networks.

node2vec: Scalable Feature Learning for Networks (Stanford, 2016) by Aditya Grover and Jure Leskovec

Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node’s network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.

Gated Graph Sequence Neural Networks (Toronto and Microsoft, 2017) by Yujia Li, Daniel Tarlow, Marc Brockschmidt and Richard Zemel

Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases. In this work, we study feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify to use gated recurrent units and modern optimization techniques and then extend to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We then show it achieves state-of-the-art performance on a problem from program verification, in which subgraphs need to be matched to abstract data structures.

Graph Classification with 2D Convolutional Neural Networks

Deep Learning on Graphs: A Survey (December 2018)

Viewing Matrices & Probability as Graphs

Graph Convolutional Networks, by Kipf

Diffusion in Networks: An Interactive Essay

Matrices as Tensor Network Diagrams

Innovations in Graph Representation Learning

graph analysis for research

Chris V. Nicholson

Chris V. Nicholson is a venture partner at Page One Ventures . He previously led Pathmind and Skymind. In a prior life, Chris spent a decade reporting on tech and finance for The New York Times, Businessweek and Bloomberg, among others.

  • Integrations
  • Customer stories
  • Our newsletter
  • Corporate information
  • Opportunities
  • Cambridge Intelligence Life
  • Request a trial

Graph analytics 101: reveal the story behind your data

by Catherine Kearns , 29th June 2021

If you’re serious about finding the stories buried deep in your graph visualizations, you need graph analytics.

They help you discover everything from the fastest route through a supply chain network to similar patterns of activity in a cryptocurrency blockchain, cliques in a social media network to the most influential members of an organized crime group.

Our KeyLines and ReGraph graph visualization toolkits have advanced graph analytics capabilities to help you build powerful applications that reveal insights fast.

This blog post focuses on what they are, why they’re important, and how they give users a deeper understanding of their graph data.

A large KeyLines network visualization showing connected entities in dark mode

About graphs & graph visualization

Before we explore graph analytics, let’s cover some graph fundamentals.

A graph is a model of data that features connections (called links or edges) between entities (called nodes or vertices). Those connections tell us what kind of relationships exist between entities, making them just as important as the entities themselves.

Relationships are complex, especially when you’re dealing with huge networks of connected entities. To understand them better, you need graph visualization to bring your graph data to life.

Beginners guide to graph visualization

FREE: The ultimate guide to graph visualization

Proven strategies for building successful graph visualization applications

GET YOUR FREE GUIDE

When you visualize graph data for the first time, you’ll immediately recognize interesting structures and connections. That’s because our brains are great at spotting patterns when they’re presented in a tangible format.

A zoomed in KeyLines graph visualization showing how social media users are connected through tweets and FaceBook posts

The real benefits of graph visualization lie in the ability to interact with the connected data to understand the full story. Your users don’t just want to see where connections exist. When they’re trying to understand relationships, they also need to know:

  • Why do some have more important connections than others?
  • How do certain relationships impact the network as a whole?
  • Which connections have the power to make or break a group dynamic?

For answers, they must rely on graph analytics.

What are graph analytics?

While traditional analytics try to make sense of data points – either individually or in aggregate – graph analytics use a sophisticated set of algorithms specifically designed to uncover powerful insights in graph data.

Each algorithm analyzes connections in a different way and reveals something new. They tell us what’s really going on in a network: who has the most influence, who is well connected, who belongs to a sub-network, and more.

A zoomed in ReGraph graph visualization showing the most influential people in a mafia family

Why do I need graph analytics?

If you’re visualizing graph data already, you’ll know how important it is to simplify complexity, filter out noise, and drill down on the right details.

When users can run sophisticated graph analytics at the touch of a button, they gain much deeper knowledge and understanding. As a result, they can make informed decisions quickly based on details they can trust.

A zoomed in ReGraph graph visualization showing email communications between Enron employees to identify individual roles and hierarchies

Without these algorithms, crucial information about the network remains hidden, and progress on your analysis is painfully slow.

The calculations graph analytics perform aren’t simple research tasks that can be done by hand: they’re advanced mathematical computations completed in an instant even on your largest, most complex datasets. The algorithms do the heavy lifting for you.

Which graph analytics should I use?

Different graph analytics complement each other to reveal new and interesting data insights. Here are the most popular algorithms.

Path analysis algorithm

This algorithm helps users understand the different ways to travel through (or ‘traverse’) a network. By measuring how many ‘hops’ each node is from every other node, it calculates distances across the network.

It answers important questions, such as, “what’s the shortest (or fastest) path A would take to reach Z?” It’s useful for identifying routes between physical locations, and finding the quickest lines of communication between people in an organization.

A zoomed in ReGraph graph visualization and map showing the shortest path in the energy grid between Ukraine and the UK

Read about another practical example of shortest path analysis .

Pattern matching

This graph analytics technique identifies subgroups with similar characteristics inside a network. Repeated patterns in the graph could identify subgroups that are connected in a way that isn’t immediately obvious.

Real world examples include spotting repeated unusual activity in financial fraud, and investigating terrorist cells suspected of working for the same organization.

A zoomed in KeyLines graph visualization showing links between a worldwide terror network

Community detection

Large, complex networks can contain groups of densely-connected nodes. These communities are interesting to analysts – they want to know what bonds them, how the community is evolving and what impact it has on the wider network.

Community detection algorithms reveal which circles people belong to within their wider social network. It’s essential for organizational network analysis to identify informal groups that exist across formal hierarchies.

Once we’ve spotted sub-groups, we can take the analysis one step further by exploring their timelines using our time-based analysis tool, KronoGraph . This reveals how relationships evolved, who’s the most active communicator, and how they keep in contact with the wider network.

A KronoGraph timeline visualization revealing detailed patterns of communication between individuals

Read about another example of community detection routines .

Centrality measures

These techniques help us to understand how nodes interact and which are the most important in the network.

Key to social network analysis are centrality measures. These reveal those who are strategically well placed, those who act as ‘gatekeepers’ between different parts of the network, those who can spread information more quickly, and those who have the most influence.

A KronoGraph timeline visualization and a ReGraph graph visualization revealing in-depth details of network connections.

Who should use graph analytics?

Every organization relying on connected data analysis to make key decisions needs graph analytics.

Our customers use them across a wide range of domains, including:

  • Fraud detection – typical insurance claims feature small, isolated clusters of people. If community detection graph analytics reveal that larger groups of highly-connected individuals are involved in multiple claims, it suggests an organized insurance scam.
  • Security and intelligence – to disrupt an organized crime network, you need to identify the most influential individuals whose removal would have the greatest impact.
  • Cyber security – using path analysis to find the most efficient way to reroute a central server if part of the wider IT network infrastructure is compromised.

For other examples – from network infrastructure to compliance, pharmaceuticals to knowledge graphs – see our use cases .

Get started

Graph analytics are the best way to understand how networks behave. Together with our toolkits’ other advanced features, including graph layout algorithms and custom styling options , they uncover the most important nodes and highlight the connections that matter.

You’ll find demos of how to use graph analytics in your applications, together with example source code, in our SDKs.

A screen showing a hybrid graph and timeline visualization created using ReGraph and KronoGraph

FREE: Start your trial today

Visualize your data! Request full access to our SDKs, demos and live-coding playgrounds.

TRY OUR TOOLKITS

How can we help you?

Request trial

Ready to start?

Request a free trial

Learn more

Want to learn more?

Read our white papers

“case

Looking for success stories?

Browse our case studies

  • Our customers
  • Our partners
  • Connected Insights
  • Paid internships
  • How we work
  • Meet the team
  • Company news
  • Evaluation FAQs
  • Procurement FAQs

Register for news & updates

Registered in England and Wales with Company Number 07625370 | VAT Number 113 1740 61 6-8 Hills Road, Cambridge, CB2 1JP. All material © Cambridge Intelligence 2024. Read our Privacy Policy .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Research Briefing
  • Published: 26 June 2023

A software resource for large graph processing and analysis

Nature Computational Science volume  3 ,  pages 586–587 ( 2023 ) Cite this article

653 Accesses

2 Altmetric

Metrics details

  • Scientific data

GRAPE is a software resource for graph processing, learning and embedding that is orders of magnitude faster than existing state-of-the-art libraries. GRAPE can quickly process real-world graphs with millions of nodes and billions of edges, enabling complex graph analyses and research in graph-based machine learning and in diverse disciplines.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

92,52 € per year

only 7,71 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

graph analysis for research

Li, M. M., Huang, K. & Zitnik, M. Graph representation learning in biomedicine and healthcare. Nat. Biomed. Eng. 6 , 1353–1369 (2022). A review article that presents applications of GRL in medicine.

Article   Google Scholar  

Xia, F. et al. Graph learning: a survey. IEEE Trans. Artif. Intell. 2 , 109–127 (2021). A review article that presents graph learning algorithms.

Zhang, D., Yin, J., Zhu, X. & Zhang, C. Network representation learning: a survey. IEEE Trans. Big Data 1 , 3–28 (2020). A review article that considers scalability issues of network representation learning algorithms.

Perkel, J. M. Why scientists are turning to Rust. Nature 588 , 185–186 (2020). An opinion piece explaining the reasons that the Rust language has been successful for scientific computation.

Elias, P. Efficient storage and retrieval by content and address of static files. J. ACM 21 , 246–260 (1974). A paper that introduces the theoretical basis on which quasi-succinct data structures are grounded.

Article   MathSciNet   MATH   Google Scholar  

Download references

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a summary of: Cappelletti, L. et al. GRAPE for fast and scalable graph processing and random-walk-based embedding. Nat. Comput. Sci . https://doi.org/10.1038/s43588-023-00465-8 (2023).

Rights and permissions

Reprints and permissions

About this article

Cite this article.

A software resource for large graph processing and analysis. Nat Comput Sci 3 , 586–587 (2023). https://doi.org/10.1038/s43588-023-00466-7

Download citation

Published : 26 June 2023

Issue Date : July 2023

DOI : https://doi.org/10.1038/s43588-023-00466-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

graph analysis for research

  • Affiliate Program

Wordvice

  • UNITED STATES
  • 台灣 (TAIWAN)
  • TÜRKIYE (TURKEY)
  • Academic Editing Services
  • - Research Paper
  • - Journal Manuscript
  • - Dissertation
  • - College & University Assignments
  • Admissions Editing Services
  • - Application Essay
  • - Personal Statement
  • - Recommendation Letter
  • - Cover Letter
  • - CV/Resume
  • Business Editing Services
  • - Business Documents
  • - Report & Brochure
  • - Website & Blog
  • Writer Editing Services
  • - Script & Screenplay
  • Our Editors
  • Client Reviews
  • Editing & Proofreading Prices
  • Wordvice Points
  • Partner Discount
  • Plagiarism Checker
  • APA Citation Generator
  • MLA Citation Generator
  • Chicago Citation Generator
  • Vancouver Citation Generator
  • - APA Style
  • - MLA Style
  • - Chicago Style
  • - Vancouver Style
  • Writing & Editing Guide
  • Academic Resources
  • Admissions Resources

How to Use Tables & Graphs in a Research Paper

graph analysis for research

It might not seem very relevant to the story and outcome of your study, but how you visually present your experimental or statistical results can play an important role during the review and publication process of your article. A presentation that is in line with the overall logical flow of your story helps you guide the reader effectively from your introduction to your conclusion. 

If your results (and the way you organize and present them) don’t follow the story you outlined in the beginning, then you might confuse the reader and they might end up doubting the validity of your research, which can increase the chance of your manuscript being rejected at an early stage. This article illustrates the options you have when organizing and writing your results and will help you make the best choice for presenting your study data in a research paper.

Why does data visualization matter?

Your data and the results of your analysis are the core of your study. Of course, you need to put your findings and what you think your findings mean into words in the text of your article. But you also need to present the same information visually, in the results section of your manuscript, so that the reader can follow and verify that they agree with your observations and conclusions. 

The way you visualize your data can either help the reader to comprehend quickly and identify the patterns you describe and the predictions you make, or it can leave them wondering what you are trying to say or whether your claims are supported by evidence. Different types of data therefore need to be presented in different ways, and whatever way you choose needs to be in line with your story. 

Another thing to keep in mind is that many journals have specific rules or limitations (e.g., how many tables and graphs you are allowed to include, what kind of data needs to go on what kind of graph) and specific instructions on how to generate and format data tables and graphs (e.g., maximum number of subpanels, length and detail level of tables). In the following, we will go into the main points that you need to consider when organizing your data and writing your result section .

Table of Contents:

Types of data , when to use data tables .

  • When to Use Data Graphs 

Common Types of Graphs in Research Papers 

Journal guidelines: what to consider before submission.

Depending on the aim of your research and the methods and procedures you use, your data can be quantitative or qualitative. Quantitative data, whether objective (e.g., size measurements) or subjective (e.g., rating one’s own happiness on a scale), is what is usually collected in experimental research. Quantitative data are expressed in numbers and analyzed with the most common statistical methods. Qualitative data, on the other hand, can consist of case studies or historical documents, or it can be collected through surveys and interviews. Qualitative data are expressed in words and needs to be categorized and interpreted to yield meaningful outcomes. 

Quantitative data example: Height differences between two groups of participants Qualitative data example: Subjective feedback on the food quality in the work cafeteria

Depending on what kind of data you have collected and what story you want to tell with it, you have to find the best way of organizing and visualizing your results.

When you want to show the reader in detail how your independent and dependent variables interact, then a table (with data arranged in columns and rows) is your best choice. In a table, readers can look up exact values, compare those values between pairs or groups of related measurements (e.g., growth rates or outcomes of a medical procedure over several years), look at ranges and intervals, and select specific factors to search for patterns. 

Tables are not restrained to a specific type of data or measurement. Since tables really need to be read, they activate the verbal system. This requires focus and some time (depending on how much data you are presenting), but it gives the reader the freedom to explore the data according to their own interest. Depending on your audience, this might be exactly what your readers want. If you explain and discuss all the variables that your table lists in detail in your manuscript text, then you definitely need to give the reader the chance to look at the details for themselves and follow your arguments. If your analysis only consists of simple t-tests to assess differences between two groups, you can report these results in the text (in this case: mean, standard deviation, t-statistic, and p-value), and do not necessarily need to include a table that simply states the same numbers again. If you did extensive analyses but focus on only part of that data (and clearly explain why, so that the reader does not think you forgot to talk about the rest), then a graph that illustrates and emphasizes the specific result or relationship that you consider the main point of your story might be a better choice.

graph in research paper

When to Use Data Graphs

Graphs are a visual display of information and show the overall shape of your results rather than the details. If used correctly, a visual representation helps your (or your reader’s) brain to quickly understand large amounts of data and spot patterns, trends, and exceptions or outliers. Graphs also make it easier to illustrate relationships between entire data sets. This is why, when you analyze your results, you usually don’t just look at the numbers and the statistical values of your tests, but also at histograms, box plots, and distribution plots, to quickly get an overview of what is going on in your data.

Line graphs

When you want to illustrate a change over a continuous range or time, a line graph is your best choice. Changes in different groups or samples over the same range or time can be shown by lines of different colors or with different symbols.

Example: Let’s collapse across the different food types and look at the growth of our four fish species over time.

line graph showing growth of aquarium fish over one month

You should use a bar graph when your data is not continuous but divided into categories that are not necessarily connected, such as different samples, methods, or setups. In our example, the different fish types or the different types of food are such non-continuous categories.

Example: Let’s collapse across the food types again and also across time, and only compare the overall weight increase of our four fish types at the end of the feeding period.

bar graph in reserach paper showing increase in weight of different fish species over one month

Scatter plots

Scatter plots can be used to illustrate the relationship between two variables — but note that both have to be continuous. The following example displays “fish length” as an additional variable–none of the variables in our table above (fish type, fish food, time) are continuous, and they can therefore not be used for this kind of graph. 

Scatter plot in research paper showing growth of aquarium fish over time (plotting weight versus length)

As you see, these example graphs all contain less data than the table above, but they lead the reader to exactly the key point of your results or the finding you want to emphasize. If you let your readers search for these observations in a big table full of details that are not necessarily relevant to the claims you want to make, you can create unnecessary confusion. Most journals allow you to provide bigger datasets as supplementary information, and some even require you to upload all your raw data at submission. When you write up your manuscript, however, matching the data presentation to the storyline is more important than throwing everything you have at the reader. 

Don’t forget that every graph needs to have clear x and y axis labels , a title that summarizes what is shown above the figure, and a descriptive legend/caption below. Since your caption needs to stand alone and the reader needs to be able to understand it without looking at the text, you need to explain what you measured/tested and spell out all labels and abbreviations you use in any of your graphs once more in the caption (even if you think the reader “should” remember everything by now, make it easy for them and guide them through your results once more). Have a look at this article if you need help on how to write strong and effective figure legends .

Even if you have thought about the data you have, the story you want to tell, and how to guide the reader most effectively through your results, you need to check whether the journal you plan to submit to has specific guidelines and limitations when it comes to tables and graphs. Some journals allow you to submit any tables and graphs initially (as long as tables are editable (for example in Word format, not an image) and graphs of high enough resolution. 

Some others, however, have very specific instructions even at the submission stage, and almost all journals will ask you to follow their formatting guidelines once your manuscript is accepted. The closer your figures are already to those guidelines, the faster your article can be published. This PLOS One Figure Preparation Checklist is a good example of how extensive these instructions can be – don’t wait until the last minute to realize that you have to completely reorganize your results because your target journal does not accept tables above a certain length or graphs with more than 4 panels per figure. 

Some things you should always pay attention to (and look at already published articles in the same journal if you are unsure or if the author instructions seem confusing) are the following:

  • How many tables and graphs are you allowed to include?
  • What file formats are you allowed to submit?
  • Are there specific rules on resolution/dimension/file size?
  • Should your figure files be uploaded separately or placed into the text?
  • If figures are uploaded separately, do the files have to be named in a specific way?
  • Are there rules on what fonts to use or to avoid and how to label subpanels?
  • Are you allowed to use color? If not, make sure your data sets are distinguishable.

If you are dealing with digital image data, then it might also be a good idea to familiarize yourself with the difference between “adjusting” for clarity and visibility and image manipulation, which constitutes scientific misconduct .  And to fully prepare your research paper for publication before submitting it, be sure to receive proofreading services , including journal manuscript editing and research paper editing , from Wordvice’s professional academic editors .

Book cover

A Behavior Analyst’s Guide to Supervising Fieldwork pp 69–99 Cite as

Graphing, Interpreting Graphs, and Experimental Designs

  • Tonya N. Davis 3 &
  • Jessica S. Akers 3  
  • First Online: 06 January 2023

345 Accesses

Behavior analysts collect data on client behavior to make endless decisions within a behavior change program. In order to use data efficiently, they must be graphically displayed and systematically analyzed. In this chapter, you will introduce your supervisees to the line and bar graphs and the essential components of a graph. You will review the steps to visually analyzing data within and across conditions as well as discuss AB, ABAB, and multielement designs. During the group supervision meeting, your supervisee will practice identifying essential graph components in fictional graphed data and graph data sets, and then visually analyze their graphical displays. During the individual supervision meeting without a client, they will repeat this process of graphing and then visually analyzing the graph with raw data sets they collected during sessions with their client. Because graphing and visual analysis are tasks that take place without the client present, rather than conduct activities related to this topic during the supervision meeting with the client, you will observe your supervisee and provide performance feedback using the Supervision Observation: Procedural Fidelity Checklist . This form is designed to evaluate general supervisee behavior not related to a specific skill or intervention methodology.

  • Visual analysis
  • Variability
  • Immediacy of change
  • Consistency across similar conditions
  • Experimental design
  • ABAB design
  • Reversal design
  • Multielement design

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

American Psychological Association. (2020). Publication manual of the American Psychological Association (7th ed.). https://doi.org/10.1037/0000165-000

Book   Google Scholar  

Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1 (1), 91.

Article   PubMed   PubMed Central   Google Scholar  

Beavers, G. A., Iwata, B. A., & Lerman, D. C. (2013). Thirty years of research on the functional analysis of problem behavior. Journal of Applied Behavior Analysis, 46 (1), 1–21.

Article   PubMed   Google Scholar  

Chok, J. T. (2019). Creating functional analysis graphs using Microsoft Excel® 2016 for PCs. Behavior Analysis in Practice, 12 (1), 265–292.

Cooper, J. O., Heron, T. E., & Heward, W. L. (2020). Applied behavior analysis . Pearson UK.

Deochand, N. (2017). Automating phase change lines and their labels using Microsoft Excel (R). Behavior Analysis in Practice, 10 (3), 279–284.

Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71 (2), 165–179.

Article   Google Scholar  

Johnston, J., & Pennypacker, H. (1980). Strategies and tactics of human behavioral research . Erlbaum.

Google Scholar  

Kazdin, A. E. (1982). Single-case experimental designs in clinical research and practice. New Directions for Methodology of Social & Behavioral Science, 13 , 33–47.

Kazdin, A. E. (2019). Single-case experimental designs. Evaluating interventions in research and clinical practice. Behaviour Research and Therapy, 117 , 3–17.

Kennedy, C. H. (2005). Single-case designs for educational research (Vol. 1). Pearson A & B.

Risley, T. (2005). Montrose M. Wolf (1935–2004). Journal of Applied Behavior Analysis, 38 (2), 279–287. https://doi.org/10.1901/jaba.2005.165-04

Roane, H. S., Ringdahl, J. E., Kelley, M. E., & Glover, A. C. (2011). Single-case experimental designs. In W. W. Fisher, C. C. Piazza, & H. S. Roane (Eds.), Handbook of applied behavior analysis (pp. 132–147). The Guilford Press.

Watts, Z. B., & Stenhoff, D. M. (2021). Creating multiple-baseline graphs with phase change lines in Microsoft Excel for Windows and macOS. Behavior Analysis in Practice, 14 (4), 996–1009.

What Works Clearinghouse, Institute of Education Sciences, U.S. Department of Education. (2017). Appendix A: Pilot single-case design standards. In What works clearinghouse, Institute of Education Sciences, U.S. Department of Education (Eds.), What works clearinghouse: Standards handbook (Version 4.0, pp. A1–A17).

What Works Clearinghouse, Institute of Education Sciences, U.S. Department of Education. (2020). What works clearinghouse: Standards handbook (Version 4.1).

Download references

Author information

Authors and affiliations.

Baylor University, Waco, TX, USA

Tonya N. Davis & Jessica S. Akers

You can also search for this author in PubMed   Google Scholar

Electronic Supplementary Material

(pptx 1183 kb), appendix a:, graph component checklist.

figure g

Appendix B:

Sample graphs.

figure 1

Percent of 10-second intervals in which Simon engaged in aggression across baseline and intervention conditions

figure 2

Frequency of mands for assistance across baseline and intervention conditions

figure 3

Frequency of disruptive behavior during baseline, noncontent reinforcement (NCR), and differential reinforcement of alternative behavior (DRA)

figure 4

Percent of multiplication facts answered correctly across baseline, and FR5 schedule of reinforcement, and an FR 1 schedule of reinforcement

figure 5

Rate of behavior

figure 8

Percent of intervals under A and B conditions

Appendix C:

Sample graphs answer sheet.

Figure B.1:

Vertical axis range (should be 0–100%)

Missing condition labels

Figure B.2:

Horizontal axis is not in equal intervals

Missing the vertical axis label

Gridlines are visible

Figure B.3:

Border lines need to be removed

Data markers need to be differentiated so that one can determine which data path represents what variable in the second phase

The key is missing

Figure B.4:

Horizontal axis labels are missing

Vertical axis labels are missing

The condition change line is missing

The condition labels are missing

Figure B.5:

Data markers need to be revised so that one can distinguish between the two data paths (suggested to increase size of marker and select markers more easily differentiated)

Data path needs to be revised so that one can distinguish between the two data paths (suggested to decrease the thickness of the line so that marker shapes are more distinguishable)

Figure B.6:

Graph should only use blank ink

The vertical axis is missing

Figure B.7:

The vertical axis range is too large. It should be scaled 1–3

The figure caption is unclear

The vertical axis label is unclear

Figure B.8:

Condition labels are unclear

Data makers need to be revised because it is unclear if this is the same behavior graphed in the first and second conditions

The vertical axis range is likely inaccurate. If graphing the percent of intervals, the values would likely range closer to 0–100%

Appendix D:

Graphing practice data set.

Data Set One: Percent of intervals

Data Set Two: Percent of trials

Data Set Three: Rate

Data Set Four: Duration (in seconds)

Data Set Five: Frequency

Data Set Six: Percent of trials

Data Set Seven: Rate

Data Set Eight: Duration (in minutes)

Data Set Nine: Percent of intervals

Data Set Ten: Frequency

Appendix E:

Graphs for visual analysis.

figure 9

Rate of disruptive behavior across baseline and intervention conditions

figure 10

Percent of intervals with target two-word mands across baseline and intervention conditions

figure 11

Percent of intervals with property destruction across baseline and intervention conditions

figure 12

Percent of intervals with crying across baseline and intervention conditions

figure 13

Percent of opportunities in which the client correctly responded to a greeting across baseline and intervention conditions

figure 14

Percent of opportunities in which the client responded to his name by orienting to the speaker across baseline and intervention conditions

figure 15

Percent of opportunities in which the client engaged in self-injury across baseline and intervention conditions

figure 16

Percent of opportunities in which the client correctly reported his address across baseline and intervention

figure 17

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Cite this chapter.

Davis, T.N., Akers, J.S. (2022). Graphing, Interpreting Graphs, and Experimental Designs. In: A Behavior Analyst’s Guide to Supervising Fieldwork. Springer, Cham. https://doi.org/10.1007/978-3-031-09932-8_6

Download citation

DOI : https://doi.org/10.1007/978-3-031-09932-8_6

Published : 06 January 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-09931-1

Online ISBN : 978-3-031-09932-8

eBook Packages : Behavioral Science and Psychology Behavioral Science and Psychology (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Neurosci
  • PMC10570536

Visibility graph analysis for brain: scoping review

In the past two decades, network-based analysis has garnered considerable attention for analyzing time series data across various fields. Time series data can be transformed into graphs or networks using different methods, with the visibility graph (VG) being a widely utilized approach. The VG holds extensive applications in comprehending, identifying, and predicting specific characteristics of time series data. Its practicality extends to domains such as medicine, economics, meteorology, tourism, and others. This research presents a scoping review of scholarly articles published in reputable English-language journals and conferences, focusing on VG-based analysis methods related to brain disorders. The aim is to provide a foundation for further and future research endeavors, beginning with an introduction to the VG and its various types. To achieve this, a systematic search and refinement of relevant articles were conducted in two prominent scientific databases: Google Scholar and Scopus. A total of 51 eligible articles were selected for a comprehensive analysis of the topic. These articles categorized based on publication year, type of VG used, rationale for utilization, machine learning algorithms employed, frequently occurring keywords, top authors and universities, evaluation metrics, applied network properties, and brain disorders examined, such as Epilepsy, Alzheimer’s disease, Autism, Alcoholism, Sleep disorders, Fatigue, Depression, and other related conditions. Moreover, there are recommendations for future advancements in research, which involve utilizing cutting-edge techniques like graph machine learning and deep learning. Additionally, the exploration of understudied medical conditions such as attention deficit hyperactivity disorder and Parkinson’s disease is also suggested.

1. Introduction

The human brain is undoubtedly one of the most complex and mysterious organs in the human body. Understanding the neural mechanisms of brain activities has posed a significant challenge for scientists ( Sporns and Honey, 2006 ). Specifically, the brain is a network of numerous different regions, each with its own specific function and task, constantly sharing information with each other ( van den Heuvel and Hulshoff Pol, 2010 ). To comprehend brain function or pairwise interactions between different regions of the brain, researchers often rely on non-invasive techniques such as functional magnetic resonance imaging (fMRI), which analyses structural and functional modifications in brain disorders ( Matthews and Jezzard, 2004 ). Additionally, researchers also make use of electroencephalography (EEG) and magnetoencephalography (MEG), which are non-invasive methods used for recording brain electrical activity, which allow them to record and analyze brain activity without harming the subject.

The utilization of time-series data obtained from non-invasive techniques such as fMRI, EEG, and MEG is one of the most valuable resources for carrying out computations and studies related to the brain ( Yu et al., 2020 ). Several methods exist for analyzing time-series data, with one of the latest and significant ones being visibility graphs (VGs) that was proposed by Lacasa et al. (2008) . This method involves mapping the time-series data to a graph, followed by performing computations on this graph. It has emerged as an important tool for gaining insights into brain function and inter-regional interactions.

The use of VGs in brain research is just one example of the many computational methods that are being developed to understand the brain. Machine learning techniques, graph theory and network methods ( van den Heuvel and Hulshoff Pol, 2010 ), dimensionality reduction techniques, fuzzy models ( Yu et al., 2020 ), etc. are all being used to analyze brain data and gain insights into brain function. By analyzing time-series data of brain activity, researchers are able to obtain valuable information for the diagnosis, prediction, and analysis of brain diseases or related applications. For instance, early diagnosis of a disease can slows down its progression and even lead to its cure. Nevertheless, the application of network theory to time series data is not directly possible because of non-relative structure of the data. Consequently, researchers have devised methods like VG analysis to harness the potential of network theory in analyzing and predicting brain time series data. This approach enables the exploration of new techniques and interpretations, thereby expanding the scope of research in this field.

Due to the novelty and youth of the field of VG analysis in brain researches we has chosen the scoping review to glance the topic. A scoping review is a type of literature review that aims to map the existing literature on a particular topic, identify gaps in the research, and provide an overview of the available evidence. Unlike a systematic review, which focuses on answering a specific research question using a predefined set of criteria, a scoping review is more exploratory in nature and can be used to identify research gaps and inform the development of future research questions. Scoping reviews can be particularly useful in fields where the literature is rapidly evolving or where there is a large volume of research on a particular topic ( Munn et al., 2018 ).

In this article, we aim to provide an overview of research related to the brain’s VG, which has demonstrated significant potential for analysis and promising results. The focus of our review will be to explore the scope and breadth of brain-related research that utilizes VG analysis, with the goal of providing valuable insights and knowledge that can guide and inspire future research opportunities. The potential of the brain’s VG for analysis and research has been demonstrated by various studies. Understanding the brain and its mechanisms is crucial for advancing medical treatments, developing new therapies for neurological disorders, and improving overall human health. Our review aims to highlight the importance of VG analysis in this pursuit and its potential to revolutionize the field of brain research. Specifically, our research aims to address the following questions:

  • 1. What is the VG, and how is it used in brain research?
  • 2. What are the current applications of VG analysis in brain research, and what are the limitations and challenges associated with these applications?
  • 3. What are the future research opportunities and directions in brain research utilizing VG analysis, and how can these opportunities be pursued to advance our understanding of brain function and neurological disorders?

Our paper is structured into five sections. After introduction in the first section, part 2 introduces the concept of VGs and their types, followed by an explanation of VG analysis, its definition, and purpose. Section “3. Methodology” details the research methodology, including eligibility criteria for paper selection, information sources, and the PRISMA flow diagram. In part 4, we present our findings and interpretations using various diagram formats and perspectives to enhance our analysis and deepen the understanding of the brain’s VG for disorders. Lastly, part 5 summarizes our findings, discusses future research potential, and proposes avenues for further investigation to advance knowledge in this field.

2. Background

2.1. visibility graph.

The VG is a potent tool for analyzing time series data by mapping it onto a network. In its creation method, each time series sample is represented as a node in a graph, with edges between nodes defined based on their mutual visibility. Specifically, to create an edge between two nodes ( t i , y i ) and ( t j , y j ), it must be determined whether the two corresponding time series samples can “see” each other, which is achieved through the use of an intermediate node. Here, t i stands for i -th point in the timeseries, and y i is the value associated with the point. This is known as visibility, and vertices are connected to each other provided that the following condition is satisfied:

This algorithm has wide applicability across many domains, including medicine, economics, and social sciences. By leveraging time series data in this way, it becomes possible to search for various patterns and relationships within the data, which can lead to new insights and discoveries. The resulting graph from the algorithm is a simple graph that satisfies three conditions ( Figure 1 ): connectivity, without direction, and preserving the information of the time series data after being transformed into a graph due to the fulfillment of the condition in Equation 1. In summary, the VG approach provides a highly effective framework for analyzing time series data. Its ability to identify complex patterns and relationships within the data makes it a valuable tool for researchers and analysts in a range of fields.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g001.jpg

(A) A time series bar chart and its extracted graph (graph nodes include time points on the horizontal axis, and graph edges consist of lines connecting values that meet the condition in Equation 1, and (B) the resulting graph from mapping the time series.

The previous description pertained to the natural VG, commonly referred to as the VG. Besides, several types of VGs have been developed, each with its own specific characteristics and applications. Here, we list the most commonly used types of VGs in brain-related studies, along with a brief description of each one. We also provide a table, Table 3 , that summarizes the brain disorders that have been associated with each type of VG collecting the related references:

Association of visibility graph type with investigated brain disorder and related references.

  • • Horizontal visibility graph (HVG): similar to natural VG, but the nodes are connected by edges if the data points are visible to each other in a horizontal line of sight, offering a simplified visual representation of the time series ( Luque et al., 2009 ).
  • • Limited penetration visibility graph (LPVG): a modified form of the VG where edges are allowed between nodes within a certain distance threshold, offering a more localized representation of the underlying patterns in a time series to reduce the effect of noise in the data ( Ting-Ting et al., 2012 ).
  • • Visibility graph similarity (VGS): the average similarity estimate between two graphs is created with the help of mutual correlation between a sequence of degrees obtained from the time series interval in the VG ( Ahmadlou and Adeli, 2012 ).
  • • Weighted visibility graph (WVG): is a version of natural VG where edges between nodes are assigned weights based on the degree of visibility between corresponding data points ( Supriya et al., 2016a ).
  • • Weighted horizontal visibility graph (WHVG): is an advanced version of HVG that also considers the weight of edges and weakens the effect of nodes with long distances ( Zhu et al., 2014b ).
  • • Power of scale-freeness visibility graph (PSVG): states that the degree distribution of its nodes satisfies power-law, and hence, the extracted network has the scale-free property ( Lacasa et al., 2008 , 2009 ; Ahmadlou et al., 2012 ).
  • • Multilayer visibility graph (MVG): transforms multidimensional time series into multilayer networks for extracting high-dimensional information after analyzing the feature structure of the network ( Nicosia et al., 2014 ).
  • • Difference visibility graph (DVG): where the edge and degrees are equal to the difference of edges and degrees of two VGs and HVGs with fixed nodes, which is very useful in acquiring the fundamental features of the signal ( Zhu et al., 2014a ).

The VG derived from brain timeseries data should be interpreted cautiously, as it does not necessarily reflect the true intricacies of the real brain’s network structure. This graph represents a simplified mathematical construct based on observable time series data, but it lacks a direct correlation with the actual connectivity between neurons or regions of interest within the brain. While the VG may reveal certain patterns or relationships within the time series data, it cannot provide insights into the specific neural connections, synapses, or the underlying physical architecture of the brain. It is a representation of data in a graph format, where edges and nodes are constructed based on mathematical algorithms, but these connections may not hold biological significance.

2.2. Visibility graph analysis

The visibility graph analysis is a straightforward and efficient method for time series data analysis that examines graph features. This method has numerous applications in brain analysis, providing valuable information on the fundamental characteristics of brain networks. By utilizing advanced network detection, prediction, and analysis techniques, this approach assists researchers in the field of brain diseases in gaining comprehensive and informative insights. As a result, this analysis method plays a crucial role in enhancing our understanding of brain network dynamics and in developing effective treatments for brain-related conditions. Through analyzing the local topological properties of the resulted networks in graph-theoretical area, we can extract valuable information about brain features.

Network analysis metrics can be defined based on different network features, including connectivity, centrality, and distance ( Artameeyanant et al., 2017 ). Connectivity-based metrics include average degree, average clustering coefficient, modularity, and density, while centrality-based metrics measure dominance and closeness at the central point. Distance-based metrics, on the other hand, assess average shortest path length, global efficiency, and network diameter. The following sections examine the considered metrics. Other network analysis metrics, such as sparsity, density, small-world or scale-free properties, can be used to determine more precise network features.

Table 1 lists the most common metrics based on our article review. For instance, Ji et al. (2016) found that the degree distribution of the extracted brain network of individuals with job stress increases with the k coefficient, while that of normal individuals decreases with the k coefficient, or Wang et al. (2018) , by analyzing brain network features, showed that the connections in the brain network of individuals with Alzheimer’s disease are more scattered than those of healthy individuals, indicating a scale-free property, and the range of connections in their brain network is reduced. Therefore, calculating and studying network analysis metrics in graph-theoretical area can be highly beneficial.

Simple description of network analysis metrics.

Visibility graph analysis has demonstrated utility in various applications within the field of brain research. Specifically, it can be categorized into three distinct areas of application as described below:

  • 1. Enhanced understanding: The application of novel graph-based analysis techniques on time series data has opened up new avenues for comprehending complex brain dynamics. By investigating network properties such as centrality measures or clustering coefficients, researchers can uncover previously unknown concepts in the time series, such as crucial time points or regions that hold significance. We will refer to this property in the paper as VG analysis.
  • 2. Disease diagnosis: VG analysis proves valuable in the classification of different states of disorders, such as Epilepsy, where two distinct states are present. Leveraging this analysis technique, it becomes possible to accurately classify and differentiate between various time series associated with different disorders, aiding in the diagnosis process.
  • 3. Disease prediction: Utilizing time series data from individuals with specific disorders, VG analysis offers the potential to predict future changes or anticipate trends. By applying this analysis methodology, researchers can derive valuable insights into the trajectory of the disease, enabling early detection and intervention strategies.

3. Methodology

In this article, scoping review were conducted following the PRISMA methodology ( Tricco et al., 2018 ). The eligibility criteria for selecting studies and the methods for collecting them were established and will be discussed in the upcoming sections.

3.1. Eligibility criteria

For the purposes of this scoping review, we established strict eligibility criteria to identify relevant studies for inclusion. Our search strategy targeted original research articles and conference proceedings on the topic of VG analysis. To ensure consistency and accuracy, we limited our search to studies published in English. We also sought to include studies examining the application of VG analysis in the context of brain disorders, including but not limited to Alzheimer’s disease, epilepsy, and fatigue. To ensure the quality and relevance of studies considered for inclusion, we excluded works in progress, editorials, dissertation papers, book chapters, and position papers. The application of these eligibility criteria allowed us to comprehensively identify and evaluate studies that met our research objectives.

3.2. Information sources and search strategy

To identify relevant studies for a scoping review, a comprehensive search was conducted between 4 January 2022 and 4 June 2022. Various platforms, including Google Scholar, Scopus, and search article functions provided by leading journal publishers such as Elsevier, Springer, and Taylor & Francis, were utilized to retrieve high-quality scholarly content from scientific journals, books, and conference proceedings. An advanced search technique with an “AND” condition was used to design the search strategy, focusing on the most relevant studies published between 2008 and 2023. Google Scholar search results were limited to the first 20 pages, and citing articles of the retrieved search results were also examined for potentially valuable articles. Date and topic filters were applied within Google Scholar to ensure efficient and precise search results. Scopus, the largest abstract and citation database, was instrumental in retrieving relevant studies. The initial search terms were “visibility graph analysis,” “EEG,” or “brain,” with advanced search filters limiting results to articles published from 2008 onward. The inclusion of article titles, keywords, and abstracts narrowed the search to 61 relevant articles, which were further refined to 53 articles and two doctoral dissertations based on full-text analysis.

The search article functions provided by eight leading journal publishers were also utilized to broaden the search for relevant articles and confirm the findings obtained from Scopus. This comprehensive search strategy, combined with careful selection criteria and filters, resulted in a thorough exploration of the literature, ensuring that relevant studies were included in the scoping review. It is worth mentioning that the year 2008 marks the inception of the VG concept by Lacasa et al. (2008) , signifying the foundational period for the development of this scoping review research.

3.3. Study selection

Figure 2 illustrates the PRISMA diagram, which outlines the detailed process followed in this scoping review. All articles related to VG analysis in various brain disorders were identified from the selected databases. Subsequently, the abstracts of these articles were carefully reviewed to assess their suitability based on the predefined inclusion criteria. To ensure reliability and consistency, two researchers independently screened the abstracts. In cases where there were differences in opinion, the researchers engaged in thorough discussions and deliberations until a consensus was reached. During the abstract screening phase, the researchers scrutinized the content of each paper to determine its alignment with the inclusion criteria established for this study. This step was crucial in ensuring that only relevant articles were considered for further analysis. By involving multiple advisors and facilitating discussions, the review process aimed to minimize bias and enhance the reliability of article selection. The PRISMA diagram provides a visual representation of this systematic approach, demonstrating the systematic identification and screening of articles, as well as the collaborative decision-making process employed by the researchers. The diagram serves as a comprehensive overview of the study selection process, showcasing the rigorous and meticulous methodology followed in this scoping review.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g002.jpg

A PRISMA flow diagram illustrating the process of selecting relevant articles from reputable scientific databases. The variable “ n ” represents the number of articles at each stage.

4. Results and discussion

This section provides an overview of research findings on VG analysis in the brain, examined from various perspectives. To enhance the understanding and applicability of the results, we have utilized different visualization methods to present the findings in a concise and comprehensible manner within a compact format. Table 2 provides a summary of the section’s content, highlighting the comprehensive coverage of diverse aspects. The aim is to present the results in a manner that maximizes their utility and facilitates further investigation.

Overview of the research results by summarizing the tables and figures captions and their visualization type.

Figure 3 provides a timeline of analyzed articles by year and country, with the first article related to this field published in 2010. The chart shows that more than 85% of the published articles are from 2016 onward, indicating a growing interest in using VG analysis for brain networks in the future. China accounts for the majority of researchers in this field, contributing to approximately 10.14% of the article publications and emerging as the leading country in this field followed by Australia, India, and United States among others. The data also highlights the significant impact of the distribution of records among these countries on the overall distribution of records, as the countries with higher percentages have a greater influence on the distribution.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g003.jpg

Number of published articles by year and country.

Also, the statistics in Figure 3 underscore the growing importance of VG analysis in the field of neuroscience and emphasize the need for continued investment in research and development to further explore this powerful analytical technique. The overall increasing slope of the growth line regarding VG analysis in brain networks indicates that this field is gaining momentum and will become even more important in the future. The trend in years, indicating sustained interest and activity in this area. Additionally, the emergence of the United States as a significant contributor to the field of graph analysis in recent years highlights the importance of continued investment in research and development in this area for developed countries.

The leading universities with a high number of publications, exceeding three papers, include Tianjin University from China, Izmir University from Turkey, and the University of Southern Queensland from Australia. Although the individual publication counts for each university may not be substantial, it is important to highlight the diverse representation of countries involved. Another diversity hints suggests a promising future for the proliferation of research and the emergence of interdisciplinary collaborations among researchers in various fields such as economy, medicine, and engineering regard to department and university research filed. Besides, Jiang Wang, Aydin Akan, and Yan Li have made the greatest contributions to publishing articles related to the analysis of brain VG. This suggests that these authors are prominent in this field and that their work has had a significant impact on the direction of research within the area of brain analysis. A statistical analysis of publication data of these authors may provide additional insights into the popular research directions. Consequently, researchers can use tools like bibliometric analysis to evaluate the productivity and impact of different authors, institutions, and fields of study, which can facilitate better decision-making in planning research strategies and collaborations.

Prior to proceeding, it is intriguing to ascertain the distribution of publications across different journals specifically pertaining to VG research within the realm of neuroscience. Basically, the data presented in Figure 4 provide valuable insights into the distribution of publications in the field of brain graph analysis, highlighting the importance of different publication types and the impact of different publishers in this area. The pie chart in Figure 4A depicts the percentage of publications in journals and conferences. The fact that 21 conference papers were published from 51 published articles, suggests that conference proceedings are an important venue for disseminating research in this field. In addition to identifying the most popular journals for publishing research in brain graph analysis, the data also provides insights into the overall distribution of publications across different publication types. The dominance of IEEE, Springer, and ELSEVIER publications in this area highlights the importance of these publishers in the field of brain graph analysis, Figure 4B . This may be due to a variety of factors, such as the quality of the peer-review process, the reputation of the journals, or the accessibility of the publications. As more research is conducted in this field, it will be interesting to see how the distribution of publications evolves and how different publishers and publication types continue to contribute to the growth and development of this area. With a deeper investigation, it reveals that IEEE Xplore in conferences, and Physica A: Statistical Mechanics and its Applications in journals are the top publishers in this area.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g004.jpg

(A) Diversity of published articles in two categories: journal and conference. (B) Top publishers and journals in the field of brain graph analysis.

Besides, a statistical investigation of the publication trends and patterns pertaining to VG analysis in brain networks, based on the number of authors and citations related to publishing articles reveals an average author count of 3.69 per article, indicating a relatively low level of collaboration among researchers in this multidisciplinary field. However, it is noteworthy that the maximum number of citations on a single paper is 323, which demonstrates the high level of interest and relevance of the topic. The impact of VG analysis in neuroscience research is exemplified by the study conducted by Zhu et al. (2014a) , which utilized this approach to analyze sleep states. The resulting article has been cited in other studies a remarkable 323 times, underscoring the potential influence and significance of this method in the field. Moreover, the potential of VG analysis in collecting the researchers in a single team is demonstrated by another study focused on Alzheimer’s disease identification using the WVG approach and fuzzy learning ( Yu et al., 2020 ). This study involved the participation of seven authors, highlighting the collaborative nature of research efforts in this area. In summary, the statistical data the citation and co-authorship data emphasizes the need for continued investment in research and development to further explore the potential of this powerful analytical technique.

Since VG analysis has been widely used in various fields for different purposes of processing time series data of brain activity, by classifying the application types, it was found that this approach is mostly used in three distinct fields of analysis, followed by diagnosis, prediction, plus other differentiated applications. Furthermore, the top seven categories of brain diseases studied using VG analysis were identified as Epilepsy, Sleep state, Alzheimer’s disease, Depression, Autism, Alcoholism, Fatigue, Down syndrome, and other conditions ( Sengupta et al., 2013 ; Pei et al., 2014 ; Ahmadi and Pechenizkiy, 2016 ; Ji et al., 2016 ; Zhu et al., 2018 ; Samanta et al., 2019 ; Ozel et al., 2020 ; Cui et al., 2021 ; Varley and Sporns, 2021 ; Huang-Jing et al., 2023 ). Accordingly, Figure 5 illustrates the classification of these applications. It is evident that most of the research efforts have been directed toward the diagnosis of epilepsy, with relatively less attention paid to predicting other related areas of brain disorders. Furthermore, VG analysis has been most commonly applied in the context of Alzheimer’s disease following Epilepsy. These findings demonstrate the potential of VG analysis in contributing to the diagnosis and treatment of various brain-related diseases. Moreover, the results suggest that further research efforts are needed to explore the full potential of this analytical technique in other areas of brain diseases, such as depression and fatigue.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g005.jpg

General categorization of visibility graph analysis applications and associated brain disorders.

If we want to know more about the VG types that have been used to study each of the common brain disorders, we can create a table that lists each type of VG in a row, along with the brain disorders that have been studied using that type of graph and the related references. This would be similar to Table 3 , but it would also include less common VG types that have only been used for special applications under category named other, such as WHVG-TE 1 ( Isfahani et al., 2022 ), WLPVG 2 ( Pei et al., 2014 ), 2DHVG 3 ( Huang-Jing et al., 2023 ), etc.

By examining the VG analysis applications closely, we can clearly see that Epilepsy, and Alzheimer’s collectively account for approximately half of all investigated disorders, Figure 6 . Epilepsy is a complex medical condition that has been the subject of extensive analysis and diagnosis by experts since 2010. It is evident from statistical data that Epilepsy continues to affect a significant number of individuals worldwide, leading to high morbidity rates and an overall decrease in quality of life. Besides, according to the Alzheimer’s Association, Alzheimer’s disease is the most common cause of dementia and accounts for 60–80% of dementia cases ( Alzheimer’s Association, 2023 ). As such, researchers have been actively seeking ways to improve diagnosis and treatment options for these neurological disorders.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g006.jpg

Ratio of disorders examined using visibility graph analysis.

Moreover, the study of sleep disorders and autism has gained momentum in the research community due to their potential impact on cognitive function and overall health. With advancements in technology and increased awareness, experts hope to gain a better understanding of these conditions and develop effective interventions to improve patient outcomes ( Distefano et al., 2023 ). Overall, the field of neurology continues to evolve rapidly, with new discoveries and innovations paving the way for improved healthcare delivery and patient care.

A more detailed perspective stemming from Figure 5 involves the examination of brain disorders from various viewpoints, such as analysis, detection, prediction, and other aspects. Figure 7 , which adopts this visualization approach, demonstrates that while analysis and detection have been the subject of ongoing research across different diseases over the years, prediction and other differentiated research on VG analysis, have received comparatively less attention. This disparity in focus could be attributed to the nature of brain time series data, where prediction may be considered less feasible than analysis and diagnosis. Alternatively, it could signify the need for further exploration and investigation of the potentials in the realm of prediction and other related areas.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g007.jpg

Time distribution of the different applications of visibility graph analysis associated with brain disorders.

Interestingly, VG analysis for brain is divided into several categories depending on the type of network used. Through our analysis of 53 research papers, we have identified the top eight most commonly used categories for VGs mentioned in Table 1 along with their applications, shown in Figure 8 . Natural VG has attracted the most interest among researchers due to its simple and direct expression, and it accounts for the largest share of practical applications, approximately 36.36% of all papers. In addition, HVG has been widely used due to its substructure of simple graphs. VGS, followed by LPVG and WVG, have been the most popular types of VGs in the field of brain applications. While the epilepsy has been investigated mainly with VG and HVG, it is evident that most researchers in the field of Alzheimer’s disease have used VG, PSVG, LPVG, WVG, and HVG, which are among the top eight types of graph visualization. Additionally, researchers have focused on the use of VG, PSVG, and MVG in analyzing graph theory in the field of autism spectrum disorder.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g008.jpg

Different visibility graph shares related to each brain disorder.

During our review, we found that among the metrics used for network analysis, those listed in Table 2 were the most commonly used. Figure 9 demonstrates the tree map view of the metrics based on their usage frequency. It seems that for brain VG analysis, degree, clustering coefficient, and degree distribution are the three top leveraged metrics because they provide important information about the nature of the network in an easy-to-understand manner. These metrics can provide valuable insights into the structure and behavior of a network and are essential tools for understanding complex systems. These metrics have also been widely studied and applied in various fields. For example, degree centrality has been used to study social networks, communication networks, and biological networks, while clustering coefficient has been used to analyze brain networks and power grids. Degree distribution has also been used in the study of technological networks, transportation networks, and financial networks. Overall, the use of these metrics is critical for gaining a deeper understanding of complex systems and has become an essential part of network analysis.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g009.jpg

Commonly used network analysis metrics for brain visibility graph analysis.

Researchers generally use statistical metrics to evaluate models. These statistical metrics are essential for evaluating the performance and effectiveness of models in network analysis. For example, accuracy is often used to measure the degree of agreement between predicted and actual values, while ANOVA is used to analyze variance between groups. Other statistical metrics such as correlation coefficients, regression analysis, and hypothesis testing can also provide valuable insights into the relationships between variables and the overall structure of the network. Accordingly, Figure 10 below shows 13 of the most commonly used statistical approaches in network analysis. Over 22 articles have used accuracy as a metric, while ANOVA has been used as a statistical metric in 19 articles. Additionally, 12 articles have employed statistical mean to evaluate their approach and model in their research. Some researchers have used more than one statistical measure simultaneously in one paper. These statistical techniques provide a powerful toolset for understanding networks and can help us gain new insights into the underlying mechanisms driving complex systems. Overall, statistical metrics are an indispensable component of network analysis and play a vital role in advancing our understanding of complex systems.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g010.jpg

Commonly used evaluation measures for brain visibility graph analysis based on their frequency.

There are several popular machine learning classification algorithms that have gained significant attention and success in various domains such as brain. One such algorithm is the support vector machine (SVM), which aims to find an optimal hyperplane that separates different classes by maximizing the margin between them. SVMs are known for their ability to handle high-dimensional data and can effectively handle both linear and non-linear classification tasks through the use of kernel functions. Another widely used algorithm is Random Forest, which combines multiple decision trees to create a robust and accurate classifier. By aggregating the predictions individual trees, Random Forest can handle complex datasets, handle missing values, and provide feature importance rankings. Additionally, Logistic Regression is a simple yet powerful algorithm commonly used for binary classification tasks. It models the relationship between input features and the probability of belonging to a particular class using a logistic function. Logistic Regression is computationally efficient, interpretable, and can handle large datasets. Moreover, K -nearest neighbors (KNN) is a non-parametric machine learning algorithm that classifies new data points based on their proximity to the labeled examples in the training set. These are just a few examples of the popular machine learning classification algorithms that have proven their effectiveness in solving a wide range of classification problems ( Dash et al., 2021 ).

During our literature review, we collected information on the machine learning methods used in brain visibility network analysis, Figure 11 . In 22.22% of the articles, SVM was used as a machine learning approach for classification tasks, while KNN was used in 12.5% of the articles, and RBF (radial basis function) was used in 6.94% of the articles. It is interesting to note that dimensionality reduction techniques such as LDA (linear discriminant analysis) and PCA (principal component analysis) were also used in some of the articles. These methods are crucial for reducing the complexity of high-dimensional data and extracting meaningful features from it. Overall, the use of machine learning and dimensionality reduction techniques in brain VG analysis has provided new avenues for exploring complex systems and has significantly influenced the field’s research direction, especially for diagnosis applications of the brain disorders.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g011.jpg

Most commonly utilized machine learning algorithms in visibility graph analysis.

During the process of reviewing articles, the most frequently used key terms were “Visibility Graph,” “EEG,” “Complex Networks,” and “Epilepsy” in that order, as indicated in Figure 12 . This suggests that these concepts are currently dominant and prominent in research related to the field being studied. Statistical analysis of the frequency of these terms can provide insights into the trends and patterns of research in this area, and help researchers identify important topics and areas for further exploration. Additionally, understanding the relationships between these key terms can provide valuable information about the underlying structures and mechanisms of the phenomena being studied. Additionally, researchers can use this information to identify important areas for further exploration and research. This type of statistical analysis serves as a useful tool for researchers seeking to stay up-to-date with the latest trends and developments in their field.

An external file that holds a picture, illustration, etc.
Object name is fnins-17-1268485-g012.jpg

Most commonly used keywords in reviewed articles.

Clearly, the utilized data in the research on VG analysis of brain diseases can provide valuable insights and clues. Also, the sample size is a crucial factor in any study or analysis, as it can affect the accuracy and reliability of the results. Therefore, researchers must carefully consider their sample sizes to ensure the validity of their findings. Our investigations show the highest number of data samples in brain VG analysis were used for sleep-related studies ( Xiong et al., 2019 ), 295, while the lowest number of data samples were used for epilepsy studies ( Modak et al., 2020 ), 5. Likewise, the youngest participant in the sampling process was 1.5 years old, which was related to the epilepsy dataset. On the other hand, the oldest participant in the sampling process was 85 years old, which was related to the Alzheimer’s dataset. It is also interesting to note that the maximum number of samples among women was 28, which was used in an article ( Yu et al., 2020 ) focusing on Alzheimer’s disease detection.

In comparison, brain VG is more popular with EEG timeseries data than fMRI data. This is because EEG data is a continuous signal that is recorded over time, while fMRI data is a discrete signal that is collected at a series of time points. The continuous nature of EEG data makes it more suitable for the construction of VGs, as it allows us to capture the temporal dynamics of brain activity.

Furthermore, EEG data is less expensive and more portable than fMRI data, making it more accessible to researchers. This has led to a wider use of EEG data in studies of brain connectivity, including the use of VGs. However, VGs can also be used with fMRI data. In fact, there are some studies that have shown that VGs can be used to identify functional connectivity in fMRI data ( Sannino et al., 2017 ). However, these studies are still in their early stages, and more research is needed to determine the effectiveness of VGs for fMRI data.

5. Conclusion and future works

Writing a scoping review for brain VG analysis is of paramount importance for several reasons. Firstly, it provides a comprehensive overview of the existing literature, identifying the breadth and depth of the research conducted in this field. This helps to identify gaps in the current knowledge and areas that require further investigation. Secondly, it aids in clarifying key concepts and theories used in brain VG analysis, thereby enhancing understanding and facilitating communication among researchers. Thirdly, it helps to identify the methodologies and tools used in previous studies, which can guide future research design. Lastly, a scoping review can help to identify the potential impacts and applications of brain VG analysis in various fields such as neuroscience, psychology, and clinical medicine, thereby informing policy and practice.

This article presents a comprehensive review of VG analysis in order to determine the practical and research diversity in the field of brain analysis. The study examined 51 articles and identified a significant number of publications in this area. On average, four articles were published each year over the course of 13 years. VG analysis can improve the diagnosis and prediction of brain diseases. The most common use of this method has been for diagnosing epilepsy. However, other brain diseases, particularly Parkinson’s disease and attention deficit hyperactivity disorder (ADHD), have received less or no attention in graph analysis research. Future studies may identify the reasons for these gaps in research and address them or starting new studies on not investigated brain disorders.

The article acknowledges two limitations in its findings. First, it is possible that relevant published articles may have been excluded due to our search terms not being reflected. The second limitation is due to the lack of use of other databases such as PubMed, ScienceDirect, Semantic Scholar, Web of Science, and IEEE Xplore.

Future work in this field can focus on the integration of deep learning and graph machine learning techniques into brain VG analysis. The complexity and non-linear nature of brain networks necessitate the use of advanced machine learning methods that can capture these characteristics. Deep learning, with its ability to learn hierarchical representations, can be used to extract meaningful features from brain VGs. These features can then be used to classify different brain states or to predict outcomes in neurological disorders. On the other hand, graph machine learning, which is designed to work with graph data, can be used as an advanced technique of processing the VG with extra value-added information. The combination of these two approaches can lead to a more comprehensive understanding of brain signals and their role in health and disease. This will require the development of new algorithms and computational tools, as well as the collection and analysis of large-scale brain network data. The results of this research could have significant implications for the diagnosis and treatment of neurological disorders. Eventually, the image visibility graph (IVG), which is a novel approach for transforming images into VGs in a distinctive way ( Iacovacci and Lacasa, 2020 ), could potentially find application in the realm of neuroscience, particularly given the abundance of neuroimaging data accessible. This technique might offer a fresh perspective for analyzing brain images, enabling the construction of a network representation that captures the interconnections and communication patterns present within these images of the brain. Furthermore, IVG analysis may help identify disruptions or abnormalities in these images, aiding in the diagnosis and understanding of neurological disorders. In summary, the utilization of Image VGs in brain research has the potential to revolutionize our comprehension of brain function and dysfunction, offering new avenues for both basic neuroscience research and clinical applications.

Author contributions

ZS: Data curation, Writing—original draft, Visualization. SS: Conceptualization, Validation, Formal analysis, Writing—review and editing, Supervision.

Funding Statement

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

1 Weighted horizontal visibility graph-transferable entropy.

2 Wavelet limited penetrable visibility graph.

3 Two-dimensional horizontal visibility graph.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

  • Ahmadi N., Pechenizkiy M. (2016). “ Application of horizontal visibility graph as a robust measure of neurophysiological signals synchrony ,” in Proceedings of the 2016 IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS) , (Belfast: IEEE; ), 273–278. [ Google Scholar ]
  • Ahmadlou M., Adeli H., Adeli A. (2010). New diagnostic EEG markers of the Alzheimer’s disease using visibility graph. J. Neural Transm. 117 1099–1109. 10.1007/s00702-010-0450-3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ahmadlou M., Adeli H., Adeli A. (2012). Improved visibility graph fractality with application for the diagnosis of autism spectrum disorder. Phys. A Stat. Mech. Appl. 391 4720–4726. [ Google Scholar ]
  • Ahmadlou M., Adeli H. (2012). Visibility graph similarity: A new measure of generalized synchronization in coupled dynamic systems. Phys. D Nonlinear Phenomena 241 326–332. 10.1016/J.PHYSD.2011.09.008 [ CrossRef ] [ Google Scholar ]
  • Ahmadlou M., Gharib M., Hemmati S., Vameghi R., Sajedi F. (2013). Disrupted small-world brain network in children with Down syndrome. Clin. Neurophysiol. 124 1755–1764. [ PubMed ] [ Google Scholar ]
  • Alzheimer’s Association (2023). Alzheimer’s disease facts and figures. Available online at: https://www.alz.org/media/Documents/alzheimers-facts-and-figures.pdf (accessed July 23, 2023). [ Google Scholar ]
  • Artameeyanant P., Sultornsanee S., Chamnongthai K. (2017). Electroencephalography-based feature extraction using complex network for automated epileptic seizure detection. Expert Syst. 34 : e12211 . 10.1111/EXSY.12211 [ CrossRef ] [ Google Scholar ]
  • Bashiri F., Mokhtarpour A. (2022). Depression classification and recognition by graph-based features of EEG signals. Int. J. Med. Eng. Inform. 14 252–263. [ Google Scholar ]
  • Bhaduri S., Ghosh D. (2015). Electroencephalographic data analysis with visibility graph technique for quantitative assessment of brain dysfunction. Clin. EEG Neurosci. 46 218–223. 10.1177/1550059414526186 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cai L., Deng B., Wei X., Wang R., Wang J. (2018). “ Analysis of spontaneous EEG activity in Alzheimer’s disease using weighted visibility graph ,” in Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) , (Honolulu, HI: IEEE; ), 3100–3103. 10.1109/EMBC.2018.8513010 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cai L., Wang J., Cao Y., Deng B., Yang C. (2016). “ LPVG analysis of the EEG activity in Alzheimer’s disease patients ,” in Proceedings of the 2016 12th World Congress on Intelligent Control and Automation (WCICA) , (Guilin: IEEE; ), 934–938. [ Google Scholar ]
  • Cui X., Liu M., Zhang N., Zhang J., Wei N., Li K. (2021). “ Brain functional networks analysis of five fingers grasping in virtual reality environment ,” in Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) , (Mexico: IEEE; ), 804–807. 10.1109/EMBC46164.2021.9630128 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dash S. S., Nayak S. K., Mishra D. (2021). A review on machine learning algorithms. Smart Innov. Syst. Technol. 153 495–507. 10.1007/978-981-15-6202-0_51/FIGURES/16 [ CrossRef ] [ Google Scholar ]
  • Distefano G., Calderoni S., Apicella F., Cosenza A., Igliozzi R., Palermo G., et al. (2023). Impact of sleep disorders on behavioral issues in preschoolers with autism spectrum disorder. Front. Psychiatry 14 : 1181466 . 10.3389/FPSYT.2023.1181466/BIBTEX [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ebenezer Rajadurai T., Valliyammai C. (2018). “ Epileptic seizure prediction using weighted visibility graph ,” in Proceedings of the International Conference on Soft Computing Systems , (Cham: Springer; ), 453–461. [ Google Scholar ]
  • Gao Z.-K., Cai Q., Yang Y.-X., Dang W.-D., Zhang S.-S. (2016). Multiscale limited penetrable horizontal visibility graph for analyzing nonlinear time series. Sci. Rep. 6 : 35622 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gao Z.-K., Guo W., Cai Q., Ma C., Zhang Y.-B., Kurths J. (2019). Characterization of SSMVEP-based EEG signals using multiplex limited penetrable horizontal visibility graph. Chaos 29 : 073119 . 10.1063/1.5108606 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hao C., Chen Z., Zhao Z. (2016a). “ Analysis and prediction of epilepsy based on visibility graph ,” in Proceedings of the 2016 3rd International Conference on Information Science and Control Engineering (ICISCE) , (Beijing: IEEE; ), 1271–1274. [ Google Scholar ]
  • Hao C., Li W., Du S. (2016b). “ Classification of EEG in eyes-open and eyes-closed state based on limited penetrable visibility graph ,” in Proceedings of the 2016 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER) , (Chengdu: ), 448–451. [ Google Scholar ]
  • Huang-Jing N., Ruo-Yu D., Lei L., Ling-Ling H., Li-Hua Z., Jiao-Long Q. (2023). Two-dimensional horizontal visibility graph analysis of human brain aging on gray matter. Chin. Phys. B. 32 : 078501 . [ Google Scholar ]
  • Iacovacci J., Lacasa L. (2020). Visibility graphs for image processing. IEEE Trans. Pattern Anal. Mach. Intell. 42 974–987. 10.1109/TPAMI.2019.2891742 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Isfahani P. P., Kharajinezhadian F., Songhorzadeh M. (2022). Evaluation of the flow information using WHVG-TE, in epilepsy . Kerala: Research Square. 10.21203/rs.3.rs-2240585/v1 [ CrossRef ] [ Google Scholar ]
  • Ji H., Xu T., Wu W., Wang J. (2016). “ Visibility graph analysis on EEG signal ,” in Proceedings of the 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) , (Datong: IEEE; ), 1557–1561. [ Google Scholar ]
  • Lacasa L., Luque B., Ballesteros F., Luque J., Nuno J. C. (2008). From time series to complex networks: The visibility graph. Proc. Natl. Acad. Sci. U.S.A. 105 4972–4975. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lacasa L., Luque B., Luque J., Nuno J. C. (2009). The visibility graph: A new method for estimating the Hurst exponent of fractional Brownian motion. Europhys. Lett. 86 : 30001 . [ Google Scholar ]
  • Liu Z., Sun J., Zhang Y., Rolfe P. (2016). Sleep staging from the EEG signal using multi-domain feature extraction. Biomed. Signal. Process. Control 30 86–97. 10.1016/j.neures.2021.03.012 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Luque B., Lacasa L., Ballesteros F., Luque J. (2009). Horizontal visibility graphs: Exact results for random time series. Phys. Rev. E Stat. Nonlin. Soft. Matter. Phys. 80 : 046103 . [ PubMed ] [ Google Scholar ]
  • Mathur P., Chakka V. K. (2020). “ Graph signal processing of EEG signals for detection of epilepsy ,” in Proceedings of the 2020 7th International Conference on Signal Processing and Integrated Networks (SPIN) , (Noida: IEEE; ), 839–843. [ Google Scholar ]
  • Matthews P. M., Jezzard P. (2004). Functional magnetic resonance imaging. J. Neurol. Neurosurg. Psychiatry 75 6–12. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Modak S., Roy S. S., Samanta K., Chatterjee S., Dey S., Bhowmik R., et al. (2020). “ Detection of focal EEG signals employing weighted visibility graph ,” in Proceedings of the 2020 International Conference on Computer, Electrical & Communication Engineering (ICCECE) , (Kolkata: IEEE; ), 1–5. [ Google Scholar ]
  • Munn Z., Peters M. D. J., Stern C., Tufanaru C., Mcarthur A., Aromataris E. (2018). Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 18 : 143 . 10.1186/s12874-018-0611-x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nicosia V., Lacasa L., Latora V. (2014). From multivariate time series to multiplex visibility graphs. arXiv [preprint]. 10.48550/arXiv.1408.0925 [ CrossRef ] [ Google Scholar ]
  • Olamat A., Ozel P., Akan A. (2022). Synchronization analysis in epileptic EEG signals via state transfer networks based on visibility graph technique. Int. J. Neural Syst. 32 : 2150041 . [ PubMed ] [ Google Scholar ]
  • Olamat A., Shams P., Akan A. (2017). “ State transfer network of time series based on visibility graph analysis for classifying and prediction of epilepsy seizures ,” in Proceedings of the 2017 Medical Technologies National Congress (TIPTEKNO) , (Trabzon: IEEE; ), 1–4. [ Google Scholar ]
  • Olamat A., Shams P., Akan A. (2018). “ Synchronization analysis of EEG epilepsy by visibility graph similarity ,” in Proceedings of the 2018 Medical Technologies National Congress (TIPTEKNO) , (Magusa: IEEE; ), 1–4. [ Google Scholar ]
  • Ozel P., Karaca A., Olamat A., Akan A., Ozcoban M. A., Tan O. (2020). Intrinsic synchronization analysis of brain activity in obsessive–compulsive disorders. Int. J. Neural Syst. 30 : 2050046 . [ PubMed ] [ Google Scholar ]
  • Paranjape P. N., Dhabu M. M., Deshpande P. S. (2023). “ A novel weighted visibility graph approach for alcoholism detection through the analysis of EEG signals ,” in Proceedings of the Advanced Network Technologies and Intelligent Computing: Second International Conference, ANTIC 2022, Varanasi, India, December 22–24, 2022, Part II , (Cham: Springer; ), 16–34. [ Google Scholar ]
  • Pei X., Wang J., Deng B., Wei X., Yu H. (2014). WLPVG approach to the analysis of EEG-based functional brain network under manual acupuncture. Cogn. Neurodyn. 8 417–428. 10.1007/s11571-014-9297-x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Samanta K., Chatterjee S., Bose R. (2019). Cross-subject motor imagery tasks EEG signal classification employing multiplex weighted visibility graph and deep feature extraction. IEEE Sens. Lett. 4 1–4. [ Google Scholar ]
  • Sannino S., Stramaglia S., Lacasa L., Marinazzo D. (2017). Visibility graphs for fMRI data: Multiplex temporal graphs and their modulations across resting-state networks. Netw. Neurosci. 1 208–221. 10.1162/NETN_A_00012 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schindler K., Rummel C., Andrzejak R. G., Goodfellow M., Zubler F., Abela E., et al. (2016). Ictal time-irreversible intracranial EEG signals as markers of the epileptogenic zone. Clin. Neurophysiol. 127 3051–3058. 10.1016/j.clinph.2016.07.001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sengupta A., Routray A., Kar S. (2013). “ Complex brain networks using Visibility Graph synchronization ,” in Proceedings of the 2013 Annual IEEE India Conference (INDICON) , (Mumbai: IEEE; ), 1–4. 10.1016/j.clinph.2013.03.004 [ CrossRef ] [ Google Scholar ]
  • Sengupta A., Routray A., Kar S. (2014). “ Estimation of fatigue in drivers by analysis of brain networks ,” in Proceedings of the 2014 Fourth International Conference of Emerging Applications of Information Technology , (Kolkata: IEEE; ), 289–293. 10.3390/e22070787 [ CrossRef ] [ Google Scholar ]
  • Sporns O., Honey C. J. (2006). Small worlds inside big brains. Proc. Natl. Acad. Sci. U.S.A. 103 19219–19220. 10.1073/pnas.0609523103 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Supriya S., Siuly S., Wang H., Zhang Y. (2018). EEG sleep stages analysis and classification based on weighed complex network features. IEEE Trans. Emerg. Top. Comput. Intell. 5 236–246. [ Google Scholar ]
  • Supriya S., Siuly S., Wang H., Cao J., Zhang Y. (2016a). Weighted visibility graph with complex network features in the detection of epilepsy. IEEE Access 4 6554–6566. [ Google Scholar ]
  • Supriya S., Wang H., Zhuo G., Zhang Y. (2016). “ Analyzing EEG signal data for detection of epileptic seizure: Introducing weight on visibility graph with complex network feature ,” in Proceedings of the Databases Theory and Applications: 27th Australasian Database Conference, ADC 2016, Sydney, NSW, September 28-29, 2016, Proceedings 27 , (Sydney, NSW: Springer; ), 56–66. [ Google Scholar ]
  • Teymourlouei A., Gentili R. J., Reggia J. (2023). “ Decoding EEG signals with visibility graphs to predict varying levels of mental workload ,” in Proceedings of the 2023 57th Annual Conference on Information Sciences and Systems (CISS) , (Baltimore, MD: IEEE; ), 1–6. [ Google Scholar ]
  • Ting-Ting Z., Ning-De T., Zhong-Ke G., Yue-Bin L. (2012). Limited penetrable visibility graph for establishing complex network from time series. Acta Phys. Sin. 61 : 030506 . 10.1038/srep35622 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Tiwari M., Wong Y.-M. (2022). Identification of topological measures of visibility graphs for analyzing transitions in complex time series. Int. J. Mod. Phys. B. 36 : 2240080 . [ Google Scholar ]
  • Tricco A. C., Lillie E., Zarin W., O’Brien K. K., Colquhoun H., Levac D., et al. (2018). PRISMA extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann. Intern. Med. 169 467–473. [ PubMed ] [ Google Scholar ]
  • van den Heuvel M. P., Hulshoff Pol H. E. (2010). Exploring the brain network: A review on resting-state fMRI functional connectivity. Eur. Neuropsychopharmacol. 20 519–534. 10.1016/j.euroneuro.2010.03.008 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Varley T. F., Sporns O. (2021). Network analysis of time series: Novel approaches to network neuroscience. Front. Neurosci. 15 : 787068 . 10.3389/fnins.2021.787068 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Villa Padilla R. V., Rodríguez Vázquez K., Vázquez Hernández M., Sandoval Bonilla B. A., Sánchez Dueñas J. J. (2023). “ Graph analysis of functional connectivity Rs-FMRI in healthy and epileptic brain using visibility algorithm ,” in Proceedings of the Congreso Nacional de Ingeniería Biomédica , (Cham: Springer; ), 27–36. [ Google Scholar ]
  • Wadhera T., Kakkar D. (2020). Multiplex temporal measures reflecting neural underpinnings of brain functional connectivity under cognitive load in Autism Spectrum Disorder. Neurol. Res. 42 327–337. 10.1080/01616412.2020.1726586 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wadhera T., Kakkar D. (2021). Analysis of simultaneous visual and complex neural dynamics during cognitive learning to diagnose ASD. Phys. Eng. Sci. Med. 44 1081–1094. 10.1007/s13246-021-01045-8 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wang H., Zhuo G., Zhang Y. (2016a). “ Analyzing EEG signal data for detection of epileptic seizure: Introducing weight on visibility graph with complex network feature ,” in Proceedings of the 27th Australasian Database Conference , (Sydney: Springer; ), 56–66. [ Google Scholar ]
  • Wang J., Yang C., Wang R., Yu H., Cao Y., Liu J. (2016b). Functional brain networks in Alzheimer’s disease: EEG analysis based on limited penetrable visibility graph and phase space method. Phys. A Stat. Mech. Appl. 460 174–187. [ Google Scholar ]
  • Wang L., Long X., Arends J. B. A. M., Aarts R. M. (2017). EEG analysis of seizure patterns using visibility graphs for detection of generalized seizures. J. Neurosci. Methods 290 85–94. [ PubMed ] [ Google Scholar ]
  • Wang R., Yang Z., Wang J., Shi L. (2018). “ An improved visibility graph analysis of EEG signals of Alzheimer brain ,” in Proceedings of the 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 13–15 October 2018 , Beijing, 1–5. 10.1038/s41598-023-32664-8 [ CrossRef ] [ Google Scholar ]
  • Wang S., Li Y., Wen P., Lai D. (2016c). Data selection in EEG signals classification. Australas Phys. Eng. Sci. Med. 39 157–165. [ PubMed ] [ Google Scholar ]
  • Wang Y., Long X., van Dijk J. P., Aarts R. M., Wang L., Arends J. B. A. M. (2020). False alarms reduction in non-convulsive status epilepticus detection via continuous EEG analysis. Physiol. Meas. 41 : 055009 . 10.1088/1361-6579/ab8cb3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Xiong H., Shang P., Hou F., Ma Y. (2019). Visibility graph analysis of temporal irreversibility in sleep electroencephalograms. Nonlinear Dyn. 96 1–11. 10.1007/s11071-019-04768-2 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Xu H., Dai J., Li J., Wang J., Hou F. (2018). “ Research of EEG signal based on permutation entropy and limited penetrable visibility graph ,” in Proceedings of the 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI) , (Beijing: IEEE; ), 1–5. [ Google Scholar ]
  • Yu H., Zhu L., Cai L., Wang J., Liu J., Wang R., et al. (2020). Identification of Alzheimer’s EEG with a WVG network-based fuzzy learning approach. Front. Neurosci. 14 : 641 . 10.3389/fnins.2020.00641 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhang B., Wei D., Yan G., Lei T., Cai H., Yang Z. (2022). Feature-level fusion based on spatial-temporal of pervasive EEG for depression recognition. Comput. Methods Programs Biomed. 226 : 107113 . 10.1016/j.cmpb.2022.107113 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhang X., Landsness E. C., Chen W., Miao H., Tang M., Brier L. M., et al. (2022). Automated sleep state classification of wide-field calcium imaging data via multiplex visibility graphs and deep learning. J. Neurosci. Methods 366 : 109421 . 10.1016/J.JNEUMETH.2021.109421 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhu G., Li Y., Wen P. (2012). “ Analysing epileptic EEGs with a visibility graph algorithm ,” in Proceedings of the 2012 5th international conference on biomedical engineering and informatics , (Piscataway, NJ: IEEE; ), 432–436. 10.1063/5.0140579 [ CrossRef ] [ Google Scholar ]
  • Zhu G., Li Y., Wen P. (2014a). Analysis and classification of sleep stages based on difference visibility graphs from a single-channel EEG signal. IEEE J. Biomed. Health Inform. 18 : 1813 . 10.1109/JBHI.2014.2303991 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhu G., Li Y., Wen P. P. (2014b). Epileptic seizure detection in EEGs signals using a fast weighted horizontal visibility algorithm. Comput. Methods Programs Biomed. 115 64–75. 10.1016/j.cmpb.2014.04.001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zhu G., Li Y., Wen P. P., Wang S. (2014c). Analysis of alcoholic EEG signals based on horizontal visibility graph entropy. Brain Inform. 1 19–25. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Zhu L., Haghani S., Najafizadeh L. (2018). “ Spatiotemporal characterization of brain function via multiplex visibility graph ,” in Proceedings of the Optics and the Brain-OSA Technical Digest , JTh3A-54 , (Washington, DC: Optica Publishing Group; ). [ Google Scholar ]

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Exploratory graph analysis: A new approach for estimating the number of dimensions in psychological research

* E-mail: [email protected]

Affiliations Department of Psychology, University of Virginia, Charlottesville, VA, United States of America, Graduate School of Psychology, Universidade Salgado de Oliveira, Rio de Janeiro, Brasil

Affiliation University of Amsterdam, Amsterdam, Netherlands

  • Hudson F. Golino, 
  • Sacha Epskamp

PLOS

  • Published: June 8, 2017
  • https://doi.org/10.1371/journal.pone.0174035
  • Reader Comments

Table 1

The estimation of the correct number of dimensions is a long-standing problem in psychometrics. Several methods have been proposed, such as parallel analysis (PA), Kaiser-Guttman’s eigenvalue-greater-than-one rule, multiple average partial procedure (MAP), the maximum-likelihood approaches that use fit indexes as BIC and EBIC and the less used and studied approach called very simple structure (VSS). In the present paper a new approach to estimate the number of dimensions will be introduced and compared via simulation to the traditional techniques pointed above. The approach proposed in the current paper is called exploratory graph analysis (EGA), since it is based on the graphical lasso with the regularization parameter specified using EBIC. The number of dimensions is verified using the walktrap , a random walk algorithm used to identify communities in networks. In total, 32,000 data sets were simulated to fit known factor structures, with the data sets varying across different criteria: number of factors (2 and 4), number of items (5 and 10), sample size (100, 500, 1000 and 5000) and correlation between factors (orthogonal, .20, .50 and .70), resulting in 64 different conditions. For each condition, 500 data sets were simulated using lavaan . The result shows that the EGA performs comparable to parallel analysis, EBIC, eBIC and to Kaiser-Guttman rule in a number of situations, especially when the number of factors was two. However, EGA was the only technique able to correctly estimate the number of dimensions in the four-factor structure when the correlation between factors were .7, showing an accuracy of 100% for a sample size of 5,000 observations. Finally, the EGA was used to estimate the number of factors in a real dataset, in order to compare its performance with the other six techniques tested in the simulation study.

Citation: Golino HF, Epskamp S (2017) Exploratory graph analysis: A new approach for estimating the number of dimensions in psychological research. PLoS ONE 12(6): e0174035. https://doi.org/10.1371/journal.pone.0174035

Editor: Martin Voracek, University of Vienna, AUSTRIA

Received: April 5, 2016; Accepted: March 2, 2017; Published: June 8, 2017

Copyright: © 2017 Golino, Epskamp. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files, specially the R code used in the manuscript. The real dataset used in the last section are available into figshare: https://figshare.com/articles/TDRI_dataset_csv/3142321 .

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Estimating the number of dimensions in psychological and educational instruments is a long-standing problem in psychometrics [ 1 , 2 , 3 ]. Dimensions can be defined as the low set of features from a large set of correlated variables that collectively explain most of the variability in the original set [ 4 ], or as the underlying source of the variability presented in multivariate data [ 5 ]. Two main general traditions, within psychology, can be identified in the methods that have been proposed [ 2 ]. The first one examines patterns of eigenvalues, determining the number of factors based on some specified stopping rule. Two of the most known methods following this tradition is the Kaiser-Guttman eigenvalue greater than one rule [ 6 , 7 ] and Horn’s parallel analysis [ 8 ]. The second general tradition compares the fit of structural models with varying numbers of factors and determines the number of factors to be retained based on the minimum average partial procedure (MAP) [ 9 ] or fit indexes such as the Bayesian information criterion (BIC) [ 10 ] and the extended Bayesian information criterion (EBIC) [ 11 ]. In addition to the above-mentioned traditions, there is an alternative—less used and studied—approach called very simple structure (VSS) [ 12 ]. This approach assesses if the original correlation matrix can be reproduced by a simplified pattern matrix, in which the highest loading for each item is retained and the other loadings are set to zero.

In this paper, we introduce a new approach to estimate the number of dimensions to be retained. We term this approach exploratory graph analysis (EGA), since it is based on estimating a graphical model [ 13 , 14 , 15 ] followed by cluster detection to estimate the number of dimensions in psychological data. EGA has the additional benefit over the above-mentioned procedures that it not only estimates the number of dimensions but also which items belong to each dimension. We will compare this approach via simulation to the traditional or antique factor retention techniques: VSS, MAP, Kaiser-Guttman rule, parallel analysis, and the fit of different numbers of factors via BIC and EBIC. Finally, we have implemented EGA [ 16 ] in a free to use software package for the statistical programming language R.

Assessing dimensionality in psychological data

Nowadays, psychology disposes of an impressive number of statistical procedures, with complex and flexible models carefully developed to deal with a multitude of problems. One may wonder whether estimating the number of dimensions using factor analysis still plays a role in the investigations, as it did some decades ago. The use of factor models is still very present as an early step in the process of construct validation [ 17 ], being considered “inexorably linked to the development of intelligence tests and to intelligence theory” (p. 37) [ 18 ]. A quick search in the Science Direct, an Elsevier web database for scientific publications, using the keywords “exploratory factor analysis” from 1990 to 2016 in journals from the fields of Arts and Humanities, Psychology and Social Sciences, yielded 40,132. From this total, 73.79% were published in the last ten years. So, as the very brief and non-systematic search pointed above shows, going in the same direction of previous papers [ 19 ], factor analysis is still widely used and broadly applied. However, reviews show that from 22% to 28% of papers published using exploratory factor analysis failed to report the specific extraction method used [ 20 ]. This is a very serious issue, because the extraction method used can impact the number of dimensions estimated. As will be pointed in the next paragraphs, each technique has its benefits and pitfalls, so reporting which method was used is extremely important.

Why does psychology need a new way to estimate the number of dimensions? The answer lies in the several studies published about the performance of the parallel analysis [ 18 , 21 , 22 , 23 , 24 , 25 ], the MAP [ 18 , 24 , 26 ], the BIC [ 27 , 28 , 29 , 30 ] and the Kaiser-Guttman eigenvalue rule [ 2 , 18 , 24 , 25 , 31 ] in estimating the correct number of factors. In this line of research, it has been shown that parallel analysis and the MAP work quite well when there is a low or moderate correlation between factors, when the sample size is equal to or greater than 500 and when the factor loadings are from moderate to high [ 18 , 21 , 22 ]. However, they tend to underestimate the number of factors when the correlations between factors are high, when the sample size is small and when there is small number of indicators per factor [ 2 , 18 , 21 , 22 ].

The Kaiser-Guttman rule is the default method for choosing the number of factors in many commercial software packages [ 20 ]. However, simulation studies show that this method overestimates the number of factors, especially with a large number of items and a large sample size [ 2 , 18 , 24 , 25 , 31 ]. Ruscio and Roche [ 2 ] provided startling evidence in this direction: the Kaiser-Gutman rule overestimated the number of factors in 89.87% of the 10,000 simulated datasets, generated with different number of factors, sample size, number of items, number of response categories per item and strength of correlation between factors. In face of the evidences from the simulation studies, some researchers strongly recommend not to use this method [ 20 , 24 ].

Regarding the BIC, evidences are contrasting. Preacher, Zhang, Kim and Mels [ 29 ] showed that BIC performs well when the sample size is small, but tends to overestimate the number of factors in large datasets. However, Dziak, Coffman, Lanza and Li [ 27 ] showed that BIC decreases its underestimation and increases its correctness in estimating the number of factors when the sample size is greater than 200 cases. It is important to point that, to our best knowledge, there is no study showing how the very simple structure approach behaves under different conditions. These simulation studies highlight a very complicated problem within psychology, since it is very common to find areas in which the correlation between factors is high, especially in the intelligence field [ 18 ]. Thus, in such situations, parallel analysis, MAP and comparing different number of factors via BIC perform bad, in average, proving that estimating the number of factors is still a non-trivial task, in spite of the past decades’ developments [ 32 ]. It seems that Kaiser’s dictum remains valid: “a solution to the number-of-factors problem in factor analysis is easy… But the problem, of course is to find the solution” [ 33 ].

The next section will introduce a new approach to estimate the number of dimensions, called exploratory graph analysis (EGA). EGA will be compared to Parallel Analaysis, MAP, BIC, EBIC, Kaiser-Guttman rule and VSS in a simulation study with 32,000 simulated data sets, created by 64 conditions varying in four different criteria: number of factors (2 and 4), number of items per factor (5 and 10), sample size (100, 500, 1000 and 5000) and correlation between factors (orthogonal, .20, .50 and .70). In the last section, EGA will be used to estimate the number of factors from a real dataset with an empirically found factor structure. This will enable the comparison of EGA with the six techniques tested in the simulation study.

Network psychometrics

Recent literature has focused on the estimation of undirected network models, so called Markov Random Fields [ 13 ] to psychological datasets. In these network models, nodes represent random variables (as opposed to e.g., people in social networks) which are connected by edges or links indicating the level of interaction between these variables. These models focus on the estimation of direct relationships between observed variables rather than modeling observed variables as functions of latent common causes. Such models have shown great promise in diverse psychological fields such as psychopathology [ 34 , 35 , 36 , 37 , 38 ], attitude formation [ 39 ] quality of life research [ 40 ] and developmental psychology [ 41 ]. Forming a network structure on psychological data, however, is not an easy task. The field of network psychometrics [ 42 ] emerged as a response to these concerns with the estimation of such network models.

The network model we will utilize in this paper is termed the Gaussian graphical model (GGM) [ 13 ] which models multivariate normally distributed network directly through the inverse covariance matrix. Each element of the inverse covariance matrix corresponds to a connection, an edge , in the network, linking two variables, nodes , if they feature a pairwise interaction. These edges can be standardized, visualized and more easily interpreted as partial correlation coefficients of two variables after conditioning on all other variables in the dataset. Partial correlation coefficients of exactly zero indicate that there is no edge between two nodes. Thus, in a GGM, if two variables are not connected, they are conditionally independent after conditioning on all other variables in the network.

While a GGM can be estimated directly by inverting the sample variance-covariance matrix, doing so can lead to large standard errors and unstable parameter estimates in relatively small datasets (i.e., typical sample sizes in psychological research) due to overfitting. A popular technique used in estimating GGMs is to not directly invert the variance-covariance matrix but to estimate this model using penalized maximum likelihood estimation. In particular, the least absolute shrinkage and selection operator (LASSO) [ 43 ] can be used to estimate a GGM while guarding against overfitting. By using the LASSO many parameters can be estimated to be exactly zero; indicating conditional independence and increasing interpretability of a network structure. Because of these properties, LASSO estimation has become the go-to estimation method for network models on psychological datasets e.g. [ 38 , 40 , 44 ]. When using LASSO estimation one needs to set a tuning parameter that loosely controls the sparsity of the resulting network structure. A typical way of setting this tuning parameter is by estimating a model on 100 different tuning parameters and selecting the value that minimizes some criterion. For GGM estimation, minimizing the extended Bayesian information criterion [ 11 ] has been shown to work well in retrieving the true network structure [ 15 ]. This methodology has been implemented in the qgraph R package [ 45 , 46 ] for easy usage on psychometric datasets.

Exploratory graph analysis

The modeling of psychological datasets through network models originates with the work of van der Maas et al. [ 41 ], who show that a dataset that corresponds to a general factor model can be simulated using a fully connected network model as well. A section of a network in which all nodes are fully connected is also termed a clique , and a section in which many nodes are connected with each other is termed a cluster . Such clusters are of particular interest to psychometrics, as it is argued clusters of nodes will lead to comparable data as a latent variable model, or, depending on one’s assumptions on the underlying causal structure, influence due to latent variables will manifest in network structures as such clusters or even cliques in which all nodes interact with each other. For instance, in psychopathological literature it is argued that clusters of nodes representing symptoms correspond to psychopathological disorders [ 34 , 35 ]. Similar arguments have been made for stable personality traits, which routinely come up as clusters in an estimated network structure [ 45 , 47 , 48 ].

The relationship between latent variables on the one hand and network clusters on the other goes deeper than mere philosophical speculation and empirical findings. It can directly be seen that if a latent variable model is the true underlying causal model, we would expect indicators in a network model to feature strongly connected clusters for each latent variable. Since edges correspond to partial correlation coefficients between two variables after conditioning on all other variables in the network, and two indicators cannot become independent after conditioning on observed variables given that they are both caused by a latent variable, the edge strength between two indicators should not be zero. In fact, in a mathematical point of view, network models can be shown to be equivalent under certain conditions to latent variable models in both binary [ 42 , 49 ] and Gaussian datasets [ 50 ], in which case each latent variable is represented by a rank-1 cluster. Thus, when defining a cluster as a group of connected nodes regardless of edge weight, we can state the following relationship as a fundamental rule of network psychometrics: Clusters in network = latent variables.

It should be noted that when multiple correlated latent variables underlie distinct sets of indicators, none of the edges should be missing as we technically cannot condition on any observed variable to make two indicators independent. However, we would expect the partial correlation between two indicators of the same latent variable to be much stronger than the partial correlation between two indicators of different latent variables. Furthermore, when using LASSO estimation, we would expect these already small edge weights to be pushed more easily to zero simply due to the penalization. As such, we expect an algorithm to detect weighted network clusters to indicate indicators of the same latent variable.

graph analysis for research

Since Θ is diagonal, so is Θ −1 , leading to Θ −1 Λ to be block diagonal and Λ T Θ −1 Λ to be diagonal. Let X = ( ψ −1 + Λ T Θ −1 Λ ) −1 . Then, K becomes a block matrix in which every block is constructed of the inner product of factor loadings and inverse residual variances, every diagonal block is scaled by diagonal elements of X and every off-diagonal block is scaled by off-diagonal values of X .

Since ψ must be positive definite it follows that X must be positive definite as well. Typically in factor analysis the first factor loadings or the latent variance-covariances are fixed to 1 to identify the model. We can, however, without loss of information, also constrain the diagonal of X to equal 1. It then follows that every absolute off-diagonal value of X must be smaller than 1. From the formation of X follows that off-diagonal values of X equal zero if the latent factors are orthogonal. Hence, the above decomposition shows that:

  • If the latent factors are orthogonal, the resulting GGM consists of unconnected clusters.
  • Assuming factor loadings and residual variances are reasonably on the same scale for every item, the off-diagonal blocks of K will be scaled closer to zero than the diagonal blocks of K . Hence, the resulting GGM will contain weighted clusters for each factor.

This line of reasoning leads us to develop Exploratory Graph Analysis (EGA), in which firstly we estimate the correlation matrix of the observable variables, then the graphical LASSO estimation is used to obtain the sparse inverse covariance matrix, with the regularization parameter defined via EBIC over 100 different values. Finally, the walktrap algorithm [ 52 ] is used to find the number of dense subgraphs (communities or clusters) of the partial correlation matrix computed in the previous step. The walktrap algorithm provides a measure of similarities between vertices based on random walks which can capture the community/cluster structure in a graph [ 52 ]. The number of clusters identified equals the number of latent factors in a given dataset.

In sum, we expect EGA to present a high accuracy in estimating the number of dimensions in psychology-like datasets due to the use of the LASSO technique [ 43 ]. Partial correlation is one of the methods used to estimate network models, but it suffers from an important issue: even when two variables are conditionally independent, the estimated partial correlation coefficient is not zero due to sampling variation [ 46 ]. In other words, partial correlation can reflect spurious correlations. This issue can be solved using regularization techniques, such as the LASSO [ 43 ], which is one of the most prominent methods for network estimation on psychological datasets [ 38 , 40 , 44 ]. When LASSO is used to estimate a network, it avoids overfitting by shrinking the partial correlation coefficients, so small coefficients are estimated to be exactly zero, indicating conditional independence and making the interpretability of the network structure easier [ 46 ]. Since the LASSO can be used to control spurious connections, it is reasonable to expect it will provide high accurate estimates of the underlying structure of the data when combined with a community detection algorithm such as the walktrap [ 52 ].

Simulation study

Three thousand two hundred data sets were simulated to fit known factor structures, with the data sets varying across different criteria. The data generation design manipulated four variables, number of factors (2 and 4), number of items per factor (5 and 10), sample size (100, 500, 1000 and 5000) and correlation between factors (orthogonal, .20, .50 and .70), in a total of 64 different conditions with a 2x2x4x4 design. For each condition, 500 data sets were simulated using the R [ 53 ] package lavaan [ 54 ], resulting in the above mentioned 32,000 data sets. The simulated data came from a centered multivariate normal distribution, with factor loadings and variances set to unity, and every item artificially dichotomized at their respective theoretical mean zero.

Each factor was composed by five or ten dichotomous items. The choice of using this kind of items can be justified by the dichotomous nature of a significant number of intelligence test items, especially those requiring the respondents to perform some task with only one correct answer, such as the Raven’s progressive matrices [ 55 ], the Wiener Matrizen-Test 2 [ 56 ] or the more recent tests from the International Cognitive Ability Resource [ 57 , 58 ]. Since high correlation between factors are often found in intelligence researches, we have intended to mimic the nature of the field, so the comparison between the proposed exploratory graphical analysis and the traditional/antique techniques are easier to understand and to interpret.

Data analysis.

The simulated data sets were submitted to seven different methods to estimate the number of dimensions (factors): (1) very simple structure (VSS) [ 12 ] with complexity 1; (2) minimum average partial procedure (MAP) [ 9 ]; (3) the fit of different number of factors, from 1 to 10, via BIC; (4) the fit of different number of factors, from 1 to 10, via EBIC; (5) Horn’s Parallel Analysis (PA) [ 8 ] using the generalized weighted least squares factor method; (6) Kaiser-Guttman eigenvalue greater than one rule [ 6 , 7 ]; and (7) Exploratory Graph Analysis. The first five methods were implemented using the R package psych [ 59 ]. Since the items are dichotomous, the PA was applied using tetrachoric correlations for the real and simulated data. The eigenvalue greater than one rule was applied taking the observed eigenvalues calculated during the PA procedure.

The exploratory graph analysis was applied using the R package EGA [ 16 ]. This package has a function named EGA with two arguments: data and plot . EGA . The first one is used to specify the dataset and the second one is a logical argument, if TRUE returns a network showing the dimensions estimated. The EGA function returns a list with 5 elements: ndim (number of dimensions estimated), correlation (a matrix of zero-order correlation between the items), glasso (a matrix with the partial correlation estimated using EBICglasso, from qgraph ), wc (the walktrap community membership of the items), dim . variables (a dataframe with two columns: items and their respective estimated dimension). The EGA function firstly calculates the polychoric correlations via the cor_auto function of the qgraph package [ 45 ]. Secondly, the function uses the EBICglasso from the qgraph package [ 45 ] to estimate the sparse inverse covariance matrix with the graphical lasso technique. The EBICglasso function runs one hundred values of the regularization parameter, generating one hundred graphs. The EBIC is computed and the graph with the smallest EBIC is selected. Finally, the EGA function uses the walktrap algorithm [ 52 ] to find the number of dense subgraphs (communities) of the partial correlation matrix computed in the previous step, via the walktrap . community function available in igraph [ 60 ]. The walktrap algorithm provides a measure of similarities between vertices based on random walks which can capture the community structure in a graph [ 52 ].

Three indexes were recorded for each one of the 32,000 datasets, following Garrido, Abad and Posada [ 17 ]. The first index is the accuracy to correctly recover the number of factors. For example, in the four factor structure the accuracy equals one if four factors are estimated and zero otherwise. So it is possible to compute descriptive statistics based on the accuracy of each method in each group of 500 simulated data sets, for each condition. The second index, bias error, is the difference between the number of factors estimated and the true number of factors. A positive bias error indicates that the method is overestimating the number of factors. On the other hand, a negative bias error indicates that the method is underestimating the number of factors, and a bias error of zero indicates a complete lack of bias. The mean bias error (MBE) is calculated as the sum of the bias error divided by the number of datasets generated for each condition. The third index is the absolute error, which is the absolute value of the bias error. The mean absolute error (MAE) is calculated as the sum of the absolute error divided by the number of datasets generated for each condition. As pointed by Garrido, Abad and Posada [ 17 ], the bias error cannot be used alone for verifying the precision of a method to estimate the number of factors, since errors of under- and overfactoring can compensate each other. This does not happen with the accuracy index or with the absolute error index. A mean absolute error of zero indicates a perfect accuracy, while higher values are evidence of deviation from the correct number of dimensions.

Structure with two factors

Table 1 shows the mean accuracy and its standard deviation for each method, in each condition. When the correlation between factors was zero (orthogonal) the methods presented a mean accuracy ranging from 98% to 100%, except for the VSS method, which presented a mean accuracy of 31% (SD = 46%). As the sample size increased, the mean accuracy of VSS decreased from 76% (sample size of 100) to 3% (sample size of 5,000). On the other side, all the other methods achieved a mean accuracy of 100% for sample sizes of 500, 1000 and 5000. The exactly same pattern appeared when the correlation was .2. When the correlation between factors was .5, BIC, eBIC, Kaiser-Guttman’s eigenvalue rule, PA and EGA presented an overall mean accuracy greater than 90%, while VSS presented a mean accuracy of 22% and the MAP 67%. When the correlation was high (.7), only eBIC, PA and EGA presented an overall mean accuracy greater than 90%. In general, the increase in the number of items per factor lead to an increase in the mean accuracy and a decrease in the standard deviation, especially in the high correlation scenario ( Table 1 ).

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

VSS = Very Simple Structure; BIC = Bayesian Information Criteria; EBIC = Extended Bayesian Information Criteria; MAP = Minimum Average Partial procedure; Kaiser = Kaiser-Guttman eigenvalue greater than one rule; PA = Parallel Analysis; EGA = Exploratory Graph Analysis. Low correlation = .2; Moderate Correlation = .5; High Correlation = .7. The rows show the aggregate mean and standard deviation for each level of correlation (bold), sample size (bold and italicized) and number of items per factor (non-italicized).

https://doi.org/10.1371/journal.pone.0174035.t001

Fig 1 presents the mean accuracies and its 95% confidence interval by correlation (top left panel), number of items per factor (to right panel), sample size (bottom left panel) and by all conditions combined (bottom right panel). In general, the mean accuracies spread as the correlation between factors increase from zero to .7, with the Kaiser-Guttman’s rule, PA and EGA presenting the highest accuracies ( Fig 1 , top left panel). On the other side, the mean accuracies are higher (between 90% and 100%) when the number of items increase from 5 to 10, except for the VSS ( Fig 1 , top right panel). As the sample size increases, the mean accuracies of BIC, eBIC, PA and EGA also increase, attaining its maximum from sample sizes of 500 on ( Fig 1 , bottom left panel). The Kaiser-Guttman’s rule is the technique less affected by the variability in sample size. Finally, the bottom right panel of Fig 1 shows clearly that the worst scenario appears when the correlation is high (.7), the number of items is small (5 per factor) and the sample size is 100. In this case, as the sample size increases from 100 to 500, 1,000, or 5,000, BIC, eBIC, PA and EGA increase its accuracies up to 100% ( Table 1 ).

thumbnail

VSS = Very Simple Structure; BIC = Bayesian Information Criteria; EBIC = Extended Bayesian Information Criteria; MAP = Minimum Average Partial procedure; Kaiser = Kaiser-Guttman eigenvalue greater than one rule; PA = Parallel Analysis; EGA = Exploratory Graph Analysis. Low correlation = .2; Moderate Correlation = .5; High Correlation = .7.

https://doi.org/10.1371/journal.pone.0174035.g001

Bias error and absolute error.

In terms of mean bias error ( Fig 2 ), i.e. the mean difference between the estimated and the correct number of factors, VSS presented a very high error, indicating an overestimation when the correlation between factors are orthogonal (MBE = 2.96, SD = 2.95). As the correlation increases, the mean bias error of VSS, Kaiser-Guttman rule, BIC and MAP decreases, while PA and eBIC remains relatively stable, and EGA increases the MBE from .03 (SD = .02) when the correlation is orthogonal to .19 (SD = .28) when the correlation is high ( Fig 2 , top left panel). From five to ten items per factor, VSS also decreases its mean bias error, while the other techniques remain stable or increases the MBE to values close to zero ( Fig 2 , top right panel). Considering the sample size, the highest MBE variability is found when the sample equals 100 cases ( Fig 2 , bottom left panel), with EGA presenting a mean bias error of .30 (SD = 1.06), while the MBE of VSS was .05 (SD = 1.54), of PA .03 (SD = .37), eBIC -.10 (SD = .31), Kaiser-Guttman -.10 (SD = .35), MAP -.25 (SD = .43), and BIC -.31 (SD = .46). The increase in sample size sharply increases the MBE of VSS. When the sample size was equal to or greater than 500, EGA, PA, BIC and eBIC presented a MBE of zero ( Fig 2 , bottom left panel). Analyzing all the conditions together ( Fig 2 , bottom right panel), it is clear that VSS is the technique presenting the more intense issue with overestimation, while MAP and the Kaiser-Guttman rule tend to underestimate the number of factors when the correlation is high and the number of items per factor is five ( Fig 2 , bottom right panel). In terms of mean absolute error ( Fig 3 ), i.e. the mean absolute difference between the estimated and the correct number of factors, the scenario is very similar to the described above.

thumbnail

https://doi.org/10.1371/journal.pone.0174035.g002

thumbnail

https://doi.org/10.1371/journal.pone.0174035.g003

Structure with four factors

Table 2 shows the mean accuracy and its standard deviation for each method, in each condition in the four factor structure. When the correlation between factors was zero (orthogonal), BIC, the Kaiser-Guttman rule, PA and EGA achieved accuracies greater than 90%, while MAP presented a mean accuracy of only 32% (SD = 47%). The increase in the sample size improved the mean accuracies, except for MAP. The same scenario appeared when the correlation between factors were low. However, when the correlation was moderate, only EGA achieved a mean accuracy greater than 90%, irrespective of the sample size, number of items per factor or sample size. In the high correlation scenario, EGA showed the higher overall accuracy (Mean = 71%, SD = 46%). However, as the sample size and the number of items per factor increased, BIC, eBIC, Kaiser-Guttman and PA were able to achieve mean accuracies greater than 90%.

thumbnail

VSS = Very Simple Structure; BIC = Bayesian Information Criteria; EBIC = Extended Bayesian Information Criteria; MAP = Minimum Average Partial procedure; Kaiser = Kaiser-Guttman eigenvalue greater than one rule; PARAN = Parallel Analysis; EGA = Exploratory Graph Analysis. Low correlation = .2; Moderate Correlation = .5; High Correlation = .7. The rows show the aggregate mean and standard deviation for each level of correlation (bold), sample size (bold and italicized) and number of items per factor (non-italicized).

https://doi.org/10.1371/journal.pone.0174035.t002

Fig 4 presents the mean accuracies and its 95% confidence interval by correlation (top left panel), number of items per factor (to right panel), sample size (bottom left panel) and by all conditions combined (bottom right panel) in the four-factor structure. In general, the mean accuracies decrease as the correlation between factors increases from zero to .7, with EGA presenting the highest mean accuracy ( Fig 4 , top left panel). On the other hand, the mean accuracies increase when the number of items goes from 5 to 10 ( Fig 4 , top right panel) and with the increase of the sample size ( Fig 4 , bottom left panel), except for MAP, whose accuracy is inversely related to sample size. Finally, the bottom right panel of Fig 4 shows, again, that the worst scenario appears when the correlation between factors is high (.7) and the number of items is small (5 per factor). In this case, only EGA was able to correctly estimate the number of dimensions, presenting a mean accuracy of 100% for a sample size of 5,000 ( Fig 4 , bottom right panel). However, the increase in the number of items per factor, from five to ten, sharply increments the mean accuracy of the methods ( Fig 4 , bottom right panel).

thumbnail

https://doi.org/10.1371/journal.pone.0174035.g004

In terms of mean bias error, Fig 5 shows that VSS overestimated the number of dimensions when the correlation between factors was orthogonal or low. When the correlation between factors was moderate, MAP, BIC, eBIC, VSS, Kaiser-Guttman and PA underestimated the number of dimensions. All techniques presented a mean bias error lower than zero, indicating a tendency to underestimate the number of factors in the high correlation scenario ( Fig 5 , top left panel). The top right panel of Fig 5 also shows a very clear tendency: except for EGA, all the methods increased the mean bias error with the increase in number of items increases per factor. The sample size also affects the mean bias error ( Fig 5 , bottom left panel). When the sample size was 100, BIC, VSS, MAP, eBIC and PA presented the lowest mean bias error. As sample size increase, the mean bias error of the techniques tends to be closer to zero, except for the VSS, since the increase in the sample size implies an increase in its overestimation. Finally, the bottom right panel of Fig 5 shows what happens when the correlation between factors is high and the number of items is five: the methods tends to underestimate the number of dimensions. In terms of mean absolute error ( Fig 6 ), i.e. the mean absolute difference between the estimated and the correct number of factors, the scenario is very similar to the described above. In general, the absolute error increased as the correlation between factors became stronger and decreased when the number of items went from five to ten and when the sample size increased (except for VSS). EGA was the only technique to present a mean absolute error of zero for a sample size of 5,000.

thumbnail

https://doi.org/10.1371/journal.pone.0174035.g005

thumbnail

https://doi.org/10.1371/journal.pone.0174035.g006

High order interactions

The final analysis aimed to verify how each condition investigated, and their combinations, impacted the accuracy to identify the correct number of dimensions for each technique used. In order to do it, an analysis of variance (ANOVA) were performed for each technique, with the accuracy as the dependent variable and the correlation between factors, sample size, number of items per factor and number of factors as the independent variables ( Table 3 ). Only the partial η 2 (eta squared) effect size will be reported, since the goal is to verify the magnitude of the difference between groups of conditions, in each technique. Partial η 2 values equals to or greater than .14 can be considered large effect sizes [ 61 ]. The VSS technique presented a large effect size for correlation, number of factors and for the two-way interaction of sample size X number of factors. The MAP method presented a large effect size for correlation, number of factors and for the two-way interaction of correlation X items per factor. BIC and eBIC, on the other hand, presented large effect sizes for correlation, sample size, items per factor and number of factors. BIC also presented large effect sizes for every two-way interaction involving correlation, plus the two-way interaction of items per factor X number of factors and the three-way interaction of correlation X items per factor X number of factors. The eBIC technique, on the other hand, also presented large effect sizes for correlation X items per factor, correlation X number of factors, sample size X number of factors and for the four-way interaction of correlation X sample size X items per factor X number of factors. By its turn, the Kaiser-Guttman rule presented large effect sizes for correlation, items per factor, and correlation X items per factor. Parallel analysis presented large effect sizes for all isolate conditions, plus the two-way interactions of correlation X items per factor, correlation X number of factors, as well as for the three-way interaction of correlation X items per factor X number of factors and the four-way interaction of correlation X sample size X items per factor X number of factors. Finally, EGA only presented a large effect size for the sample size, being the technique whose accuracy was least affected by the conditions investigated in this paper.

thumbnail

VSS = Very Simple Structure; BIC = Bayesian Information Criteria; EBIC = Extended Bayesian Information Criteria; MAP = Minimum Average Partial procedure; Kaiser = Kaiser-Guttman eigenvalue greater than one rule; PA = Parallel Analysis; EGA = Exploratory Graph Analysis. In bold and underlined are the large effect sizes [ 61 ].

https://doi.org/10.1371/journal.pone.0174035.t003

Using EGA in real dataset

The dataset we are using in this section was published by Golino and Gomes [ 62 ]. It presents data from 1,803 Brazilians (52.5% female) with age varying from 5 to 85 years (M = 15.75; SD = 12.21) that answered to the Inductive Reasoning Developmental Test–IRDT (3rd version) [ 62 ], a pencil-and-paper instrument with 56 items designed to assess developmentally sequenced and hierarchically organized inductive reasoning. The dataset can be downloaded for reproducible purposes in the following link: https://figshare.com/articles/TDRI_dataset_csv/3142321 . The sequence of IRDT items was constructed to measure seven developmental stages based on the Model of Hierarchical Complexity [ 63 , 64 ] and on Fischer’s Dynamic Skill Theory [ 65 , 66 ], two neo-Piagetian theories of development. Golino and Gomes [ 62 ] showed that two structures can be used to describe the IRDT items. The first one is a seven correlated factors model [χ 2 (1463) = 764,28; p = 0,00; CFI = 1,00; RMSEA = 0,00; NFI = 0,99; NNFI = 1,00], in which each factor represents one stage and explains a group of eight items ( Fig 7 ). The other is a bifactor (Schmid-Leiman) model with seven specific first order factors ( Fig 8 ), each one representing one stage and explaining a group of eight items, plus a general first order factor directly explaining the IRDT’s 56 items [χ 2 (1428) = 2768,36; p = 0,00; CFI = 0,98; RMSEA = 0,04; NFI = 0,95; NNFI = 0,98]. The authors showed that both models are not significantly different, via the Satorra and Bentler’s [ 67 ] scaled chi-square test [Δχ 2 = -99.87; ΔDF = 35; p = 1]. Figs 7 and 8 shows the standardized factor loadings and correlations of both models, and were created using semPlot [ 68 ].

thumbnail

The factors correspond to the stages the instrument intended to measure: Prp = Pre-Operational; Prm = Primary; Cnc = Concrete; Abs = Abstract; Frm = Formal; Sys = Systematic; Met = Metasystematic.

https://doi.org/10.1371/journal.pone.0174035.g007

thumbnail

The specific, first order factors correspond to the stages the instrument intended to measure: Prp = Pre-Operational; Prm = Primary; Cnc = Concrete; Abs = Abstract; Frm = Formal; Sys = Systematic; Met = Metasystematic. The general first order factor (G) is the general factor of inductive reasoning.

https://doi.org/10.1371/journal.pone.0174035.g008

The EGA was used in the IRDT data and suggested seven dimensions ( Fig 9 ) with its respective items. The nodes represent the items, and the communities, factors or dimensions are colored. The seven dimensions estimated by EGA correspond exactly to the seven first-order factors investigated in the original publication [ 62 ]. Parallel analysis, MAP, VSS and BIC and EBIC were used to estimate the number of dimensions in the IRDT data via the psych [ 59 ] package. Table 4 shows the statistics by number of factors from one to ten, for each method. As can be seen highlighted in bold in Table 4 , VSS suggests two factors, Kaiser-Guttman eigenvalue rule suggests six factors, MAP seven, BIC and EBIC ten factors, and parallel analysis four factors. Only MAP suggested the same number of factors investigated in the original publication for the IRDT data.

thumbnail

https://doi.org/10.1371/journal.pone.0174035.g009

thumbnail

VSS = Very Simple Structure; BIC = Bayesian Information Criteria; EBIC = Extended Bayesian Information Criteria; MAP = Minimum Average Partial procedure; Kaiser = Kaiser-Guttman eigenvalue rule. The number of factors is chosen as follows: the highest value of the VSS statistic, the lowest value of the MAP, BIC and EBIC statistics, and the last observed eigenvalue greater than the simulated eigenvalue in the parallel analysis.

https://doi.org/10.1371/journal.pone.0174035.t004

Estimating the correct number of dimensions in psychological and educational instruments is challenging [ 1 , 2 , 3 ]. We proposed a new method for assessing the number of dimensions in psychological data, which has been derived from the growing field of network psychometrics in which network models are used to model the covariance structure. We term this method exploratory graph analysis (EGA), and showed in simulation studies that the method performed comparable to parallel analysis in most cases, and better with multiple strongly correlated latent factors. In addition, EGA automatically identifies which items indicate the retrieved dimensions. We showcased EGA on an empirical dataset of the Inductive Reasoning Developmental Test.

As shown in our simulation study, EGA performed comparable to parallel analysis, EBIC, eBIC and to Kaiser-Guttman rule in a number of situations, especially when the number of factors was two. However, EGA outperformed all methods when the number of items per factor was five and the correlation between factors were high in the four-factor structure. In general, EGA outperformed the other methods in the four factor structure, with a general mean accuracy of 89%, and was the technique whose accuracy was least affected by the conditions investigated in this paper, as shown by the ANOVA’s partial eta squared effect size in Table 2 . The large differences in the four factors X high correlation X five indicators condition is remarkable, especially compared to the results of the 10-indicator condition. Future simulation studies should confirm if these results can be replicated. Not taking this condition into account, EGA performs comparable to PA over all other conditions with the added benefit of returning which items indicate each dimension.

A surprising evidence appeared in our results: The Kaiser-Guttman eigenvalue greater-than-one rule was better than some researchers would expect [ 20 , 24 ]. It presented the third best mean accuracy for the two-factor structure (Mean = 86%; SD = 35%) and for the four-factor structure (Mean = 76%, SD = 43%), only losing to parallel analysis (Mean Two-Factors = 97%, SD Two-Factors = 16%; Mean Four-Factors = 80%, SD Four-Factors = 40%) and EGA (Mean Two-Factors = 97%, SD Two-Factors = 19%; Mean Four-Factors = 89%, SD Four-Factors = 31%). However, the Kaiser-Guttman rule suffers from the same issues of parallel analysis, i.e. its accuracy is very low when the correlation between factors is high and the number of items per factor is low. Another results worth pointing refers to the poor performance of VSS, that was the technique less accurate to estimate the number of factors. It should be noticed that while choosing a method to investigate the number of underlying dimensions of a given dataset or instrument, one needs to consider the strengths and weaknesses of each technique, reviewing the scientific literature in order to see the conditions they work the best and the conditions they fail, as well as considering the assumptions of each method. For example, VSS seeks a very simple structure, making very rigid assumptions, that will be met only in a limited number of cases. Both the results of simulation studies and the careful analysis of the underlying assumptions of each method should be considered in order to make a substantiated decision regarding which technique to use.

It is important to note that we have used a very pragmatic approach in our study, since the goal was to investigate whether different procedures can detect the number of simulated dimensions. This is an important part of the development of new quantitative methods aiming to identify the number of dimensions or factors underlying a given instrument or dataset. It is also relevant in order to detect in which conditions the available techniques work the best, in which conditions they should be used carefully and under which circumstances they fail. However, detecting the correct number of factors is only possible for simulated data. Real data allow for several solutions, often similar, especially if one varies the decision criterion. The role of quantitative techniques is to provide support in the quest for understanding the data, supported by careful theoretical analysis, in order to arrive at a solution that is robust both from a quantitative and from a theoretical point of view.

As this is the first study presenting EGA and comparing it to other methods, it has important limitations that should be addressed in future research. Future research should investigate the robustness of EGA to estimate the correct number of dimensions if the data is not multivariate normal, as well as compare it to the well-known and used technique of the Scree-Plot. Also, it would be important to verify the accuracy of other community detection algorithm, besides the walk-trap algorithm currently used in the EGA procedure, in the identification of clusters in undirected weighted networks. A similar investigation was published by Yang, Algesheimer and Tessone [ 69 ], which showed the walk-trap algorithm as one of the most accurate ones. However, Yang, Algesheimer and Tessone [ 69 ] investigated the accuracy of community detection algorithms for very large undirected weighted networks (with more than 1,000 nodes), which is not the usual number of variables in psychological or educational researches involving the use of tests and/or questionnaires.

There are at least other four things to investigate further. The first two are how EGA works for different levels of factor loadings and for different type of items (polytomous and continuous). It is also important to investigate if the findings of the current paper can be replicated in scenarios involving only one factor. Finally, future research should investigate both the communalities and the proportion of explained variance of the dimensional structure suggested by EGA, especially when using real data. We expect that, in spite of the relevant open questions briefly pointed above, EGA can be used in real datasets. It outperformed other methods, including the very well-known and widely used parallel analysis and minimum average partial procedure, when the number of factors were equal to four, the number of items was five and the correlation between factors were high. In a nutshell, EGA can help with an issue that have been challenging researchers since the beginning of scientific psychological testing. The findings of the current paper may be the solution that Keith, Caemmerer and Reynolds [ 18 ] was looking for when they investigated if the available methods underestimates or overestimates the number of factors in intelligence researches. In face of the problems with parallel analysis and MAP, they pointed that a possible solution could be found in formal and informal theory in research with cognitive tests. We can argue that a possible solution is the use of EGA in intelligence like data.

Supporting information

S1 file. scripts used in the current simulation study..

https://doi.org/10.1371/journal.pone.0174035.s001

Author Contributions

  • Conceptualization: HFG SE.
  • Formal analysis: HFG.
  • Methodology: HFG.
  • Software: HFG.
  • Validation: HFG.
  • Visualization: HFG.
  • Writing – original draft: HFG SE.
  • Writing – review & editing: HFG SE.
  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 4. James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. New York: springer; 2013 Feb 11.
  • 5. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. Springer, Berlin: Springer series in statistics; 2001.
  • 13. Lauritzen SL. Graphical models. Clarendon Press; 1996 May 2.
  • 15. Foygel R, Drton M. Extended Bayesian information criteria for Gaussian graphical models. InAdvances in neural information processing systems 2010 (pp. 604–612).
  • 16. Golino, HF. EGA package. Available at github.com/hfgolino/EGA.
  • 20. Bandalos DL, Boehm-Kaufman MR. Four common misconceptions in exploratory factor analysis. Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences. 2009:61–87.
  • 24. Velicer WF, Eaton CA, Fava JL. Construct explication through factor or component analysis: A review and evaluation of alternative procedures for determining the number of factors or components. InProblems and solutions in human assessment 2000 (pp. 41–71). Springer US.
  • 27. Dziak JJ, Coffman DL, Lanza ST, Li R. Sensitivity and specificity of information criteria. The Methodology Center and Department of Statistics, Penn State, The Pennsylvania State University. 2012 Jun 27:1–0.
  • 42. Epskamp S, Maris G, Waldorp LJ, Borsboom D. Network psychometrics. Handbook of psychometrics. New York: Wiley. 2015.
  • 46. Epskamp S, Fried EI. Estimating Regularized Psychological Networks Using qgraph. arXiv preprint arXiv:1607.01367. 2016 Sep 25.
  • 50. Chandrasekaran V, Parrilo PA, Willsky AS. Latent variable graphical model selection via convex optimization. InCommunication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on 2010 Sep 29 (pp. 1610–1613). IEEE.
  • 51. Epskamp S, Rhemtulla M, Borsboom D. Generalized Network Psychometrics: Combining Network and Latent Variable Models. arXiv preprint arXiv:1605.09288. 2016 May 30.
  • 52. Pons P, Latapy M. Computing communities in large networks using random walks. InInternational Symposium on Computer and Information Sciences 2005 Oct 26 (pp. 284–293). Springer Berlin Heidelberg.
  • 53. R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2016. URL http://www.R-project.org/ .
  • 55. Raven JC. Progressive matrices: A perceptual test of intelligence. London: HK Lewis. 1938.
  • 56. Formann AK, Waldherr K, Piswanger K. Wiener Matrizen-Test 2 (WMT-2): Ein Rasch-skalierter sprachfreier Kurztest zur Erfassung der Intelligenz. Beltz Test; 2011.
  • 59. Revelle W. psych: Procedures for personality and psychological research. Northwestern University, Evanston. R package version. 2014 Jan;1(1).
  • 61. Cohen J. Statistical power analysis for the behavior science. Lawrance Eribaum Association. 1988.
  • 62. Golino, HF, & Gomes, CM. Investigando estágios de desenvolvimento do raciocínio indutivo usando a análise fatorial confimatória, o Modelo Logístico Simples de Rasch e o modelo de teste logístico linear (Rasch Estendido). In H. F. Golino, C. M. Gomes, A. Amantes, & G. Coelho. Psicometria Contemporânea: Compreendendo os Modelos Rasch. São Paulo: Casa do Psicólogo/Pearson, 2015. 283–331.
  • 68. Epskamp S. semPlot: Path diagrams and visual analysis of various SEM packages' output (Version 1.0. 0.). http://CRAN.R-project.org/package=semPlot . 2014.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

The Beginner's Guide to Statistical Analysis | 5 Steps & Examples

Statistical analysis means investigating trends, patterns, and relationships using quantitative data . It is an important research tool used by scientists, governments, businesses, and other organizations.

To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process . You need to specify your hypotheses and make decisions about your research design, sample size, and sampling procedure.

After collecting data from your sample, you can organize and summarize the data using descriptive statistics . Then, you can use inferential statistics to formally test hypotheses and make estimates about the population. Finally, you can interpret and generalize your findings.

This article is a practical introduction to statistical analysis for students and researchers. We’ll walk you through the steps using two research examples. The first investigates a potential cause-and-effect relationship, while the second investigates a potential correlation between variables.

Table of contents

Step 1: write your hypotheses and plan your research design, step 2: collect data from a sample, step 3: summarize your data with descriptive statistics, step 4: test hypotheses or make estimates with inferential statistics, step 5: interpret your results, other interesting articles.

To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design.

Writing statistical hypotheses

The goal of research is often to investigate a relationship between variables within a population . You start with a prediction, and use statistical analysis to test that prediction.

A statistical hypothesis is a formal way of writing a prediction about a population. Every research prediction is rephrased into null and alternative hypotheses that can be tested using sample data.

While the null hypothesis always predicts no effect or no relationship between variables, the alternative hypothesis states your research prediction of an effect or relationship.

  • Null hypothesis: A 5-minute meditation exercise will have no effect on math test scores in teenagers.
  • Alternative hypothesis: A 5-minute meditation exercise will improve math test scores in teenagers.
  • Null hypothesis: Parental income and GPA have no relationship with each other in college students.
  • Alternative hypothesis: Parental income and GPA are positively correlated in college students.

Planning your research design

A research design is your overall strategy for data collection and analysis. It determines the statistical tests you can use to test your hypothesis later on.

First, decide whether your research will use a descriptive, correlational, or experimental design. Experiments directly influence variables, whereas descriptive and correlational studies only measure variables.

  • In an experimental design , you can assess a cause-and-effect relationship (e.g., the effect of meditation on test scores) using statistical tests of comparison or regression.
  • In a correlational design , you can explore relationships between variables (e.g., parental income and GPA) without any assumption of causality using correlation coefficients and significance tests.
  • In a descriptive design , you can study the characteristics of a population or phenomenon (e.g., the prevalence of anxiety in U.S. college students) using statistical tests to draw inferences from sample data.

Your research design also concerns whether you’ll compare participants at the group level or individual level, or both.

  • In a between-subjects design , you compare the group-level outcomes of participants who have been exposed to different treatments (e.g., those who performed a meditation exercise vs those who didn’t).
  • In a within-subjects design , you compare repeated measures from participants who have participated in all treatments of a study (e.g., scores from before and after performing a meditation exercise).
  • In a mixed (factorial) design , one variable is altered between subjects and another is altered within subjects (e.g., pretest and posttest scores from participants who either did or didn’t do a meditation exercise).
  • Experimental
  • Correlational

First, you’ll take baseline test scores from participants. Then, your participants will undergo a 5-minute meditation exercise. Finally, you’ll record participants’ scores from a second math test.

In this experiment, the independent variable is the 5-minute meditation exercise, and the dependent variable is the math test score from before and after the intervention. Example: Correlational research design In a correlational study, you test whether there is a relationship between parental income and GPA in graduating college students. To collect your data, you will ask participants to fill in a survey and self-report their parents’ incomes and their own GPA.

Measuring variables

When planning a research design, you should operationalize your variables and decide exactly how you will measure them.

For statistical analysis, it’s important to consider the level of measurement of your variables, which tells you what kind of data they contain:

  • Categorical data represents groupings. These may be nominal (e.g., gender) or ordinal (e.g. level of language ability).
  • Quantitative data represents amounts. These may be on an interval scale (e.g. test score) or a ratio scale (e.g. age).

Many variables can be measured at different levels of precision. For example, age data can be quantitative (8 years old) or categorical (young). If a variable is coded numerically (e.g., level of agreement from 1–5), it doesn’t automatically mean that it’s quantitative instead of categorical.

Identifying the measurement level is important for choosing appropriate statistics and hypothesis tests. For example, you can calculate a mean score with quantitative data, but not with categorical data.

In a research study, along with measures of your variables of interest, you’ll often collect data on relevant participant characteristics.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Population vs sample

In most cases, it’s too difficult or expensive to collect data from every member of the population you’re interested in studying. Instead, you’ll collect data from a sample.

Statistical analysis allows you to apply your findings beyond your own sample as long as you use appropriate sampling procedures . You should aim for a sample that is representative of the population.

Sampling for statistical analysis

There are two main approaches to selecting a sample.

  • Probability sampling: every member of the population has a chance of being selected for the study through random selection.
  • Non-probability sampling: some members of the population are more likely than others to be selected for the study because of criteria such as convenience or voluntary self-selection.

In theory, for highly generalizable findings, you should use a probability sampling method. Random selection reduces several types of research bias , like sampling bias , and ensures that data from your sample is actually typical of the population. Parametric tests can be used to make strong statistical inferences when data are collected using probability sampling.

But in practice, it’s rarely possible to gather the ideal sample. While non-probability samples are more likely to at risk for biases like self-selection bias , they are much easier to recruit and collect data from. Non-parametric tests are more appropriate for non-probability samples, but they result in weaker inferences about the population.

If you want to use parametric tests for non-probability samples, you have to make the case that:

  • your sample is representative of the population you’re generalizing your findings to.
  • your sample lacks systematic bias.

Keep in mind that external validity means that you can only generalize your conclusions to others who share the characteristics of your sample. For instance, results from Western, Educated, Industrialized, Rich and Democratic samples (e.g., college students in the US) aren’t automatically applicable to all non-WEIRD populations.

If you apply parametric tests to data from non-probability samples, be sure to elaborate on the limitations of how far your results can be generalized in your discussion section .

Create an appropriate sampling procedure

Based on the resources available for your research, decide on how you’ll recruit participants.

  • Will you have resources to advertise your study widely, including outside of your university setting?
  • Will you have the means to recruit a diverse sample that represents a broad population?
  • Do you have time to contact and follow up with members of hard-to-reach groups?

Your participants are self-selected by their schools. Although you’re using a non-probability sample, you aim for a diverse and representative sample. Example: Sampling (correlational study) Your main population of interest is male college students in the US. Using social media advertising, you recruit senior-year male college students from a smaller subpopulation: seven universities in the Boston area.

Calculate sufficient sample size

Before recruiting participants, decide on your sample size either by looking at other studies in your field or using statistics. A sample that’s too small may be unrepresentative of the sample, while a sample that’s too large will be more costly than necessary.

There are many sample size calculators online. Different formulas are used depending on whether you have subgroups or how rigorous your study should be (e.g., in clinical research). As a rule of thumb, a minimum of 30 units or more per subgroup is necessary.

To use these calculators, you have to understand and input these key components:

  • Significance level (alpha): the risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Statistical power : the probability of your study detecting an effect of a certain size if there is one, usually 80% or higher.
  • Expected effect size : a standardized indication of how large the expected result of your study will be, usually based on other similar studies.
  • Population standard deviation: an estimate of the population parameter based on a previous study or a pilot study of your own.

Once you’ve collected all of your data, you can inspect them and calculate descriptive statistics that summarize them.

Inspect your data

There are various ways to inspect your data, including the following:

  • Organizing data from each variable in frequency distribution tables .
  • Displaying data from a key variable in a bar chart to view the distribution of responses.
  • Visualizing the relationship between two variables using a scatter plot .

By visualizing your data in tables and graphs, you can assess whether your data follow a skewed or normal distribution and whether there are any outliers or missing data.

A normal distribution means that your data are symmetrically distributed around a center where most values lie, with the values tapering off at the tail ends.

Mean, median, mode, and standard deviation in a normal distribution

In contrast, a skewed distribution is asymmetric and has more values on one end than the other. The shape of the distribution is important to keep in mind because only some descriptive statistics should be used with skewed distributions.

Extreme outliers can also produce misleading statistics, so you may need a systematic approach to dealing with these values.

Calculate measures of central tendency

Measures of central tendency describe where most of the values in a data set lie. Three main measures of central tendency are often reported:

  • Mode : the most popular response or value in the data set.
  • Median : the value in the exact middle of the data set when ordered from low to high.
  • Mean : the sum of all values divided by the number of values.

However, depending on the shape of the distribution and level of measurement, only one or two of these measures may be appropriate. For example, many demographic characteristics can only be described using the mode or proportions, while a variable like reaction time may not have a mode at all.

Calculate measures of variability

Measures of variability tell you how spread out the values in a data set are. Four main measures of variability are often reported:

  • Range : the highest value minus the lowest value of the data set.
  • Interquartile range : the range of the middle half of the data set.
  • Standard deviation : the average distance between each value in your data set and the mean.
  • Variance : the square of the standard deviation.

Once again, the shape of the distribution and level of measurement should guide your choice of variability statistics. The interquartile range is the best measure for skewed distributions, while standard deviation and variance provide the best information for normal distributions.

Using your table, you should check whether the units of the descriptive statistics are comparable for pretest and posttest scores. For example, are the variance levels similar across the groups? Are there any extreme values? If there are, you may need to identify and remove extreme outliers in your data set or transform your data before performing a statistical test.

From this table, we can see that the mean score increased after the meditation exercise, and the variances of the two scores are comparable. Next, we can perform a statistical test to find out if this improvement in test scores is statistically significant in the population. Example: Descriptive statistics (correlational study) After collecting data from 653 students, you tabulate descriptive statistics for annual parental income and GPA.

It’s important to check whether you have a broad range of data points. If you don’t, your data may be skewed towards some groups more than others (e.g., high academic achievers), and only limited inferences can be made about a relationship.

A number that describes a sample is called a statistic , while a number describing a population is called a parameter . Using inferential statistics , you can make conclusions about population parameters based on sample statistics.

Researchers often use two main methods (simultaneously) to make inferences in statistics.

  • Estimation: calculating population parameters based on sample statistics.
  • Hypothesis testing: a formal process for testing research predictions about the population using samples.

You can make two types of estimates of population parameters from sample statistics:

  • A point estimate : a value that represents your best guess of the exact parameter.
  • An interval estimate : a range of values that represent your best guess of where the parameter lies.

If your aim is to infer and report population characteristics from sample data, it’s best to use both point and interval estimates in your paper.

You can consider a sample statistic a point estimate for the population parameter when you have a representative sample (e.g., in a wide public opinion poll, the proportion of a sample that supports the current government is taken as the population proportion of government supporters).

There’s always error involved in estimation, so you should also provide a confidence interval as an interval estimate to show the variability around a point estimate.

A confidence interval uses the standard error and the z score from the standard normal distribution to convey where you’d generally expect to find the population parameter most of the time.

Hypothesis testing

Using data from a sample, you can test hypotheses about relationships between variables in the population. Hypothesis testing starts with the assumption that the null hypothesis is true in the population, and you use statistical tests to assess whether the null hypothesis can be rejected or not.

Statistical tests determine where your sample data would lie on an expected distribution of sample data if the null hypothesis were true. These tests give two main outputs:

  • A test statistic tells you how much your data differs from the null hypothesis of the test.
  • A p value tells you the likelihood of obtaining your results if the null hypothesis is actually true in the population.

Statistical tests come in three main varieties:

  • Comparison tests assess group differences in outcomes.
  • Regression tests assess cause-and-effect relationships between variables.
  • Correlation tests assess relationships between variables without assuming causation.

Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics.

Parametric tests

Parametric tests make powerful inferences about the population based on sample data. But to use them, some assumptions must be met, and only some types of variables can be used. If your data violate these assumptions, you can perform appropriate data transformations or use alternative non-parametric tests instead.

A regression models the extent to which changes in a predictor variable results in changes in outcome variable(s).

  • A simple linear regression includes one predictor variable and one outcome variable.
  • A multiple linear regression includes two or more predictor variables and one outcome variable.

Comparison tests usually compare the means of groups. These may be the means of different groups within a sample (e.g., a treatment and control group), the means of one sample group taken at different times (e.g., pretest and posttest scores), or a sample mean and a population mean.

  • A t test is for exactly 1 or 2 groups when the sample is small (30 or less).
  • A z test is for exactly 1 or 2 groups when the sample is large.
  • An ANOVA is for 3 or more groups.

The z and t tests have subtypes based on the number and types of samples and the hypotheses:

  • If you have only one sample that you want to compare to a population mean, use a one-sample test .
  • If you have paired measurements (within-subjects design), use a dependent (paired) samples test .
  • If you have completely separate measurements from two unmatched groups (between-subjects design), use an independent (unpaired) samples test .
  • If you expect a difference between groups in a specific direction, use a one-tailed test .
  • If you don’t have any expectations for the direction of a difference between groups, use a two-tailed test .

The only parametric correlation test is Pearson’s r . The correlation coefficient ( r ) tells you the strength of a linear relationship between two quantitative variables.

However, to test whether the correlation in the sample is strong enough to be important in the population, you also need to perform a significance test of the correlation coefficient, usually a t test, to obtain a p value. This test uses your sample size to calculate how much the correlation coefficient differs from zero in the population.

You use a dependent-samples, one-tailed t test to assess whether the meditation exercise significantly improved math test scores. The test gives you:

  • a t value (test statistic) of 3.00
  • a p value of 0.0028

Although Pearson’s r is a test statistic, it doesn’t tell you anything about how significant the correlation is in the population. You also need to test whether this sample correlation coefficient is large enough to demonstrate a correlation in the population.

A t test can also determine how significantly a correlation coefficient differs from zero based on sample size. Since you expect a positive correlation between parental income and GPA, you use a one-sample, one-tailed t test. The t test gives you:

  • a t value of 3.08
  • a p value of 0.001

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

graph analysis for research

The final step of statistical analysis is interpreting your results.

Statistical significance

In hypothesis testing, statistical significance is the main criterion for forming conclusions. You compare your p value to a set significance level (usually 0.05) to decide whether your results are statistically significant or non-significant.

Statistically significant results are considered unlikely to have arisen solely due to chance. There is only a very low chance of such a result occurring if the null hypothesis is true in the population.

This means that you believe the meditation intervention, rather than random factors, directly caused the increase in test scores. Example: Interpret your results (correlational study) You compare your p value of 0.001 to your significance threshold of 0.05. With a p value under this threshold, you can reject the null hypothesis. This indicates a statistically significant correlation between parental income and GPA in male college students.

Note that correlation doesn’t always mean causation, because there are often many underlying factors contributing to a complex variable like GPA. Even if one variable is related to another, this may be because of a third variable influencing both of them, or indirect links between the two variables.

Effect size

A statistically significant result doesn’t necessarily mean that there are important real life applications or clinical outcomes for a finding.

In contrast, the effect size indicates the practical significance of your results. It’s important to report effect sizes along with your inferential statistics for a complete picture of your results. You should also report interval estimates of effect sizes if you’re writing an APA style paper .

With a Cohen’s d of 0.72, there’s medium to high practical significance to your finding that the meditation exercise improved test scores. Example: Effect size (correlational study) To determine the effect size of the correlation coefficient, you compare your Pearson’s r value to Cohen’s effect size criteria.

Decision errors

Type I and Type II errors are mistakes made in research conclusions. A Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s false.

You can aim to minimize the risk of these errors by selecting an optimal significance level and ensuring high power . However, there’s a trade-off between the two errors, so a fine balance is necessary.

Frequentist versus Bayesian statistics

Traditionally, frequentist statistics emphasizes null hypothesis significance testing and always starts with the assumption of a true null hypothesis.

However, Bayesian statistics has grown in popularity as an alternative approach in the last few decades. In this approach, you use previous research to continually update your hypotheses based on your expectations and observations.

Bayes factor compares the relative strength of evidence for the null versus the alternative hypothesis rather than making a conclusion about rejecting the null hypothesis or not.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval

Methodology

  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hostile attribution bias
  • Affect heuristic

Is this article helpful?

Other students also liked.

  • Descriptive Statistics | Definitions, Types, Examples
  • Inferential Statistics | An Easy Introduction & Examples
  • Choosing the Right Statistical Test | Types & Examples

More interesting articles

  • Akaike Information Criterion | When & How to Use It (Example)
  • An Easy Introduction to Statistical Significance (With Examples)
  • An Introduction to t Tests | Definitions, Formula and Examples
  • ANOVA in R | A Complete Step-by-Step Guide with Examples
  • Central Limit Theorem | Formula, Definition & Examples
  • Central Tendency | Understanding the Mean, Median & Mode
  • Chi-Square (Χ²) Distributions | Definition & Examples
  • Chi-Square (Χ²) Table | Examples & Downloadable Table
  • Chi-Square (Χ²) Tests | Types, Formula & Examples
  • Chi-Square Goodness of Fit Test | Formula, Guide & Examples
  • Chi-Square Test of Independence | Formula, Guide & Examples
  • Coefficient of Determination (R²) | Calculation & Interpretation
  • Correlation Coefficient | Types, Formulas & Examples
  • Frequency Distribution | Tables, Types & Examples
  • How to Calculate Standard Deviation (Guide) | Calculator & Examples
  • How to Calculate Variance | Calculator, Analysis & Examples
  • How to Find Degrees of Freedom | Definition & Formula
  • How to Find Interquartile Range (IQR) | Calculator & Examples
  • How to Find Outliers | 4 Ways with Examples & Explanation
  • How to Find the Geometric Mean | Calculator & Formula
  • How to Find the Mean | Definition, Examples & Calculator
  • How to Find the Median | Definition, Examples & Calculator
  • How to Find the Mode | Definition, Examples & Calculator
  • How to Find the Range of a Data Set | Calculator & Formula
  • Hypothesis Testing | A Step-by-Step Guide with Easy Examples
  • Interval Data and How to Analyze It | Definitions & Examples
  • Levels of Measurement | Nominal, Ordinal, Interval and Ratio
  • Linear Regression in R | A Step-by-Step Guide & Examples
  • Missing Data | Types, Explanation, & Imputation
  • Multiple Linear Regression | A Quick Guide (Examples)
  • Nominal Data | Definition, Examples, Data Collection & Analysis
  • Normal Distribution | Examples, Formulas, & Uses
  • Null and Alternative Hypotheses | Definitions & Examples
  • One-way ANOVA | When and How to Use It (With Examples)
  • Ordinal Data | Definition, Examples, Data Collection & Analysis
  • Parameter vs Statistic | Definitions, Differences & Examples
  • Pearson Correlation Coefficient (r) | Guide & Examples
  • Poisson Distributions | Definition, Formula & Examples
  • Probability Distribution | Formula, Types, & Examples
  • Quartiles & Quantiles | Calculation, Definition & Interpretation
  • Ratio Scales | Definition, Examples, & Data Analysis
  • Simple Linear Regression | An Easy Introduction & Examples
  • Skewness | Definition, Examples & Formula
  • Statistical Power and Why It Matters | A Simple Introduction
  • Student's t Table (Free Download) | Guide & Examples
  • T-distribution: What it is and how to use it
  • Test statistics | Definition, Interpretation, and Examples
  • The Standard Normal Distribution | Calculator, Examples & Uses
  • Two-Way ANOVA | Examples & When To Use It
  • Type I & Type II Errors | Differences, Examples, Visualizations
  • Understanding Confidence Intervals | Easy Examples & Formulas
  • Understanding P values | Definition and Examples
  • Variability | Calculating Range, IQR, Variance, Standard Deviation
  • What is Effect Size and Why Does It Matter? (Examples)
  • What Is Kurtosis? | Definition, Examples & Formula
  • What Is Standard Error? | How to Calculate (Guide with Examples)

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection  methods, and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

in-app feedback tools

In-App Feedback Tools: How to Collect, Uses & 14 Best Tools

Mar 29, 2024

Customer Journey Analytics Software

11 Best Customer Journey Analytics Software in 2024

VOC software

17 Best VOC Software for Customer Experience in 2024

Mar 28, 2024

CEM software

CEM Software: What it is, 7 Best CEM Software in 2024

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

 alt=

Market Research and Analysis - Part 3: Market Trend Analysis

Greg Morris

Greg Morris

This article (and the next) focuses on trends in the market—an explanation as to why markets trend, reasons why it is good to know that markets trend, then finally, a large research section into how much markets trend. This analysis will initially be shown on 109 market indices that involve domestic, international, and commodity sectors. Following that the full list of all S&P GICS sectors, industry groups, and industries are shown following the same format. There is a great amount of data in these two sections. I try to slice through it with simple analysis, keeping in mind that lots of data does not equate to information.

Why Markets Trend

Trends in markets are generally caused by short-term supply-and-demand imbalances with a heavy overdose of human emotion. When you buy a stock, you know that someone had to sell it to you. If the market has been rising recently, then you know you will probably pay a higher price for it, and the seller also knows he can get a higher price for it. The buying enthusiasm is much greater than the selling enthusiasm.

I hate it when the financial media makes a comment when the market is down by saying that there are more sellers than buyers. They clearly do not understand how these markets work. Based on shares, there are always the same number of buyers and sellers; it is the buying and selling enthusiasm that changes.

Trending is a positive feedback process. Even Isaac Newton believed in trends with his first law of motion, which stated that an object at rest stays at rest, while an object in motion stays in motion, with the same speed and in the same direction unless acted on by an unbalanced force. Hey, an apple will continue to fall until it hits the ground. Positive feedback is the direct result of an investor's confidence in the price trend. When prices rise, investors confidently buy into higher and higher prices.

Supply and Demand

A buyer of a stock, which is the demand, bids for a certain amount of stock at a certain price. A seller, which is the supply, offers a certain amount at a certain price. I think it is fair to say that one buys a stock with the anticipation that they can sell it later to someone at a higher price. Not an unreasonable desire, and probably what drives most investors. The buyer has no idea who will sell it to him, or why they would sell it to him. He may assume that he and the seller have a complete disagreement on the future value of that stock. And that might be correct; however, the buyer will never know. In fact, the buyer just might be the seller's person who buys it from him at a higher price.

The reasons for buying and selling stock are complex and impossible to quantify. However, when they eventually agree, what is it that they agreed on? Was it the earnings of the company? Was it the products the company produces? Was it the management team? Was it the amount of the stock's dividend? Was it the sales revenues? As it turns out, it was none of those things; the transaction was settled because they agreed on the price of the stock, and that alone determines profit or loss. Changes in supply and demand are reflected immediately in price, which is an instantaneous assessment of supply and demand.

What Do You Know about This Chart?

In Figure 10.1, I have removed the price scale, the dates, and the name of this issue; now let me ask you some questions about this issue.

graph analysis for research

  • Is this a chart of daily prices, weekly prices, or 30-minute prices?
  • Is this a chart of a stock, a commodity, or a market index? (Okay, I'll give you this much, it is a daily price chart of a stock over a period of about six years.)
  • During this period of time, there were 11 earnings announcements. Can you show me where one of those announcements occurred and, if you could, whether the earnings report was considered good or bad?
  • Also during the period of time for this chart, there were seven Federal Open Market Committee (FOMC) announcements. Can you tell me where one of them occurred, and whether the announcement was considered good or bad?
  • Does this stock pay a dividend?
  • Hurricane Katrina occurred during this period displayed on this chart; can you tell me where it is?
  • Finally, would you want to buy this stock at the beginning of the period displayed and then sell it at the end of the period (right side of chart)?

I doubt, in fact, I know you cannot answer most of the above questions with any tool other than guessing. The point of this exercise is to point out that there is always and ever noise in stock prices. This noise comes in hundreds of different colors, sizes, shapes, and media formats. The bottom line is that it is just noise. The financial media bombards us all day long with noise. I do not think they do it maliciously; they do it because they believe they are giving you valuable information to help you make investment decisions. Nothing could be further from the truth.

Of course, question number 7 is the one question that most can answer, because from the chart a buy-and-hold investment during the data displayed clearly resulted in no investment growth.

However, let me tell you what I see as shown in Figure 10.2. I see two really good uptrends and, if I had a trend-following methodology that could capture 65 percent to 75 percent of those uptrends, I would be happy. I also see two good downtrends, and if I had a methodology that could avoid about 75 percent of them, I would also be happy. If you could do that for the amount of time shown on the chart below, then you would come out considerably better off than the buy-and-hold investor. I generally only participate in the long side of the market and move to cash or cash equivalents when defensive. However, a long-short strategy could possibly derive even greater profit.

graph analysis for research

Trend vs. Mean Reversion

I prefer to use a market analysis methodology called trend following . Sometimes it should be called trend continuation . Why? Trend analysis works on the thoroughly researched concept that once a trend is identified, it has a reasonable probability to continue. I know that is the case because, most of the time, markets are trending markets, and I see no reason to adopt a different strategy during a period of mean reverting, such as is experienced in the market from time to time.

You can think of trend following as a positive feedback mechanism. Mean reverting measures are those that oscillate between predetermined parameters; oftentimes the selection of those parameters is the problem. Mean reversion strategies are clearly superior during those volatile sideways times, but the implementation of a mean reverting process requires a level of guessing that I refuse to be a part of. You can think of mean reversion as a negative feedback mechanism.

In technical analysis, there are many mean reverting measures that could be used. They are the ones where you frequently hear the terms overbought and oversold. Overbought means the measurement shows that prices have moved upward to a limit that is predefined. Oversold means the opposite—prices have moved down to a predetermined level. The problem with that type of indicator or measurement is that a parameter needs to be set beforehand to know what the overbought and oversold levels are. Also, if you believe something mean reverts, you will probably have difficulty in determining the rate of reversion. For mean reversion to be relevant, there must be a meaning tied to average (mean) and, since most market data does not adhere to normal distributions, the mean isn't as meaningful (sic). Kind of like charting net worth and removing billionaires to make the data less skewed and therefore a more meaningful average.

Clearly, mean reverting measurements would work better in highly volatile markets, such as we witness from time to time. One might ask the question: Why don't you incorporate both into your model? A fair question, but one that shows the inquiry is forgetting that hindsight is not an analysis tool that will serve you well. When do you switch from one strategy (trend following) to the other (mean reversion)? Therein lies the problem.

Another question that might be asked is why not use adaptive measures to help identify the two types of markets. Again, another fair question! I think the lag between the two types of markets and the fact that often there is no clear period of delineation is the issue. It is a natural instinct to want to change the strategy in order to respond more quickly from one to the other. Natural instincts are what we are trying to avoid, simply because they are generally wrong, and painfully wrong at the worst times.

The transition from trend following to mean reversion can be difficult to see except with 20/20 hindsight. For example, when you view a chart which clearly has gone from trending to reversion, from that point, if we had used a simple mean reverting measurement, we would have looked like geniuses. However, in reality, periods like that have existed many times in the past in overall trending markets. Then the next problem becomes when to move away from a mean reverting strategy back to a trend following one. Again, hindsight always gives the precise answer, but in reality it is extremely difficult to implement in real time.

The bottom line is that with markets that generally trend most of the time, keeping a set of rules and stop loss levels in place will probably always win over the long-term. Sharpshooting the process is the beginning of the end. Trend following is somewhat similar to a momentum strategy except for two significant differences: one, momentum strategies generally rank past performance for selection, and two, often they do not utilize stop-loss methods, instead moving in and out of top performers. They both rely on the persistence of price behavior.

Trend Analysis

If one is going to be a trend follower, what is the first thing that must be done (rhetorical)? In order to be a trend follower, you must first determine the minimum length trend you want to identify. You cannot follow every little up and down move in the market; you must decide what the minimum trend length is that you want to follow. Once this is done, you can then develop trend-following indicators using parameters that will help identify trends in the market based on the minimum length you have decided on.

Figure 10.3 is an example of various trend-following periods. The top plot is the Nasdaq Composite index. The second plot is a filtered wave showing the trend analysis for a fairly short-term-oriented trend system. This is for traders and those who want to try to capture every small up and down in the market; a process that is not adopted by this author. The third plot is the ideal trend system, where it is obvious that you buy at the long-term bottom and sell at the long-term top. You must realize that this trend analysis can only be done with perfect 20/20 hindsight, and is probably even more difficult than the short-term process shown in the second plot. The bottom plot is a trend analysis process that is at the heart of the concepts discussed in this book. It is a trend-following process that realizes you cannot participate in every small up and down move, but try to capture most of the up moves and avoid most of the down moves.

graph analysis for research

There is a concept developed by the late Arthur Merrill called Filtered Waves. A filtered wave is the measurement of price movements in which only the movement that exceeds a predetermined percentage is counted. The price component used in this concept needs to be decided on as to whether to use just the closing prices for the filtered wave or use a combination of high and low prices. This would mean that, while prices are rising, the high would be used, and while prices are falling, the low price would be used. I personally prefer the high and low prices, as they truly reflect the price movements, whereas the closing prices only would eliminate some of the data.

For example, in Figure 10.4 , the background plot is the S&P 500 Index with both the close C and the high low H-L filtered waves overlaid on the prices. You can see that the H-L filtered wave techniques picks up more of the data; in fact, it shows a move of 5 percent in the middle of the plot that the Close only version did not show. In this particular example, the zigzag line uses a filter of 5 percent, which means that each time it changes direction, it had previously moved at least 5 percent in the opposite direction. There is one exception to this, and that is the last move of the zigzag line (there is a similar discussion in an earlier chapter). It merely moves to the most recent close regardless of the percentage moved so it must be ignored.

graph analysis for research

The bottom plot in Figure 10.5 shows the filtered wave by breaking down the up moves and down moves and then counting the number of periods that were in each move. There are three horizontal lines on that plot; the middle one is at zero, which is where the filtered wave changes direction. In this example, the top and bottom lines are at +21 and -21 periods, which mean that anytime the filtered wave exceeds those lines above or below, the trend has lasted at least 21 periods. Notice that, in this example, there was a period at the beginning (highlighted) where the market moved up and down in 5% or greater moves with high frequency, but never lasted long enough to exceed the 21 boundaries. Then, in the second half of the chart, there were two good moves that did exceed the 21 boundaries. This is a good example of a chart where there was a trendless market (first half) and a trending market (second half). I used the high-low filtered wave of 5 percent and 21 days for the minimum length because that is what I prefer to use for most trend analysis.

graph analysis for research

The following research was conducted using the high-low filtered wave using various percentages and various trend length measures. The research was conducted on a wide variety of market prices, such as most domestic indices, most foreign indices, all of the S&P sectors and industry groups; 109 issues in all. I offer commentary throughout so you can see that this was a robust process. Any indices or price series that is missing was probably because of an inadequate amount of data, as you need a few years of data to determine a series' trendiness. The goal of this research was to determine that markets generally trend and if there are some markets that trend better than others. Following this large section, the trend analysis will be shown using the S&P GICS data on sectors, industry groups, and industries.

Table 10.1 is the complete list of indices used in this study along with the beginning date of the data.

graph analysis for research

I did multiple sets of data runs, but will explain the process by showing just one of them. Table 10.2 is the data run through all 109 indices for the 5% filtered wave and 21 days for the trend to be identified. The first column is the name of the index (they are in alphabetical order), while the next four columns are the results of the data runs for the total trend percentage, the uptrend percentage, the downtrend percentage, and the ratio of uptrends to downtrends.

The total reflects the amount of time relative to the amount of all data available that the index was in a trend mode defined by the filtered wave and trend time; in the case below, a trend had to last at least 21 days and a move of 5% or greater. The up measure is just the percentage of the uptrend relative to the amount of data. Similarly, the downtrend is the percentage of the downtrend to the amount of data. If you add the uptrend and downtrend, you will get the total trend.

The last column is the U/D Ratio, which is merely the uptrend percentage divided by the downtrend percentage. If you look at the first entry in Table 10.2, the AMEX Composite trends 71.18 percent of the time, with 56.16% of the time in an uptrend and 15.03% of the time in a downtrend. The U/D Ratio is 3.74, which means the AMEX Composite trends up almost 4 (3.74) times more than it trends down. You can verify the amount of data in the Indices Date table shown early to see if it was adequate enough for trend analysis. It is not shown, but the complement of the total would give you the amount of time the index was trendless.

graph analysis for research

At the bottom of each table is a grouping of statistical measures for the various columns. Here are the definitions of those statistics:

Mean. In statistics, this is the arithmetic average of the selected cells. In Excel, this is the Average function (go figure). It is a good measure as long as there are no large outliers in the data being analyzed.

Average deviation. This is a function that returns the average of the absolute deviations of data points from their mean. It can be thought of as a measure of the variability of the data.

Median. This function measures central tendency, which is the location of the center of a group of numbers in a statistical distribution. It is the middle number of a group of numbers; that is, half the numbers have values that are greater than the median, and half the numbers have values that are less than the median. For example, the median of 2, 3, 3, 5, 7, and 10 is 4. If there are a wide range of values that are outliers, then median is a better measure than mean or average.

Minimum. Shows the value of the minimum value of the cells that are selected.

Maximum. Shows the value of the maximum value of the cells that are selected.

Sigma. Also known as standard deviation. It is a measure of how widely values are dispersed from their mean (average).

Geometric mean. First of all, it is only good for positive numbers and can be used to measure growth rates, etc. It will always be a smaller number than the mean.

Harmonic mean. Simply the reciprocal of the arithmetic mean, or could be stated as the arithmetic mean of the reciprocals. It is a value that is always less than the geometric mean, and like the geometric mean, can only be calculated on positive numbers and generally used for rates and ratios.

Kurtosis. This function characterizes the relative peakedness or flatness of a distribution compared with the normal distribution (bell curve). If the distribution is "tall", then it reflects positive kurtosis, while a relatively flat or short distribution (relative to normal) reflects a negative kurtosis.

Skewness. This characterizes the degree of symmetry of a distribution about its mean. Positive skewness reflects a distribution that has long tails of positive values, while negative skewness reflects a distribution with an asymmetric tail extending toward more negative values.

Trimmed mean (20 percent). This is a great function. It is the same as the Mean, but you can select any number or percentage of numbers (sample size) to be eliminated at the extremes. A great way to eliminate the outliers in a data set.

Trendiness Determination Method One

This methodology for trend determination looks at the average of multiple sets of raw data. An example of just one set of the data was shown previously in Table 10.2, which looks at a filtered wave of 5% and a minimum trend length of 21 days. Following Table 10.3 is an explanation of the column headers for Trendiness One in the analysis tables that follow.

graph analysis for research

Trendiness average. This is the simple average of all the total trending expressed as a percentage. The components that make up this average are the total trendiness of all the raw data tables, in which the total average is the average of the uptrends and downtrends as a percentage of the total data in the series.

Rank. This is just a numerical ranking of the trendiness average, with the largest total average equal to a rank of 1.

Avg. U/D. This is the average of all the raw data tables' ratio of uptrends to downtrends. Note: If the value of the Avg. U/D is equal to 1, it means that the uptrends and downtrends were equal. If it is less than 1, then there were more downtrends.

Uptrendiness WtdAvg. This is the product of column Trendiness Average and column Avg. U/D. Here the Total Trendiness (sum of up and down) is multiplied by their ratio, which gives a weighted portion to the upside when the ratio is high. If the average of the total trendiness is high and the uptrendiness is considerably larger than the downtrendiness, then this value (WtdAvg) will be high.

Rank. This is a numerical ranking of the Up Trendiness WtdAvg, with the largest value equal to a rank of 1.

Table 10.4 shows the complete results using Trendiness One methodology.

graph analysis for research

Trendiness Determination Method Two

The second method of trend determination uses the raw data averages. For example, the up value is calculated by using the raw data up average compared to the raw data total average, which therefore means it only is using the amount of data that is trending and not the full data set of the series. This way, the results are dealing only with the trending portion of the index, and if you think about it, when the minimum trend length is high and the filtered wave is low, there might not be that much trending. Table 10.5 shows the column headers followed by their definitions.

graph analysis for research

Up. This is the average of the raw data Up Trends as a percentage of the Total Trends.

Down. This is the average of the raw data Down Trends as a percentage of the Total Trends.

Up rank. This is the numerical ranking of the Up column, with the largest value equal to a rank of 1.

Table 10.6 shows the results using Trendiness Two methodology.

graph analysis for research

Comparison of the Two Trendiness Methods

Figure 10.6 compares the rankings using both "Trendiness" methods. Keep in mind we are only using uptrends, downtrends, and a derivative of them, which is up over down ratio. The plot below is informally called a scatter plot and deals with the relationships between two sets of paired data.

The equation of the regression line is from high school geometry and follows the expression: y = mx + b , where  m is the slope and b is the y -intercept (where it crosses the y axis); x is known as the independent variable or the predictor variable and y is the dependent variable or response variable. The expression that defines the regression (linear least squares) shows that the slope of the line ( m ) is 0.8904. The line crosses the  y (vertical) axis at 6.027, which is b . R^2 , which is also known as the coefficient of determination , is 0.7928. From R^2 , we can easily see that the correlation  R is 0.8904 (square root of R^2 ). We know this is a highly positive correlation because we can visually verify it simply from the orientation of the slope. We can interpret m as the value of y when x is zero and we can interpret b as the amount that y increases when x increases by one. From all of this, one can determine the amount that one variable influences the other.

Sorry, I beat this to death; you can probably find simpler explanations in a high school statistics textbook.

graph analysis for research

Trendless Analysis

 This is a rather simple but complementary (intentional spelling) method that helps to validate the other two processes. This method focuses on the lack of a trend, or the amount of trendless time that is in the data. The first two methods focused on trending, and this one is focused on nontrending, all using the same raw data. Determining markets that do not trend will serve two purposes. One is to not use conventional trend-following techniques on them, and the other is that it can be good for mean reversion analysis. Table 10.7 shows the column headers; the definitions follow.

graph analysis for research

Up. This is the Total Trend average from Trendiness One multiplied by the Up Total from Trendiness Two.

Down. This is the Total Trend average from Trendiness One multiplied by the Down Total from Trendiness Two.

Trendless. This is the complement of the sum of the Up and Down values (1 – (Up + Down)).

Rank. This is the numerical rank of the Trendless column with the largest value equal to a rank of 1.

Table 10.8 shows the results using the Trendless methodology.

graph analysis for research

Comparison of Trendiness One Rank and Trendless Rank

Although I think this was quite obvious, Figure 10.7 shows the analysis math is consistent and acceptable. These two series should essentially be inversely correlated, and they are with coefficient of determination equal to one.

graph analysis for research

The following tables take the data from the full 109 indices and subdivide it into sectors, international, domestic, and time frames to ensure there is robustness across a variety of data. There are many indices that appear in many of, if not most of, these tables, but keeping data of that sort for comparison with others that are not so widely diversified will enhance the research.

These tables show all three trend method results. This first table consists of all the index data. The remaining ones contain subsets of the All table, such as Domestic, International, Commodities, Sectors, Data > 2000, Data > 1990, and Data > 1980. The reason for the data subsets is to ensure there is a robust analysis in place across various lengths of data, which means multiple bull-and-bear cyclical markets are considered in addition to secular markets. The Data > 2000 means that the data starts sometime prior to 2000 and therefore totally contains the secular bear market that began in 2000.

All Trendiness Analysis

Table 10.9 contains data from all of the 109 indices in the analysis. The first column contains letters identifying the subcategory for each issue as follows:

I – International

C – Commodity

Blank – Domestic

graph analysis for research

Trend Table Selective Analysis

In this section, I will demonstrate more details on selected issues from Table 10.9 to show how the data can be utilized.

Using the Trendiness One Rank, you can see that the U.S. Dollar Index is number one. You can also see it is the worst for being Trendless (last column), which one would expect. However, if you look at the Trendiness One and Trendiness Two Up Ranks, you see that it did not rank well. This can only be interpreted that the U.S. Dollar Index is a good downtrending issue, but not a good uptrending one based on this relative analysis with 109 various indices. This is made clear from the long trendline drawn from the first data point to the last data point and is clearly in a downtrend.

Figure 10.8 shows the U.S. Dollar Index with a 5% filtered wave overlaid on it. The lower plot shows the filtered wave of 5% measuring the number of days during each up and down move. The two horizontal lines are at +21 and -21, which means that movements inside that band are not counted in the trendiness or trendless calculations. The only difference between what this chart shows and what the table data measures is the fact that the table is averaging a number of different filtered waves and trend lengths.

graph analysis for research

Let's now look at the worst trendiness index and see what we can find out about it (Table 10.9). The Trendiness One rank and the Trendless Rank confirm that this is not a good trending index. Furthermore, the Up Trendiness in both One and Two also shows that it ranks low (109 and 81) in the Trendiness One, which is measuring the trendiness based on all the data, and that the rank in Trendiness Two is high (4). Remember that Trendiness Two only looks at the trending data, not all of the data. Therefore, you can say that this index when in a trending mode, tends to trend up well, but the problem is that it isn't in a trending mode often (see Table 10.11).

graph analysis for research

Figure 10.9 shows the Turkey ISE National-100 index with the same format as the earlier analysis. Notice that it is generally in an uptrend based on the long-term trend line. From the bottom plot, you can see that there is very little movement of trends outside of the +21 and -21 day bands. Bottom line is that this index doesn't trend well, and is quite volatile in its price movements; if you are trend follower; don't waste your time with this one. A question that might arise is that it is also clear from the top plot that it is in an uptrend, so if you used a larger filtered wave and/or different trend length, it might yield different results. My response to that is simply: of course it will, you can fit the analysis to get any results you want, especially with all this wonderful hindsight. Bad approach to successful trend following.

graph analysis for research

Using the same data table, let's look at an index that ranks high in the uptrend rankings (Table 10.9). From the table it ranks as middle of the road relatively based on Trendiness One and Trendless rank. However, the rank for Up Trendiness One and Trendiness Two Up rank is high (both are 5). This means that most of the trendiness is to the upside with only moderate downtrends (see Table 10.12).

graph analysis for research

Figure 10.10 shows the Norway Oslo Index clearly in an uptrend. The bottom plot shows that most of the spikes of trend length are above the +21 band level and very few are below the .21 band level. This confirms the data in the table.

graph analysis for research

In order to carry this analysis to fruition, let's look at the index with the worst uptrend rank (Table 10.9). From the table, the Trendiness One and Two Up ranks are dead last (109). The Trendiness One overall rank is 104, which is almost last, and the trendless rank is 6, which confirms that data (see Table 10.13).

graph analysis for research

Figure 10.11 shows that the Hanoi SE Index is clearly in a downtrend; however, the bottom plot shows that very few trends are outside the bands. And the ones that move well outside the bands are the downtrends. As before, one can change the analysis and get desired results, but that is not how it should be done. One note, however, is that this index does not have a great deal of data compared to most of the others and this should be a consideration in the overall analysis.

graph analysis for research

Thanks for reading this far. I intend to publish one article in this series every week. Can't wait? The book is for sale  here .

FOLLOW THIS BLOG

Subscribe to Dancing with the Trend for email notifications whenever a new article is posted

POPULAR ARTICLES

Attention: your browser does not have javascript enabled.

In order to use StockCharts.com successfully, you must enable JavaScript in your browser. Click Here to learn how to enable JavaScript.

IMAGES

  1. Graphing 101: Examples of graph types

    graph analysis for research

  2. 5 Steps of the Data Analysis Process

    graph analysis for research

  3. Graph and Charts used in Research Methodology

    graph analysis for research

  4. Survey Data Analysis Software

    graph analysis for research

  5. The correlation graph between experimental and estimated activity

    graph analysis for research

  6. Quantitative Analysis

    graph analysis for research

VIDEO

  1. Choosing a graph to fit a narrative Advanced

  2. Application of Graph Theory / Operations Research ,Crashing in Project Management

  3. Share market graph analysis #viralvideo #sharemarket #paisa #trending #market #loss #profit

  4. Graph analysis training in 9 min! Binary options 2023

  5. Temporal Graph Analysis with TGX

  6. How High can THE GRAPH GRT go in 2024

COMMENTS

  1. Graph Analytics in 2024: Types, Tools, and Top 10 Use Cases

    Graph analytics is important due to the expected market growth. According to a recent graph analytics market report, the graph analytics market size was ~$600 million in 2019, and it is expected to reach ~$2.5 billion by 2024, at a Compound Annual Growth Rate (CAGR) of 34% during the forecast period.

  2. A Beginner's Guide to Graph Analytics and Deep Learning

    Finally, we propose potential research directions in this fast-growing field. Representation Learning on Graphs: Methods and Applications (2017) by William Hamilton, Rex Ying and Jure Leskovec. Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks.

  3. A key review on graph data science: The power of graphs in scientific

    The power of GNN to model dependencies between nodes in a graph has revolutionized the field of research in graph analysis [92], [101], [129]. Due to the traditional deep neural network's inability to parse the data correctly, the interest in graphs has recently increased with the use of geometric deep learning methods in the analysis of ...

  4. Review A key review on graph data science: The power of graphs in

    This comprehensive review provides an in-depth analysis of graph theory, various graph types, and the role of graph visualization in scientific studies. ... Current research interests include graph theory, software design, and architecture, IoT/M2M applications, software quality and assurance, big data analysis, and visualization. View full ...

  5. Graph neural networks: A review of methods and applications

    The other motivation comes from graph representation learning (Cui et al., 2018a; Hamilton et al., 2017b; Zhang et al., 2018a; Cai et al., 2018; Goyal and Ferrara, 2018), which learns to represent graph nodes, edges or subgraphs by low-dimensional vectors.In the field of graph analysis, traditional machine learning approaches usually rely on hand engineered features and are limited by its ...

  6. Graph Analytics

    Centrality. In graph analytics, Centrality is a very important concept in identifying important nodes in a graph. It is used to measure the importance (or "centrality" as in how "central" a node is in the graph) of various nodes in a graph. Now, each node could be important from an angle depending on how "importance" is defined.

  7. Graph analytics 101: reveal the story behind your data

    Without these algorithms, crucial information about the network remains hidden, and progress on your analysis is painfully slow. The calculations graph analytics perform aren't simple research tasks that can be done by hand: they're advanced mathematical computations completed in an instant even on your largest, most complex datasets.

  8. Graph Interpretation, Summarization and Visualization Techniques: A

    With the advancement in computation technology and the increase of multimedia, generation of massive amount of data gained more proliferation due to emerging applications in research domains such as machine learning techniques, artificial intelligence and mathematical modeling for analysis of these data [1, 2].The graph has been a ubiquitous way of representing massive amounts of data which is ...

  9. Knowledge Graphs: A Practical Review of the Research Landscape

    Knowledge graphs (KGs) have rapidly emerged as an important area in AI over the last ten years. Building on a storied tradition of graphs in the AI community, a KG may be simply defined as a directed, labeled, multi-relational graph with some form of semantics. In part, this has been fueled by increased publication of structured datasets on the Web, and well-publicized successes of large-scale ...

  10. A software resource for large graph processing and analysis

    Scientific data. Software. GRAPE is a software resource for graph processing, learning and embedding that is orders of magnitude faster than existing state-of-the-art libraries. GRAPE can quickly ...

  11. Graph Analysis and Visualization

    Wring more out of the data with a scientific approach to analysis Graph Analysis and Visualization brings graph theory out of the lab and into the real world. Using sophisticated methods and tools that span analysis functions, this guide shows you how to exploit graph and network analytic techniques to enable the discovery of new business insights and opportunities. Published in full color ...

  12. How to Use Tables & Graphs in a Research Paper

    Using graphs, charts, and other visual media in a research paper helps the reader comprehend quickly and identify patterns and predictions. 1-888-627-6631; [email protected]; ... If your analysis only consists of simple t-tests to assess differences between two groups, you can report these results in the text (in this case: mean, standard ...

  13. Graph theory methods: applications in brain networks

    In the brain, motif analysis has been applied to structural 20 and functional graphs. 21. Most highly resolved structural brain networks are not fully, or even densely, connected. In such sparsely connected graphs, the minimal topological distance between two nodes, ie, the length of the shortest path, often involves multiple steps.

  14. Graph Analysis

    Graph analysis is a data analytic technique for determining the strength and direction of relationships between objects in a graph. Graphs are mathematical structures used to model relationships and processes between objects. 2 They can be used to represent complex relationships and dependencies in your data. Graphs are composed of vertices or nodes that represent the entities in the system.

  15. Graphing, Interpreting Graphs, and Experimental Designs

    Thirty years of research on the functional analysis of problem behavior. Journal of Applied Behavior Analysis, 46(1), 1-21. Article PubMed Google Scholar Chok, J. T. (2019). Creating functional analysis graphs using Microsoft Excel® 2016 for PCs. Behavior Analysis in Practice, 12(1), 265-292.

  16. Visibility graph analysis for brain: scoping review

    VG analysis can improve the diagnosis and prediction of brain diseases. The most common use of this method has been for diagnosing epilepsy. However, other brain diseases, particularly Parkinson's disease and attention deficit hyperactivity disorder (ADHD), have received less or no attention in graph analysis research.

  17. Exploratory graph analysis: A new approach for estimating the number of

    The estimation of the correct number of dimensions is a long-standing problem in psychometrics. Several methods have been proposed, such as parallel analysis (PA), Kaiser-Guttman's eigenvalue-greater-than-one rule, multiple average partial procedure (MAP), the maximum-likelihood approaches that use fit indexes as BIC and EBIC and the less used and studied approach called very simple ...

  18. Line Charts: Using, Examples, and Interpreting

    Graphically assess how a metric changes as the X-value increases. Emphasize trends and patterns. Display main and interaction effects. Line charts typically require a continuous variable for the Y-axis and a continuous, time, or categorical variable for the X-axis. To learn about other graphs, read my Guide to Data Types and How to Graph Them.. Example Line Chart

  19. The Beginner's Guide to Statistical Analysis

    Statistical analysis means investigating trends, patterns, and relationships using quantitative data. It is an important research tool used by scientists, governments, businesses, and other organizations. To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process. You need to specify ...

  20. Graphical Methods

    Analyze the graph: Once you have created the graph, analyze it to identify patterns, trends, and relationships in the data. Look for outliers or other anomalies that may require further investigation. Draw conclusions: Based on your analysis of the graph, draw conclusions about the research question you are exploring. Use the graph to support ...

  21. Data Analysis & Graphs

    Graphs are often an excellent way to display your results. In fact, most good science fair projects have at least one graph. For any type of graph: Generally, you should place your independent variable on the x-axis of your graph and the dependent variable on the y-axis. Be sure to label the axes of your graph— don't forget to include the ...

  22. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...

  23. Bar Charts: Using, Examples, and Interpreting

    Bar Charts: Using, Examples, and Interpreting. By Jim Frost 4 Comments. Use bar charts to compare categories when you have at least one categorical or discrete variable. Each bar represents a summary value for one discrete level, where longer bars indicate higher values. Types of summary values include counts, sums, means, and standard deviations.

  24. Mapping the Pathways Between Posttraumatic Stress Disorder, Depression

    Abstract. Objective: The present study examines the network structure and, using Bayesian network analysis, estimates the directional pathways among symptoms of posttraumatic stress disorder (PTSD), major depressive disorder (MDD), and levels of alcohol and cannabis use.Method: A sample of 1471 adults in the United States, who reported at least one potentially traumatic event, completed the ...

  25. Market Research and Analysis

    This article (and the next) focuses on trends in the market—an explanation as to why markets trend, reasons why it is good to know that markets trend, then finally, a large research section into how much markets trend. This analysis will initially be shown on 109 market indices that involve domestic, international, and commodity sectors.

  26. Technical Analysis: Four Easy Tools for Beginners

    Luckily, there are other crypto indicators that are easy to understand for beginners — and still trusted by the pros. The Crypto Rainbow Chart, Crypto RSI, Altcoin Season Index, and Crypto Fear & Greed Index are four of the most popular tools to gauge where the crypto space is leaning. While useful to give a macro indication of the market, do ...