data quality problems case study

Data Science Central

  • Author Portal
  • 3D Printing
  • AI Data Stores
  • AI Hardware
  • AI Linguistics
  • AI User Interfaces and Experience
  • AI Visualization
  • Cloud and Edge
  • Cognitive Computing
  • Containers and Virtualization
  • Data Science
  • Data Security
  • Digital Factoring
  • Drones and Robot AI
  • Internet of Things
  • Knowledge Engineering
  • Machine Learning
  • Quantum Computing
  • Robotic Process Automation
  • The Mathematics of AI
  • Tools and Techniques
  • Virtual Reality and Gaming
  • Blockchain & Identity
  • Business Agility
  • Business Analytics
  • Data Lifecycle Management
  • Data Privacy
  • Data Strategist
  • Data Trends
  • Digital Communications
  • Digital Disruption
  • Digital Professional
  • Digital Twins
  • Digital Workplace
  • Marketing Tech
  • Sustainability
  • Agriculture and Food AI
  • AI and Science
  • AI in Government
  • Autonomous Vehicles
  • Education AI
  • Energy Tech
  • Financial Services AI
  • Healthcare AI
  • Logistics and Supply Chain AI
  • Manufacturing AI
  • Mobile and Telecom AI
  • News and Entertainment AI
  • Smart Cities
  • Social Media and AI
  • Functional Languages
  • Other Languages
  • Query Languages
  • Web Languages
  • Education Spotlight
  • Newsletters
  • O’Reilly Media

Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation

MichalFracek

  • April 7, 2019 at 12:21 am

Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency:  data quality . The maxim “garbage in – garbage out” coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase’s  2013 fiasco  and this lovely list of  Excel snafus .  Brilliant people make data collection and entry errors all of the time, and that isn’t just our opinion (although we have plenty of personal experience with it); Kaggle  did a survey  of data scientists and found that “dirty data” is the number one barrier for data scientists.  

Before we create a  machine  learning model, before we create a Shiny R dashboard, we evaluate the dataset for a project.  Data validation is a complicated multi-step process, and maybe it’s not as sexy as talking about the latest   ML  models, but as the data science consultants of Appsilon we live and breathe data governance and offer solutions.  And it is not only about data format. Data can be corrupted on different levels of abstraction. We can distinguish three levels:  

  • Data structure and format
  • Qualitative & business logic rules
  • Expert logic rules

Level One: structure and format

For every project, we must verify:

  • Is the data structure consistent? A given dataset should have the same structure all of the time, because the ML model or app expects the same format.  Names of columns/fields, number of columns/fields, field data type (integers? Strings?) must remain consistent.
  • Are we working with multiple datasets, or merged?
  • Do we have duplicate entries? Do they make sense in this context or should they be removed?
  • Do we have correct, consistent data types (e.g. integers, floating point numbers, strings) in all entries?
  • Do we have a consistent format for floating point numbers? Are we using a comma or a period?
  • What is the format of other data types, such as e-mail addresses, dates, zip codes, country codes and is it consistent?

It sounds obvious, but there are always problems and it must be checked every time.  The right questions must be asked.

Level Two: qualitative and business logic rules

We must check the following every time:

  • Is the price parameter (if applicable) always non-negative?  (We stopped several of our retail customers from recommending the wrong discounts thanks to this rule. They saved significant sums and prevented serious problems thanks to this step… More on that later).
  • Do we have any unrealistic values?  For data related to humans, is age always a realistic number?
  • Parameters.  For data related to machines, does the status parameter always have a correct value from a defined set? E.g. only “FINISHED” or “RUNNING” for a machine status?
  • Can we have “Not Applicable” (NA), null, or empty values? What do they mean?
  • Do we have several values that mean the same thing? For example, users might enter their residence in different ways — “NEW YORK”, “Nowy Jork”, “NY, NY” or just “NY” for a city parameter. Should we standardize them?

Level three: expert rules

Expert rules govern something different than format and values. They check if the story behind the data makes sense. This requires business knowledge about the data and it is the data scientist’s responsibility to be curious, to explore and challenge the client with the right questions, to avoid logical problems with the data.  The right questions must be asked.

Expert Rules Case Studies 

I’ll illustrate with a couple of true stories.

Story #1: Is this machine teleporting itself?

We were tasked to analyze the history of a company’s machines.  The question was, how much time did each machine work at a given location.  We have the following entries in our database:

We see that format and values are correct. But why did machine #1234 change its location  every day? Is it possible? We should ask such a question of our client.   In this case, we found that it was not physically possible for the machine to switch sites so often.  After some investigation, we found that the problem was that the software installed on the machine had a duplicated ID number and in fact there were two machines on different sites with the same ID number.  When we learned what was possible, we set data validation rules for that, and then we ensured that this issue won’t happen again.

Expert rules can be developed only by the close cooperation between data scientists and business. This is not an easy part that can be automated by “data cleaning tools,” which are great for hobbyists, but are not suitable for anything remotely serious.

Story #2: Negative sign could have changed all prices in the store

One of our retail clients was pretty far along in their project journey when we began to work with them.  They already had a data scientist on staff and had already developed their own price optimization models. Our role was to utilize the output from those models and display recommendations in an R Shiny dashboard that was to be used by their salespeople. We had some assumptions about the format of the data that the application would use from their models.  So we wrote our validation rules on what we thought the application should expect when it reads the data.

We reasoned that the price should be

  • non-negative
  • an integer number
  • shouldn’t be an empty value or a string.  
  • within a reasonable range for the given product

As this model was being developed over the course of several weeks, suddenly we observed that prices were being returned as too high.  It was actually validated automatically. It wasn’t like we spotted this in production, we spotted this problem before the data even landed in the application.  After we saw this result in the report, we asked their team why it happened. It turns out that they had a new developer who assumed that discounts could be displayed as a negative number, because why not?  He didn’t realize that some applications actually depended on that output, and assumed that it would be subtracting the value instead of adding It. Thanks to the automatic data validation, we could prevent loading errors into production.  We worked with their data scientists to improve the model. It was a very quick fix of course, a no-brainer. But the end result was that they saved real money.

Data Validation Report for all stakeholders

Here is a sample data validation report that our workflow produces for all stakeholders in the project:

Data Verification Report

The intent is that the data verification report is readable by all stakeholders, not just data scientists and software engineers.  After years of experience working on data science projects, we observed that multiple people within an organization know of realistic parameters for data values, such as price points.  There is usually more than one expert in a community, and people are knowledgeable about different things. New data is often added at a constant rate, and parameters can change. So why not allow multiple people add and edit rules when verifying data? So with our Data Verification workflow, anyone from the team of stakeholders can add or edit a data verification rule.

Our Data Verification workflow works with the  assertr package  (for the R enthusiasts out there).  Our workflow runs validation rules automatically – after every update in the data.   This is exactly the same process as writing unit tests for software. Like unit testing, our data verification workflow allows you to more easily identify problems and catch them early; and of course fixing problems at an earlier stage is much more cost effective.    

Finally, what do validation rules look like on the code level?  We can’t show you code created for clients, so here is an example using data from the City of Warsaw public transportation system (requested from a public API).  Let’s say that we want a real-time check on the location and status of all the vehicles in the transit system fleet.

In this example, we want to ensure that the Warsaw buses and trams are operating within the borders of the city, so we check the latitude and longitude.  If a vehicle is outside the city limits, then we certainly want to know about it! We want real-time updates, so we write a rule that “Data is not older than 5 minutes.”  In a real project, we would probably write hundreds of such rules in partnership with the client. Again, we typically run this workflow BEFORE we build a model or a software solution for the client, but as you can see from the examples above, there is even tremendous value in running the Data Validation Workflow late in the production process!  And one of our clients did remark that they saved more money with the Data Validation Workflow than with some of the machine learning models that were previously built for them.

Sharing our data validation workflow with the community

Data quality must be verified in every project to produce the best results.  There are a number of potential errors that seem obvious and simplistic but in our experience to tend to occur often.  

After working on numerous projects with Fortune 500 companies, we came up with a solution to the above 3-Level problem cluster.  Since multiple people within an organization know of realistic parameters for datasets, such as price points, why not allow multiple people add and edit rules when verifying data?  We recently shared our workflow at a  hackathon  sponsored by the Ministry of Digitization here in Poland.  We took third place in the competition, but more importantly, it reflects one of the core values of our company — to share our best practices with the data science community.     

photo of Hackathon winner

Pawel   and  Krystian   accept an award at the Ministry of Digital Affairs  Hackathon

I hope that you can put these take-aways in your toolbox:

  • Validate your data early and often, covering all assumptions.
  • Engage a data science professional early in the process  
  • Leverage the expertise of your workforce in data governance strategy
  • Data quality issues are extremely common

In the midst of designing new products, manufacturing, marketing, sales planning and execution, and the thousands of other activities that go into operating a successful business, companies sometimes forget about data dependencies and how small errors can have a significant impact on profit margins.  

We unleash your expertise about your organization or business by asking the right questions, then we teach the workflow to check for it constantly.  We take your expertise and we leverage it repeatedly.

data validation infographic

You can find me on Twitter at  @pawel_appsilon .

Originally posted on Data Science Blog .

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

Maintaining Data Quality from Multiple Sources Case Study

Provider relying on data quality

There is a wealth of data within the healthcare industry that can be used to drive innovation, direct care, change the way systems function, and create solutions to improve patient outcomes. But with all this information coming in from multiple unique sources that all have their own ways of doing things, ensuring data quality is more important than ever.

The COVID-19 pandemic highlighted breakthroughs in data sharing and interoperability advances in the past few years. However, that does not mean that there aren’t challenges when it comes to data quality.

“As we have seen, many organizations have created so many amazing solutions around data,” Mujeeb Basit, MD, associate chief medical informatics officer and associate director, Clinical Informatics Center, University of Texas Southwestern Medical Center said. “COVID really highlighted the innovations and what you can do with sophisticated data architectures and how that flow of data really helps us understand what's happening in our communities. Data has become even more important.”

Dr. Basit shared some of his organization’s experiences in creating strategies to improve data quality while making the process as seamless as possible for all stakeholders.

The medical center had four groups working together on solution co-development, including quality, clinical operations, information resources and analytics.

“It is the synergy of working together and aligning our goals that really helps us develop singular data pipelines as well as workflows and outcomes that we're all vested in,” Dr. Basit said.

Finding Errors

One of the problems the organization previously faced was that errors would slowly accumulate in their systems because of complicated processes or frequent updates. When an error was found, Dr. Basit noted it was usually fixed as a single entity, and sometimes a backlog is fixed.

“But what happens is, over time, this error rate redevelops. How do we take this knowledge gained in this reported error event and then make that a sustainable solution long term? And this becomes exceedingly hard because that relationship may be across multiple systems,” Dr. Basit said.

He shared an example of how this had happened while adding procedures into their system that become charges, which then get translated into claim files.

“But if that charge isn't appropriately flagged, we actually don't get that,” Dr. Basit said. “This is missing a rate and missing a charge, and therefore, we will not get revenue associated with it. So, we need to make sure that this flag is appropriately set and this code is appropriately captured.”

His team created a workaround for this data quality issue where they will use a user story in their development environment and fix the error, but this is just a band-aid solution to the problem.

“As additional analysts are hired, they may not know this requirement, and errors can reoccur. So how do you solve this globally and sustain that solution over time? And for us, the outcome is significantly lost work, lost reimbursement, as well as denials, and this is just unnecessary work that is creating a downstream problem for us,” Dr. Basit said.

Their solution? Apply analysis at regular intervals to keep error rates low. 

“This is not sustainable by applying people to it, but it is by applying technology to it. We approach it as an early detection problem. No repeat failures, automate it so we don't have to apply additional resources for it, and therefore, it scales very, very well, as well as reduced time to resolution, and it is a trackable solution for us,” Dr. Basit said.

To accomplish this, they utilized a framework for integrated tests (FIT) and built a SQL server solution that intermittently runs to look for new errors. When one is found, a message is sent to an analyst to determine a solution.

“We have two types of automated testing. You have reactive where someone identifies the problem and puts in the error for a solution, and we have preventative,” Dr. Basit said.

The outcome of this solution means they are saving time and money—something the leadership within the University of Texas Southwestern Medical Center has taken notice of. They are now requesting FIT tests to ensure errors do not reoccur.

“This has now become a part of their vocabulary as we have a culture of data-driven approaches and quality,” Dr. Basit said.

Applying the Data Efficiently

Another challenge they faced was streamlining different types of information coming in through places like the patient portal and EHR while maintaining data quality.

“You can't guarantee 100% consistency in a real-time capture system. They would require a lot of guardrails in order to do that, and the clinicians will probably get enormously frustrated,” Dr. Basit said. “So we go for reasonable accuracy of the data. And then we leverage our existing technologies to drive this.”

He used an example from his organization about a rheumatology assessment to determine the day-to-day life of someone with the condition. They use a patient questionnaire to create a system scoring system, and providers also conduct an assessment.

“Those two data elements get linked together during the visit so that we can then get greater insight on it. From that, we're able to use alerting mechanisms to drive greater responsiveness to the patient,” Dr. Basit said.

Conducting this data quality technology at scale was a challenge, but Dr. Basit and his colleagues utilized the Agile methodology to help.

“We didn't have sufficient staff to complete our backlog. What would happen is somebody would propose a problem, and by the time we finally got to solve it, they'd not be interested anymore, or that faculty member has left, or that problem is no longer an issue, and we have failed our population,” Dr. Basit said. “So for us, success is really how quickly can we get that solution implemented, and how many people will actually use it, and how many patients will it actually benefit. And this is a pretty large goal.”

 The Agile methodology focused on:

  • Consistency
  • Minimizing documentation
  • Incremental work products that can be used as a single entity

They began backlog sprint planning, doing two-week sprints at a time.

“We want to be able to demonstrate that we're able to drive value and correct those problems that we talked about earlier in a very rapid framework. The key to that is really this user story, the lightweight requirement gathering to improve our workflow,” Dr. Basit said.  “So you really want to focus as a somebody, and put yourself in the role of the user who's having this problem.”

An example of this would be a rheumatologist wanting to know if their patient is not on a disease-modifying anti-rheumatic drug (DMARD) so that their patient can receive optimal therapy for their rheumatoid arthritis.

“This is really great for us, and what we do is we take this user story and we digest it. And especially the key part here is everything that comes out for the ‘so that,’ and that really tells us what our success measures are for this project. This should only take an hour or two, but it tells so much information about what we want to do,” Dr. Basit said.

Acceptance criteria they look for include:

  • Independent
  • Estimatable

“And we try to really stick to this, and that has driven us to success in terms of leveraging our data quality and improving our overall workflow as much as possible,” Dr. Basit said.

With the rheumatology project, they were able to reveal that increased compliance to DMARD showed an increase in low acuity disease and a decrease in high acuity.

“That's what we really want to go for. These are small changes but could be quite significant to those people's lives who it impacted,” Dr. Basit said.

In the end, the systems he and his team have created high-value solutions that clinicians and executives at their medical center use often.

“And over time we have built a culture where data comes first. People always ask, ‘What does the data say?’ Instead of sitting and wasting time on speculating on that solution,” Dr. Basit said.

The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.

Machine Learning & AI for Healthcare Forum

December 14–15, 2021 | Digital

Machine learning and AI are full of possibilities to address some of healthcare’s biggest challenges. Learn how leading healthcare organizations have leveraged the power of machine learning and AI to improve patient care and where they see real ROI—better care, cost containment, and operational improvements and efficiencies.

Register for the forum and get inspired

SpatialData: an open and universal data framework for spatial omics

Luca Marconato, Giovanni Palla, … Oliver Stegle

data quality problems case study

Genome-wide association studies

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

data quality problems case study

Genomic data in the All of Us Research Program

The All of Us Research Program Genomics Investigators

Advances in deep learning using artificial neural networks (ANN) 1 , 2 have resulted in the increased use of AI for healthcare applications 3 , 4 , 5 . One of the most successful examples of deep learning has been the application of convolutional neural network (CNN) algorithms for medical image analysis to support clinical assessment 6 . AI models are trained with labeled or annotated data (medical images) and learn complex features of the images that relate to a clinical outcome, which can then be applied to classify new unseen medical images. Applications of this technology in healthcare span a wide range of domains including but not limited to dermatology 7 , 8 , radiology 9 , 10 , ophthalmology 11 , 12 , 13 , pathology 14 , 15 , 16 , and embryo quality assessment in IVF 17 .

Despite the enormous potential of AI to improve healthcare outcomes, AI performance can often be sub-optimal as it is crucially dependent on the quality of the data. While AI practitioners often focus on the quantity of data as the driver of performance, even fractional amounts of poor-quality data can substantially hamper AI performance. Good-quality data is therefore needed to train models that are both accurate and generalizable, and which can be relied upon by clinics and patients globally. Furthermore, because measuring AI performance on poor-quality test data can mislead or obfuscate the true performance of the AI, good-quality data is also important for benchmark test sets used in performance reporting, which clinics and patients rely on for clinical decisioning.

We define two types of poor-quality data:

Incorrect data : Mislabeled data, for example an image of a dog incorrectly labeled as a cat. This also includes adversarial attacks by intentionally inserting errors in data labels (especially detrimental to online machine learning methods 18 ).

Noisy data : Data itself is of poor quality (e.g. out-of-focus image), making it ambiguous or uninformative, with insufficient information or distinguishing features to correlate with any label.

In healthcare, clinical data can be inherently poor quality due to subjectivity and clinical uncertainty. An example of this is pneumonia detection from chest X-ray images. The labeling of a portion of the image can be somewhat subjective in terms of clinical assessment, often without a known ground truth outcome, and is highly dependent on the quality of the X-ray image taken. In some cases, the ground truth outcome might also involve clinical data that is not present in the dataset used for analysis, such as undiagnosed conditions in a patient, or effects that cannot be seen from images and records provided for the assessment. This kind of clinical uncertainty can contribute to both the incorrect and noisy data categories. Therefore, poor-quality data cannot always be reliably detected, even by clinical experts. Furthermore, due to data privacy, manual visual verification of private patient data is not always possible.

Several methods exist to account for these sources of reader variability and bias. One method 19 uses a so-called Direct Uncertainty Prediction to provide an unbiased estimate of label uncertainty for medical images, which can be used to draw attention to images requiring a second medical opinion. This technique relies on training a model to identify cases with high potential for expert disagreement. Other methods model uncertainty in poor-quality data through Bayesian techniques 20 . However, such techniques require significant amounts of annotated data from multiple experts, which is often not readily available. Some methods assume that erroneous label distribution is conditionally independent of the data instance given the true label 21 , which is an assumption that does not hold true in the settings considered in this article. Other techniques 22 relax this assumption by using domain-adapted generative models to explain the process that generates poor-quality data, though these techniques typically require additional clean data to generate good priors for learning. This is an issue in medical domains such as the assessment of embryo viability, where reader variability is significant 17 and ground truth labels may be impossible to ascertain, so there is no way of identifying data as poor quality a priori. There is a need for better approaches to cleanse poor-quality data automatically and effectively, in and beyond healthcare.

In this paper, a novel technique is presented for automated data cleansing which can identify poor data quality without requiring a cleansed dataset with known ground truth labels. The technique is called Untrainable Data Cleansing (UDC) and is described in the Methods section. UDC essentially identifies and removes a subset of the data (i.e. cleanses the data) that AI models are unable to correctly label (classify) during the AI training process. From a machine learning perspective, the two types of poor-quality data described above are realized through: (1) identifying data that are highly correlated to the opposite label of what would reasonably be expected, based on the classification of most of the data in the dataset (incorrect data); or (2) identifying data that have no distinguishing features that correlate with any label (noisy data). Results show that UDC can consistently and accurately identify poor-quality data, and that removal of UDC-identified poor-quality data, and thus “cleansing” of the data, ultimately leads to higher performing and more reliable AI models for healthcare applications.

Validation of UDC

UDC was first validated using two types of datasets, cats and dogs for binary classification, and vehicles for multi-classification. These datasets were used because the ground truth can be manually confirmed, and therefore incorrect labels could be synthetically injected.

Binary classification using cats and dogs

A benchmark (Kaggle) dataset of 37,500 cat and dog images were used to validate UDC for binary classification. This dataset was chosen because the ground truth outcomes (labels) could be manually determined with certainty, and synthetic incorrect labels could be readily introduced by flipping the correct label to an incorrect label. Synthetic incorrect labels were added to this dataset to test UDC under various amounts of poor-quality data.

A total of 24,916 images (12,453 cats, 12,463 dogs) were used for training and 12,349 images (6143 cats, 6206 dogs) used as a separate blind test set. Synthetic errors (incorrect labels) were added to the training dataset (but not the test set), which was split 80/20 into training and validation sets. A single round of UDC was applied to the training dataset, poor-quality data identified by UDC was removed, and a new AI model was trained on the UDC-cleansed dataset. The highest balanced AI accuracy achievable on the blind test dataset was reported. Results are shown in Table 1 .

Results show that UDC is resilient to even extreme levels of errors, functioning in datasets with up to 50% incorrect labels in one class (while the other class remained relatively clean), and with significant symmetric errors of up to 30% incorrect labels in both classes. Visual assessment verified removal of both incorrect data and a minor proportion of noisy data, where, for example, dogs looked like cats (see Supplementary Figure S1 online). Improvements of greater than 20% were achieved in all cases. Compared to cases with uniform levels of incorrect labels, those with asymmetric incorrect labels, e.g. (35%, 5%), achieved a higher balanced accuracy of over 98% after just one round of UDC. This is expected, since in the asymmetric cases, one class remains as a true correct class, allowing UDC to be more confident in identifying incorrectly labeled samples.

In the uniform cases a slightly lower balanced accuracy of 94.7% was achieved after one round of UDC, which was found to identify and remove 88% of the intentionally mislabeled images. A second round of UDC improved upon the results of a single application of UDC, successively increasing the predictive power of the dataset by removing the remaining incorrect data. The accuracy achieved after a second round of UDC (99.7%) to the symmetric case (30%, 30%) showed an improvement even when compared to the baseline accuracy (99.2%) on datasets with 0% synthetic error. Further tests would be required to confirm the statistical significance of this uplift, but it is not unreasonable that the UDC could filter out noisy data that may be present in the original clean dataset (since the baseline set itself is not guaranteed to be free of poor-quality data), therefore helping to not only recover but surpass the accuracy of models trained on the baseline datasets.

For the symmetrical (50%, 50%) case (not shown), UDC simply learns to treat one entire class as incorrect and the other as correct, thereby discarding all samples from the opposite class as data errors. Therefore, as might be expected, UDC fails when data error levels in both classes approach 50%, because there is no longer a majority of good-quality data to allow UDC to confidently identify the minority of poor-quality data. In this case, the dataset is deemed unsuitable for AI training. To address such datasets that are so noisy as to have virtually no learnable features, an important screening process prior to implementing a UDC workflow, is to conduct a hyperparameter search to determine parameter spaces wherein predictive power can be achieved on a given dataset. The hyperparameter search is implemented by selecting a range of architectures, learning rates, momentum values and optimizers, and measuring the accuracy of each model on a validation set, at each epoch while training, for a range of 200–300 epochs. Hyperparameter combinations are eligible for inclusion in the UDC workflow if their associated training runs include a high mean accuracy value across the epochs for training, and are screened if they are not able to achieve a statistically significant accuracy above 50%.

Multi-classification using vehicles

An open image database of 27,363 vehicles was used to validate UDC for multi-classification. The dataset comprised four classes of vehicles: airplanes, boats, motorcycles, and trucks. A total of 18,991 images (7244 airplanes, 5018 boats, 3107 motorcycles, 3622 trucks) were used for training, and 8372 images (3101 airplanes, 2194 boats, 1424 motorcycles, 1653 trucks) used as a separate blind test set. As in the previous section, this dataset was chosen because the ground truth outcomes (labels) could be manually ascertained, and synthetic incorrect labels could be readily introduced. Synthetic incorrect labels were uniformly added to each class in the training dataset in increments of 10% to test UDC under various amounts of poor-quality data. UDC results are shown in Supplementary Figure S2 online. This figure shows, for a given data error rate (0–90%), the number of images that are correctly predicted by x constituent UDC models, where the x -axis ranges from 0 to the total number of models used in the UDC workflow. Each UDC histogram is thus quantized by the total number of models, and each bin shows a different total number of images, depending on the choice of data error rate.

Results are summarized in Table 2 , which shows the percentage improvement for all cases after only a single round of UDC, removing both noisy and incorrect labels. Errors are calculated as the standard error on the mean for results obtained from eight models overall to reduce bias on a particular validation set (four model architectures based on Residual Convolutional Neural Network (ResNet) 23 and Dense Convolutional Network (DenseNet) 24 , each trained on two cross-validation phases).

For the multi-class case, results show that UDC is resilient and can identify poor-quality data and improve AI accuracy even at more extreme levels of errors (i.e. 30–70%) compared with the binary case. UDC fails when the percentage of incorrect labels in each class approaches 80%. This is because when 80% or more of a class’ labels are distributed into three other classes, this results in fewer correct labels than incorrect labels for that class, making model training impossible as the model is pulled away from convergence by a larger number of incorrectly vs. correctly labeled data, and making such data uncleansable .

Taken together, these results suggest that UDC creates cleansed datasets that can be used to develop high performing AI models that are both accurate and generalizable using fewer training data, and reduced training time and cost. Near baseline-level performance was achieved using datasets containing up to 70% fewer training data. We showed that 97% accuracy could be achieved on datasets with up to 60% fractional incorrect labels for all classes, using less than 30% of the original amount of training data. In an even more extreme case with 70% incorrect labels, UDC had a higher false positive rate (correct labels images identified as incorrect), which resulted in the removal of 95% of the original dataset, but models trained on the remaining 5% still achieved over 92% accuracy on a blind test set. Finally, application of UDC gave greater stability and accuracy during the training process (across epochs), which means that AI model selection for deployment can be automated because the selection is not hyper-dependent on a training epoch once a threshold of accuracy is achieved.

Application of UDC

The UDC technique was then applied to two healthcare problems, pediatric chest X-ray images for identification of pneumonia, and embryo images to identify likelihood of pregnancy (viability) for in vitro fertilization (IVF). Finally, UDC was also shown to be able to cleanse benchmark test datasets themselves to enable a truer and more realistic representation of AI performance.

Assessment of pediatric chest X-rays for pneumonia detection

A publicly available dataset of pediatric chest X-ray images with associated labels of “Pneumonia” or “Normal” from Kaggle 25 was used. The labels were determined by multiple expert physicians. There were 5232 images in the training set and 624 images in the test set. UDC was applied to all 5856 images. Approximately 200 images were identified as noisy, while no labels were identified as incorrect. This suggests there were no suspected labeling errors in the dataset, but the images identified by UDC were considered poor-quality or uninformative. Poor-quality images in this dataset mean that labels of “normal” or “pneumonia” were not easily identifiable with certainty from the X-ray.

figure 1

Cohen’s kappa test for noisy and Correct labels shows that images with Correct labels lead to a significantly higher level of agreement than random chance, and significantly higher than those with noisy labels.

To verify the result, an independent expert radiologist assessed 200 X-ray images from this dataset, including 100 that were identified as noisy by UDC, and 100 that were identified as correct. The radiologist was only provided the image, and not the image label nor the UDC assessment. Images were assessed in random order, and the radiologist’s assessment of the label for each image recorded. Results showed that the reader consensus between the radiologist’s label and the original label was significantly higher for the correct images compared with the noisy images. Applying Cohen’s kappa test 26 on the results gives levels of agreement for noisy ( \(\kappa \approx 0.05\) ) and correct ( \(\kappa \approx 0.65\) ) labels (refer to Fig. 1 ). This confirms that for noisy images detected by UDC, there is insufficient information in the image alone to conclusively (or easily) make an assessment of pneumonia by either the radiologist or the AI. UDC could therefore prove beneficial as a screening tool for radiologists that could help triage difficult to read or suspicious (noisy) images that warrant further in-depth evaluation or additional tests to support a definitive diagnosis.

We then compared AI performance when trained using the original uncleansed X-ray training dataset versus UDC-cleansed X-ray training dataset with noisy images removed. Results are shown in Fig. 2 . The blue bar in the figure represents a theoretical maximum accuracy possible on the test dataset. It is obtained by testing every trained AI model on the test dataset to find the maximum accuracy that can be achieved. The orange bar is the actual (generalized) accuracy of the AI obtained using standard practice for training and selecting a final AI model using the validation set, then testing AI performance on the test set. The difference between the blue bar and orange bar indicates the generalizability of the AI, i.e. the ability of the AI to reliably apply to unseen data. Results show that training the AI on a UDC-cleansed dataset increases both the accuracy and generalizability of the AI. Additionally, the AI trained using a UDC-cleansed dataset achieved 95% generalized accuracy. This exceeds the 92% accuracy reported for other models in the literature using this same chest X-ray dataset 27 .

figure 2

Balanced accuracy before and after UDC. The orange bar represents the AI accuracy on the test dataset using the standard AI training practice. The blue bar represents the theoretical maximum AI accuracy possible on the test dataset. The discrepancy between these two values is indicative of the generalizability of the model.

Lastly, we investigated application of UDC on the test dataset of X-ray images to assess its quality. This is vital because the test dataset is used by AI practitioners to assess and report on AI performance. Too much poor-quality data in a test set means the AI accuracy is not a true representation of AI performance. To evaluate this, we injected the uncleansed test dataset into the training dataset used to train the AI to determine the maximum accuracy that could be obtained on the validation dataset. Figure 3 shows reduced performance of AI trained using the aggregated dataset (training dataset plus the noisy test dataset) compared with the AI trained only using the cleansed training set. This suggests that the level of poor-quality data in the test dataset is significant, and thus care should be taken when AI performance is measured using this particular test dataset.

figure 3

The colors of the bars represent the performance of the model on the validation set, with (orange) and without (blue) the test set included in the training set. AI performance drops when the uncleansed blind test set is included in the training set, indicating a considerable level of poor-quality data in the test set.

figure 4

Performance metrics of AI model predicting clinical pregnancy, trained on original (left section) and UDC-cleansed (right section) training data. Both graphs show results on the validation set (green), and corresponding original test set (blue) and UDC-cleansed test set (orange).

Assessment of embryo quality for IVF

Finally, UDC was successfully applied to the problem of assessing embryo viability in IVF. UDC was a core technique in developing a commercial AI healthcare product, which is currently being used in IVF clinics globally 17 . The AI model analyzes images of embryos at Day 5 of development to identify which ones are viable and likely to lead to a clinical pregnancy.

Clinical pregnancy is measured by the presence of a fetal heartbeat at the first ultrasound scan approximately 6 weeks after the embryo is transferred to an IVF patient. An embryo is labeled viable if it led to pregnancy, and non-viable if a pregnancy did not occur. Although there is certainty in the outcome (pregnancy or no pregnancy), there is uncertainty in the labels, because there may be patient medical conditions or other factors beyond embryo quality that prevent pregnancy (e.g. endometriosis) 17 . Therefore, an embryo that is viable may be incorrectly labeled as non-viable. These incorrect labels impact the performance of the AI if not identified and removed.

UDC was applied to images of embryos to identify incorrect labels. These were predominantly in the training dataset’s non-viable class, as expected, as they included embryos that appeared viable but were labeled as unviable. Performance results are shown in Fig. 4 . AI models trained using a UDC-cleansed training dataset achieved an increase in accuracy, from 59.7 to 61.1%, when reported on the standard uncleansed test dataset. This small increase in accuracy was not statistically significant, but could potentially be misleading, as the uncleansed test set itself may comprise a significant portion of incorrectly labeled non-viable embryo images, thus reducing specificity as the AI model improves. For the predominantly clean viable class, sensitivity increased by a larger amount, from 76.8 to 80.6%. When a UDC-cleansed test set is utilized, AI models trained using a UDC-cleansed training dataset achieved an increase in accuracy from 73.5 to 77.1%. This larger increase is a truer representation of the AI performance, and while the uplift is just at the \(1-\sigma\) level, it is noted that a medical dataset may require multiple rounds of UDC to fully cleanse the training set.

Effect sizes before and after UDC are represented using Cohen’s d , as shown in Table 3 , along with p -values. Effect sizes larger than 0.6 are considered “large”, meaning that for all test sets (including validation), UDC has a large effect on training and inference (test) performance, except for specificity results for both cleansed and uncleansed (expected due to the large proportion of incorrectly labeled non-viable embryos) test sets. This can be interpreted as there being a significant pair-wise uplift in sensitivity without much cost to specificity. In all cases, there is very large ( \(d>1.4\) ) effect on overall performance. Taken together these results suggest that using UDC to cleanse training datasets can improve the accuracy of the AI even in clinical datasets with a high level of mislabeled, poor-quality data.

This study characterizes a novel technique, Untrainable Data Cleansing (UDC), that serves to automatically identify, and thus allow removal of, poor-quality data to improve AI performance and reporting. In the clinical setting, accurate AI could mean the difference between life and death, or early diagnosis versus missed diagnosis. Thus it is critical that poor-quality data are identified and removed so as not to confuse the AI training process and impact clinical outcomes. It can be difficult for even the most experienced clinicians to identify poor-quality data, particularly when the clinical outcome is uncertain, or the quality and integrity of the data does not allow for a definitive labeling of the image. Furthermore, due to data privacy laws, it may not even be possible to manually assess data quality of private medical datasets. Because UDC can be “shipped” to the secure location of private data, it offers an automated way of addressing data quality concerns while respecting data privacy laws.

UDC was validated across two problem sets, (1) cats vs. dogs, and (2) vehicles, or binary and multi-classification problems, respectively, because image labels could be manually verified. In both cases UDC was effective at identifying synthetically introduced incorrect labels. Training AI models following removal of poor-quality data significantly improved the AI performance, in terms of both accuracy and generalizability. UDC was applied to two medical problem sets, one for pneumonia detection in chest X-rays, and the other for embryo selection in IVF. Both are challenging clinical assessment areas due to varying degrees of noisy or incorrectly labeled data. In both studies UDC was effective (as measured on double blind datasets) at identifying poor quality data, and yielded significant improvements in accuracy and generalizability.

In UDC, the use of a variety of model architectures as well as k -fold cross-validation serves to mitigate overfitting on smaller datasets. Though there may always be a trivial lower bound on dataset size, the behavior of UDC as total training images decreases was found to be stable after training on (cleansed) datasets as low as 5% the size of the initial training datasets. Nevertheless, to further alleviate the effect of overfitting, all models are trained using data augmentation, dropout, weight balancing, learning rate scheduling, weight decay, and early stopping.

In the same way that poor-quality training data can impact AI training, the reporting (or testing) of AI performance done so on a poor-quality test dataset (that contains noisy or incorrectly labeled data) can lead to inaccurate performance reporting. Inaccurate reporting can mislead clinicians and patients on the true performance and reliability of the AI, with potential real-world consequences for those that may rely on AI results. We assessed the utility of UDC for cleansing test datasets and showed that the accuracy of the AI reported on test datasets cleansed with UDC was different to that reported on uncleansed test datasets. The reporting of AI accuracy on a UDC-cleansed test set was shown to be a truer representation of the AI performance based on independent assessment.

Since UDC relies on pooling or ensembling various model architectures, its advantages (namely, that it is relatively agnostic to initial feature-crafting and allows for automated implementation) could be enhanced with the application of discriminant analysis techniques to provide richer context when identifying noisy or incorrect data. Future research is possible investigating how the granularity of Principal Component Analysis-Linear Discriminant Analysis (PCA-LDA) or related techniques, which cannot be applied directly to images themselves, but to model activation (or feature) layers, could be applied to more precisely explain why individual samples were identified as poor-quality by UDC. Such methods may be able to identify which features in each image are not in agreement with the received label.

Finally, we showed that UDC was able to identify noisy data, which in the case of the pneumonia X-rays neither the AI nor the radiologist could consistently classify. The ability for UDC to identify these cases suggests it can be used as a triage tool and direct clinicians to those cases that warrant new tests or additional in-depth clinical assessment. This study demonstrates that the performance of AI for clinical applications is highly dependent on the quality of clinical data, and the utility of a method like UDC that can automatically cleanse otherwise poor-quality clinical data cannot be overstated.

UDC algorithm

Untrainable Data Cleansing (UDC) can identify three categories of image-label pairs:

Correct —strongly expected to be correct (i.e. label matches the ground-truth).

Incorrect —strongly expected to be incorrect (i.e. label does not match ground-truth).

Noisy —data is ambiguous or uninformative to classify with certainty (i.e. label may or may not match ground-truth).

UDC delineates between correct, incorrect, and noisy images using the method described in Algorithm 1, which utilizes a mixture between sampling different (n) model architectures and sampling across data using k -fold cross validation (KFXV). These \(n\times k\) models vote on each image-label in this manner to reduce bias and to increase robustness to outliers.

figure a

Prior to training each AI model configuration, the images are minimally pre-processed by auto-scaling their resolutions to \(224 \times 224\) pixels, and normalizing the color levels using the standard ImageNet mean RGB values of (0.485, 0.456, 0.406) and standard deviations of (0.229, 0.224, 0.225), respectively.

We describe UDC as “ turning AI onto itself ”, as it uses the AI training process to identify poor-quality data. Multiple AI models using different architectures and parameters are trained using the data (to be cleansed), then the AI models are applied back on the same training data to infer their labels. Data which cannot be consistently classified correctly are identified as poor-quality (i.e. incorrect or noisy).

The idea behind UDC is that if data cannot be consistently correctly classified within the AI training process itself, which is where AI models are likely to find the strongest correlations and correct classifications on the dataset used to train/create it, then the data is likely to be poor-quality.

The intuition behind using UDC to subdivide image-label pairs into three types of labels is based on probability theory. A correct label can be thought of as a positively weighted coin ( \(p\gg 0.5\) ), where p is the probability of being correctly predicted by a certain model. In contrast, an incorrect label can be thought of as a negatively weighted coin ( \(p\ll 0.5\) )—very likely to be incorrectly predicted. A noisy label can be thought of as a fair coin ( \(p\approx 0.5\) )—equally likely to be correctly or incorrectly predicted. To illustrate how this intuition applies to UDC, we consider a hypothetical dataset of N image-label pairs. Algorithm 1 is applied to this dataset to produce a number of successful predictions ( \(s_j\) ) for each image j . A histogram of \(s_j\) values (with increasing s on the x -axis) then shows how correct, incorrect, and noisy labels tend to cluster at high, low, and medium values of s , respectively.

Synthetic errors (incorrect labels) were added to the training dataset of 24,916 images, which is split 80/20 into training and validation sets with the following parameters used for each study t , where \({\varvec{\mathscr {n}}}^{(t)}=\left( {\varvec{\mathscr {n}}}_{cat}^{(t)},{\varvec{\mathscr {n}}}_{dog}^{(t)}\right)\) , where \(0\le {\varvec{\mathscr {n}}}(\%) \le 100\) , contains the fractional level of incorrect labels for cat and dog classes (see Table 4 ), respectively, and where RN stands for ResNet 23 , and DN stands for DenseNet 24 .

Synthetic errors (incorrect labels) were added to the training dataset of 18,991 images, which is split 80/20 ( \(k=5\) ) into training and validation sets and where for each study t , \({\varvec{\mathscr {n}}}^{(t)}=\left( {\varvec{\mathscr {n}}}_{A}^{(t)},{\varvec{\mathscr {n}}}_{B}^{(t)},{\varvec{\mathscr {n}}}_{M}^{(t)},{\varvec{\mathscr {n}}}_{T}^{(t)}\right)\) represents fractional levels of incorrect labels for airplane ( A ), boat ( B ), motorcycle ( M ), and truck ( T ) classes (see Table 5 ), respectively, where \(0\le {\varvec{\mathscr {n}}}(\%) \le 100\) , and where R stands for ResNet 23 , and D stands for DenseNet 24 . Note, in Table 5 , the fractional level of incorrect labels was kept constant across classes in each study, so only one value is shown.

Reader consensus between radiologists for correct vs. noisy labels

Images identified by UDC to have noisy labels are suspected to have inconsistencies rendering their annotation (or labeling) more difficult. As such, we expect the reader consensus of Pneumonia/Normal assessments between different radiologists to be lower for images with noisy labels than for those with correct labels that are easily identified by the AI model and for which we expect a relatively high reader consensus between radiologists. The following two hypotheses are formulated and can be directly tested using the (Cohen’s) kappa test:

\({\varvec{\mathcal {H}}}_\mathbf{0} ^\mathbf {(1)}\) : The level of agreement between radiologists for noisy labels is different from random chance.

\({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(1)}\) : The level of agreement between radiologists for noisy labels is no different from random chance.

\({\varvec{\mathcal {H}}}_\mathbf{0} ^\mathbf {(2)}\) : The level of agreement between radiologists for correct labels is no greater than random chance.

\({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(2)}\) : The level of agreement between radiologists for correct labels is greater than random chance.

We prepare an experimental dataset by splitting the data into correct and noisy labels as follows, where the two subsets are used in a clinical study to test the above hypotheses and validate UDC:

A dataset \(\mathcal {D}\) with 200 elements \({\varvec{\mathscr {z}}}_j = \left( \mathbf {x}_j,\hat{y}_j\right)\) has images \(\mathbf {x}_j\) and (noisy) annotated labels \(\hat{y}_j\) . This dataset is split into two equal subsets of 100 images each:

\(\mathcal {D}_{clean}\) —labels identified as correct by UDC, with the following breakdown:

52 Pneumonia (39 Bacterial / 13 Viral)

\(\mathcal {D}_{noisy}\) —labels identified as noisy by UDC, with the following breakdown:

49 Pneumonia (14 Bacterial / 35 Viral)

The dataset \(\mathcal {D}\) is randomized to create a new dataset \(\hat{\mathcal {D}}\) for an expert radiologist to label, and to indicate confidence or certainty in those labels (Low, Medium, High). This randomization is to address fatigue bias and any bias related to the ordering of the images.

The reader consensus between the expert radiologist and the original labels is calculated using Cohen’s kappa test26 and is compared between datasets \(\mathcal {D}_{clean}\) vs. \(\mathcal {D}_{noisy}\) .

Figure 1 provides visual evidence showing that both null hypotheses, \({\varvec{\mathcal {H}}}_\mathbf{0} ^\mathbf {(1)}\) and \({\varvec{\mathcal {H}}_0^{(2)}}\) , are rejected with very high confidence ( \(>99.9\%\) ) and effect size ( \(>0.85\) ). Therefore, both alternate hypotheses are accepted: \({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(1)}\) , stating that labels identified as noisy have levels of agreement no different from random chance, and \({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(2)}\) , stating that labels identified by UDC as correct have levels of agreement greater than random chance.

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article   ADS   CAS   Google Scholar  

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

MATH   Google Scholar  

Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25 , 24–29 (2019).

Article   CAS   Google Scholar  

Fourcade, A. & Khonsari, R. H. Deep learning in medical image analysis: A third eye for doctors. J. Stomatol. Oral Maxillofac. Surg. 120 , 279–288. https://doi.org/10.1016/j.jormas.2019.06.002 (2019).

Article   CAS   PubMed   Google Scholar  

Lundervold, A. S. & Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Med. Phys. 29 , 102–127. https://doi.org/10.1016/j.zemedi.2018.11.002 (2019).

Article   PubMed   Google Scholar  

Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42 , 60–88 (2017).

Article   Google Scholar  

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Haenssle, H. A. et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29 , 1836–1842 (2018).

Cheng, J.-Z. et al. Computer-aided diagnosis with deep learning architecture: applications to breast lesions in us images and pulmonary nodules in CT scans. Sci. Rep. 6 , 1–13 (2016).

Google Scholar  

Kooi, T. et al. Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35 , 303–312 (2017).

Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 , 2402–2410 (2016).

Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2 , 158 (2018).

De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24 , 1342–1350 (2018).

Ciresan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks 411–418 (Springer, 2013).

Charoentong, P. et al. Pan-cancer immunogenomic analyses reveal genotype-immunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep. 18 , 248–262 (2017).

Beck, A. H. et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl. Med. 3 , 108ra113-108ra113 (2011).

VerMilyea, M. et al. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during ivf. Hum. Reprod. (2020).

Zhang, X. & Lessard, L. Online data poisoning attacks (2020).

Raghu, M. et al. Direct uncertainty prediction for medical second opinions. arXiv:1807.01771 [cs, stat] (2019).

Kendall, A. & Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? 5574–5584 (2017).

Natarajan, N., Dhillon, I. S., Ravikumar, P. K. & Tewari, A. Learning with Noisy Labels 1196–1204 (Curran Associates Inc, 2013).

Xiao, T., Xia, T., Yang, Y., Huang, C. & Wang, X. Learning from massive noisy labeled data for image classification. Proc. IEEE Conf. Comput. Vis. Pattern Recogn. 20 , 2691–2699 (2015).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition.. Proc. IEEE Conf. Comput. Vis. Pattern Recogn. 20 , 770–778 (2016).

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. Proc. IEEE Conf. Comput. Vis. Pattern Recogn. 20 , 4700–4708 (2017).

Mooney, P. Chest X-ray images (pneumonia).

Sim, J. & Wright, C. C. The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Phys. Ther. 85 , 257–268 (2005).

Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 , 1122–1131 (2018).

Download references

Author information

Authors and affiliations.

Presagen, Adelaide, SA, 5000, Australia

M. A. Dakka, T. V. Nguyen, J. M. M. Hall, S. M. Diakiw, M. Perugini & D. Perugini

School of Mathematical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia

  • M. A. Dakka

School of Computing and Information Technology, University of Wollongong, Wollongong, NSW, 2522, Australia

  • T. V. Nguyen

Australian Research Council Centre of Excellence for Nanoscale BioPhotonics, Adelaide, SA, 5000, Australia

  • J. M. M. Hall

Ovation Fertility, Austin, TX, 78731, USA

M. VerMilyea

Texas Fertility Center, Austin, TX, 78731, USA

Department of Medical Imaging-SAMI, Women’s and Children’s Hospital Campus, Adelaide, SA, 5000, Australia

Adelaide Medical School, The University of Adelaide, Adelaide, SA, 5000, Australia

M. Perugini

You can also search for this author in PubMed   Google Scholar

Contributions

D.P. invented the concept, M.A.D. designed the algorithm, M.A.D. and J.M.M.H. and T.V.N. and D.P. conceived the experiments, M.A.D. and J.M.M.H. and T.V.N. conducted the experiments, M.V. and R.L. provided clinical data and clinical review, D.P. and M.A.D. and J.M.M.H. and T.V.N. and S.M.D. and M.P. drafted the manuscript and provided critical review of the results.

Corresponding authors

Correspondence to J. M. M. Hall or D. Perugini .

Ethics declarations

Competing interests.

J.M.M.H., D.P., and M.P. are co-owners of Presagen. S.M.D., T.V.N., and M.A.D. are employees of Presagen.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information 1

Supplementary information 1., supplementary information 2., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Dakka, M.A., Nguyen, T.V., Hall, J.M.M. et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep 11 , 18005 (2021). https://doi.org/10.1038/s41598-021-97341-0

Download citation

Received : 19 April 2021

Accepted : 23 August 2021

Published : 09 September 2021

DOI : https://doi.org/10.1038/s41598-021-97341-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Efficient automated error detection in medical data using deep-learning and label-clustering.

  • S. M. Diakiw

Scientific Reports (2023)

Proceedings of the first world conference on AI in fertility

  • Carol Lynn Curchoe

Journal of Assisted Reproduction and Genetics (2023)

Moving towards vertically integrated artificial intelligence development

  • Sanjay Budhdeo
  • James T. Teo

npj Digital Medicine (2022)

A novel decentralized federated learning approach to train on globally distributed, poor quality, and protected private medical data

  • D. Perugini

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data quality problems case study

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 11, Issue 12
  • COVID-19 surveillance data quality issues: a national consecutive case series
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0002-7109-1101 Cristina Costa-Santos 1 , 2 ,
  • http://orcid.org/0000-0002-7107-7211 Ana Luisa Neves 1 , 2 , 3 ,
  • Ricardo Correia 1 , 2 ,
  • http://orcid.org/0000-0002-2362-5527 Paulo Santos 1 , 2 ,
  • http://orcid.org/0000-0002-4586-2910 Matilde Monteiro-Soares 1 , 2 , 4 ,
  • Alberto Freitas 1 , 2 ,
  • Ines Ribeiro-Vaz 1 , 2 , 5 ,
  • Teresa S Henriques 1 , 2 ,
  • Pedro Pereira Rodrigues 1 , 2 ,
  • Altamiro Costa-Pereira 1 , 2 ,
  • Ana Margarida Pereira 1 , 2 ,
  • Joao A Fonseca 1 , 2
  • 1 Department of Community Medicine, Information and Health Decision Sciences (MEDCIDS) , Faculty of Medicine, University of Porto , Porto , Portugal
  • 2 Centre for Health Technology and Services Research (CINTESIS) , Faculty of Medicine, University of Porto , Porto , Portugal
  • 3 Patient Safety Translational Research Centre, Institute of Global Health Innovation , Imperial College London , London , UK
  • 4 Escola Superior de saúde da Cruz Vermelha Portuguesa , Lisbon , Portugal
  • 5 Porto Pharmacovigilance Centre , Faculty of Medicine, University of Porto , 4200-450 Porto , Portugal
  • Correspondence to Dr Cristina Costa-Santos; csantos.cristina{at}gmail.com

Objectives High-quality data are crucial for guiding decision-making and practising evidence-based healthcare, especially if previous knowledge is lacking. Nevertheless, data quality frailties have been exposed worldwide during the current COVID-19 pandemic. Focusing on a major Portuguese epidemiological surveillance dataset, our study aims to assess COVID-19 data quality issues and suggest possible solutions.

Settings On 27 April 2020, the Portuguese Directorate-General of Health (DGS) made available a dataset (DGSApril) for researchers, upon request. On 4 August, an updated dataset (DGSAugust) was also obtained.

Participants All COVID-19-confirmed cases notified through the medical component of National System for Epidemiological Surveillance until end of June.

Primary and secondary outcome measures Data completeness and consistency.

Results DGSAugust has not followed the data format and variables as DGSApril and a significant number of missing data and inconsistencies were found (eg, 4075 cases from the DGSApril were apparently not included in DGSAugust). Several variables also showed a low degree of completeness and/or changed their values from one dataset to another (eg, the variable ‘underlying conditions’ had more than half of cases showing different information between datasets). There were also significant inconsistencies between the number of cases and deaths due to COVID-19 shown in DGSAugust and by the DGS reports publicly provided daily.

Conclusions Important quality issues of the Portuguese COVID-19 surveillance datasets were described. These issues can limit surveillance data usability to inform good decisions and perform useful research. Major improvements in surveillance datasets are therefore urgently needed—for example, simplification of data entry processes, constant monitoring of data, and increased training and awareness of healthcare providers—as low data quality may lead to a deficient pandemic control.

  • information management
  • health informatics
  • epidemiology
  • public health
  • statistics & research methods

Data availability statement

Data may be obtained from a third party and are not publicly available. The data related to this study (deidentified COVID-19 cases information) were made available by Portuguese Directorate-General of Health to authors upon request and after submission of a research proposal and documented approval by an ethical committee. All data are available from the corresponding author on a reasonable request and after authorisation from Portuguese Directorate-General of Health.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjopen-2020-047623

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

As accurate data in an epidemic are crucial to guide public health policies, this study identifies quality issues of the COVID-19 surveillance datasets.

Only studied the quality issues of COVID-19 surveillance datasets from one country, Portugal.

Several strategies for improving quality of healthcare data were recommended.

Introduction

The availability of accurate data in an epidemic is crucial to guide public health measures and policies. 1 During outbreaks, making epidemiological data openly available, in real time, allows researchers with different backgrounds to use diverse analytical methods to build evidence 2 3 in a fast and efficient way. This evidence can then be used to support adequate decision-making which is one of the goals of epidemiological surveillance systems. 4 To ensure that high-quality data are collected and stored, several factors are needed, including robust information systems that promote reliable data collection, 5 adequate and clear methods for data collection and integration from different sources, as well as strategic data curation procedures. Epidemiological surveillance systems need to be designed having data quality as a high priority and thus promoting, rather than relying on, users’ efforts to ensure data quality. 6 Only timely, high-quality data can provide valid and useful evidence for decision-making and pandemic management. On the contrary, using datasets without carefully examining the metadata and documentation that describes the overall context of data can be harmful. 7

The low data quality of epidemiological surveillance systems has been a matter of concern worldwide. In fact, Boes and colleagues assessed the German surveillance system for acute hepatitis B infections. They concluded that although timeliness improved over the evaluation period, data quality in terms of completeness of information decreased considerably. Authors also stress that as improved data completeness is required to adequately design prevention activities, reasons for this decrease should further be explored. 8 On the other hand, other authors assessed timeliness and data quality of Italy’s surveillance system for acute viral hepatitis and concluded that this system collects high-quality data, but wide reporting delays exist. 9 Another study evaluated the quality of the influenza-like illness surveillance system in Tunisia and concluded that to better monitor influenza, the quality of data collected by this system should be closely monitored and improved. 10 Visa and colleagues, in Nigeria, evaluated the Kano State malaria surveillance system and recommended strategies to improve data quality. 11 Regarding COVID-19 pandemic, a recent study that evaluated the accuracy of COVID-19 data collection by the Chinese Center for Disease Control and Prevention, WHO, and European Centre for Disease Prevention and Control showed noticeable and increasing measurement errors in the three datasets as more countries contributed data for the official repositories. 7

At the moment, producing these high-quality datasets within a pandemic is nearly impossible without a broad collaboration between health authorities, health professionals and researchers from different fields. The urgency to produce scientific evidence to manage the COVID-19 pandemic contributes to lower quality datasets that may jeopardise the validity of results, generating biased evidence. The potential consequences are suboptimal decision-making or even not using data at all to drive decisions. Methodological challenges associated with analysing COVID-19 data during the pandemic, including access to high-quality health data, have been recognised 12 and some data quality concerns were described. 7 Nevertheless, to our knowledge, there is no study performing a structured assessment of data quality issues from the datasets provided by the National Surveillance Systems for research purposes during the COVID-19 pandemic. Although this is a worldwide concern, this study will use Portuguese data as a case study.

The Portuguese systems to input COVID-19 data and the data flows

In early March, the first cases of COVID-19 were diagnosed in Portugal. 13 The Portuguese surveillance system for mandatory reporting of communicable diseases is named SINAVE (National System for Epidemiological Surveillance) and is in the dependence of the Directorate-General of Health (DGS). COVID-19 is included in the list of mandatory communicable diseases to be notified through this system either by medical doctors (through SINAVE MED) or laboratories (SINAVE LAB). A COVID-19-specific platform (Trace COVID-19) was created for the clinical management of patients with COVID-19 and contact tracing. However, data from both SINAVE and Trace COVID-19 are not integrated in the electronic health record (EHR). Thus, healthcare professionals need to register similar data, several times, for the same suspect or confirmed case of COVID-19, increasing the burden of healthcare professionals and potentially leading to data entry errors and missing data. The SINAVE notification form includes a high number of variables, with few or no features to help data input. Some examples include: (1) within general demographic characteristics, patient occupation is chosen from a drop-down list with hundreds of options and with no free text available; (2) the 15 questions regarding individual symptoms need to be individually filled using a three-response option drop-down list, even for asymptomatic patients; (3) in the presence of at least one comorbidity, 10 specific questions on comorbidities need to be filled; and (4) there are over 20 questions to characterise clinical findings, disease severity, and use of healthcare resources, including details on hospital isolation. Other examples of the suboptimal design are (5) the inclusion of two questions on autopsy findings among symptoms and clinical signs, although no previous question ascertains if the patient has died; (6) lack of a specific question on disease outcome (only hospital discharge date); (7) lack of validation rules that allow, for example, to have a disease diagnosis prior to birth date or to be discharged before the date of hospital admission; and (8) no mandatory data fields, allowing the user to proceed without completing any data. Furthermore, a global assessment of disease severity is included with the options ‘unknown’, ‘severe’, ‘moderate’ and ‘not applicable’ without a readily available definition and without the possibility to classify the disease as mild. This unfriendly system may impair the quality of COVID-19 surveillance data. The problems described have existed for a long time at SINAVE and they are usually solved by personal contact with the health local authorities. However, in the current COVID-19 pandemic scenario, and due to the pressure of the huge number of new cases reported daily, this does not happen at this moment.

There is more than one possible data flow from the moment the data are introduced until the dataset is made available to researchers. Figure 1 is an example of the information flow from data introduced by public health professionals until the analysis of data.

  • Download figure
  • Open in new tab
  • Download powerpoint

Example of one possible information flow from the moment the data are introduced until the dataset is made available to researchers. The ⊗ symbol means that data are not sent and therefore not present in the research database (DB). The dashed line represents a manual cumbersome process that is many times executed by public health professionals and that is very susceptible to errors. DGS, Directorate-General of Health.

Since the beginning of the pandemic, several research groups in Portugal stated their willingness to contribute by producing knowledge and improving data systems and data quality. 14 Researchers requested access to healthcare-disaggregated data related to COVID-19, in order to timely produce scientific knowledge to help evidence-based decision-making during the pandemic. In April, DGS made a document publicly available with the description and metadata of the dataset to be provided and a form to be filled by researchers to request this dataset. 15 16 A research protocol and a documented approval by an ethical committee were also necessary. A metadata document was available and the researchers knew what variables they would receive if they formally requested the dataset. The variables available did not include, for instance, clinical presentation or specific disease symptoms. The variable formats were described in the provided metadata but were not discussed or adjusted based on researchers’ opinions or needs. In the metadata, the coded value list was described (eg, Y=Yes, N=No, Unk=Unknown) but not the coding mechanism, that is, how the form answer (given by the healthcare professional) was coded in the dataset. Therefore, although there was an ‘agreed’ dataset specification document, it was not complete enough to fully understand the provided data. Along with the data request form and metadata document, information was made available that researchers would receive weekly data updates.

On 27 April 2020, the DGS sent the described dataset (DGSApril) collected by the SINAVE MED and according to the metadata document made available before. At least 50 research groups received the data and started their dataset analyses. Weekly dataset updates were not provided and only on 4 August 2020, DGS sent an updated dataset (DGSAugust) to the research groups who had requested the first dataset, including COVID-19 cases already included in the initial dataset plus new cases diagnosed during May and June 2020. This updated dataset did not respect the metadata document initially provided, and had an inconsistent manifest, including some variables presented in a different format or absent. For example, instead of a variable with the outcome of the patient, the second dataset presented two dates: death and recovery date; and this new version did not distinguish between dead due to COVID-19 and dead due to reasons. The updated dataset also used definitions (for example, variable age was defined as the age at the time of COVID-19 onset or as age at the time of COVID-19 notification, in the first and second datasets, respectively). Also the variable of preconditions had different categories. For example, the first dataset the variable comorbidities had the category ‘cardiac disease’ and in the updated version of the dataset, this category was not present. All these aspects raised concerns regarding the updated dataset used for replication of the analysis made using the first version of data and consequently some concerns regarding its use for valid research.

We aimed to assess data quality issues of COVID-19 surveillance data and suggest solutions to overcome them, using the Portuguese surveillance datasets as an example.

The data provided by DGS included all COVID-19-confirmed cases notified through the SINAVE MED and, thus, excluding those only reported by laboratories (SINAVE LAB).

The DGSApril dataset was provided on 27 April 2020 and the updated one (DGSAugust) on 4 August 2020. The available variables in both datasets are described in online supplemental file 1 .

Supplemental material

There was a variable named ‘outcome’, with the information on the outcome of the case, present in DGSApril dataset that was not available in the DGSAugust dataset. On the other hand, there were also some variables (dead, recovery, diagnosis and discharge dates) present in DGSAugust dataset that were not available in the DGSApril dataset.

The quality of the data was assessed through the analysis of data completeness and consistency between the DGSApril and DGSAugust datasets. For data completeness evaluation, missing information was classified as ‘system missing’ when there was no information provided (blank cells) and as ‘coded as unknown’ when the information ‘unknown’ was coded. Considering the consistency, both datasets were compared in order to evaluate if the data quality increased with the update sent 4 months later. As many data entry errors could be avoided using an optimised information system, the potential data entry errors in DGSAugust were also described.

The main outcome measures were: the frequency of cases with missing information, the frequency of cases with unmatched information between the datasets and its update, and the frequency of cases with wrong data entry (considered impossible values) for each variable.

The number of COVID-19 cases and the number of deaths due to COVID-19 were also compared with the public daily report by Portuguese DGS. 17 We highlight that it is not expected that the daily numbers of cases and deaths reported publicly were coincident with the numbers obtained in the datasets made available to researchers as these datasets included only the COVID-19 cases notified through the SINAVE MED (excluding those only reported by laboratories). However, the calculation of this difference is important to estimate the potential bias that data of these (DGSApril and DGSAugust) datasets, provided by DGS to researchers, may have. This comparison is only possible in the DGSAugust dataset as in the DGSApril dataset, the variable date of diagnosis was not available.

Statistical methods

Descriptive statistics are presented as absolute and relative frequencies.

Data handling and analyses were performed using IBM SPSS Statistics V.26 and R V.4.0.3.

Patient and public involvement

As this study used secondary data, it was not possible to involve the participants in the study, in the design or in the recruitment and conduct of the study. However, the results have been and will continue to be disseminated not only with DGS but with patients and the whole community through the media.

Cases included and omitted

From the 20 293 COVID-19 cases included in the DGSApril dataset, only 80% (n=16 218) had the same unique case identifier in the DGSAugust dataset. There were 4075 cases in the DGSApril dataset that were not included in the DGSAugust dataset or, alternatively, had changed the unique case identifier. The DGSAugust dataset provided a total of 38 545 COVID-19 cases, including 22 327 that were not available in DGSApril dataset: 5713 diagnosed until 27 April but that presumably were not included in the DGSApril dataset, 16 609 diagnosed after the period included in the DGSApril dataset and 5 cases with missing information on diagnosis date ( figure 2 ).

Number of unique case identifiers presented in the datasets of COVID-19 cases diagnosed since the start of the pandemic until 27 April (date when the first database was made available) and after 27 April. DGS, Directorate-General of Health.

Considering the 5713 cases made available only in the DGSAugust and diagnosed before 27 April that, presumably, were not included in the DGSApril dataset, the majority (58%) were diagnosed in the 2 weeks immediately prior to 27 April (the date on which this database was made available). However, 42% were diagnosed more than 2 weeks before the DGSApril dataset was made available ( figure 2 ).

Data completeness of both datasets

Several variables showed a low degree of completeness. For example, two variables (‘date of first positive laboratory result’ and ‘case required care in an intensive care unit’) had more than 90% of cases with missing information in DGSApril dataset—coded as unknown or system missing. In the DGSAugust dataset, the variable ‘case required care in an intensive care unit’ reduced the proportion of incomplete information to 26% of system missing and no cases were coded as unknown. However, the variable ‘date of first positive laboratory result’ still had 90% system missing in the DGSAugust dataset. Table 1 provides detailed information about missing information for each available variable.

  • View inline

Data completeness (number and percentage of missing information) of each variable available in the DGSApril and DGSAugust datasets with COVID-19 cases provided by DGS

Data consistency between DGSApril and DGSAugust datasets

The consistency of the information for cases identified with the same unique case identifier in both datasets (n=16 218) was further evaluated ( figure 1 ).

Table 2 presents the number and percentage of cases with different information, for each variable.

Number and percentage of COVID-19 cases presented in both datasets (n=16 218) with information that did not match for each variable

Since the beginning of the pandemic, the report of COVID-19 within SINAVE kept the same data structure. A few variables related to specific symptoms that were progressively described as being in relation to COVID-19 infection were added (eg, anosmia or dysgeusia), but these variables (symptoms) were not included in the analysed datasets (DGSApril and DGSAugust). However, some inconsistencies may be due to differences in the data format made available to researchers. Anyway, due to the lack of metadata information related to DGSAugust, it is not possible to harmonise such inconsistencies in data analysis. Some inconsistencies may be due to the update of the data made meanwhile, however many inconsistencies are difficult to understand because there is often information filled in the first dataset that is not filled in the updated dataset.

The variable ‘underlying conditions’ was the one showing a higher percentage of inconsistencies between both datasets, with more than half of cases showing different information when comparing the information from both datasets ( table 2 ). Most of the inconsistencies were due to the cases recorded as ‘no underlying conditions’ in the DGSApril dataset and corrected to ‘unknown if the case has underlying conditions’ or ‘missing’ in the updated dataset (DGSAugust) (42%, n=6851). There were 1952 cases (12%) recorded as ‘no underlying conditions’ in the first dataset and corrected to ‘yes—underlying conditions’ in the second one. There were also 99 (1%) cases with underlying conditions in the first dataset corrected to ‘no underlying conditions’ in the second one.

The variable ‘age’ also had more than half of cases showing different information when comparing the information from both datasets ( table 2 ). The difference in all cases with different information, except one, was 1 year old. The definition of ‘age’ was different in both datasets: in DGSApril is the age at the time of COVID-19 onset and, in DGSAugust, the age at the time of COVID-19 notification.

The variable ‘hospitalisation’ had 16% of cases (n=253) with unmatched information ( table 2 ). One hundred and twenty-five cases were recorded as ‘unknown if the case was hospitalised’ in the DGSApril dataset and corrected to ‘no hospitalisation’ in the DGSAugust. Sixty-two cases were recorded as ‘no hospitalisation’ and corrected to ‘hospitalised’ or ‘unknown information’ in DGSApril and DGSAugust datasets, respectively. Fifty-five cases were recorded as hospitalised patients and corrected to ‘no hospitalisation’ or ‘unknown information’ in DGSApril and DGSAugust datasets, respectively. Only 11 cases changed from ‘unknown if the case was hospitalised’ to ‘hospitalisation’.

The variable ‘date of disease onset’ had 12% of cases (n=2008) with unmatched information ( table 2 ). In 1445 cases, information about the date of disease onset was provided only in DGSApril and 563 cases had dates in both datasets but the dates did not match.

The variable ‘date of the first positive laboratory result’ did not match in both datasets in 6% of the cases (n=962). In 5 cases, there was a date available in both datasets but the dates did not match; in 74 cases, the date was available only in the DGSApril dataset; and in 883 cases, the date was available only in the DGSAugust dataset.

The variable patient outcome (variable ‘outcome’) was not present in the DGSAugust dataset which instead presents the variables ‘date of recovery’ and ‘date of death’ (not presented in DGSApril) ( table 1 ). In the DGSApril dataset, there were 1134 cases coded as ‘alive, recovered and cured’, but only 83% of those (n=947) had recovery date in the updated dataset (DGSAugust), which may be due to the lack of information on a specific date, despite knowing that the case result is alive, recovered and cured. In fact, 177 patients recorded as ‘alive, recovered and cured’ in the DGSApril did not have any date in the DGSAugust dataset. However, 10 patients recorded as ‘alive, recovered and cured’ in the DGSApril had a date of death in the DGSAugust dataset. Seven of these were dates of death before April 19, which is incongruent. Among the 455 cases coded as ‘died because of COVID-19’ in the DGSApril dataset, 7 (2%) did not have a date of death in the second dataset.

Data entry errors in the updated dataset (DGSAugust)

The age of one patient is probably wrong (more than 130 years old). There were also male patients and elderly women registered as pregnant. There was a wrong diagnosis date (50-05-2020) and 19 patients had registered dates of diagnosis before the first official case of COVID-19 was diagnosed in Portugal. There were also two patients with a negative length of stay in hospital.

Of the 38 545 cases included in the dataset, 6772 had recorded in the recovery date variable ‘April 3’, 1032 cases had recorded in the recovery date variable ‘May 25’ and 242 cases ‘May 26’. The remaining 30 499 cases had no information registered in this variable.

Number of COVID-19 cases and deaths provided by DGSAugust dataset and by daily public report

Table 3 shows the number of COVID-19 cases and deaths due to COVID-19 reported by DGSAugust dataset and by the daily public report. The DGSAugust dataset included 38 520 COVID-19 cases diagnosed between March and June, less 4003 cases (9%) than th e daily public report provided by Portuguese DGS. However, when looking at data from March, the DGSAugust dataset reported more 669 cases (8%) than th e daily public report. In April, May and June, the DGS dataset reported less 17%, 8% and 12% of cases than the public report provided, respectively.

Number of COVID-19 cases and deaths due to COVID-19 reported by DGSAugust dataset and by the daily public report

The DGSAugust dataset reported 1155 deaths due to COVID-19 until the end of June, less 424 cases (27%) than th e daily public report provided by the Portuguese DGS. However, in March, the DGSAugust dataset reported more five deaths due to COVID-19 (3%) than the daily public report. In April, May and June, the DGS dataset reported less 8%, 49% and 100% of cases than the public report provided, respectively.

Bias estimation

The most important problem in the first dataset is the potential underestimation of comorbidities due to the misclassification of cases with the information unknown about preconditions as ‘absence of precondition’. To estimate the potential systematic error identified by Costa-Santos and colleagues 18 presented in the study by Nogueira and colleagues 19 who analysed the first dataset, we estimate the prevalence of each precondition with the first dataset (those presented in Nogueira and colleagues’ study 19 ) and with the second dataset (where the cases with unknown information about preconditions were classified as missing information for that variable and not as ‘precondition absent’).

As table 4 evidence, the first dataset (DGSApril) presented a bias in the prevalence estimation of almost all preconditions probably due to the misclassification of cases with the information unknown about preconditions as ‘absence of precondition’. Almost all the comorbidities in the DGSApril were greatly underestimated relatively to the second dataset. Even in the updated dataset (DGSAugust), the prevalence of preconditions may be underestimated. Indeed, for example, the estimate of the prevalence of asthma in the Portuguese population is 6.8% (95% CI 6.0% to 7.7%). 20 According to Quinaz Romana and colleagues, the percentage of people in the Portuguese population who have at least one precondition is 58%. 21 The Portuguese population of people infected with COVID-19 is unlikely to have a lower prevalence of comorbidities than the Portuguese general population.

Prevalence estimation for each precondition by DGSApril (used in Nogueira and colleagues’ 19 study) and by the updated dataset

The production of scientific evidence to help manage the COVID-19 pandemic is an urgency worldwide. However, if the quality of datasets is low, the evidence produced may be inaccurate and, therefore, have limited applicability. This problem may be particularly critical when low-quality datasets provided by official organisations lead to the replication of biased conclusions in different studies.

The problem of using datasets with suboptimal quality for research purposes during the COVID-19 pandemic probably occurs in a large number of countries. This study, using the Portuguese surveillance data, reports a high number of inconsistencies and incompleteness of data that may interfere with scientific conclusions. To date, we could identify three scientific papers reporting analysis of these data 19 22 23 that may have been affected by the low quality of the datasets. 21 Table 5 presents data quality issues identified in the provided datasets and possible solutions.

Most frequent data quality issues and possible solutions

The issue of ‘missing’ versus ‘absent’ variable coding seems to be present in the findings of Nogueira and colleagues’ study. 19 The reduction of the risk of death in relation with comorbidities observed in the analysis of the first dataset is underestimated if we assume that the updated dataset is the correct one. 21 In fact, these cases were registered as having no underlying conditions in the first dataset but corrected in the second dataset to ‘unknown if the case has underlying conditions’ or system missing. This problem might be due to the way these data were collected and/or were recorded in the database sent to the researchers. In the form used to collect COVID-19 surveillance data, comorbidities are recorded one by one after a general question assessing the presence of any comorbidity and the field is not mandatory. From a clinical point of view, it might be enough to register only positive data perceived as relevant (eg, the presence of a specific diagnosis, but not its absence), especially in a high-burden context as the ongoing pandemic. In the context of clinical research, however, the lack of registered comorbidity data cannot be interpreted as the absence of comorbidities. A similar bias can be found in the other two studies reporting analysis of DGSApril dataset. 22 23

Another data quality issue is related to the discrepancies in cases included in both datasets. In fact, only 80% of cases included in the DGSApril dataset had the same unique case identifier in the DGSAugust dataset and only 74% of cases diagnosed until 27 April included in DGSAugust had the same unique case identifier in the DGSApril. Alternatively, the unique case identifier had been changed. We do not know if the unique identifier is generated in each data download or if it is recorded in the database. This last option will be the safest. Moreover, until 19 June, it was not mandatory to fill in the national health service user number in order to have a standard unique patient identifier. That may have led to not identifying duplicate SINAVE MED entries for the same patient and increased the difficulty in adequately merging data from SINAVE LAB, SINAVE MED and other data sources.

The high percentage of incomplete data in several variables may also produce biases whose dimensions and directions are not possible to estimate. In fact, as our results showed, half of the variables available in the DGSAugust dataset had more than one-third of missing information. Furthermore, that dataset was already incomplete since it only provides COVID-19 cases from the medical component of SINAVE totalling 90% of the cases reported by health authorities until the end of June 2020. 17 It is unclear, however, why the updated version of the dataset in March reported more 669 COVID-19 cases and more 5 deaths than the public report (which would be expected to be more complete). Moreover, there were no reported dates of deaths in June in DGSAugust dataset, despite the 155 deaths reported in the public report during this month.

The consistency of variables in different updates of datasets is also an important quality issue. In fact, our results show that the variable ‘age’ was calculated differently in the two datasets: in the DGSApril dataset it was the age at the time of COVID-19 onset and in the DGSAugust dataset it was age at the time of COVID-19 notification. Despite this change in definition, the difference of 1 year in half of the cases does not seem to be completely justified only by this fact, since the two dates should be relatively close. Still related to this problem of inconsistent information and variables, we realised that some information may have been lost in the second dataset sent (DGSAugust). In fact, the outcome of the COVID-19 case is not presented in the second dataset. DGSAugust dataset only presents the recovery and death dates. It would be possible to reconstruct in the second dataset some of the information on the outcome variable presented in the first one. However, it would only be possible to directly recode those with ‘date of recovery’ as ‘alive, recovered and cured’; all other categories (‘died of COVID-19’; ‘died of other cause’; ‘cause of death unknown’; ‘still on medical treatment’) are impossible to obtain from the dates of recovery or death. In fact, using only the variable ‘date of death’, it is not possible to determine if the patient died because of COVID-19, died of another cause or if the cause of death is unknown as in the DGSApril dataset. Moreover, 17% of cases coded as ‘alive, recovered and cured’ in the first dataset did not have the variable ‘date of recovery’ filled in the updated one. While the recovery date (when available) can be used as a proxy of the patient outcome, if this date is unknown in spite of a known recovery, we miss the whole outcome information.

In fact, in the DGSAugust dataset, it is assumed that the missing information about the recovery date implies that the case had not recovered yet. Also, the ‘recovery date’ had only three dates even though it refers to a 4-month period.

All the described errors, inconsistencies, data incompleteness, changes in the variables’ definitions and format may lead to unreproducible methods and analyses. While important to start working in data analysis as fast as possible in the early beginning of a pandemic, it is also crucial that the models and analysis developed with the first data are validated a posteriori and confirmed with the updated data. It is thus fundamental that the subsequent datasets follow the same metadata and preferably are more complete and with less inconsistencies and errors.

Quality of healthcare data can be improved through several strategies. First, data entry processes must be simplified, avoiding duplications and reusing the data already in the system, since the need to input the same information in different systems is time-consuming, frustrating for the user, and can negatively impact both data completeness and accuracy. Data interoperability can also be a powerful approach to minimise the number of interactions with the system. 24 Second, data need to be constantly monitored and tracked 25 : organisations must develop processes to evaluate data patterns, and establish report systems based on data quality metrics. Even before data curation, simple validation procedures and rules in information systems can help detect and prevent many errors (ie, male patients classified as ‘pregnant’, or a patient aged 134 years old) and inconsistencies, and improve data completeness.

Finally, we need to establish the value proposition for both creators and observers. 26 This includes ensuring that healthcare providers understand the importance of data, receive feedback about their analysis and how it may improve both the assistance to the patient and the whole organisation, and have received adequate training for better performance.

The adoption of these strategies should pave the way to high-quality, accurate healthcare datasets that can generate accurate knowledge to timely inform health policies, and the readaptation of healthcare systems to new challenges.

We acknowledge that our study has some limitations. One of such limitations is the lack of clarification by the data provider on the issues found in the datasets. In fact, despite repeated requests, we did not receive from DGS complete answers that could clarify the issues described in the manuscript. Therefore, the analysis of the Portuguese surveillance data quality was done exclusively with the analysis of the databases provided by DGS to researchers and with our external knowledge about how the information flows from the moment the data are introduced by health professionals until the dataset can be used for data analysis. Another limitation is the fact that we only studied the quality issues of COVID-19 data from one country, Portugal. However, our results seem to be in line with the findings of Ashofteh and Bravo 7 who analysed and compared the quality of official datasets available for COVID-19, including data from the Chinese Center for Disease Control and Prevention, the WHO, and the European Centre for Disease Prevention and Control. In fact, they also found noticeable and increasing measurement errors in the three datasets as the pandemic outbreak expanded and more countries contributed data for the official repositories.

We describe some important quality issues of the Portuguese COVID-19 surveillance datasets, relevant enough to force the discussion about the validity of the published findings arising from these and similar data.

The availability of official data by the National Health Authorities to researchers is an enormous asset, allowing data analysis, modelling and prediction that may support better decisions for the patient and the community as a whole. However, to fully embrace this potential, it is crucial that these data are accurate and reliable.

System interoperability would be needed to allow the connection with all the different EHRs that are in use in Portugal. Most EHRs collect data using unstructured data fields that would be difficult to correctly extract to a form like the one in the National Surveillance Systems.

It also urges to define and implement major improvements in the processes and systems of surveillance datasets: simplification of data entry processes, constant monitoring of data, raising awareness of healthcare providers for the importance of good data and providing them adequate training.

Data curation processes, capitalising on effective and multidisciplinary collaborations between healthcare providers and data analysts, play a critical role to ensure minimum quality standards. Once these processes are fully optimised, the reliability of results and the quality of the scientific evidence produced can be greatly improved.

Ethics statements

Patient consent for publication.

Not required.

Ethics approval

The data used in this work were anonymised and made available by the Portuguese Directorate-General of Health (DGS), under the scope of article 39th of the decree law 2-B/2020, from 2 April. The study was approved on 17 April 2020 by the Health Ethics Committee of Centro Hospitalar Universitário de São João and Faculty of Medicine, University of Porto (number not available).

  • Kraemer MUG ,
  • Gutierrez B , Open COVID-19 Data Curation Group
  • Yozwiak NL ,
  • Schaffner SF ,
  • German RR ,
  • Horan JM , et al
  • Santos JV ,
  • Pinto M , et al
  • Wang N , et al
  • Ashofteh A ,
  • Houareau C ,
  • Altmann D , et al
  • de Waure C , et al
  • Bouguerra H , et al
  • Ajumobi O ,
  • Bamgboye E , et al
  • Wolkewitz M ,
  • Direção Geral da Saúde
  • ↵ Carta aberta AO Conselho Nacional de Saúde Pública: Um contributo pessoal acerca dA epidemia de Covid-19, em Portugal , 2020 . Available: https://sigarra.up.pt/fmup/pt/noticias_geral.noticias_cont?p_id=F307210300/CartaAberta_COVID19_11.03.2020_.pdf [Accessed 17 Aug 2020 ].
  • Costa-Santos C ,
  • Ribeiro-Vaz I ,
  • Monteiro-Soares M
  • Nogueira PJ ,
  • de Araújo Nobre M ,
  • Costa A , et al
  • Sa-Sousa A ,
  • Morais-Almeida M ,
  • Azevedo LF , et al
  • Quinaz Romana G ,
  • Kislaya I ,
  • Salvador MR , et al
  • Peixoto R ,
  • D'Amore J ,
  • Bouhaddou O ,
  • Mitchell S , et al
  • IOM Roundtable on Value & Science-Driven Care
  • Institute of Medicine

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1

AMP and JAF contributed equally.

Contributors CC-S conceptualised the study, contributed to design, analysis and interpretation of data, and drafted the manuscript. CC-S was also the guarantor and accepts full responsibility for the finished work and the conduct of the study, had access to the data, and controlled the decision to publish.

ALN, RC, PS, MM-S, AF, IR-V, PPR, AC-P, AMP and JAF made substantial contributions to the conception and design of the study and revised the draft critically for important intellectual content. TSH made substantial contributions to the analysis and interpretation of data and revised the draft critically for important intellectual content.

Funding This work was supported by National Funds through FCT - Fundação para a Ciência e a Tecnologia,I.P., within CINTESIS, R&D Unit (reference UIDB/4255/2020).

Competing interests None declared.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

data quality problems case study

Data import

data quality problems case study

Data profiling

data quality problems case study

Data cleansing

data quality problems case study

Data matching

data quality problems case study

Data deduplication

data quality problems case study

Data merge purge

data quality problems case study

Address verification

By use case.

data quality problems case study

Address standardization

data quality problems case study

Data standardization

data quality problems case study

Data scrubbing

data quality problems case study

Entity resolution

data quality problems case study

Fuzzy matching

data quality problems case study

Record linkage

data quality problems case study

List matching

data quality problems case study

Product matching

By industry.

data quality problems case study

Finance and Insurance

data quality problems case study

Sales and Marketing

data quality problems case study

OUR PRODUCTS

data quality problems case study

DataMatch Enterprise

data quality problems case study

ProductMatch

data quality problems case study

Data Ladder Team

data quality problems case study

Partner With Us

data quality problems case study

Why Data Ladder?

data quality problems case study

  • Customer Stories

data quality problems case study

360 Science

All resources.

data quality problems case study

Whitepapers

data quality problems case study

  • DataMatch Enterprise API
  • Partner with us
  • Free Download

data quality problems case study

Building a case for data quality: What is it and why is it important

Ehsan Elahi

  • Written by Ehsan Elahi
  • May 9, 2022

According to an IDC study , 30-50% of organizations encounter a gap between their data expectations and reality. A deeper look at this statistic shows that:

  • 45% of organizations see a gap in data lineage and content ,
  • 43% of organizations see a gap in data completeness and consistency ,
  • 41% of organizations see a gap in data timeliness ,
  • 31% of organizations see a gap in data discovery , and
  • 30% of organizations see a gap in data accountability and trust .

These data dimensions are commonly termed as data quality metrics – something that helps us to measure the fitness of data for its intended purpose – also known as data quality.

What is data quality?

The degree to which data satisfies the requirements of its intended purpose.

If an organization is unable to use the data for the reason it is stored and managed for, then it’s said to be of poor quality. This definition implies that data quality is subjective and it means something different for every organization, depending on how they intend to use it. For example, in some cases, data accuracy is more important than data completeness , while in other cases, the opposite may be true.

Another interesting way of describing data quality is:

The absence of intolerable defects in a dataset.

Meaning, data cannot be completely free of defects and that is fine. It just has to be free of defects that are intolerable for the purpose it is used across the organization. Usually, data quality is monitored to see that the datasets contain the needed information (in terms of attributes and entities), and that information is as accurate (or defect-free) as possible.

How to build a case for data quality?

Having delivered data solutions to Fortune 500 clients for over a decade, we usually find data professionals spending more than 50 hours a week on their job responsibilities. The added hours are a result of duplicate work, unsuccessful results, and lack of data knowledge. On further analysis, we often find data quality to be the main culprit behind most of these data issues. The absence of a centralized data quality engine that consistently validates and fixes data quality problems is costing experienced data professionals more time and effort than necessary.

When something silently eats away at your team productivity and produces unreliable results, it becomes crucial to bring it to the attention of necessary stakeholders so that corrective measures can be taken in time. These measures should also be integrated as part of the business process so that they can be exercised as a habit and not just a one-time act.

In this blog, we will cover three important points:

  • The quickest and easiest way to prove the importance of data quality.
  • A bunch of helpful resources that discuss different aspects of data quality.
  • How data quality benefits the six main pillars of an organization.

Let’s get started.

1. Design data flaw – business risk matrix

To prove the importance of data quality, you need to highlight how data quality problems increase business risks and impact business efficiency. This requires some research and discussion amongst data leaders and professionals, and then they can share the results and outcomes with necessary stakeholders.

We oftentimes encounter minor and major issues in our datasets, but we rarely evaluate them deep enough to see the kind of business impact they can have. In a recent blog, I talked about designing the data flaw – business risk matrix : a template that helps you to relate data flaws to business impact and resulting costs. In a nutshell, this template helps you to relate different types of misinformation present in your dataset to business risks.

For example, a misspelled customer name or incorrect contact information can lead to duplicate records in a dataset for the same customer. This, in turn, increases the number of inbound calls, decreases customer satisfaction, as well as impacts audit demand. These mishaps take a toll on a business in terms of increased staff time, reduced orders due to customer dissatisfaction, and increased cash flow volatility, etc.

But if you can get this information on paper where something as small as a misspelled customer name is attributed to something as big as losing customers, it can prove to be the first step in building a case about the importance of data quality.

2. Utilize helpful data quality resources

We have a bunch of content on our data quality hub that discusses data quality from different angles and perspectives. You will probably find something there that fulfils your requirements – something that helps you to convince your team or managers about the importance and role of data quality for any data-driven initiative.

A list of such resources is given below:

  • The impact of poor data quality: Risks, challenges, and solutions
  • Data quality measurement: When should you worry?
  • Building a data quality team: Roles and responsibilities to consider
  • 8 best practices to ensure data quality at enterprise-level
  • Data quality dimensions – 10 metrics you should be measuring
  • 5 data quality processes to know before designing a DQM framework
  • Designing a framework for data quality management
  • The definitive buyer’s guide to data quality tools
  • The definitive guide to data matching

3. Present the benefits of data quality across main pillars

In this section, we will see how end-to-end data quality testing and fixing can benefit you across the six main pillars of an organization (business, finance, customer, competition, team, and technology).

a. Business

A business uses its data as a fuel across all departments and functions. Not being able to trust the authenticity and accuracy of your data can be one of the biggest disasters in any data initiative. Although all business areas benefit from good data quality, the core ones include:

i. Decision making

Instead of relying on intuitions and guesses, organizations use business intelligence and analytics results to make concrete decisions. Whether these decisions are being made at an individual or a corporate level, data is utilized throughout the company to find patterns in past information so that accurate inferences can be made for the future. Lack of quality data can definitely skew the results of your analysis, leading this approach to do more harm than good.

Read more at Improving analytics and business intelligence with clean data .

ii. Operations

Various departments such as sales, marketing, and product depend on data for effective operation of business processes. Whether you are putting product information on your website, using prospect lists in marketing campaigns, or using sales data to calculate yearly revenue, data is part of every small and big operation. Hence, good quality data can boost operational efficiency of your business, while ensuring results accuracy and reducing gaps for potential errors.

Read more at Key components that should be part of operational efficiency goals .

iii. Compliance

Data compliance standards (such as GDPR, HIPAA, and CCPA, etc.) are compelling businesses to revisit and revise their data management strategies. Under these data compliance standards, companies are obliged to protect the personal data of their customers and ensure that data owners (the customers themselves) have the right to access, change, or erase their data.

Apart from these rights granted to data owners, the standards also hold companies responsible for following the principles of transparency, purpose limitation, data minimization, accuracy, storage limitation, security, and accountability. Timely implementation of such principles becomes way easier with clean and reliable data quality. Hence, quality data can help you conform to integral compliance standards.

Read more at The importance of data cleansing and matching for data compliance .

b. Finances

A company’s finances include an abundance of customer, employee, and vendor information, as well as the history of all transactions that happened with these entities. Bank records, invoices, credit cards, bank sheets, customer information are confidential data that do not have room for errors. For this reason, consistent, accurate, and available data help ensure that:

  • Timely payments are made whenever due,
  • Cases of underpay and overpay are avoided,
  • Transactions to incorrect recipients are avoided,
  • The chances of fraud are reduced due to duplicate entity records, and so on.

Read more at The impact of data matching on the world of finance .

c. Customer

In this era, customers seek personalization. The only way to convince them to buy from you and not a competitor is to offer them an experience that is special to them. Make them feel they are seen, heard, and understood. To achieve this, businesses use a ton of customer-generated data to understand their behavior and preferences. If this data has serious defects, you will obviously end up inferring wrong details about your customers or potential buyers. This can lead to reduced customer satisfaction and brand loyalty.

On the other hand, having quality data increases the probability of discovering relevant buyers or leads – someone who is interested in doing business with you. While allowing poor quality data in your datasets can add noise and make you lose sight of potential leads out there in the market.

Read more at Your complete guide to a obtaining a 360 customer view .

d. Competition

Good data quality can help you to identify potential opportunities in the market for cross-selling and upselling. Similarly, accurate market data and understanding can help you effectively strategize your brand and product according to market needs.

If your competition leverages quality data to infer trends about market growth and consumer behavior, they will definitely leave you behind and convert potential customers more quickly and timely. On the other hand, if wrong or incorrect data is used for such analysis, your business can be misled into making inaccurate decisions – costing you a lot of time, money, and resources.

Read more at How you can leverage your data as a competitive advantage?

Managing data and its quality is the core responsibility of the data team, but almost everyone reaps the benefits of clean and accurate data. With good quality data, your team doesn’t have to spend time correcting data quality issues every time before they can use it in their routine tasks. Since people do not waste time on rework due to errors and gaps present in datasets, this has a positive impact on the team’s productivity and efficiency; and they can focus their efforts on the task at hand.

Read more at Building a data quality team: Roles and responsibilities to consider .

f. Technology

Data quality can be a deal-breaker while digitizing any aspect of your organization through technology. It is quite easy to digitize a process when the data involved is structured, organized, and meaningful. On the other hand, bad or poor data quality can be the biggest roadblock in process automation and technology adoption in most companies.

Whether you are employing a new CRM, business intelligence, or automating marketing campaigns, you won’t get the expected results if the data contains errors and is not standardized. To get the most out of your web applications or designed databases, the content of the data must conform to acceptable data quality standards.

Read more at The definitive buyer’s guide to data quality tools .

And there you have it – we went through a whole lot of information that can help you build a case for data quality in front of stakeholders or line managers. This piece is definitely a bit different in how the benefits of data quality were presented. The reason for this is, instead of highlighting six or ten areas that can be improved with quality data – I wanted to bring our attention to a more crucial point: Data quality impacts the main pillars of your business in too many different dimensions .

Business leaders need to realize that having and using data is not even half the game. The ability to trust and rely on that data to produce consistent and accurate results is the main concern now. For this reason, companies often adopt stand-alone data quality tools for cleaning and standardizing their datasets so that it can be trusted and used whenever and wherever needed.

In this blog, you will find:

Try data matching today.

No credit card required

" * " indicates required fields

Want to know more?

Check out dme resources.

data quality problems case study

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.

data quality problems case study

Address standardization guide: What, why, and how?

Inaccurate and incomplete address data can cause your mail deliveries to be returned. In fact, the US postal service handled 6.5 billion pieces of UAA

data quality problems case study

What is data integrity and how can you maintain it?

While surveying 2,190 global senior executives, only 35% claimed that they trust their organization’s data and analytics. As data usage surges across various business functions,

data quality problems case study

Guide to data survivorship: How to build the golden record?

92% of organizations claim that their data sources are full of duplicate records. To make things worse, valuable information is present in every duplicate that

Data Ladder offers an end-to-end data quality and matching engine to enhance the reliability and accuracy of enterprise data ecosystem without friction.

Quick Links

  • ( 516) 468 6879
  • [email protected]

© DataLadder 2024

Privacy policy.

Data Topics

  • Data Architecture
  • Data Literacy
  • Data Science
  • Data Strategy
  • Data Modeling
  • Governance & Quality
  • Data Education
  • Enterprise Information Management
  • Information Management Articles

Case Study: Using Data Quality and Data Management to Improve Patient Care

Mismatched patient data is the third leading cause of preventable death in the United States, according to healthIT.gov, and a 2016 survey by the Poneman Institute revealed that 86 percent of all healthcare practitioners know of an error caused by incorrect patient data. Patient misidentification is also responsible for 35 percent of denied insurance claims, […]

Data Quality

Melanie Mecca , Director of Data Management Products & Services for CMMI Institute calls this situation “A classic Master Data and Data Quality problem.” A multitude of different vendors is one of the causes, she said, but “there’s really no standard at all for this data.”

The Health and Human Services Office of the National Coordinator (HHS-ONC ) wants to make it safer for patients needing health care by improving those numbers.

“They’re trying to lower the number of duplicates and overlays in the patient identification data – the demographic data – so that they can have fewer instances of record confusion and ensure that records can be matched with patients as close as possible to a hundred percent,” she said.

In the article Improving Patient Data Quality, Part 1: Introduction to the PDDQ Framework Mecca remarked that, “duplicate patient records are a symptom of a deeper and more pervasive issue – the lack of industry-wide adoption of fundamental Data Management practices.” Sources for this case study also include a presentation by Mecca and Jim Halcomb , Strategy Consultant at CMMI, as well as the Patient Demographic Data Quality (PDDQ) Framework, v.7.

The Challenge

Government sources (and CMMI) estimate that the average hospital has 8-12 percent of duplicate records, and as many as 10 percent of incoming patients are misidentified. Sharing of patient data from disparate providers increases the likelihood of duplicates during health information exchanges due to defects in Master Patient Indexes.

Preventable medical errors may include:

  • Misdiagnosis and incorrect treatment procedures
  • Incorrect or wrong dose of medication
  • Incorrect blood type
  • Allergies not known
  • Repeated diagnostic tests

In addition, inaccurate and duplicate records can increase the risk of lawsuits, and can cause claims to be rejected. The cost to correct a duplicate patient record is estimated at $1000.

Previous attempts have been made to address these issues using algorithms that search for data fragments, but due to a lack of standardized practices:

“Algorithms alone have failed to provide a sustainable solution. Patient record-matching algorithms are necessary, but they are reactive, and do not address the root cause, which is the lack of industry-wide standards for capturing, storing, maintaining, and transferring patient data,” she said.

data quality problems case study

Finding a Solution

According to the HHS-ONC website , the Office of the National Coordinator for Health Information Technology (ONC) is located within the U.S. Department of Health and Human Services (HHS). The ONC:

“Serves as a resource to the entire health system, promoting nationwide health information exchange to improve health care. ONC is the principal federal entity charged with coordination of nationwide efforts to implement and use the most advanced health information technology and the electronic exchange of health information.”

In line with this mission, HHS-ONC decided to craft a solution to the patient data problem with built-in participation and support across all areas of health care. To this end, ONC assembled a community of practice that included 25 organizations ranging from health Data Management associations such as AHIMA (American Health Information Management Association), to government offices like OSHA (Occupational Safety and Health Administration), and large health care organizations such as Kaiser Permanente, as well as other health care providers.

This community was charged with finding a set of standards and practices that could be used to evaluate existing patient Data Management processes, and a comprehensive tool for bringing organizations into compliance with those standards. “What they were looking for was a Data Management Framework that was complimentary to what they were trying to accomplish,” said Mecca.

They chose CMMI’s Data Management Maturity (DMM) SM Model as the best approximation of what they were looking to accomplish. “The DMM’s fact-based approach, enterprise focus, and built-in path for capability growth aligned exactly with the healthcare industry’s need for a comprehensive standard,” she said.

Developing a Tool

CMMI, as a sub-contractor with health information technology company Audacious Inquiry , then used the Data Management Maturity model to determine which practices were essential “specifically for patient demographic Data Quality,” Mecca said. Out of that process came the Patient Demographic Data Quality Framework (PDDQ). The PDDQ offered the HHS-ONC a health care-focused, “sustainable solution for building proactive, defined processes that lead to improved and sustained Data Quality.”

The Patient Demographic Data Quality (PDDQ) Framework: The Solution

The PDDQ Framework :

“Allows organizations to evaluate themselves against key questions designed to foster collaborative discussion and consensus among all involved stakeholders. Its content reflects the typical path that most organizations follow when building proactive, defined processes to influence positive behavioral changes in the management of patient demographic data.”

The PDDQ is composed of 76 questions, organized into five categories with three to five process areas in each, representing the broad topics that need interrogation by the health care organization to understand current practices and determine what activities need to be established, enhanced, and followed.

The questions are supported by contextual information specific to health care providers.

“Data Governance is highly accented, as is the Business Glossary – the business terms used in registration, and terms that providers, and claims, and billing have to agree on, like patient status,” Mecca said.

Examples include illustrative scenarios, such as how a patient name should be entered, and what to enter if the patient has three middle names, for example. The questions and supporting context are intended to serve as an “encouraging and helpful mechanism for discovery.” While the framework encourages good Data Management,

“It does not prescribe how an organization should achieve these capabilities. It should be used by organizations both to assess their current state of capabilities, and as input to a customized roadmap for data management implementation.”

One of the features of the PDDQ is its flexibility to address Data Quality in a variety of environments. It is designed for any organization creating, managing or aggregating patient data, such as hospitals, health systems, Health Information Exchange (HIE) vendors, Master Data Management (MDM) solution vendors, and Electronic Health Record (EHR) vendors. “An organization can implement any combination of categories or process areas, and obtain baseline conclusions about their capabilities,” she said.

The organization can focus on a single process area, a set of process areas, a category, a set of categories, or any combination up to and including the entire PDDQ Framework. This allows flexible application to meet specific organizational needs and address for resource and time constraints.

Using the PDDQ , organizations can quickly assess their current state of Data Management practices, discover gaps, and formulate actionable plans and initiatives to improve management of the organization’s data assets across functional, departmental, and geographic boundaries.

“The PDDQ Framework is designed to serve as both a proven yardstick against which progress can be measured as well as an accelerator for an organization-wide approach to improving Data Quality. Its key questions stimulate knowledge sharing, surface issues, and provide an outline of what the organization should be doing next to more effectively manage this critical data.”

The PDDQ assessment can deliver actionable results within three weeks, leading directly to the implementation phase. For HHS-ONC, Kaiser did pilots (in Oregon) where they went on site and did data profiling and cleansing of the patient records. During this effort, they used several of the process areas that we wrote for the PDDQ Framework, and they applied them to the organizations.”

A month later when Kaiser checked with the pilot sites, all had made improvements in the way they were managing that data, she said. “And it showed because the matching algorithms had lower incidence of duplicates.”

According to the presentation by Mecca and Halcomb, use of the PDDQ leads to decreased operational risk through improvements to the quality of patient demographic data. Specifically, patient safety is protected and quality in the delivery of patient care improves due to:

  • Increased operational efficiency, requiring less manual effort to fix data issues, fewer duplicate test orders for patients, and adoption of standard data representations.
  • Improved interoperability and data integration through adopting data standards and data management practices that are followed by staff across the patient lifecycle.
  • Improved staff productivity by expending fewer hours on detecting and remediating data defects to perform their tasks.
  • Increased staff awareness for contributing to and following processes that improve patient identity integrity.

Patient data is a common thread throughout the health care system, she said. Capturing or modifying patient data differently magnifies the potential for duplication. The PDDQ helps uncover unexamined processes around patient data and means that health care organizations don’t have to guess about how they’re managing patient data. It clearly identifies gaps, creates awareness about individual responsibility for quality of patient data, engenders cooperation and participation, and sets a baseline for monitoring progress.

“Once gaps and strengths have been identified, organizations can quickly establish timelines for new capabilities and objectives,” she said.

Steps Moving Forward

According to Mecca, managing data is “first and foremost a people problem, not a system problem.” No one individual knows everything about the patient data. Adoption of consistent data standards industry-wide would increase interoperability and minimize duplicates. The PDDQ provides organizational guidance and “an embedded path for successive improvements along with a concentrated education for everyone dealing with patient demographic data,” she said.

“ Health care Data Management consultants can employ the PDDQ for their client organizations as a powerful tool to quickly identify gaps, leverage accomplishments, focus priorities, and develop an improvement roadmap with the confidence that all factors have been examined and that consensus has been reached.”

Access the PDDQ

  • The PDDQ and evaluation scoring tool are available at the following location: https://www.healthit.gov/playbook/pddq-framework/
  • A condensed version of the PDDQ, the Ambulatory Guide, contains a core set of questions aimed at very small health care practices, to help them get started in improving Data Quality.  It is available at: https://www.healthit.gov/playbook/ambulatory-guide/

Photo Credit: Micolas/Shutterstock.com

Leave a Reply Cancel reply

You must be logged in to post a comment.

Strategies for Master Data Management: A Case Study of an International Hearing Healthcare Company

  • Published: 03 October 2022
  • Volume 25 , pages 1903–1923, ( 2023 )

Cite this article

  • Anders Haug   ORCID: orcid.org/0000-0001-6173-6925 1 ,
  • Aleksandra Magdalena Staskiewicz 2 &
  • Lars Hvam 2  

949 Accesses

Explore all metrics

Data quality (DQ) issues consume a significant part of many companies’ administrative and operational costs. To reduce such costs, companies often engage in master data management (MDM) initiatives to improve their DQ. MDM is, however, not a silver bullet that solves all data quality issues, but typically there are trade-offs between different data management strategies. Such trade-offs have not received much attention in the academic literature, and conceptualizations of the strategies for implementing MDM are sparse. Thus, with the aim of conceptualizing MDM strategies and understanding their consequences, this paper conducts a longitudinal case study at a large international hearing healthcare company in which an MDM initiative was implemented. The analysis identifies consequences of centralized and decentralized MDM approaches, on which basis four distinct strategies for data management are defined.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

data quality problems case study

Similar content being viewed by others

data quality problems case study

A Survey on Master Data Management Techniques for Business Perspective

data quality problems case study

Quality Assurance of Data

data quality problems case study

Improve Quality of Data Management and Maintenance in Data Warehouse Systems

Al-Ruithe, M., Benkhelifa, E., & Hameed, K. A. (2019). systematic literature review of data governance and cloud data governance. Personal and Ubiquitous Computing , 23 (1), 839–859

Article   Google Scholar  

Alhassan, I., Sammon, D., & Daly, M. (2019). Critical success factors for data governance: A theory building approach. Information Systems Management , 36 (2), 98–110

Alhassan, I., Sammon, D., & Daly, M. (2016). Data governance activities: An analysis of the literature. Journal of Decision Systems , 25 (S1), 64–75

Alhassan, I., Sammon, D., & Daly, M. (2018). Data governance activities: A comparison between scientific and practice-oriented literature. Journal of Enterprise Information Management , 31 (2), 300–316

Allen, B., & Boynton, A. (1991). Information architecture: In search of efficient flexibility. MIS Quarterly , 15 (4), 435–445

Bader, J., Hayward, C., Razo, J., Madnick, S., & Siegel, M. (1999). An analysis of data standardization across a capital markets/financial services firm. MIT Sloan School of Management Working Paper no. 4108-99. Retrieved September 21, 2021, from https://dspace.mit.edu/bitstream/handle/1721.1/2767/SWP -4108-44117756-CISL-99-07.pdf

Baghi, E., Schlosser, S., Ebner, V., Otto, B., & Österle, H. (2014). Toward a decision model for master data application architecture. In R.H. Sprague (Ed.), Proceedings of the 47th Hawaii International Conference on System Science (HICSS’14) (pp. 3827–3836). Piscataway, NJ: IEEE

Begg, C., & Caira, T. (2012). Exploring the SME quandary: Data governance in practice in the small to medium-sized enterprise sector. The Electronic Journal of Information Systems Evaluation , 15 (1), 3–13

Google Scholar  

Benfeldt, O., Persson, J. S., & Madsen, S. (2020). Data governance as a collective action problem. Information Systems Frontiers , 22 (2), 299–313

Brous, P., Janssen, M., & Vilminko-Heikkinen, R. (Eds.). (2016). Coordinating decision-making in data management activities: A systematic review of data governance principles. In H.J. Scholl, O. Glassey, & M. Janssen, (eds.), Proceedings of 15th IFIP Electronic Government (EGOV) and 8th Electronic Participation (ePart) Conference (pp. 115–125). Cham, Switzerland: Springer

Brynjolfsson, E. (1994). Information assets, technology and organization. Management Science , 40 (12), 1645–1662

Buffenoir, E., & Bourdon, I. (2013). Managing extended organizations and data governance. In P. Benghozi, D. Krob, A. Lonjon, & H. Panetto (Eds.), Digital Enterprise Design and Management (pp. 135–145). Berlin, Heidelberg: Springer

Coleman, D. W., Hughes, A. A., & Perry, W. D. (2009). The role of data governance to relieve information sharing impairments in the federal government. In M. Burgin, M. H. Chowdhury, C. H. Ham, S. Ludwig, W. Su, & S. Yenduri (Eds.), 2009 WRI World Congress on Computer Science and Information Engineering (CSIE) (pp. 267–271). Los Alamitos, CA: IEEE

Chapter   Google Scholar  

Cousins, K. (2016). Health IT legislation in the United States: Guidelines for IS researchers. Communications of the Association for Information Systems , 39 (1), 338–366

Cleven, A., & Wortmann, F. (2010). Uncovering four strategies to approach master data management. In R.H. Sprague (Ed.), Proceedings of the 43rd Hawaii International Conference on System Sciences (pp. 1–10). Washington, DC: IEEE

Crick, C., & Chew, E. K. (2020). Microfoundations of organizational agility: A socio-technical perspective. Communications of the Association for Information Systems , 46. https://doi.org/10.17705/1CAIS.04612 . Advance online publication

DalleMule, L., & Davenport, T. H. (2017). What’s your data strategy? Harvard Business Review , 95 (3), 112–121

Dreibelbis, A., Hechler, E., Milman, I., Oberhofer, M., van Run, P., & Wolfson, D. (2008). Enterprise Master Data Management: An SOA Approach to Managing Core Information . Westford, MA: IBM Press

Dwivedi, Y. K., Janssen, M., Slade, E. L., Rana, N. P., Weerakkody, V., Millard, J., Hidders, J., & Snijders, D. (2017). Driving innovation through big open linked data (BOLD): Exploring antecedents using interpretive structural modelling. Information Systems Frontiers , 19 (2), 197–212

Eisenhardt, K. M., & Graebner, M. E. (2007). Theory building from cases: Opportunities and challenges. Academy of Management Journal , 50 (1), 25–32

Experian (2016). The 2016 Global Data Management Research Benchmark Report . London, UK: Experian

Experian (2019). Benchmark Report: 2019 Global Data Management Research: Taking Control in the Digital Age . London, UK: Experian

Gioia, D. A., Corley, K. G., & Hamilton, A. L. (2012). Seeking qualitative rigor in inductive research: Notes on the Gioia methodology. Organizational Research Methods , 16 (1), 15–31

Haug, A. (2021). Understanding the differences across data quality classifications: A literature review and guidelines for future research. Industrial Management & Data Systems . https://doi.org/10.1108/IMDS-12-2020-0756 . Advance online publication

Haneem, F., Kama, N., Taskin, N., Pauleen, D., & Abu Bakar, N. A. (2019). Determinants of master data management adoption by local government organizations: An empirical study. International Journal of Information Management , 45 (April), 25–43

Haug, A. (2021). Understanding the differences across data quality classifications: A literature review and guidelines for future research. Industrial Management & Data Systems , 121 (12), 2651–2671

Haug, A., Arlbjørn, J. S., Zachariassen, F., & Schlichter, J. (2013). Master data quality barriers: An empirical investigation. Industrial Management & Data Systems , 113 (2), 234–249

Hoberman, S., Burbank, D., & Bradley, C. (2009). Data Modeling for the Business: A Handbook for Aligning the Business with IT Using High-Level Data Models . Bradley Beach, NJ: Technics Publications

Holt, V., Ramage, M., Kear, K., & Heap, N. (2015). The usage of best practices and procedures in the database community. Information Systems , 49 (C), 163–181

Janssen, M., Brous, P., Estevez, E., Barbosa, L. S., & Janowski, T. (2020). Data governance: Organizing data for trustworthy artificial intelligence. Government Information Quarterly , 37 (3), 101493

Karkošková, S. (2022). Data governance model to enhance data quality in financial institutions. Information Systems Management Advance online publication . https://doi.org/10.1080/10580530.2022.2042628

Khatri, V., & Brown, C. V. (2010). Designing data governance. Communications of the ACM , 53 (1), 148–152

KPMG (2017). Disrupt and Grow: 2017 Global CEO Outlook . Retrieved Septemver 21, 2021, from https://assets.kpmg.com/content/dam/kpmg/xx/pdf/2017/06/2017-global-ceo-outlook.pdf

Ladley, J. (2012). Data Governance: How to Design, Deploy, and Sustain an Effective Data Governance Program . Waltham: Morgan Kaufmann

Lee, Y. W., Madnick, S. E., Wang, R. Y., Wang, F. L., & Zhang, H. (2014). A cubic framework for the chief data officer: Succeeding in a world of big data. MIS Quarterly Executive , 13 (1), 1–13

Liu, Q., Feng, G., Tayi, G. K., & Tian, J. (2021). Managing data quality of the data warehouse: A chance-constrained programming approach. Information Systems Frontiers , 23 (2), 375–389

Loshin, D. (2008). Master Data Management . Burlington, MA: Morgan Kaufmann

Lucas, A. (2010). Corporate quality management. In Rocha, A. & Ferrás, C. (Eds), 5th Iberian Conference on Information Systems and Technologies (CISTI 2010) (pp. 524–548). Piscataway, NJ: IEEE

Marchand, D. A., & Peppard, J. (2013). Why IT fumbles analytics. Harvard Business Review , 91 (1), 104–112

Maxwell, J. A. (2004). Using qualitative methods for causal explanation. Field Methods , 16 (3), 243–264

McKnight, W. (2009). Master data management and the elephant. Information Management , 19 (8), 40–41

Miller, K. D., & Tsang, E. W. K. (2011). Testing management theories: Critical realist philosophy and research methods. Strategic Management Journal , 32 (2), 139–158

Mills, A., Durepos, G., & Wiebe, E. (2010). Encyclopedia of Case Study Research . Thousand Oaks, California: Sage

Book   Google Scholar  

Nagle, T., & Sammon, D. (2017). The data value map: A framework for developing shared understanding on data initiatives. In I. Ramos, V. Tuunainen, & H. Krcmar (Eds.), Proceedings of the 25th European Conference on Information Systems (ECIS). Retrieved September 20, 2021, from https://aisel.aisnet.org/ecis2017_rp/93

Nault, B. (1998). Information technology and organizational design: Locating decisions and information. Management Science , 44 (10), 1321–1325

Nielsen, O. B. (2017). A comprehensive review of data governance literature. In R. B. Rosseland, H. Holone, S. K. Stigberg, & J. Karlsen (Eds.), IRIS: Selected Papers of the Information Systems Research Seminar in Scandinavia (pp. 120–133). Cham, Switzerland: Springer

O’Brien, T. (2015). Accounting’ for data quality in enterprise systems. Procedia Computer Science , 64 (Dec.), 442–449

Ofner, M., Straub, K., Otto, B., & Österle, H. (2013). Management of the master data lifecycle: A framework for analysis. Journal of Enterprise Information Management , 26 (4), 472–491

Otto, B. (2011). Organizing data governance: Findings from the telecommunications industry and consequences for large service providers. Communications of the Association for Information Systems , 29 (3), 45–66

Otto, B., & Reichert, A. (2010). Organizing master data management: findings from an expert survey. In D. Shin (Ed.), Proceedings of the 2010 ACM Symposium on Applied Computing (pp. 106–110). New York, NY: Association for Computing Machinery

Patton, M. Q. (2015). Qualitative Evaluation and Research Methods . Thousand Oaks, CA: Sage

Pereira, G. V., Macadar, M. A., Luciano, E. M., & Testa, M. G. (2017). Delivering public value through open government data initiatives in a Smart City context. Information Systems Frontiers , 19 (2), 213–229

Pierce, E., Dismute, W. S., & Yonke, C. L. (2008). The state of information and data governance: Understanding how organizations govern their information and data assets . Little Rock, AR: International Association for Information and Data Quality (IAIDQ) and Information Quality Program (UALR-IQ)

Priglinger, S., & Friedrich, D. (2008). Master Data Management Survey 08 . Wûrzburg, Germany: Business Application Research Center (BARC)

Redman, T. C. (2013). Data’s credibility problem. Harvard Business Review , 91 (12), 84–88

Redman, T. C. (2016). Bad data costs the U.S. $3 trillion per year. Harvard Business Review. Retrieved September 21, 2021, from https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year

Sambamurthy, V., Bharadwaj, A., & Grover, V. (2003). Shaping agility through digital options: Reconceptualizing the role of information technology in contemporary firms. MIS Quarterly , 27 (2), 237–263

Schäffer, T., & Beckmann, H. (2013). Trendstudie Stammdatenqualität 2013: Erhebung der aktuellen Situation zur Stammdatenqualität in Unternehmen und daraus Abgeleitete Trends . Stuttgart, Germany: Steinbeis-Edition

Singh, S., & Singh, J. (2022). A survey on master data management techniques for business perspective. In J. M. R. S. Tavares, P. Dutta, S. Dutta, & D. Samanta (Eds.), Cyber Intelligence and Information Retrieval. Lecture Notes in Networks and Systems (291 vol.). Singapore: Springer

Smith, H. A., & McKeen, J. D. (2008). Master data management: Salvation or snake oil? Communications of the Association for Information Systems , 23 (4), 63–72

Soares, S. S. (2010). The IBM data governance unified process: Driving business value with IBM software and best practices . Boise, ID: MC Press

Spruit, M., & Pietzka, K. (2015). MD3M: The master data management maturity model. Computers in Human Behavior , 51 (B), 1068–1076

Tallon, P. P., Ramirez, R. V., & Short, J. E. (2013). The information artifact in IT governance: Toward a theory of information governance. Journal of Management Information Systems , 30 (3), 141–178

Thompson, N., Ravindran, R., & Nicosia, S. (2015). Government data does not mean data governance: Lessons learned from a public sector application audit. Government Information Quarterly , 32 (3), 316–322

Van Alstyne, M., Brynjolfsson, E., & Madnick, S. (1995). Why not one big database? Principles for data ownership. Decision Support Systems , 15 (4), 267–284

Velu, C. K., Madnick, S. E., & Van Alstyne, M. W. (2013). Centralizing data management with considerations of uncertainty and information-based flexibility. Journal of Management Information Systems , 30 (3), 179–212

Vilminko-Heikkinen, R., & Pekkola, S. (2017). Master data management and its organizational implementation: An ethnographical study within the public sector. Journal of Enterprise Information Management , 30 (3), 454–475

Vilminko-Heikkinen, R., & Pekkola, S. (2019). Changes in roles, responsibilities and ownership in organizing master data management. International Journal of Information Management , 47 (Aug), 76–87

Wang, R. Y., & Strong, D. M. (1996). Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems , 12 (4), 5–33

Weber, K., Otto, B., & Österle, H. (2009). One size does not fit all: A contingency approach to data governance. Journal of Data and Information Quality , 1 (1), 1–27

Welch, C., Piekkari, R., Plakoyiannaki, E., & Paavilainen-Mäntymäki, E. (2011). Theorising from case studies: Towards a pluralist future for international business research. Journal of International Business Studies , 42 (5), 740–762

Yin, R. K. (2018). Case Study Research and Applications: Design and Methods . Los Angeles, CA: Sage

Zhang, Q., Sun, X., & Zhang, M. (2022). Data matters: A strategic action framework for data governance. Information & Management , 59 (4), 103642

Download references

Author information

Authors and affiliations.

Department of Entrepreneurship and Relationship Management, University of Southern Denmark, Universitetsparken 1, 6000, Kolding, Denmark

Anders Haug

Department of Management Engineering, Technical University of Denmark, Akademivej 358, 2800, Kgs. Lyngby, Denmark

Aleksandra Magdalena Staskiewicz & Lars Hvam

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Anders Haug .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Interview guide (Phase II)

Please mention the master data quality improvements and problems produced by the MDM project in relation to the data that you are working with.

[For each data quality improvement, ask the respondent]

Which data does the data quality improvement concern?

Which initiatives produced the data quality improvement?

What were the consequences of the data quality improvement?

[For each data quality problem, ask the respondent]

Which data does the data quality problem concern?

Which initiatives produced the data quality problem?

What were the consequences of the data quality problem?

[When the respondent cannot come up with more problems or improvements, ask the following questions. For each improvement/problem mentioned, go to question 2 or 5]

Can you recall other master data quality improvements or problems regarding the accessibility of data? [problems figuring out who holds some needed data; problems figuring out which system holds needed data; problems getting access to the systems that hold data; problems getting people to provide data]

Can you recall other master data quality improvements or problems regarding the representation of data? [redundant data; data described in an unclear (hard to understand) manner; different terms used to describe the same data; different data formats used for the same data]

Can you recall other master data quality improvements or problems regarding the usefulness of data? [untimely (i.e., outdated) data; irrelevant data]

Can you recall other master data quality improvements or problems regarding the intrinsic quality of data? [incomplete data; inaccurate/incorrect data]

Can you recall other types of master data quality improvements or problems produced by the MDM project?

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Haug, A., Staskiewicz, A.M. & Hvam, L. Strategies for Master Data Management: A Case Study of an International Hearing Healthcare Company. Inf Syst Front 25 , 1903–1923 (2023). https://doi.org/10.1007/s10796-022-10323-z

Download citation

Accepted : 08 August 2022

Published : 03 October 2022

Issue Date : October 2023

DOI : https://doi.org/10.1007/s10796-022-10323-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Master data management
  • Data governance
  • Data management
  • Data quality
  • Centralization-decentralization

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

Subscribe to our newsletter

data quality problems case study

  • Big Data Quality Case Study Preliminary Findings

A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be compiled into a summary overview report.

Download Resources

Pdf accessibility.

One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.

A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be compiled into a summary overview report. The report herein documents one of those four cases studies.

The purpose of this document is to present information about the various data quality issues related to the design, implementation and operation of a specific data initiative, the U.S. Army's Medical Command (MEDCOM) Medical Operational Data System (MODS) project. While MODS is not currently a Big Data initiative, potential future Big Data requirements under consideration (in the areas of geospatial data, document and records data, and textual data) could easily move MODS into the realm of Big Data. Each of these areas has its own data quality issues that must be considered. By better understanding the data quality issues in these Big Data areas of growth, we hope to explore specific differences in the nature and type of Big Data quality problems from what is typically experienced in traditionally sized data sets. This understanding should facilitate the acquisition of the MODS data warehouse though improvements in the requirements and downstream design efforts. It should also enable the crafting of better strategies and tools for profiling, measurement, assessment, and action processing of Big Data Quality problems.

NASA Logo

The Effects of Climate Change

The effects of human-caused global warming are happening now, are irreversible for people alive today, and will worsen as long as humans add greenhouse gases to the atmosphere.

data quality problems case study

  • We already see effects scientists predicted, such as the loss of sea ice, melting glaciers and ice sheets, sea level rise, and more intense heat waves.
  • Scientists predict global temperature increases from human-made greenhouse gases will continue. Severe weather damage will also increase and intensify.

Earth Will Continue to Warm and the Effects Will Be Profound

Effects_page_triptych

Global climate change is not a future problem. Changes to Earth’s climate driven by increased human emissions of heat-trapping greenhouse gases are already having widespread effects on the environment: glaciers and ice sheets are shrinking, river and lake ice is breaking up earlier, plant and animal geographic ranges are shifting, and plants and trees are blooming sooner.

Effects that scientists had long predicted would result from global climate change are now occurring, such as sea ice loss, accelerated sea level rise, and longer, more intense heat waves.

The magnitude and rate of climate change and associated risks depend strongly on near-term mitigation and adaptation actions, and projected adverse impacts and related losses and damages escalate with every increment of global warming.

data quality problems case study

Intergovernmental Panel on Climate Change

Some changes (such as droughts, wildfires, and extreme rainfall) are happening faster than scientists previously assessed. In fact, according to the Intergovernmental Panel on Climate Change (IPCC) — the United Nations body established to assess the science related to climate change — modern humans have never before seen the observed changes in our global climate, and some of these changes are irreversible over the next hundreds to thousands of years.

Scientists have high confidence that global temperatures will continue to rise for many decades, mainly due to greenhouse gases produced by human activities.

The IPCC’s Sixth Assessment report, published in 2021, found that human emissions of heat-trapping gases have already warmed the climate by nearly 2 degrees Fahrenheit (1.1 degrees Celsius) since 1850-1900. 1 The global average temperature is expected to reach or exceed 1.5 degrees C (about 3 degrees F) within the next few decades. These changes will affect all regions of Earth.

The severity of effects caused by climate change will depend on the path of future human activities. More greenhouse gas emissions will lead to more climate extremes and widespread damaging effects across our planet. However, those future effects depend on the total amount of carbon dioxide we emit. So, if we can reduce emissions, we may avoid some of the worst effects.

The scientific evidence is unequivocal: climate change is a threat to human wellbeing and the health of the planet. Any further delay in concerted global action will miss the brief, rapidly closing window to secure a liveable future.

Here are some of the expected effects of global climate change on the United States, according to the Third and Fourth National Climate Assessment Reports:

Future effects of global climate change in the United States:

sea level rise

U.S. Sea Level Likely to Rise 1 to 6.6 Feet by 2100

Global sea level has risen about 8 inches (0.2 meters) since reliable record-keeping began in 1880. By 2100, scientists project that it will rise at least another foot (0.3 meters), but possibly as high as 6.6 feet (2 meters) in a high-emissions scenario. Sea level is rising because of added water from melting land ice and the expansion of seawater as it warms. Image credit: Creative Commons Attribution-Share Alike 4.0

Sun shining brightly over misty mountains.

Climate Changes Will Continue Through This Century and Beyond

Global climate is projected to continue warming over this century and beyond. Image credit: Khagani Hasanov, Creative Commons Attribution-Share Alike 3.0

Satellite image of a hurricane.

Hurricanes Will Become Stronger and More Intense

Scientists project that hurricane-associated storm intensity and rainfall rates will increase as the climate continues to warm. Image credit: NASA

data quality problems case study

More Droughts and Heat Waves

Droughts in the Southwest and heat waves (periods of abnormally hot weather lasting days to weeks) are projected to become more intense, and cold waves less intense and less frequent. Image credit: NOAA

2013 Rim Fire

Longer Wildfire Season

Warming temperatures have extended and intensified wildfire season in the West, where long-term drought in the region has heightened the risk of fires. Scientists estimate that human-caused climate change has already doubled the area of forest burned in recent decades. By around 2050, the amount of land consumed by wildfires in Western states is projected to further increase by two to six times. Even in traditionally rainy regions like the Southeast, wildfires are projected to increase by about 30%.

Changes in Precipitation Patterns

Climate change is having an uneven effect on precipitation (rain and snow) in the United States, with some locations experiencing increased precipitation and flooding, while others suffer from drought. On average, more winter and spring precipitation is projected for the northern United States, and less for the Southwest, over this century. Image credit: Marvin Nauman/FEMA

Crop field.

Frost-Free Season (and Growing Season) will Lengthen

The length of the frost-free season, and the corresponding growing season, has been increasing since the 1980s, with the largest increases occurring in the western United States. Across the United States, the growing season is projected to continue to lengthen, which will affect ecosystems and agriculture.

Heatmap showing scorching temperatures in U.S. West

Global Temperatures Will Continue to Rise

Summer of 2023 was Earth's hottest summer on record, 0.41 degrees Fahrenheit (F) (0.23 degrees Celsius (C)) warmer than any other summer in NASA’s record and 2.1 degrees F (1.2 C) warmer than the average summer between 1951 and 1980. Image credit: NASA

Satellite map of arctic sea ice.

Arctic Is Very Likely to Become Ice-Free

Sea ice cover in the Arctic Ocean is expected to continue decreasing, and the Arctic Ocean will very likely become essentially ice-free in late summer if current projections hold. This change is expected to occur before mid-century.

U.S. Regional Effects

Climate change is bringing different types of challenges to each region of the country. Some of the current and future impacts are summarized below. These findings are from the Third 3 and Fourth 4 National Climate Assessment Reports, released by the U.S. Global Change Research Program .

  • Northeast. Heat waves, heavy downpours, and sea level rise pose increasing challenges to many aspects of life in the Northeast. Infrastructure, agriculture, fisheries, and ecosystems will be increasingly compromised. Farmers can explore new crop options, but these adaptations are not cost- or risk-free. Moreover, adaptive capacity , which varies throughout the region, could be overwhelmed by a changing climate. Many states and cities are beginning to incorporate climate change into their planning.
  • Northwest. Changes in the timing of peak flows in rivers and streams are reducing water supplies and worsening competing demands for water. Sea level rise, erosion, flooding, risks to infrastructure, and increasing ocean acidity pose major threats. Increasing wildfire incidence and severity, heat waves, insect outbreaks, and tree diseases are causing widespread forest die-off.
  • Southeast. Sea level rise poses widespread and continuing threats to the region’s economy and environment. Extreme heat will affect health, energy, agriculture, and more. Decreased water availability will have economic and environmental impacts.
  • Midwest. Extreme heat, heavy downpours, and flooding will affect infrastructure, health, agriculture, forestry, transportation, air and water quality, and more. Climate change will also worsen a range of risks to the Great Lakes.
  • Southwest. Climate change has caused increased heat, drought, and insect outbreaks. In turn, these changes have made wildfires more numerous and severe. The warming climate has also caused a decline in water supplies, reduced agricultural yields, and triggered heat-related health impacts in cities. In coastal areas, flooding and erosion are additional concerns.

1. IPCC 2021, Climate Change 2021: The Physical Science Basis , the Working Group I contribution to the Sixth Assessment Report, Cambridge University Press, Cambridge, UK.

2. IPCC, 2013: Summary for Policymakers. In: Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change [Stocker, T.F., D. Qin, G.-K. Plattner, M. Tignor, S.K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex and P.M. Midgley (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.

3. USGCRP 2014, Third Climate Assessment .

4. USGCRP 2017, Fourth Climate Assessment .

Related Resources

data quality problems case study

A Degree of Difference

So, the Earth's average temperature has increased about 2 degrees Fahrenheit during the 20th century. What's the big deal?

data quality problems case study

What’s the difference between climate change and global warming?

“Global warming” refers to the long-term warming of the planet. “Climate change” encompasses global warming, but refers to the broader range of changes that are happening to our planet, including rising sea levels; shrinking mountain glaciers; accelerating ice melt in Greenland, Antarctica and the Arctic; and shifts in flower/plant blooming times.

data quality problems case study

Is it too late to prevent climate change?

Humans have caused major climate changes to happen already, and we have set in motion more changes still. However, if we stopped emitting greenhouse gases today, the rise in global temperatures would begin to flatten within a few years. Temperatures would then plateau but remain well-elevated for many, many centuries.

Discover More Topics From NASA

Explore Earth Science

data quality problems case study

Earth Science in Action

Earth Action

Earth Science Data

The sum of Earth's plants, on land and in the ocean, changes slightly from year to year as weather patterns shift.

Facts About Earth

data quality problems case study

IMAGES

  1. Infographics

    data quality problems case study

  2. Top 5 Data Quality Issues and How to Address them( Expert Solution)

    data quality problems case study

  3. How to Customize a Case Study Infographic With Animated Data

    data quality problems case study

  4. INFOGRAPHIC: THE DATA QUALITY PROBLEM

    data quality problems case study

  5. Data Quality Problems, Source: UOL(2018)

    data quality problems case study

  6. Multiple Case Study Method

    data quality problems case study

VIDEO

  1. How to Solve Case Study in MINIMUM Time like a Pro!🔥 Class 12 Maths

  2. Why Having Multiple Facebook Profiles Can Lead to Problems (Case Study)

  3. Case based problems Q.36 Standard Math Sample paper solutions

  4. 14 Cases in RQDA

  5. Case study problems Q.38

  6. Data Observability, Data Quality, Data Observability + AI with Ramon

COMMENTS

  1. Data Quality Case Studies: How We Saved Clients Real Money Thanks to

    Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency: data quality. The maxim "garbage in - garbage out" coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase's 2013 fiasco and this lovely list of Excel snafus ...

  2. PDF A Framework for Data Quality: Case Studies October 2023

    A Framework for Data Quality: Case Studies . October 2023 . Recommended citation: Mirel LB, Singpurwalla D, Hoppe T, Liliedahl E, Schmitt R, Weber J. ... All data have quality problems, whether the data come from surveys, administrative records, sensors, or a blend of multiple sources. Our challenge as creators and users of data is to minimize ...

  3. Maintaining Data Quality from Multiple Sources Case Study

    Learn how University of Texas Southwestern Medical Center improved data quality and efficiency by using technology, Agile methodology and user stories. The case study covers challenges such as error detection, data integration and data analysis.

  4. Data Governance and Data Quality Use Cases

    Although Data Quality and Data Governance are often used interchangeably, they are very different, both in theory and practice. While Data Quality Management at an enterprise happens both at the front (incoming data pipelines) and back ends (databases, servers), the whole process is defined, structured, and implemented through a well-designed framework.

  5. Data Quality Roadmap. Part II: Case Studies

    Airbnb's case study. This is a case study for Airbnb, compiled based on public information and made by authors of the roadmap. The roadmap is based on their description of data quality (part 1 ...

  6. Data Quality Analysis and Improvement: A Case Study of a Bus ...

    Due to the rapid development of the mobile Internet and the Internet of Things, the volume of generated data keeps growing. The topic of data quality has gained increasing attention recently. Numerous studies have explored various data quality (DQ) problems across several fields, with corresponding effective data-cleaning strategies being researched. This paper begins with a comprehensive and ...

  7. PDF Data Quality Issues in 10 Studies

    CASE STUDY 1 26% Of data quality issues found across 10 studies had potential to delay drug approval Prevent Data Quality Issues That Derail Drug Approvals The Challenge Nearly 50 percent of new molecular entities (NME) submissions fail their first FDA approval, and 32 percent of these failures are attributed to data quality, data

  8. Automated detection of poor-quality data: case studies in healthcare

    Scientific Reports - Automated detection of poor-quality data: case studies in healthcare. ... UDC was validated across two problem sets, (1) cats vs. dogs, and (2) vehicles, or binary and multi ...

  9. PDF Data Quality Information and Decision Making: A Healthcare Case Study

    This case study addresses the development of a data quality evaluation framework for the NZ health sector. It discusses a data quality strategy that underpins the application of the framework and defines a vision for data quality management in the health sector. It discusses how the framework and strategy combine to increase intelligence density.

  10. How Can Interactive Process Discovery Address Data Quality Issues in

    The second step, i.e., the case study, enables a broader and more generalised analysis of how IPD can address data quality issues in complex real-life contexts, like healthcare. This combined approach can be particularly useful when evaluating the suitability of BPM technologies and techniques for specific projects [61] , [62] .

  11. COVID-19 surveillance data quality issues: a national consecutive case

    Objectives High-quality data are crucial for guiding decision-making and practising evidence-based healthcare, especially if previous knowledge is lacking. Nevertheless, data quality frailties have been exposed worldwide during the current COVID-19 pandemic. Focusing on a major Portuguese epidemiological surveillance dataset, our study aims to assess COVID-19 data quality issues and suggest ...

  12. Assessing data quality in open data: A case study

    One of the main theories in data quality area is Total Data Quality Management (TDQM) (for overview of existing methodologies see (Batini et al., 2009)). According to TDQM, data quality has many ...

  13. PDF Big Data Quality Case Study Preliminary Findings--U.S. Army ...

    1. 2Case Study Findings. This section presents findings about data quality for MEDCOM MODS Big Data. Finding 1: Data warehouses should not cleanse data. (Reference Section 7.1) Because the data warehouse does not typically "own" the data, it should not change the data (a narrow definition of the phrase "to cleanse it").

  14. Drilling data quality improvement and information extraction with case

    The data quality issues have been identified, improvement approaches have been investigated, and results have been then analysed to verify the enhancement of data quality. Although one case study that utilizes laboratory data may not directly reflect the data quality situation of a standard rig operating in the field (due to the involvement of ...

  15. Building a case for data quality: What is it and why is it important

    According to an IDC study, 30-50% of organizations encounter a gap between their data expectations and reality.A deeper look at this statistic shows that: 45% of organizations see a gap in data lineage and content,; 43% of organizations see a gap in data completeness and consistency,; 41% of organizations see a gap in data timeliness,; 31% of organizations see a gap in data discovery, and

  16. (PDF) Improving Data Quality in Practice: A Case Study in the Italian

    Problems with data quality tend to fall into two categories. The first category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category ...

  17. Data Quality Case Study

    Improving Data Warehouse and Business Information Quality. New York: Wiley. Google Scholar Johns, M.L. 1997. Information Management for Health Professions, The Health Information Management Series. Albany, NY: Delmar. Google Scholar Orr, K. 1998. "Data Quality and Systems." Communications of the ACM 41(2):66-71.

  18. Case Study: Using Data Quality and Data Management to ...

    Patient misidentification is also responsible for 35 percent of denied insurance claims, costing hospitals up to $1.2 million annually. Melanie Mecca, Director of Data Management Products & Services for CMMI Institute calls this situation "A classic Master Data and Data Quality problem.". A multitude of different vendors is one of the ...

  19. PDF Data Quality Challenge Case Study

    It also reveals important opportunities that can be further exploited by trading partners to enhance their processes for data management and data quality. quality, so that GDS could move from being a technical project to being a business project. Carrefour also wanted to have fewer problems with inaccurate data. for more information.

  20. Strategies for Master Data Management: A Case Study of an ...

    Data quality (DQ) issues consume a significant part of many companies' administrative and operational costs. To reduce such costs, companies often engage in master data management (MDM) initiatives to improve their DQ. MDM is, however, not a silver bullet that solves all data quality issues, but typically there are trade-offs between different data management strategies. Such trade-offs have ...

  21. Big Data Quality Case Study Preliminary Findings

    Big Data Quality Case Study Preliminary Findings. Oct 1, 2013. By David Becker , Patricia King , William McMullen , Lisa Lalis , David Bloom , Dr. Ali Obaidi , Donna Fickett. A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be ...

  22. (PDF) Data Quality

    illustrates data quality issues in real-life examples, focusing on the health sciences, to give ... measures in digital questionnaires (see Sect ion 3.2 for a case study). However, the indi-

  23. PDF Data Quality—Concepts and Problems

    to identify the possible sources of data quality problems and the starting points for data quality assurance. Encyclopedia 2022, 2, FOR PEER REVIEW 3 ... (see Section 3.2 for a case study). However, the indi-vidual activities must be coordinated with each other in comprehe nsive data quality man-agement. 2.2. Demands, Dimensions and Approach

  24. World's worst polluted cities are in Asia

    Across India, 1.3 billion people, or 96% of the population, live with air quality seven times higher than WHO guidelines, according to the report. Central and South Asia were the worst performing ...

  25. The Effects of Climate Change

    Global climate change is not a future problem. Changes to Earth's climate driven by increased human emissions of heat-trapping greenhouse gases are already having widespread effects on the environment: glaciers and ice sheets are shrinking, river and lake ice is breaking up earlier, plant and animal geographic ranges are shifting, and plants and trees are blooming sooner.