data quality problems case study

Data Science Central

  • Author Portal
  • 3D Printing
  • AI Data Stores
  • AI Hardware
  • AI Linguistics
  • AI User Interfaces and Experience
  • AI Visualization
  • Cloud and Edge
  • Cognitive Computing
  • Containers and Virtualization
  • Data Science
  • Data Security
  • Digital Factoring
  • Drones and Robot AI
  • Internet of Things
  • Knowledge Engineering
  • Machine Learning
  • Quantum Computing
  • Robotic Process Automation
  • The Mathematics of AI
  • Tools and Techniques
  • Virtual Reality and Gaming
  • Blockchain & Identity
  • Business Agility
  • Business Analytics
  • Data Lifecycle Management
  • Data Privacy
  • Data Strategist
  • Data Trends
  • Digital Communications
  • Digital Disruption
  • Digital Professional
  • Digital Twins
  • Digital Workplace
  • Marketing Tech
  • Sustainability
  • Agriculture and Food AI
  • AI and Science
  • AI in Government
  • Autonomous Vehicles
  • Education AI
  • Energy Tech
  • Financial Services AI
  • Healthcare AI
  • Logistics and Supply Chain AI
  • Manufacturing AI
  • Mobile and Telecom AI
  • News and Entertainment AI
  • Smart Cities
  • Social Media and AI
  • Functional Languages
  • Other Languages
  • Query Languages
  • Web Languages
  • Education Spotlight
  • Newsletters
  • O’Reilly Media

Data Quality Case Studies: How We Saved Clients Real Money Thanks to Data Validation

MichalFracek

  • April 7, 2019 at 12:21 am

Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency:  data quality . The maxim “garbage in – garbage out” coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase’s  2013 fiasco  and this lovely list of  Excel snafus .  Brilliant people make data collection and entry errors all of the time, and that isn’t just our opinion (although we have plenty of personal experience with it); Kaggle  did a survey  of data scientists and found that “dirty data” is the number one barrier for data scientists.  

Before we create a  machine  learning model, before we create a Shiny R dashboard, we evaluate the dataset for a project.  Data validation is a complicated multi-step process, and maybe it’s not as sexy as talking about the latest   ML  models, but as the data science consultants of Appsilon we live and breathe data governance and offer solutions.  And it is not only about data format. Data can be corrupted on different levels of abstraction. We can distinguish three levels:  

  • Data structure and format
  • Qualitative & business logic rules
  • Expert logic rules

Level One: structure and format

For every project, we must verify:

  • Is the data structure consistent? A given dataset should have the same structure all of the time, because the ML model or app expects the same format.  Names of columns/fields, number of columns/fields, field data type (integers? Strings?) must remain consistent.
  • Are we working with multiple datasets, or merged?
  • Do we have duplicate entries? Do they make sense in this context or should they be removed?
  • Do we have correct, consistent data types (e.g. integers, floating point numbers, strings) in all entries?
  • Do we have a consistent format for floating point numbers? Are we using a comma or a period?
  • What is the format of other data types, such as e-mail addresses, dates, zip codes, country codes and is it consistent?

It sounds obvious, but there are always problems and it must be checked every time.  The right questions must be asked.

Level Two: qualitative and business logic rules

We must check the following every time:

  • Is the price parameter (if applicable) always non-negative?  (We stopped several of our retail customers from recommending the wrong discounts thanks to this rule. They saved significant sums and prevented serious problems thanks to this step… More on that later).
  • Do we have any unrealistic values?  For data related to humans, is age always a realistic number?
  • Parameters.  For data related to machines, does the status parameter always have a correct value from a defined set? E.g. only “FINISHED” or “RUNNING” for a machine status?
  • Can we have “Not Applicable” (NA), null, or empty values? What do they mean?
  • Do we have several values that mean the same thing? For example, users might enter their residence in different ways — “NEW YORK”, “Nowy Jork”, “NY, NY” or just “NY” for a city parameter. Should we standardize them?

Level three: expert rules

Expert rules govern something different than format and values. They check if the story behind the data makes sense. This requires business knowledge about the data and it is the data scientist’s responsibility to be curious, to explore and challenge the client with the right questions, to avoid logical problems with the data.  The right questions must be asked.

Expert Rules Case Studies 

I’ll illustrate with a couple of true stories.

Story #1: Is this machine teleporting itself?

We were tasked to analyze the history of a company’s machines.  The question was, how much time did each machine work at a given location.  We have the following entries in our database:

We see that format and values are correct. But why did machine #1234 change its location  every day? Is it possible? We should ask such a question of our client.   In this case, we found that it was not physically possible for the machine to switch sites so often.  After some investigation, we found that the problem was that the software installed on the machine had a duplicated ID number and in fact there were two machines on different sites with the same ID number.  When we learned what was possible, we set data validation rules for that, and then we ensured that this issue won’t happen again.

Expert rules can be developed only by the close cooperation between data scientists and business. This is not an easy part that can be automated by “data cleaning tools,” which are great for hobbyists, but are not suitable for anything remotely serious.

Story #2: Negative sign could have changed all prices in the store

One of our retail clients was pretty far along in their project journey when we began to work with them.  They already had a data scientist on staff and had already developed their own price optimization models. Our role was to utilize the output from those models and display recommendations in an R Shiny dashboard that was to be used by their salespeople. We had some assumptions about the format of the data that the application would use from their models.  So we wrote our validation rules on what we thought the application should expect when it reads the data.

We reasoned that the price should be

  • non-negative
  • an integer number
  • shouldn’t be an empty value or a string.  
  • within a reasonable range for the given product

As this model was being developed over the course of several weeks, suddenly we observed that prices were being returned as too high.  It was actually validated automatically. It wasn’t like we spotted this in production, we spotted this problem before the data even landed in the application.  After we saw this result in the report, we asked their team why it happened. It turns out that they had a new developer who assumed that discounts could be displayed as a negative number, because why not?  He didn’t realize that some applications actually depended on that output, and assumed that it would be subtracting the value instead of adding It. Thanks to the automatic data validation, we could prevent loading errors into production.  We worked with their data scientists to improve the model. It was a very quick fix of course, a no-brainer. But the end result was that they saved real money.

Data Validation Report for all stakeholders

Here is a sample data validation report that our workflow produces for all stakeholders in the project:

Data Verification Report

The intent is that the data verification report is readable by all stakeholders, not just data scientists and software engineers.  After years of experience working on data science projects, we observed that multiple people within an organization know of realistic parameters for data values, such as price points.  There is usually more than one expert in a community, and people are knowledgeable about different things. New data is often added at a constant rate, and parameters can change. So why not allow multiple people add and edit rules when verifying data? So with our Data Verification workflow, anyone from the team of stakeholders can add or edit a data verification rule.

Our Data Verification workflow works with the  assertr package  (for the R enthusiasts out there).  Our workflow runs validation rules automatically – after every update in the data.   This is exactly the same process as writing unit tests for software. Like unit testing, our data verification workflow allows you to more easily identify problems and catch them early; and of course fixing problems at an earlier stage is much more cost effective.    

Finally, what do validation rules look like on the code level?  We can’t show you code created for clients, so here is an example using data from the City of Warsaw public transportation system (requested from a public API).  Let’s say that we want a real-time check on the location and status of all the vehicles in the transit system fleet.

In this example, we want to ensure that the Warsaw buses and trams are operating within the borders of the city, so we check the latitude and longitude.  If a vehicle is outside the city limits, then we certainly want to know about it! We want real-time updates, so we write a rule that “Data is not older than 5 minutes.”  In a real project, we would probably write hundreds of such rules in partnership with the client. Again, we typically run this workflow BEFORE we build a model or a software solution for the client, but as you can see from the examples above, there is even tremendous value in running the Data Validation Workflow late in the production process!  And one of our clients did remark that they saved more money with the Data Validation Workflow than with some of the machine learning models that were previously built for them.

Sharing our data validation workflow with the community

Data quality must be verified in every project to produce the best results.  There are a number of potential errors that seem obvious and simplistic but in our experience to tend to occur often.  

After working on numerous projects with Fortune 500 companies, we came up with a solution to the above 3-Level problem cluster.  Since multiple people within an organization know of realistic parameters for datasets, such as price points, why not allow multiple people add and edit rules when verifying data?  We recently shared our workflow at a  hackathon  sponsored by the Ministry of Digitization here in Poland.  We took third place in the competition, but more importantly, it reflects one of the core values of our company — to share our best practices with the data science community.     

photo of Hackathon winner

Pawel   and  Krystian   accept an award at the Ministry of Digital Affairs  Hackathon

I hope that you can put these take-aways in your toolbox:

  • Validate your data early and often, covering all assumptions.
  • Engage a data science professional early in the process  
  • Leverage the expertise of your workforce in data governance strategy
  • Data quality issues are extremely common

In the midst of designing new products, manufacturing, marketing, sales planning and execution, and the thousands of other activities that go into operating a successful business, companies sometimes forget about data dependencies and how small errors can have a significant impact on profit margins.  

We unleash your expertise about your organization or business by asking the right questions, then we teach the workflow to check for it constantly.  We take your expertise and we leverage it repeatedly.

data validation infographic

You can find me on Twitter at  @pawel_appsilon .

Originally posted on Data Science Blog .

Related Content

'  data-srcset=

We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning.

Welcome to the newly launched Education Spotlight page! View Listings

Maintaining Data Quality from Multiple Sources Case Study

Provider relying on data quality

There is a wealth of data within the healthcare industry that can be used to drive innovation, direct care, change the way systems function, and create solutions to improve patient outcomes. But with all this information coming in from multiple unique sources that all have their own ways of doing things, ensuring data quality is more important than ever.

The COVID-19 pandemic highlighted breakthroughs in data sharing and interoperability advances in the past few years. However, that does not mean that there aren’t challenges when it comes to data quality.

“As we have seen, many organizations have created so many amazing solutions around data,” Mujeeb Basit, MD, associate chief medical informatics officer and associate director, Clinical Informatics Center, University of Texas Southwestern Medical Center said. “COVID really highlighted the innovations and what you can do with sophisticated data architectures and how that flow of data really helps us understand what's happening in our communities. Data has become even more important.”

Dr. Basit shared some of his organization’s experiences in creating strategies to improve data quality while making the process as seamless as possible for all stakeholders.

The medical center had four groups working together on solution co-development, including quality, clinical operations, information resources and analytics.

“It is the synergy of working together and aligning our goals that really helps us develop singular data pipelines as well as workflows and outcomes that we're all vested in,” Dr. Basit said.

Finding Errors

One of the problems the organization previously faced was that errors would slowly accumulate in their systems because of complicated processes or frequent updates. When an error was found, Dr. Basit noted it was usually fixed as a single entity, and sometimes a backlog is fixed.

“But what happens is, over time, this error rate redevelops. How do we take this knowledge gained in this reported error event and then make that a sustainable solution long term? And this becomes exceedingly hard because that relationship may be across multiple systems,” Dr. Basit said.

He shared an example of how this had happened while adding procedures into their system that become charges, which then get translated into claim files.

“But if that charge isn't appropriately flagged, we actually don't get that,” Dr. Basit said. “This is missing a rate and missing a charge, and therefore, we will not get revenue associated with it. So, we need to make sure that this flag is appropriately set and this code is appropriately captured.”

His team created a workaround for this data quality issue where they will use a user story in their development environment and fix the error, but this is just a band-aid solution to the problem.

“As additional analysts are hired, they may not know this requirement, and errors can reoccur. So how do you solve this globally and sustain that solution over time? And for us, the outcome is significantly lost work, lost reimbursement, as well as denials, and this is just unnecessary work that is creating a downstream problem for us,” Dr. Basit said.

Their solution? Apply analysis at regular intervals to keep error rates low. 

“This is not sustainable by applying people to it, but it is by applying technology to it. We approach it as an early detection problem. No repeat failures, automate it so we don't have to apply additional resources for it, and therefore, it scales very, very well, as well as reduced time to resolution, and it is a trackable solution for us,” Dr. Basit said.

To accomplish this, they utilized a framework for integrated tests (FIT) and built a SQL server solution that intermittently runs to look for new errors. When one is found, a message is sent to an analyst to determine a solution.

“We have two types of automated testing. You have reactive where someone identifies the problem and puts in the error for a solution, and we have preventative,” Dr. Basit said.

The outcome of this solution means they are saving time and money—something the leadership within the University of Texas Southwestern Medical Center has taken notice of. They are now requesting FIT tests to ensure errors do not reoccur.

“This has now become a part of their vocabulary as we have a culture of data-driven approaches and quality,” Dr. Basit said.

Applying the Data Efficiently

Another challenge they faced was streamlining different types of information coming in through places like the patient portal and EHR while maintaining data quality.

“You can't guarantee 100% consistency in a real-time capture system. They would require a lot of guardrails in order to do that, and the clinicians will probably get enormously frustrated,” Dr. Basit said. “So we go for reasonable accuracy of the data. And then we leverage our existing technologies to drive this.”

He used an example from his organization about a rheumatology assessment to determine the day-to-day life of someone with the condition. They use a patient questionnaire to create a system scoring system, and providers also conduct an assessment.

“Those two data elements get linked together during the visit so that we can then get greater insight on it. From that, we're able to use alerting mechanisms to drive greater responsiveness to the patient,” Dr. Basit said.

Conducting this data quality technology at scale was a challenge, but Dr. Basit and his colleagues utilized the Agile methodology to help.

“We didn't have sufficient staff to complete our backlog. What would happen is somebody would propose a problem, and by the time we finally got to solve it, they'd not be interested anymore, or that faculty member has left, or that problem is no longer an issue, and we have failed our population,” Dr. Basit said. “So for us, success is really how quickly can we get that solution implemented, and how many people will actually use it, and how many patients will it actually benefit. And this is a pretty large goal.”

 The Agile methodology focused on:

  • Consistency
  • Minimizing documentation
  • Incremental work products that can be used as a single entity

They began backlog sprint planning, doing two-week sprints at a time.

“We want to be able to demonstrate that we're able to drive value and correct those problems that we talked about earlier in a very rapid framework. The key to that is really this user story, the lightweight requirement gathering to improve our workflow,” Dr. Basit said.  “So you really want to focus as a somebody, and put yourself in the role of the user who's having this problem.”

An example of this would be a rheumatologist wanting to know if their patient is not on a disease-modifying anti-rheumatic drug (DMARD) so that their patient can receive optimal therapy for their rheumatoid arthritis.

“This is really great for us, and what we do is we take this user story and we digest it. And especially the key part here is everything that comes out for the ‘so that,’ and that really tells us what our success measures are for this project. This should only take an hour or two, but it tells so much information about what we want to do,” Dr. Basit said.

Acceptance criteria they look for include:

  • Independent
  • Estimatable

“And we try to really stick to this, and that has driven us to success in terms of leveraging our data quality and improving our overall workflow as much as possible,” Dr. Basit said.

With the rheumatology project, they were able to reveal that increased compliance to DMARD showed an increase in low acuity disease and a decrease in high acuity.

“That's what we really want to go for. These are small changes but could be quite significant to those people's lives who it impacted,” Dr. Basit said.

In the end, the systems he and his team have created high-value solutions that clinicians and executives at their medical center use often.

“And over time we have built a culture where data comes first. People always ask, ‘What does the data say?’ Instead of sitting and wasting time on speculating on that solution,” Dr. Basit said.

The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.

Machine Learning & AI for Healthcare Forum

December 14–15, 2021 | Digital

Machine learning and AI are full of possibilities to address some of healthcare’s biggest challenges. Learn how leading healthcare organizations have leveraged the power of machine learning and AI to improve patient care and where they see real ROI—better care, cost containment, and operational improvements and efficiencies.

Register for the forum and get inspired

data quality problems case study

Given the high volume of erroneous data floating across the enterprise, it is increasingly difficult for industry operators to make business decisions or plan any course of action based on such poor-quality data. Moreover, what’s the point of applying advanced analytics or BI technologies on data that is flawed?

data quality problems case study

Data Quality and Data Governance: What’s the Connection?

Although Data Quality and Data Governance are often used interchangeably, they are very different, both in theory and practice. While Data Quality Management at an enterprise happens both at the front (incoming data pipelines) and back ends (databases, servers), the whole process is defined, structured, and implemented through a well-designed framework. This framework for managing enterprise data may be thought of as Data Governance framework , where rules and policies related to data ownership, data processes, and data technologies used in the framework are clearly defined. So, Data Governance provides the framework for managing Data Quality.

InCountry Launches Data Residency-As-A-Service for Multinational Organizations discusses a one-stop regulatory solution for all multinational or country-specific business operators grappling with newly emerging compliance laws and policies. This solution enables businesses to store data locally, thus avoiding across-the-border compliance issues. Use of DQ and DG Strategies in the Financial Services

Another Data Governance use case is sharply visible in financial services. In the digital-banking industry, DQ and DG have been exploited to transform entire business models. The banks that judiciously leveraged data platforms to reduce risks, streamline expenses, and boost revenues have impacted their bottom lines by 15 to 20 percent. The financial services industry leadership has now realized that a strong Data Strategy , which includes Data Quality and Data Governance, is the answer to developing efficient business models.

The major drivers of this transformation are, of course, explosive volume of data, dramatically reduced data-storage facilities, and high-speed processing. The increased focus on regulatory compliance of financial services has necessitated use of Data Quality and Data Governance strategies to re-invent the traditional financial services .

One of the SAS users group conducted a Case Study on National Bank of Canada, where the SAS system was used to design a credit-risk management system. National Bank of Canada’s Financial Group provides “financial services to retail, commercial, corporate and institutional clients.” They found the SAS System proactive, fast, and adaptable.

The Data Quality Dimension “Coverage” is the Most Prominent for AI Outcomes describes how “coverage” used as a DQ Dimension can prevent bad or wrong data to surface in ML use cases for the financial services sector.

Use Case for Data Governance: Risk Analysis

A widely used Data Governance application is risk management. Data breaches are common, and industry leaders are well aware of the adverse consequences of data breaches. The Top Five Data Governance Use Cases and Drivers describes how IT departments are proactively managing their “data-related risks” by adopting Data Governance 2.0 approach. According to the Trends in Data Stewardship and Data Governance Report , almost 98 percent of organizations have accepted the importance of Data Governance in assessing and managing data-driven risks. Data Governance for Avoiding Swamps in Data Lakes

The data lake has become a data-storage repository of choice as it can hold very high volumes of multi-format (structured, semi-structured, and unstructured) data. Data Governance allows the data to be “tagged,” which helps users uncover contexts very easily while searching for relevant data for a specific purpose. This tagging mechanism also helps verify the quality, view a sample, and get a historical account of past actions on the data. To avoid a swamp , the data also needs to be strictly governed in terms ownership, accountability, sharing, and usage.

Data Quality and Data Governance for AI Outcomes

In most AI systems, the efficiency and impact of the predictive models depend on the scale and diversity of the data as well as on cleanliness of the data. Even the most powerful AI system may fail to deliver the expected results if the used “data” is not adequately governed and passed through quality checks. Thus, all AI-enabled business analytics systems must also be exposed to sound Data Quality and Data Governance frameworks to operate at maximum efficiency levels.

Data Quality & Data Governance can Maximize Your AI Outcomes describes how Data Quality and Data Governance can enhance the “predictive efficiency” of ML algorithms.

Data Quality Use Cases A Talend blog post describes the use of  Talend Data Quality solutions in six different industry verticals, which assure that in coming years, DQ platforms and tools will penetrate the global markets in a big way.                

Common Applications of Data Quality Tools       Data Quality Study Guide — A Review of Use Cases & Trends states that “It appears there is an abundance of data, but a  scarcity of trust, and the need for data literacy .”

According to figures available from a Gartner , “C-Level executives believe that 33% of their data is inaccurate.” To gain the trust of both the employees and customers, these enterprises must turn to Data Quality tools.

A Syncsort blog post describes very common situations where Data Quality checks are performed without most people being aware of it

  • Erroneous Addresses in Databases: In many cases, hardcopy forms may be used to collect the address data, which leads to handwritten and erroneous data. Sometimes, even online forms have many mistakes, generating a collection of low-quality data.
  • Incomplete Phone Numbers: Phone numbers received directly from consumers are very often provided in haste and usually incomplete. This happens when the information provider does not know which components of the number (country code, area code, etc.) they are supposed to provide.
  • Missing Field Entries: This happens very often while filling online forms. Users either miss certain field or enter field data in an incorrect format. Form designers have to take particular care to ensure that the fields provide information on entry format and that empty fields are flagged during form submission.

In the above cases, Data Quality tools are used to catch and rectify errors. Your Data Quality Situation is Unique (But it Really isn’t) talks about data profiling, which is how businesses choose to “organize, maintain, and utilize” their data. Data profiling makes every business’s DQ situation unique,and business users need to be aware of it. Data Quality Use Cases describes the SAS data-cleanup solution used in at least five different situations.

As global businesses continue to rely on data-driven solutions and AI systems for enhancing their competitiveness, Data Quality and Data Governance platforms will assume increased importance in the business landscape. The scale and volume of data-management solutions with a focus on quality and governance flooding the markets in the next few years may be surprising.

Image used under license from Shutterstock.com

Leave a Reply Cancel reply

You must be logged in to post a comment.

Data Quality Management Maturity Model : A Case Study in Higher Education’s Human Resource Department

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 09 September 2021

Automated detection of poor-quality data: case studies in healthcare

  • M. A. Dakka 1 , 2 ,
  • T. V. Nguyen 1 , 3 ,
  • J. M. M. Hall 1 , 4 ,
  • S. M. Diakiw 1 ,
  • M. VerMilyea 5 , 6 ,
  • R. Linke 7 ,
  • M. Perugini 1 , 8 &
  • D. Perugini 1  

Scientific Reports volume  11 , Article number:  18005 ( 2021 ) Cite this article

4173 Accesses

10 Citations

105 Altmetric

Metrics details

  • Mathematics and computing
  • Medical imaging

The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.

Similar content being viewed by others

data quality problems case study

The OpenDeID corpus for patient de-identification

data quality problems case study

A framework for evaluating clinical artificial intelligence systems without ground-truth annotations

data quality problems case study

High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP)

Advances in deep learning using artificial neural networks (ANN) 1 , 2 have resulted in the increased use of AI for healthcare applications 3 , 4 , 5 . One of the most successful examples of deep learning has been the application of convolutional neural network (CNN) algorithms for medical image analysis to support clinical assessment 6 . AI models are trained with labeled or annotated data (medical images) and learn complex features of the images that relate to a clinical outcome, which can then be applied to classify new unseen medical images. Applications of this technology in healthcare span a wide range of domains including but not limited to dermatology 7 , 8 , radiology 9 , 10 , ophthalmology 11 , 12 , 13 , pathology 14 , 15 , 16 , and embryo quality assessment in IVF 17 .

Despite the enormous potential of AI to improve healthcare outcomes, AI performance can often be sub-optimal as it is crucially dependent on the quality of the data. While AI practitioners often focus on the quantity of data as the driver of performance, even fractional amounts of poor-quality data can substantially hamper AI performance. Good-quality data is therefore needed to train models that are both accurate and generalizable, and which can be relied upon by clinics and patients globally. Furthermore, because measuring AI performance on poor-quality test data can mislead or obfuscate the true performance of the AI, good-quality data is also important for benchmark test sets used in performance reporting, which clinics and patients rely on for clinical decisioning.

We define two types of poor-quality data:

Incorrect data : Mislabeled data, for example an image of a dog incorrectly labeled as a cat. This also includes adversarial attacks by intentionally inserting errors in data labels (especially detrimental to online machine learning methods 18 ).

Noisy data : Data itself is of poor quality (e.g. out-of-focus image), making it ambiguous or uninformative, with insufficient information or distinguishing features to correlate with any label.

In healthcare, clinical data can be inherently poor quality due to subjectivity and clinical uncertainty. An example of this is pneumonia detection from chest X-ray images. The labeling of a portion of the image can be somewhat subjective in terms of clinical assessment, often without a known ground truth outcome, and is highly dependent on the quality of the X-ray image taken. In some cases, the ground truth outcome might also involve clinical data that is not present in the dataset used for analysis, such as undiagnosed conditions in a patient, or effects that cannot be seen from images and records provided for the assessment. This kind of clinical uncertainty can contribute to both the incorrect and noisy data categories. Therefore, poor-quality data cannot always be reliably detected, even by clinical experts. Furthermore, due to data privacy, manual visual verification of private patient data is not always possible.

Several methods exist to account for these sources of reader variability and bias. One method 19 uses a so-called Direct Uncertainty Prediction to provide an unbiased estimate of label uncertainty for medical images, which can be used to draw attention to images requiring a second medical opinion. This technique relies on training a model to identify cases with high potential for expert disagreement. Other methods model uncertainty in poor-quality data through Bayesian techniques 20 . However, such techniques require significant amounts of annotated data from multiple experts, which is often not readily available. Some methods assume that erroneous label distribution is conditionally independent of the data instance given the true label 21 , which is an assumption that does not hold true in the settings considered in this article. Other techniques 22 relax this assumption by using domain-adapted generative models to explain the process that generates poor-quality data, though these techniques typically require additional clean data to generate good priors for learning. This is an issue in medical domains such as the assessment of embryo viability, where reader variability is significant 17 and ground truth labels may be impossible to ascertain, so there is no way of identifying data as poor quality a priori. There is a need for better approaches to cleanse poor-quality data automatically and effectively, in and beyond healthcare.

In this paper, a novel technique is presented for automated data cleansing which can identify poor data quality without requiring a cleansed dataset with known ground truth labels. The technique is called Untrainable Data Cleansing (UDC) and is described in the Methods section. UDC essentially identifies and removes a subset of the data (i.e. cleanses the data) that AI models are unable to correctly label (classify) during the AI training process. From a machine learning perspective, the two types of poor-quality data described above are realized through: (1) identifying data that are highly correlated to the opposite label of what would reasonably be expected, based on the classification of most of the data in the dataset (incorrect data); or (2) identifying data that have no distinguishing features that correlate with any label (noisy data). Results show that UDC can consistently and accurately identify poor-quality data, and that removal of UDC-identified poor-quality data, and thus “cleansing” of the data, ultimately leads to higher performing and more reliable AI models for healthcare applications.

Validation of UDC

UDC was first validated using two types of datasets, cats and dogs for binary classification, and vehicles for multi-classification. These datasets were used because the ground truth can be manually confirmed, and therefore incorrect labels could be synthetically injected.

Binary classification using cats and dogs

A benchmark (Kaggle) dataset of 37,500 cat and dog images were used to validate UDC for binary classification. This dataset was chosen because the ground truth outcomes (labels) could be manually determined with certainty, and synthetic incorrect labels could be readily introduced by flipping the correct label to an incorrect label. Synthetic incorrect labels were added to this dataset to test UDC under various amounts of poor-quality data.

A total of 24,916 images (12,453 cats, 12,463 dogs) were used for training and 12,349 images (6143 cats, 6206 dogs) used as a separate blind test set. Synthetic errors (incorrect labels) were added to the training dataset (but not the test set), which was split 80/20 into training and validation sets. A single round of UDC was applied to the training dataset, poor-quality data identified by UDC was removed, and a new AI model was trained on the UDC-cleansed dataset. The highest balanced AI accuracy achievable on the blind test dataset was reported. Results are shown in Table 1 .

Results show that UDC is resilient to even extreme levels of errors, functioning in datasets with up to 50% incorrect labels in one class (while the other class remained relatively clean), and with significant symmetric errors of up to 30% incorrect labels in both classes. Visual assessment verified removal of both incorrect data and a minor proportion of noisy data, where, for example, dogs looked like cats (see Supplementary Figure S1 online). Improvements of greater than 20% were achieved in all cases. Compared to cases with uniform levels of incorrect labels, those with asymmetric incorrect labels, e.g. (35%, 5%), achieved a higher balanced accuracy of over 98% after just one round of UDC. This is expected, since in the asymmetric cases, one class remains as a true correct class, allowing UDC to be more confident in identifying incorrectly labeled samples.

In the uniform cases a slightly lower balanced accuracy of 94.7% was achieved after one round of UDC, which was found to identify and remove 88% of the intentionally mislabeled images. A second round of UDC improved upon the results of a single application of UDC, successively increasing the predictive power of the dataset by removing the remaining incorrect data. The accuracy achieved after a second round of UDC (99.7%) to the symmetric case (30%, 30%) showed an improvement even when compared to the baseline accuracy (99.2%) on datasets with 0% synthetic error. Further tests would be required to confirm the statistical significance of this uplift, but it is not unreasonable that the UDC could filter out noisy data that may be present in the original clean dataset (since the baseline set itself is not guaranteed to be free of poor-quality data), therefore helping to not only recover but surpass the accuracy of models trained on the baseline datasets.

For the symmetrical (50%, 50%) case (not shown), UDC simply learns to treat one entire class as incorrect and the other as correct, thereby discarding all samples from the opposite class as data errors. Therefore, as might be expected, UDC fails when data error levels in both classes approach 50%, because there is no longer a majority of good-quality data to allow UDC to confidently identify the minority of poor-quality data. In this case, the dataset is deemed unsuitable for AI training. To address such datasets that are so noisy as to have virtually no learnable features, an important screening process prior to implementing a UDC workflow, is to conduct a hyperparameter search to determine parameter spaces wherein predictive power can be achieved on a given dataset. The hyperparameter search is implemented by selecting a range of architectures, learning rates, momentum values and optimizers, and measuring the accuracy of each model on a validation set, at each epoch while training, for a range of 200–300 epochs. Hyperparameter combinations are eligible for inclusion in the UDC workflow if their associated training runs include a high mean accuracy value across the epochs for training, and are screened if they are not able to achieve a statistically significant accuracy above 50%.

Multi-classification using vehicles

An open image database of 27,363 vehicles was used to validate UDC for multi-classification. The dataset comprised four classes of vehicles: airplanes, boats, motorcycles, and trucks. A total of 18,991 images (7244 airplanes, 5018 boats, 3107 motorcycles, 3622 trucks) were used for training, and 8372 images (3101 airplanes, 2194 boats, 1424 motorcycles, 1653 trucks) used as a separate blind test set. As in the previous section, this dataset was chosen because the ground truth outcomes (labels) could be manually ascertained, and synthetic incorrect labels could be readily introduced. Synthetic incorrect labels were uniformly added to each class in the training dataset in increments of 10% to test UDC under various amounts of poor-quality data. UDC results are shown in Supplementary Figure S2 online. This figure shows, for a given data error rate (0–90%), the number of images that are correctly predicted by x constituent UDC models, where the x -axis ranges from 0 to the total number of models used in the UDC workflow. Each UDC histogram is thus quantized by the total number of models, and each bin shows a different total number of images, depending on the choice of data error rate.

Results are summarized in Table 2 , which shows the percentage improvement for all cases after only a single round of UDC, removing both noisy and incorrect labels. Errors are calculated as the standard error on the mean for results obtained from eight models overall to reduce bias on a particular validation set (four model architectures based on Residual Convolutional Neural Network (ResNet) 23 and Dense Convolutional Network (DenseNet) 24 , each trained on two cross-validation phases).

For the multi-class case, results show that UDC is resilient and can identify poor-quality data and improve AI accuracy even at more extreme levels of errors (i.e. 30–70%) compared with the binary case. UDC fails when the percentage of incorrect labels in each class approaches 80%. This is because when 80% or more of a class’ labels are distributed into three other classes, this results in fewer correct labels than incorrect labels for that class, making model training impossible as the model is pulled away from convergence by a larger number of incorrectly vs. correctly labeled data, and making such data uncleansable .

Taken together, these results suggest that UDC creates cleansed datasets that can be used to develop high performing AI models that are both accurate and generalizable using fewer training data, and reduced training time and cost. Near baseline-level performance was achieved using datasets containing up to 70% fewer training data. We showed that 97% accuracy could be achieved on datasets with up to 60% fractional incorrect labels for all classes, using less than 30% of the original amount of training data. In an even more extreme case with 70% incorrect labels, UDC had a higher false positive rate (correct labels images identified as incorrect), which resulted in the removal of 95% of the original dataset, but models trained on the remaining 5% still achieved over 92% accuracy on a blind test set. Finally, application of UDC gave greater stability and accuracy during the training process (across epochs), which means that AI model selection for deployment can be automated because the selection is not hyper-dependent on a training epoch once a threshold of accuracy is achieved.

Application of UDC

The UDC technique was then applied to two healthcare problems, pediatric chest X-ray images for identification of pneumonia, and embryo images to identify likelihood of pregnancy (viability) for in vitro fertilization (IVF). Finally, UDC was also shown to be able to cleanse benchmark test datasets themselves to enable a truer and more realistic representation of AI performance.

Assessment of pediatric chest X-rays for pneumonia detection

A publicly available dataset of pediatric chest X-ray images with associated labels of “Pneumonia” or “Normal” from Kaggle 25 was used. The labels were determined by multiple expert physicians. There were 5232 images in the training set and 624 images in the test set. UDC was applied to all 5856 images. Approximately 200 images were identified as noisy, while no labels were identified as incorrect. This suggests there were no suspected labeling errors in the dataset, but the images identified by UDC were considered poor-quality or uninformative. Poor-quality images in this dataset mean that labels of “normal” or “pneumonia” were not easily identifiable with certainty from the X-ray.

figure 1

Cohen’s kappa test for noisy and Correct labels shows that images with Correct labels lead to a significantly higher level of agreement than random chance, and significantly higher than those with noisy labels.

To verify the result, an independent expert radiologist assessed 200 X-ray images from this dataset, including 100 that were identified as noisy by UDC, and 100 that were identified as correct. The radiologist was only provided the image, and not the image label nor the UDC assessment. Images were assessed in random order, and the radiologist’s assessment of the label for each image recorded. Results showed that the reader consensus between the radiologist’s label and the original label was significantly higher for the correct images compared with the noisy images. Applying Cohen’s kappa test 26 on the results gives levels of agreement for noisy ( \(\kappa \approx 0.05\) ) and correct ( \(\kappa \approx 0.65\) ) labels (refer to Fig. 1 ). This confirms that for noisy images detected by UDC, there is insufficient information in the image alone to conclusively (or easily) make an assessment of pneumonia by either the radiologist or the AI. UDC could therefore prove beneficial as a screening tool for radiologists that could help triage difficult to read or suspicious (noisy) images that warrant further in-depth evaluation or additional tests to support a definitive diagnosis.

We then compared AI performance when trained using the original uncleansed X-ray training dataset versus UDC-cleansed X-ray training dataset with noisy images removed. Results are shown in Fig. 2 . The blue bar in the figure represents a theoretical maximum accuracy possible on the test dataset. It is obtained by testing every trained AI model on the test dataset to find the maximum accuracy that can be achieved. The orange bar is the actual (generalized) accuracy of the AI obtained using standard practice for training and selecting a final AI model using the validation set, then testing AI performance on the test set. The difference between the blue bar and orange bar indicates the generalizability of the AI, i.e. the ability of the AI to reliably apply to unseen data. Results show that training the AI on a UDC-cleansed dataset increases both the accuracy and generalizability of the AI. Additionally, the AI trained using a UDC-cleansed dataset achieved 95% generalized accuracy. This exceeds the 92% accuracy reported for other models in the literature using this same chest X-ray dataset 27 .

figure 2

Balanced accuracy before and after UDC. The orange bar represents the AI accuracy on the test dataset using the standard AI training practice. The blue bar represents the theoretical maximum AI accuracy possible on the test dataset. The discrepancy between these two values is indicative of the generalizability of the model.

Lastly, we investigated application of UDC on the test dataset of X-ray images to assess its quality. This is vital because the test dataset is used by AI practitioners to assess and report on AI performance. Too much poor-quality data in a test set means the AI accuracy is not a true representation of AI performance. To evaluate this, we injected the uncleansed test dataset into the training dataset used to train the AI to determine the maximum accuracy that could be obtained on the validation dataset. Figure 3 shows reduced performance of AI trained using the aggregated dataset (training dataset plus the noisy test dataset) compared with the AI trained only using the cleansed training set. This suggests that the level of poor-quality data in the test dataset is significant, and thus care should be taken when AI performance is measured using this particular test dataset.

figure 3

The colors of the bars represent the performance of the model on the validation set, with (orange) and without (blue) the test set included in the training set. AI performance drops when the uncleansed blind test set is included in the training set, indicating a considerable level of poor-quality data in the test set.

figure 4

Performance metrics of AI model predicting clinical pregnancy, trained on original (left section) and UDC-cleansed (right section) training data. Both graphs show results on the validation set (green), and corresponding original test set (blue) and UDC-cleansed test set (orange).

Assessment of embryo quality for IVF

Finally, UDC was successfully applied to the problem of assessing embryo viability in IVF. UDC was a core technique in developing a commercial AI healthcare product, which is currently being used in IVF clinics globally 17 . The AI model analyzes images of embryos at Day 5 of development to identify which ones are viable and likely to lead to a clinical pregnancy.

Clinical pregnancy is measured by the presence of a fetal heartbeat at the first ultrasound scan approximately 6 weeks after the embryo is transferred to an IVF patient. An embryo is labeled viable if it led to pregnancy, and non-viable if a pregnancy did not occur. Although there is certainty in the outcome (pregnancy or no pregnancy), there is uncertainty in the labels, because there may be patient medical conditions or other factors beyond embryo quality that prevent pregnancy (e.g. endometriosis) 17 . Therefore, an embryo that is viable may be incorrectly labeled as non-viable. These incorrect labels impact the performance of the AI if not identified and removed.

UDC was applied to images of embryos to identify incorrect labels. These were predominantly in the training dataset’s non-viable class, as expected, as they included embryos that appeared viable but were labeled as unviable. Performance results are shown in Fig. 4 . AI models trained using a UDC-cleansed training dataset achieved an increase in accuracy, from 59.7 to 61.1%, when reported on the standard uncleansed test dataset. This small increase in accuracy was not statistically significant, but could potentially be misleading, as the uncleansed test set itself may comprise a significant portion of incorrectly labeled non-viable embryo images, thus reducing specificity as the AI model improves. For the predominantly clean viable class, sensitivity increased by a larger amount, from 76.8 to 80.6%. When a UDC-cleansed test set is utilized, AI models trained using a UDC-cleansed training dataset achieved an increase in accuracy from 73.5 to 77.1%. This larger increase is a truer representation of the AI performance, and while the uplift is just at the \(1-\sigma\) level, it is noted that a medical dataset may require multiple rounds of UDC to fully cleanse the training set.

Effect sizes before and after UDC are represented using Cohen’s d , as shown in Table 3 , along with p -values. Effect sizes larger than 0.6 are considered “large”, meaning that for all test sets (including validation), UDC has a large effect on training and inference (test) performance, except for specificity results for both cleansed and uncleansed (expected due to the large proportion of incorrectly labeled non-viable embryos) test sets. This can be interpreted as there being a significant pair-wise uplift in sensitivity without much cost to specificity. In all cases, there is very large ( \(d>1.4\) ) effect on overall performance. Taken together these results suggest that using UDC to cleanse training datasets can improve the accuracy of the AI even in clinical datasets with a high level of mislabeled, poor-quality data.

This study characterizes a novel technique, Untrainable Data Cleansing (UDC), that serves to automatically identify, and thus allow removal of, poor-quality data to improve AI performance and reporting. In the clinical setting, accurate AI could mean the difference between life and death, or early diagnosis versus missed diagnosis. Thus it is critical that poor-quality data are identified and removed so as not to confuse the AI training process and impact clinical outcomes. It can be difficult for even the most experienced clinicians to identify poor-quality data, particularly when the clinical outcome is uncertain, or the quality and integrity of the data does not allow for a definitive labeling of the image. Furthermore, due to data privacy laws, it may not even be possible to manually assess data quality of private medical datasets. Because UDC can be “shipped” to the secure location of private data, it offers an automated way of addressing data quality concerns while respecting data privacy laws.

UDC was validated across two problem sets, (1) cats vs. dogs, and (2) vehicles, or binary and multi-classification problems, respectively, because image labels could be manually verified. In both cases UDC was effective at identifying synthetically introduced incorrect labels. Training AI models following removal of poor-quality data significantly improved the AI performance, in terms of both accuracy and generalizability. UDC was applied to two medical problem sets, one for pneumonia detection in chest X-rays, and the other for embryo selection in IVF. Both are challenging clinical assessment areas due to varying degrees of noisy or incorrectly labeled data. In both studies UDC was effective (as measured on double blind datasets) at identifying poor quality data, and yielded significant improvements in accuracy and generalizability.

In UDC, the use of a variety of model architectures as well as k -fold cross-validation serves to mitigate overfitting on smaller datasets. Though there may always be a trivial lower bound on dataset size, the behavior of UDC as total training images decreases was found to be stable after training on (cleansed) datasets as low as 5% the size of the initial training datasets. Nevertheless, to further alleviate the effect of overfitting, all models are trained using data augmentation, dropout, weight balancing, learning rate scheduling, weight decay, and early stopping.

In the same way that poor-quality training data can impact AI training, the reporting (or testing) of AI performance done so on a poor-quality test dataset (that contains noisy or incorrectly labeled data) can lead to inaccurate performance reporting. Inaccurate reporting can mislead clinicians and patients on the true performance and reliability of the AI, with potential real-world consequences for those that may rely on AI results. We assessed the utility of UDC for cleansing test datasets and showed that the accuracy of the AI reported on test datasets cleansed with UDC was different to that reported on uncleansed test datasets. The reporting of AI accuracy on a UDC-cleansed test set was shown to be a truer representation of the AI performance based on independent assessment.

Since UDC relies on pooling or ensembling various model architectures, its advantages (namely, that it is relatively agnostic to initial feature-crafting and allows for automated implementation) could be enhanced with the application of discriminant analysis techniques to provide richer context when identifying noisy or incorrect data. Future research is possible investigating how the granularity of Principal Component Analysis-Linear Discriminant Analysis (PCA-LDA) or related techniques, which cannot be applied directly to images themselves, but to model activation (or feature) layers, could be applied to more precisely explain why individual samples were identified as poor-quality by UDC. Such methods may be able to identify which features in each image are not in agreement with the received label.

Finally, we showed that UDC was able to identify noisy data, which in the case of the pneumonia X-rays neither the AI nor the radiologist could consistently classify. The ability for UDC to identify these cases suggests it can be used as a triage tool and direct clinicians to those cases that warrant new tests or additional in-depth clinical assessment. This study demonstrates that the performance of AI for clinical applications is highly dependent on the quality of clinical data, and the utility of a method like UDC that can automatically cleanse otherwise poor-quality clinical data cannot be overstated.

UDC algorithm

Untrainable Data Cleansing (UDC) can identify three categories of image-label pairs:

Correct —strongly expected to be correct (i.e. label matches the ground-truth).

Incorrect —strongly expected to be incorrect (i.e. label does not match ground-truth).

Noisy —data is ambiguous or uninformative to classify with certainty (i.e. label may or may not match ground-truth).

UDC delineates between correct, incorrect, and noisy images using the method described in Algorithm 1, which utilizes a mixture between sampling different (n) model architectures and sampling across data using k -fold cross validation (KFXV). These \(n\times k\) models vote on each image-label in this manner to reduce bias and to increase robustness to outliers.

figure a

Prior to training each AI model configuration, the images are minimally pre-processed by auto-scaling their resolutions to \(224 \times 224\) pixels, and normalizing the color levels using the standard ImageNet mean RGB values of (0.485, 0.456, 0.406) and standard deviations of (0.229, 0.224, 0.225), respectively.

We describe UDC as “ turning AI onto itself ”, as it uses the AI training process to identify poor-quality data. Multiple AI models using different architectures and parameters are trained using the data (to be cleansed), then the AI models are applied back on the same training data to infer their labels. Data which cannot be consistently classified correctly are identified as poor-quality (i.e. incorrect or noisy).

The idea behind UDC is that if data cannot be consistently correctly classified within the AI training process itself, which is where AI models are likely to find the strongest correlations and correct classifications on the dataset used to train/create it, then the data is likely to be poor-quality.

The intuition behind using UDC to subdivide image-label pairs into three types of labels is based on probability theory. A correct label can be thought of as a positively weighted coin ( \(p\gg 0.5\) ), where p is the probability of being correctly predicted by a certain model. In contrast, an incorrect label can be thought of as a negatively weighted coin ( \(p\ll 0.5\) )—very likely to be incorrectly predicted. A noisy label can be thought of as a fair coin ( \(p\approx 0.5\) )—equally likely to be correctly or incorrectly predicted. To illustrate how this intuition applies to UDC, we consider a hypothetical dataset of N image-label pairs. Algorithm 1 is applied to this dataset to produce a number of successful predictions ( \(s_j\) ) for each image j . A histogram of \(s_j\) values (with increasing s on the x -axis) then shows how correct, incorrect, and noisy labels tend to cluster at high, low, and medium values of s , respectively.

Synthetic errors (incorrect labels) were added to the training dataset of 24,916 images, which is split 80/20 into training and validation sets with the following parameters used for each study t , where \({\varvec{\mathscr {n}}}^{(t)}=\left( {\varvec{\mathscr {n}}}_{cat}^{(t)},{\varvec{\mathscr {n}}}_{dog}^{(t)}\right)\) , where \(0\le {\varvec{\mathscr {n}}}(\%) \le 100\) , contains the fractional level of incorrect labels for cat and dog classes (see Table 4 ), respectively, and where RN stands for ResNet 23 , and DN stands for DenseNet 24 .

Synthetic errors (incorrect labels) were added to the training dataset of 18,991 images, which is split 80/20 ( \(k=5\) ) into training and validation sets and where for each study t , \({\varvec{\mathscr {n}}}^{(t)}=\left( {\varvec{\mathscr {n}}}_{A}^{(t)},{\varvec{\mathscr {n}}}_{B}^{(t)},{\varvec{\mathscr {n}}}_{M}^{(t)},{\varvec{\mathscr {n}}}_{T}^{(t)}\right)\) represents fractional levels of incorrect labels for airplane ( A ), boat ( B ), motorcycle ( M ), and truck ( T ) classes (see Table 5 ), respectively, where \(0\le {\varvec{\mathscr {n}}}(\%) \le 100\) , and where R stands for ResNet 23 , and D stands for DenseNet 24 . Note, in Table 5 , the fractional level of incorrect labels was kept constant across classes in each study, so only one value is shown.

Reader consensus between radiologists for correct vs. noisy labels

Images identified by UDC to have noisy labels are suspected to have inconsistencies rendering their annotation (or labeling) more difficult. As such, we expect the reader consensus of Pneumonia/Normal assessments between different radiologists to be lower for images with noisy labels than for those with correct labels that are easily identified by the AI model and for which we expect a relatively high reader consensus between radiologists. The following two hypotheses are formulated and can be directly tested using the (Cohen’s) kappa test:

\({\varvec{\mathcal {H}}}_\mathbf{0} ^\mathbf {(1)}\) : The level of agreement between radiologists for noisy labels is different from random chance.

\({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(1)}\) : The level of agreement between radiologists for noisy labels is no different from random chance.

\({\varvec{\mathcal {H}}}_\mathbf{0} ^\mathbf {(2)}\) : The level of agreement between radiologists for correct labels is no greater than random chance.

\({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(2)}\) : The level of agreement between radiologists for correct labels is greater than random chance.

We prepare an experimental dataset by splitting the data into correct and noisy labels as follows, where the two subsets are used in a clinical study to test the above hypotheses and validate UDC:

A dataset \(\mathcal {D}\) with 200 elements \({\varvec{\mathscr {z}}}_j = \left( \mathbf {x}_j,\hat{y}_j\right)\) has images \(\mathbf {x}_j\) and (noisy) annotated labels \(\hat{y}_j\) . This dataset is split into two equal subsets of 100 images each:

\(\mathcal {D}_{clean}\) —labels identified as correct by UDC, with the following breakdown:

52 Pneumonia (39 Bacterial / 13 Viral)

\(\mathcal {D}_{noisy}\) —labels identified as noisy by UDC, with the following breakdown:

49 Pneumonia (14 Bacterial / 35 Viral)

The dataset \(\mathcal {D}\) is randomized to create a new dataset \(\hat{\mathcal {D}}\) for an expert radiologist to label, and to indicate confidence or certainty in those labels (Low, Medium, High). This randomization is to address fatigue bias and any bias related to the ordering of the images.

The reader consensus between the expert radiologist and the original labels is calculated using Cohen’s kappa test26 and is compared between datasets \(\mathcal {D}_{clean}\) vs. \(\mathcal {D}_{noisy}\) .

Figure 1 provides visual evidence showing that both null hypotheses, \({\varvec{\mathcal {H}}}_\mathbf{0} ^\mathbf {(1)}\) and \({\varvec{\mathcal {H}}_0^{(2)}}\) , are rejected with very high confidence ( \(>99.9\%\) ) and effect size ( \(>0.85\) ). Therefore, both alternate hypotheses are accepted: \({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(1)}\) , stating that labels identified as noisy have levels of agreement no different from random chance, and \({\varvec{\mathcal {H}}}_\mathbf{a} ^\mathbf {(2)}\) , stating that labels identified by UDC as correct have levels of agreement greater than random chance.

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article   ADS   CAS   Google Scholar  

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

MATH   Google Scholar  

Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25 , 24–29 (2019).

Article   CAS   Google Scholar  

Fourcade, A. & Khonsari, R. H. Deep learning in medical image analysis: A third eye for doctors. J. Stomatol. Oral Maxillofac. Surg. 120 , 279–288. https://doi.org/10.1016/j.jormas.2019.06.002 (2019).

Article   CAS   PubMed   Google Scholar  

Lundervold, A. S. & Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Med. Phys. 29 , 102–127. https://doi.org/10.1016/j.zemedi.2018.11.002 (2019).

Article   PubMed   Google Scholar  

Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42 , 60–88 (2017).

Article   Google Scholar  

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Haenssle, H. A. et al. Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Ann. Oncol. 29 , 1836–1842 (2018).

Cheng, J.-Z. et al. Computer-aided diagnosis with deep learning architecture: applications to breast lesions in us images and pulmonary nodules in CT scans. Sci. Rep. 6 , 1–13 (2016).

Google Scholar  

Kooi, T. et al. Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35 , 303–312 (2017).

Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 , 2402–2410 (2016).

Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng. 2 , 158 (2018).

De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24 , 1342–1350 (2018).

Ciresan, D. C., Giusti, A., Gambardella, L. M. & Schmidhuber, J. Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks 411–418 (Springer, 2013).

Charoentong, P. et al. Pan-cancer immunogenomic analyses reveal genotype-immunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep. 18 , 248–262 (2017).

Beck, A. H. et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci. Transl. Med. 3 , 108ra113-108ra113 (2011).

VerMilyea, M. et al. Development of an artificial intelligence-based assessment model for prediction of embryo viability using static images captured by optical light microscopy during ivf. Hum. Reprod. (2020).

Zhang, X. & Lessard, L. Online data poisoning attacks (2020).

Raghu, M. et al. Direct uncertainty prediction for medical second opinions. arXiv:1807.01771 [cs, stat] (2019).

Kendall, A. & Gal, Y. What uncertainties do we need in Bayesian deep learning for computer vision? 5574–5584 (2017).

Natarajan, N., Dhillon, I. S., Ravikumar, P. K. & Tewari, A. Learning with Noisy Labels 1196–1204 (Curran Associates Inc, 2013).

Xiao, T., Xia, T., Yang, Y., Huang, C. & Wang, X. Learning from massive noisy labeled data for image classification. Proc. IEEE Conf. Comput. Vis. Pattern Recogn. 20 , 2691–2699 (2015).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition.. Proc. IEEE Conf. Comput. Vis. Pattern Recogn. 20 , 770–778 (2016).

Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. Proc. IEEE Conf. Comput. Vis. Pattern Recogn. 20 , 4700–4708 (2017).

Mooney, P. Chest X-ray images (pneumonia).

Sim, J. & Wright, C. C. The kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Phys. Ther. 85 , 257–268 (2005).

Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 , 1122–1131 (2018).

Download references

Author information

Authors and affiliations.

Presagen, Adelaide, SA, 5000, Australia

M. A. Dakka, T. V. Nguyen, J. M. M. Hall, S. M. Diakiw, M. Perugini & D. Perugini

School of Mathematical Sciences, The University of Adelaide, Adelaide, SA, 5000, Australia

  • M. A. Dakka

School of Computing and Information Technology, University of Wollongong, Wollongong, NSW, 2522, Australia

  • T. V. Nguyen

Australian Research Council Centre of Excellence for Nanoscale BioPhotonics, Adelaide, SA, 5000, Australia

  • J. M. M. Hall

Ovation Fertility, Austin, TX, 78731, USA

M. VerMilyea

Texas Fertility Center, Austin, TX, 78731, USA

Department of Medical Imaging-SAMI, Women’s and Children’s Hospital Campus, Adelaide, SA, 5000, Australia

Adelaide Medical School, The University of Adelaide, Adelaide, SA, 5000, Australia

M. Perugini

You can also search for this author in PubMed   Google Scholar

Contributions

D.P. invented the concept, M.A.D. designed the algorithm, M.A.D. and J.M.M.H. and T.V.N. and D.P. conceived the experiments, M.A.D. and J.M.M.H. and T.V.N. conducted the experiments, M.V. and R.L. provided clinical data and clinical review, D.P. and M.A.D. and J.M.M.H. and T.V.N. and S.M.D. and M.P. drafted the manuscript and provided critical review of the results.

Corresponding authors

Correspondence to J. M. M. Hall or D. Perugini .

Ethics declarations

Competing interests.

J.M.M.H., D.P., and M.P. are co-owners of Presagen. S.M.D., T.V.N., and M.A.D. are employees of Presagen.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information 1

Supplementary information 1., supplementary information 2., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Dakka, M.A., Nguyen, T.V., Hall, J.M.M. et al. Automated detection of poor-quality data: case studies in healthcare. Sci Rep 11 , 18005 (2021). https://doi.org/10.1038/s41598-021-97341-0

Download citation

Received : 19 April 2021

Accepted : 23 August 2021

Published : 09 September 2021

DOI : https://doi.org/10.1038/s41598-021-97341-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Efficient automated error detection in medical data using deep-learning and label-clustering.

  • S. M. Diakiw

Scientific Reports (2023)

Proceedings of the first world conference on AI in fertility

  • Carol Lynn Curchoe

Journal of Assisted Reproduction and Genetics (2023)

Moving towards vertically integrated artificial intelligence development

  • Sanjay Budhdeo
  • James T. Teo

npj Digital Medicine (2022)

A novel decentralized federated learning approach to train on globally distributed, poor quality, and protected private medical data

  • D. Perugini

Scientific Reports (2022)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

data quality problems case study

data quality problems case study

Data import

data quality problems case study

Data profiling

data quality problems case study

Data cleansing

data quality problems case study

Data matching

data quality problems case study

Data deduplication

data quality problems case study

Data merge purge

data quality problems case study

Address verification

By use case.

data quality problems case study

Address standardization

data quality problems case study

Data standardization

data quality problems case study

Data scrubbing

data quality problems case study

Entity resolution

data quality problems case study

Fuzzy matching

data quality problems case study

Record linkage

data quality problems case study

List matching

data quality problems case study

Product matching

By industry.

data quality problems case study

Finance and Insurance

data quality problems case study

Sales and Marketing

data quality problems case study

OUR PRODUCTS

data quality problems case study

DataMatch Enterprise

data quality problems case study

ProductMatch

data quality problems case study

Data Ladder Team

data quality problems case study

Partner With Us

data quality problems case study

Why Data Ladder?

data quality problems case study

  • Customer Stories

data quality problems case study

360 Science

All resources.

data quality problems case study

Whitepapers

data quality problems case study

  • DataMatch Enterprise API
  • Partner with us
  • Free Download

data quality problems case study

Building a case for data quality: What is it and why is it important

Ehsan Elahi

  • Written by Ehsan Elahi
  • May 9, 2022

According to an IDC study , 30-50% of organizations encounter a gap between their data expectations and reality. A deeper look at this statistic shows that:

  • 45% of organizations see a gap in data lineage and content ,
  • 43% of organizations see a gap in data completeness and consistency ,
  • 41% of organizations see a gap in data timeliness ,
  • 31% of organizations see a gap in data discovery , and
  • 30% of organizations see a gap in data accountability and trust .

These data dimensions are commonly termed as data quality metrics – something that helps us to measure the fitness of data for its intended purpose – also known as data quality.

What is data quality?

The degree to which data satisfies the requirements of its intended purpose.

If an organization is unable to use the data for the reason it is stored and managed for, then it’s said to be of poor quality. This definition implies that data quality is subjective and it means something different for every organization, depending on how they intend to use it. For example, in some cases, data accuracy is more important than data completeness , while in other cases, the opposite may be true.

Another interesting way of describing data quality is:

The absence of intolerable defects in a dataset.

Meaning, data cannot be completely free of defects and that is fine. It just has to be free of defects that are intolerable for the purpose it is used across the organization. Usually, data quality is monitored to see that the datasets contain the needed information (in terms of attributes and entities), and that information is as accurate (or defect-free) as possible.

How to build a case for data quality?

Having delivered data solutions to Fortune 500 clients for over a decade, we usually find data professionals spending more than 50 hours a week on their job responsibilities. The added hours are a result of duplicate work, unsuccessful results, and lack of data knowledge. On further analysis, we often find data quality to be the main culprit behind most of these data issues. The absence of a centralized data quality engine that consistently validates and fixes data quality problems is costing experienced data professionals more time and effort than necessary.

When something silently eats away at your team productivity and produces unreliable results, it becomes crucial to bring it to the attention of necessary stakeholders so that corrective measures can be taken in time. These measures should also be integrated as part of the business process so that they can be exercised as a habit and not just a one-time act.

In this blog, we will cover three important points:

  • The quickest and easiest way to prove the importance of data quality.
  • A bunch of helpful resources that discuss different aspects of data quality.
  • How data quality benefits the six main pillars of an organization.

Let’s get started.

1. Design data flaw – business risk matrix

To prove the importance of data quality, you need to highlight how data quality problems increase business risks and impact business efficiency. This requires some research and discussion amongst data leaders and professionals, and then they can share the results and outcomes with necessary stakeholders.

We oftentimes encounter minor and major issues in our datasets, but we rarely evaluate them deep enough to see the kind of business impact they can have. In a recent blog, I talked about designing the data flaw – business risk matrix : a template that helps you to relate data flaws to business impact and resulting costs. In a nutshell, this template helps you to relate different types of misinformation present in your dataset to business risks.

For example, a misspelled customer name or incorrect contact information can lead to duplicate records in a dataset for the same customer. This, in turn, increases the number of inbound calls, decreases customer satisfaction, as well as impacts audit demand. These mishaps take a toll on a business in terms of increased staff time, reduced orders due to customer dissatisfaction, and increased cash flow volatility, etc.

But if you can get this information on paper where something as small as a misspelled customer name is attributed to something as big as losing customers, it can prove to be the first step in building a case about the importance of data quality.

2. Utilize helpful data quality resources

We have a bunch of content on our data quality hub that discusses data quality from different angles and perspectives. You will probably find something there that fulfils your requirements – something that helps you to convince your team or managers about the importance and role of data quality for any data-driven initiative.

A list of such resources is given below:

  • The impact of poor data quality: Risks, challenges, and solutions
  • Data quality measurement: When should you worry?
  • Building a data quality team: Roles and responsibilities to consider
  • 8 best practices to ensure data quality at enterprise-level
  • Data quality dimensions – 10 metrics you should be measuring
  • 5 data quality processes to know before designing a DQM framework
  • Designing a framework for data quality management
  • The definitive buyer’s guide to data quality tools
  • The definitive guide to data matching

3. Present the benefits of data quality across main pillars

In this section, we will see how end-to-end data quality testing and fixing can benefit you across the six main pillars of an organization (business, finance, customer, competition, team, and technology).

a. Business

A business uses its data as a fuel across all departments and functions. Not being able to trust the authenticity and accuracy of your data can be one of the biggest disasters in any data initiative. Although all business areas benefit from good data quality, the core ones include:

i. Decision making

Instead of relying on intuitions and guesses, organizations use business intelligence and analytics results to make concrete decisions. Whether these decisions are being made at an individual or a corporate level, data is utilized throughout the company to find patterns in past information so that accurate inferences can be made for the future. Lack of quality data can definitely skew the results of your analysis, leading this approach to do more harm than good.

Read more at Improving analytics and business intelligence with clean data .

ii. Operations

Various departments such as sales, marketing, and product depend on data for effective operation of business processes. Whether you are putting product information on your website, using prospect lists in marketing campaigns, or using sales data to calculate yearly revenue, data is part of every small and big operation. Hence, good quality data can boost operational efficiency of your business, while ensuring results accuracy and reducing gaps for potential errors.

Read more at Key components that should be part of operational efficiency goals .

iii. Compliance

Data compliance standards (such as GDPR, HIPAA, and CCPA, etc.) are compelling businesses to revisit and revise their data management strategies. Under these data compliance standards, companies are obliged to protect the personal data of their customers and ensure that data owners (the customers themselves) have the right to access, change, or erase their data.

Apart from these rights granted to data owners, the standards also hold companies responsible for following the principles of transparency, purpose limitation, data minimization, accuracy, storage limitation, security, and accountability. Timely implementation of such principles becomes way easier with clean and reliable data quality. Hence, quality data can help you conform to integral compliance standards.

Read more at The importance of data cleansing and matching for data compliance .

b. Finances

A company’s finances include an abundance of customer, employee, and vendor information, as well as the history of all transactions that happened with these entities. Bank records, invoices, credit cards, bank sheets, customer information are confidential data that do not have room for errors. For this reason, consistent, accurate, and available data help ensure that:

  • Timely payments are made whenever due,
  • Cases of underpay and overpay are avoided,
  • Transactions to incorrect recipients are avoided,
  • The chances of fraud are reduced due to duplicate entity records, and so on.

Read more at The impact of data matching on the world of finance .

c. Customer

In this era, customers seek personalization. The only way to convince them to buy from you and not a competitor is to offer them an experience that is special to them. Make them feel they are seen, heard, and understood. To achieve this, businesses use a ton of customer-generated data to understand their behavior and preferences. If this data has serious defects, you will obviously end up inferring wrong details about your customers or potential buyers. This can lead to reduced customer satisfaction and brand loyalty.

On the other hand, having quality data increases the probability of discovering relevant buyers or leads – someone who is interested in doing business with you. While allowing poor quality data in your datasets can add noise and make you lose sight of potential leads out there in the market.

Read more at Your complete guide to a obtaining a 360 customer view .

d. Competition

Good data quality can help you to identify potential opportunities in the market for cross-selling and upselling. Similarly, accurate market data and understanding can help you effectively strategize your brand and product according to market needs.

If your competition leverages quality data to infer trends about market growth and consumer behavior, they will definitely leave you behind and convert potential customers more quickly and timely. On the other hand, if wrong or incorrect data is used for such analysis, your business can be misled into making inaccurate decisions – costing you a lot of time, money, and resources.

Read more at How you can leverage your data as a competitive advantage?

Managing data and its quality is the core responsibility of the data team, but almost everyone reaps the benefits of clean and accurate data. With good quality data, your team doesn’t have to spend time correcting data quality issues every time before they can use it in their routine tasks. Since people do not waste time on rework due to errors and gaps present in datasets, this has a positive impact on the team’s productivity and efficiency; and they can focus their efforts on the task at hand.

Read more at Building a data quality team: Roles and responsibilities to consider .

f. Technology

Data quality can be a deal-breaker while digitizing any aspect of your organization through technology. It is quite easy to digitize a process when the data involved is structured, organized, and meaningful. On the other hand, bad or poor data quality can be the biggest roadblock in process automation and technology adoption in most companies.

Whether you are employing a new CRM, business intelligence, or automating marketing campaigns, you won’t get the expected results if the data contains errors and is not standardized. To get the most out of your web applications or designed databases, the content of the data must conform to acceptable data quality standards.

Read more at The definitive buyer’s guide to data quality tools .

And there you have it – we went through a whole lot of information that can help you build a case for data quality in front of stakeholders or line managers. This piece is definitely a bit different in how the benefits of data quality were presented. The reason for this is, instead of highlighting six or ten areas that can be improved with quality data – I wanted to bring our attention to a more crucial point: Data quality impacts the main pillars of your business in too many different dimensions .

Business leaders need to realize that having and using data is not even half the game. The ability to trust and rely on that data to produce consistent and accurate results is the main concern now. For this reason, companies often adopt stand-alone data quality tools for cleaning and standardizing their datasets so that it can be trusted and used whenever and wherever needed.

In this blog, you will find:

Try data matching today.

No credit card required

" * " indicates required fields

Want to know more?

Check out dme resources.

data quality problems case study

Merging Data from Multiple Sources – Challenges and Solutions

Oops! We could not locate your form.

data quality problems case study

Address standardization guide: What, why, and how?

Inaccurate and incomplete address data can cause your mail deliveries to be returned. In fact, the US postal service handled 6.5 billion pieces of UAA

data quality problems case study

What is data integrity and how can you maintain it?

While surveying 2,190 global senior executives, only 35% claimed that they trust their organization’s data and analytics. As data usage surges across various business functions,

data quality problems case study

Guide to data survivorship: How to build the golden record?

92% of organizations claim that their data sources are full of duplicate records. To make things worse, valuable information is present in every duplicate that

Data Ladder offers an end-to-end data quality and matching engine to enhance the reliability and accuracy of enterprise data ecosystem without friction.

Quick Links

© DataLadder 2024

Privacy policy.

FOR EMPLOYERS

Top 10 real-world data science case studies.

Data Science Case Studies

Aditya Sharma

Aditya is a content writer with 5+ years of experience writing for various industries including Marketing, SaaS, B2B, IT, and Edtech among others. You can find him watching anime or playing games when he’s not writing.

Frequently Asked Questions

Real-world data science case studies differ significantly from academic examples. While academic exercises often feature clean, well-structured data and simplified scenarios, real-world projects tackle messy, diverse data sources with practical constraints and genuine business objectives. These case studies reflect the complexities data scientists face when translating data into actionable insights in the corporate world.

Real-world data science projects come with common challenges. Data quality issues, including missing or inaccurate data, can hinder analysis. Domain expertise gaps may result in misinterpretation of results. Resource constraints might limit project scope or access to necessary tools and talent. Ethical considerations, like privacy and bias, demand careful handling.

Lastly, as data and business needs evolve, data science projects must adapt and stay relevant, posing an ongoing challenge.

Real-world data science case studies play a crucial role in helping companies make informed decisions. By analyzing their own data, businesses gain valuable insights into customer behavior, market trends, and operational efficiencies.

These insights empower data-driven strategies, aiding in more effective resource allocation, product development, and marketing efforts. Ultimately, case studies bridge the gap between data science and business decision-making, enhancing a company's ability to thrive in a competitive landscape.

Key takeaways from these case studies for organizations include the importance of cultivating a data-driven culture that values evidence-based decision-making. Investing in robust data infrastructure is essential to support data initiatives. Collaborating closely between data scientists and domain experts ensures that insights align with business goals.

Finally, continuous monitoring and refinement of data solutions are critical for maintaining relevance and effectiveness in a dynamic business environment. Embracing these principles can lead to tangible benefits and sustainable success in real-world data science endeavors.

Data science is a powerful driver of innovation and problem-solving across diverse industries. By harnessing data, organizations can uncover hidden patterns, automate repetitive tasks, optimize operations, and make informed decisions.

In healthcare, for example, data-driven diagnostics and treatment plans improve patient outcomes. In finance, predictive analytics enhances risk management. In transportation, route optimization reduces costs and emissions. Data science empowers industries to innovate and solve complex challenges in ways that were previously unimaginable.

Hire remote developers

Tell us the skills you need and we'll find the best developer for you in days, not weeks.

Data Quality Case Study

Cite this chapter.

data quality problems case study

  • Wendy Carter &
  • Cynthia Fodor  

Part of the book series: Health Informatics ((HI))

150 Accesses

The poor quality of organizational data is a growing concern, one that is affecting all industries. In health care, this issue is particularly salient. Data are no longer used for only internal purposes; healthcare organizations now depend on externally generated data for clinical decision making, quality assurance, cost and utilization analysis, and benchmarking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Baviello, L.E. 1998. Enterprise-wide Data Management Strategies. Annual HIMSS Conference, February 23, Orlando, FL.

Google Scholar  

Department of Veterans Affairs. Journey for Change. VHA strategic plan. Department of Veterans Affairs website, www.va.gov.

English, L.P. 1999. Improving Data Warehouse and Business Information Quality. New York: Wiley.

Johns, M.L. 1997. Information Management for Health Professions, The Health Information Management Series. Albany, NY: Delmar.

Orr, K. 1998. “Data Quality and Systems.” Communications of the ACM 41(2):66–71.

Article   Google Scholar  

Redman, T.C. 1996. Data Quality for the Information Age. Norwood, MA: Artech House, Inc.

Redman, T.C. 1998. “The Impact of Poor Data Quality on the Typical Enterprise.” Communications of the ACM 41(2):79–82.

Rosenthal, S.P. 1999. “Quality Data is Essential to Preserving Quality Care.” National Association of VA Physician and Dentists Newsletter Feb/March.

Tayi, G.K., Ballou, D.P “Examining Data Quality.” Communications of the ACM 41(2):54–57.

Download references

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

HealthCPR.com, 6313 Fox Hunt Road, Alexandria, VA, 22307, USA

Peter Ramsaroop MBA ( Chairman and Founder, Consultant ) ( Chairman and Founder, Consultant )

First Consulting Group, Baltimore, MD, 21210, USA

Peter Ramsaroop MBA ( Chairman and Founder, Consultant ), Marion J. Ball EdD ( Adjunct Professor, Vice President ) & Judith V. Douglas MA, MHS ( Adjunct Lecturer, Associate ) ( Chairman and Founder, Consultant ),  ( Adjunct Professor, Vice President ) &  ( Adjunct Lecturer, Associate )

School of Nursing, Johns Hopkins University, Baltimore, MD, 21205, USA

Marion J. Ball EdD ( Adjunct Professor, Vice President ) & Judith V. Douglas MA, MHS ( Adjunct Lecturer, Associate ) ( Adjunct Professor, Vice President ) &  ( Adjunct Lecturer, Associate )

First Consulting Group, Avon, CT, 06001, USA

David Beaulieu ( Vice President, Managing Director of the Government Practice ) ( Vice President, Managing Director of the Government Practice )

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media New York

About this chapter

Carter, W., Fodor, C. (2001). Data Quality Case Study. In: Ramsaroop, P., Ball, M.J., Beaulieu, D., Douglas, J.V. (eds) Advancing Federal Sector Health Care. Health Informatics. Springer, New York, NY. https://doi.org/10.1007/978-1-4757-3439-3_15

Download citation

DOI : https://doi.org/10.1007/978-1-4757-3439-3_15

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4419-2877-1

Online ISBN : 978-1-4757-3439-3

eBook Packages : Springer Book Archive

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Pharmaceutical Engineering Magazine
  • Online Exclusives
  • Special Reports
  • Facilities & Equipment
  • Information Systems
  • Product Development
  • Production Systems
  • Quality Systems
  • Regulatory Compliance
  • Research + Development
  • Supply Chain Management
  • White Papers
  • iSpeak Blog
  • Editorial Calendar
  • Article of the Year
  • Submit an Article
  • Editorial Team

Quality Risk Management for Biopharmaceuticals

Quality Risk Management for Biopharmaceuticals

In the dynamic and highly regulated world of biopharmaceutical manufacturing, maintaining and ensuring quality is a critical success factor. An effective quality risk management (QRM) system is a key component in the overall quality management infrastructure of biopharmaceutical organizations. It offers a structured, scientific, and risk-based approach to decision-making, addressing potential quality issues during manufacturing. High performing organizations effectively implement QRM into overall quality policies and procedures to enhance and streamline decision-making.

Implementing a robust QRM system is more than just a compliance requirement. It fundamentally contributes to the organization’s commitment to patient safety, product quality, and data integrity. A robust QRM system consists of key characteristics with clearly defined processes that contribute to the system’s success.

Reviewing the Risk Compliance Data

The following graphical data shows the relative compliance risk for pharmaceutical manufacturing organizations based on US Food and Drug Administration (FDA) regulatory activity (see Figure 1). Monitoring regulatory trends based on actual FDA activity provides useful insight for evaluating internal quality management system performance and proactively identifying areas of opportunity to improve overall compliance. Six major pharmaceutical regulation subparts are charted with relative annual activity increasing significantly from 2016 to 2020. During that period, the Building and Facilities, Laboratory Controls, and Production and Process Controls subparts were the largest areas, receiving 483 observations during regulatory inspections.

Those areas present increased compliance risk that would benefit from a formal review, gap analysis, and remediation to improve overall quality system performance and serve as priorities for time and resources. Regulatory risk during the COVID-19 pandemic decreased dramatically as the FDA performed few, if any, on-site investigations. However, activity during 2022 represents renewed on-site investigations with associated regulatory risk.

Figure 2 provides an annual trend of the top cited pharmaceutical regulations from 483 observations during regulatory investigations. These regulations are from the subparts identified in Figure 1 and the largest contributors to pharmaceutical regulatory risk. Any efforts to evaluate biopharmaceutical risk should consider the specific requirements identified in these regulations and address gaps identified during formal review and gap analysis as part of a QRM plan.

A Case Study

This case study concerns a major biopharmaceutical organization that specializes in producing monoclonal antibodies (mAbs) used in the treatment of various autoimmune diseases. As part of their commitment to quality and regulatory compliance, they have implemented a robust QRM system.

The organization’s production engineers identified a potential risk in their manufacturing process. The risk was related to variability in the cell culture phase, which could potentially lead to inconsistencies in the final product’s efficacy and safety.

Risk Identification

The QRM team initiated the risk identification process using failure mode and effects analysis (FMEA) and brainstorming sessions with cross-functional teams. They identified key risk factors, such as pH imbalances, temperature fluctuations, and contamination risks during the cell culture phase.

Figure 1: Data from total Title 21 CFR Part 211 key pharmaceutical subpart citations showing relative compliance risk for pharmaceutical manufacturing organizations.

Risk Assessment

Using a risk matrix, the team assessed the potential impact and likelihood of each identified risk. They determined that temperature fluctuations posed the highest risk due to their high likelihood and potential to significantly impact product quality.

Risk Control

The organization decided to implement additional control measures to mitigate this risk:

  • Enhanced monitoring: Installing advanced temperature monitoring systems with automatic alerts for deviations.
  • Process improvement: Optimizing the cell culture process to be more robust against minor temperature changes.
  • Employee training: Conducting extensive training for staff on the importance of maintaining optimal temperature conditions.

Risk Communication

The QRM team communicated the identified risks, their potential impact, and the planned control measures to all relevant stakeholders, including the manufacturing team, quality assurance department, and senior management.

Implementation

The proposed measures were implemented, and their effectiveness was closely monitored. This included regular review meetings and updates to the risk management plan.

The new control measures led to a significant reduction in temperature-related variability in the cell culture process. As a result, the consistency and quality of the mAbs improved, leading to enhanced patient safety and regulatory compliance.

Lessons Learned

The proactive approach to identifying and managing a critical risk in their manufacturing process demonstrated the importance of a dynamic and integrated QRM system. The case also highlighted the need for continuous monitoring and improvement in risk management practices.

Case Study Conclusion

This case study exemplifies the application of a structured QRM process in the biopharmaceutical industry. It illustrates the importance of identifying, assessing, controlling, and communicating risks in a systematic manner to ensure the production of high-quality biopharmaceutical products.

Characteristics of a Biopharmaceutical Qrm System

Identification of risk is a cross-functional effort that begins in the late development stages prior to technology transfer. In the early stages, research and development (R&D) is the main contributor in the risk identification process, which is facilitated by quality and manufacturing who are participants. As manufacturing develops detailed knowledge of the new process and technology, it provides a strong perspective on potential issues and risks that may exist in day-to-day manufacturing. At this time, all teams must compromise to ensure the final technology and process transfer meet the strategic goals of launching a new product.

Once the technology transfer is complete, manufacturing takes the lead in monitoring risk, along with quality. The manufacturing team also proposes any potential changes, which are reviewed by R&D, quality, and, possibly, commercial participants. Performance metrics are developed jointly between quality and manufacturing and used to periodically report to cross-functional leaders.

The main characteristics of a robust QRM system for biopharmaceutical manufacturers are identified in the following sections.

The initial step in any QRM system is the identification of potential risks. It is necessary to understand what could potentially go wrong in the manufacturing process to manage and mitigate these risks effectively. Elevated performance in risk identification is demonstrated by organizations conducting risk identification with input from cross-functional subject matter experts.

This typically involves brainstorming sessions with relevant stakeholders, analysis of historical data and problem reports, and reviews of process documentation. Clear guidelines should be established for what constitutes a risk, and all identified risks should be documented and maintained in a risk register.

In addition to brainstorming sessions and historical data analysis, other tools such as FMEA, hazard identification, or process hazard analysis can be implemented for a systematic approach. Expert opinions and predictive models can also be used. A successful process should also involve reassessing the risk landscape periodically and after any significant changes. Changes requiring revalidation are a notable trigger to update risk profiles.

A robust biopharmaceutical QRM system recognizes that the process of risk identification is continuous and dynamic, adjusting to changes in procedures, equipment, materials, and the overall business environment. It also considers both internal and external sources of risk.

After identifying risks, it is crucial to evaluate them in terms of their potential impact on product quality and the probability of their occurrence. This allows the company to prioritize its risk management efforts.

Risk assessment usually involves qualitative or quantitative methods. Qualitative methods might include rating risks on a scale from low to high, whereas quantitative methods might involve statistical analysis or simulation. Risk assessment is about creating an informed understanding of the risk and considering the severity of the impact, the likelihood of occurrence, and the detectability of the risk. This aids in prioritizing resources and efforts for risk control.

The process should include risk ranking or scoring systems that can objectively evaluate and compare different risks. Detailed risk maps or matrices can be created to visualize the risk landscape. Risk assessments should be periodically reviewed and updated, especially when new information becomes available.

This step involves deciding on and implementing measures to mitigate the identified risks. Without this step, the risk management process would be incomplete. Risk control involves not only mitigating risks but also deciding whether to accept, transfer, or avoid certain risks. Risk control measures should be proportional to the significance of the risk.

Risk control could involve anything from making changes to the manufacturing process to training employees in new procedures. A key part of this step is documenting the control measures and monitoring their effectiveness over time. After devising risk control measures, a pilot test can be conducted for complex or high-stake measures to ensure their effectiveness before full-scale implementation. The measures should also be reviewed and updated regularly, and particularly after any significant incidents.

Communication and Consultation

Effective communication ensures all relevant stakeholders are aware of the risks and the steps being taken to control them. This not only fosters a culture of risk awareness, but also ensures risk management efforts are coordinated across the organization. Effective communication promotes a shared understanding of risks, risk management practices, and individual roles and responsibilities in managing risk. It should involve all levels of the organization, as well as external stakeholders when appropriate.

This could involve regular meetings, reports, or automated notifications. The key is to ensure that the right information reaches the right people at the right time. The communication process should be a two-way street, allowing feedback from all stakeholders. In addition to meetings and reports, knowledge management systems or collaboration platforms could be used to facilitate communication. Clear protocols should be established for escalation of high-priority risks.

Continuous Monitoring and Review

The risk landscape can change over time, with new risks emerging and old ones disappearing or changing in severity. Continuous monitoring and review ensure that a QRM system stays relevant and effective. Incorporating accurate trend data based on regulatory activity provides an additional level of input elevating the effectiveness of risk management activities.

This can involve regular risk assessments, audits, and reviews of risk control measures. Any changes should be documented and communicated to relevant stakeholders. Monitoring and review processes should include the risks themselves and the effectiveness of the QRM system, changes in context, and the identification of emerging risks.

Risk Management Integration

Risk management should be an integral part of all organizational processes—not a separate activity. This ensures risk considerations are a part of all decisions, rather than being an afterthought. Integrating risk management with other business processes ensures risk management is proactive rather than reactive. It allows risks to be addressed before they can cause problems.

This could involve incorporating risk management into existing process documentation, training employees on risk management, or establishing a risk management committee. This could involve the use of integrated management systems or embedding risk management into standard operating procedures. Cross-functional teams or committees could be established to oversee the integration. Key performance indicators related to risk management should be established and monitored. Audits and reviews should be scheduled regularly and triggered by significant changes or incidents. Feedback from these activities should be used to drive continuous improvement.

Root Cause Analysis

Understanding the root cause of a problem allows for more effective risk management. It helps avoid merely treating the symptoms of a problem, which can lead to recurrence. The goal of root cause analysis is to prevent recurrence of problems by addressing their underlying causes, not just the symptoms. It allows for more efficient use of resources and improves process understanding.

Techniques such as the five whys and fishbone diagrams, among others, can be used to identify root causes. Once identified, these root causes should be addressed in the risk control measures. When conducting root cause analysis, it is important to ensure a blame-free environment where all ideas are considered. Tools such as Pareto charts could be used to prioritize root causes. Root cause prioritization may also reference regulatory trends based on current regulatory activity. Corrective and preventive actions should be devised to address the root causes.

Data-Driven Decision-Making

Decisions about risk management should be based on data, not on gut feelings or intuition. This leads to more objective and effective decisions. The use of data promotes objectivity, consistency, and efficiency in decision-making. It also allows for tracking and demonstrating the performance of the QRM system. An example of data-driven decision-making used by high-performing organizations uses available newsletters, visualizations, and trend tracking regulatory data to provide accurate insights to compliance risk.

This might involve collecting and analyzing data on process performance, product quality, and the effectiveness of risk control measures. Decision-making tools such as decision trees or Bayesian networks can also be used. An effective process should include not only collection and analysis of data, but also data management practices to ensure data integrity and usability. Advanced data analytics or artificial intelligence could be used for predictive risk modeling. Regulatory trends and current regulatory activity are also indicators providing insight into predictive risks of regulatory audits.

Quality Culture

A strong culture of quality fosters individual accountability, intrinsic motivation, and proactive behavior in managing risk and it ensures risk management is not the responsibility of just the quality department. Successful organizations building a strong culture of quality and compliance have notable focus and support from executive leadership. A successful quality culture can only succeed with outstanding support from organizational executives. The “tone at the top” significantly drives the performance and adherence of the organization to quality principles.

This could involve training, recognition programs, or changes to organizational structure. It is important to regularly assess the culture of quality and adjust as needed. Activities to foster a quality culture could include workshops, training sessions, recognition programs, and team-building activities. Regular culture assessments could be conducted through surveys or interviews and the findings used to inform culture improvement initiatives.

A robust biopharmaceutical QRM system recognizes that the process of risk identification is continuous and dynamic, adjusting to changes in procedures, equipment, materials, and the overall business environment.

Flexibility and Adaptability

As the organization and its external environment change, the QRM system needs to be able to adapt. A rigid system that cannot handle change will quickly become ineffective. A flexible and adaptable QRM system allows the organization to respond effectively to changes and challenges, turning them into opportunities rather than threats. It helps ensure the system’s resilience and long-term sustainability. Using data-driven metrics and tracking tools facilitates effective management of quality and compliance risk.

This can involve regular reviews of the QRM system and a process for making changes to it. Feedback from stakeholders should be actively sought and incorporated. Scenario analysis or stress testing could be used to evaluate and improve the system’s adaptability. A change management process should be established to handle changes in a systematic and controlled manner. Integral to a change management process should be the incorporation of risk assessment and evaluation relevant to any proposed changes.

Compliance with Regulations

Biopharmaceutical companies operate in a heavily regulated environment. Compliance with regulations avoids legal problems and ensures products are safe and effective. It also promotes trust and credibility among stakeholders, and it provides a baseline for risk management practices.

Compliance can be ensured by keeping up to date with regulatory changes, incorporating these changes into the QRM system, and regularly auditing for compliance. Regular training should be provided to keep staff current on regulatory requirements. Regulatory intelligence activities could be conducted to anticipate and prepare for upcoming changes.

Compliance checks should be integrated into the risk assessment and review processes. Additionally, compliance reports and newsletters summarizing regulatory activity provide valuable insight into risks and trends associated with regulatory compliance for life sciences. Further expanding compliance data to broader time horizons increases insights into longer-term trends and the value of current trends in a historical perspective.

Traceability and Documentation

Documentation provides evidence of the QRM system’s functioning. It also allows for traceability, which is crucial for root cause analysis and for demonstrating compliance with regulations. Proper documentation allows the team to preserve institutional knowledge, learn from past experiences, and demonstrate compliance. Traceability is crucial for investigating incidents, validating processes, and ensuring product quality.

Documentation should be maintained for all risk management activities, including risk identification, assessment, and control. It should be kept in a format that is easily accessible and understandable. Traceability can be maintained through unique identifiers for risks and control measures, and by linking related documents. A document management system could be used to manage and control documents. The system should support version control, approval processes, and easy retrieval of documents. Traceability could be maintained through traceability matrices or dedicated software systems.

An effective QRM system in biopharmaceutical manufacturing is multifaceted, involving the identification and assessment of potential risks, robust control mechanisms, effective communication strategies, and regular monitoring and review procedures. The QRM system should be flexible and adaptable, grounded in data-driven decision-making, and deeply integrated within the organization’s culture and processes. The risk management process, which includes risk identification and mitigation, is a cross-functional effort requiring participation from R&D, quality, and manufacturing, with metrics for monitoring and reporting process effectiveness to cross-functional leaders.

Organizations performing at elevated levels consistently demonstrate an ability to incorporate risk criteria into daily operations using various tools to evaluate risk according to product and patient impact. Compliance with regulations and maintaining detailed traceability and documentation are also of paramount importance. Although implementing such a comprehensive system can be complex, the benefits of ensuring product quality and safety, and ultimately patient health, are profound. The successful deployment of QRM necessitates a continual commitment to each of these characteristics, fostering a culture of quality that permeates every aspect of the organization.

Not a Member Yet?

To continue reading this article and to take advantage of full access to Pharmaceutical Engineering magazine articles, technical reports, white papers and exclusive content on the latest pharmaceutical engineering news, join ISPE today. In addition to exclusive access to all of the content in Pharmaceutical Engineering magazine, you will get online access to 24 ISPE Good Practice Guides, exclusive networking events, regulatory resources, Communities of Practice, and more.

Learn more about the valuable benefits you'll receive with an ISPE membership.

Join Today Member Login

Related Articles

Cell Culture Media Manufacturing Controls for Bio-Manufacturing

Advanced therapy medicinal products (ATMPs) and cell and gene therapies (C&GTs) represent a promising medical product class that employs gene therapy, cell therapy, or tissue engineering to address various diseases and injuries. One critical aspect of ATMP and C&GT manufacturing is using cell culture media. With thousands of ATMPs and C&GTs in clinical trial phases, the role of...

Building Better Therapies With Antibody Engineering

Antibody engineering has transformed the development of therapeutic antibodies, enabling the creation of specific and effective treatments for a range of diseases. These antibody-based therapeutics are advancing in clinical development at a rapid rate and are being approved in record numbers. Currently, more than 100 monoclonal antibodies (mAbs) have been approved for the treatment of various...

Continuous Buffer Management System: Large-Scale Buffer Preparation

Although traditional tank farm systems have long been the cornerstone of buffer preparation, they face challenges that have grown with the expansion of processing scale in the industry.

ORIGINAL RESEARCH article

Modeling the impact of coincidence loss on count rate statistics and noise performance in counting detectors for imaging applications provisionally accepted.

  • 1 Dectris AG, Switzerland

The final, formatted version of the article will be published soon.

Coincidence loss can have detrimental effects on the image quality provided by pixelated counting detectors, especially in dose-sensitive applications like cryoEM where the information extracted from the recorded signal needs to be maximized. In this work, we investigate the impact of coincidence loss phenomena on the recorded statistics in counting detectors producing sparse binary images. First, we derive exact analytical expressions for the mean and the variance of the recorded counts as a function of the incoming event rate. Secondly, we address the problem of the mean and variance of the recorded events (i.e. pixel clusters identified as individual incoming events), also as a function of the incoming event rate. In this frame, we review previous studies from different disciplines on approximated two-dimensional models, we critically reinterpret them in our context and evaluate the suitability of their adoption in the present case. The knowledge of the first two momenta of the recorded statistics allows to infer about the signal-to-noise ratio (SNR) and the detective quantum efficiency at zero-frequency (DQE 0 ). Analytical results are validated through comparison with numerical data obtained with a custom-made Monte Carlo code. We chose a realistic case study for cryoEM application consisting of a 25 µm-thick MAPS detector featuring a pixel size of 10 µm and illuminated with electrons of 300 keV energy over a wide range of incoming rate.

Keywords: Coincidence loss, Counting detector, binary sparse image, CryoEM, recorded counts, recorded electrons, Analytical model, DQE

Received: 28 Mar 2024; Accepted: 09 May 2024.

Copyright: © 2024 Zambon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: PhD. Pietro Zambon, Dectris AG, Baden, Switzerland

People also looked at

IMAGES

  1. Dealing With Data Quality Problems In Tableau

    data quality problems case study

  2. Infographics

    data quality problems case study

  3. Case Study: What Can Be Done About Data Quality? Essay Example

    data quality problems case study

  4. Overcome These 10 Common Data Quality Issues in Reporting

    data quality problems case study

  5. How to Customize a Case Study Infographic With Animated Data

    data quality problems case study

  6. INFOGRAPHIC: THE DATA QUALITY PROBLEM

    data quality problems case study

VIDEO

  1. Why Having Multiple Facebook Profiles Can Lead to Problems (Case Study)

  2. Power Platform Data Quality Tools

  3. 14 Cases in RQDA

  4. Cloud Data Integration: Easily Maintain Data Quality from Disparate Data Sources

  5. iA: Taking 5 logins to 1 with Master Data Management, removing complexity and increasing trust

  6. BEVAE-181 UNIT -9 ENVIRONMENTAL POLLUTION AND HAZARDS #bevae181 #environment

COMMENTS

  1. Data Quality Case Studies: How We Saved Clients Real Money Thanks to

    Machine learning models grow more powerful every week, but the earliest models and the most recent state-of-the-art models share the exact same dependency: data quality. The maxim "garbage in - garbage out" coined decades ago, continues to apply today. Recent examples of data verification shortcomings abound, including JP Morgan/Chase's 2013 fiasco and this lovely list of Excel snafus ...

  2. Maintaining Data Quality from Multiple Sources Case Study

    Learn how University of Texas Southwestern Medical Center improved data quality and efficiency by using technology, Agile methodology and user stories. The case study covers challenges such as error detection, data integration and data analysis.

  3. PDF A Framework for Data Quality: Case Studies October 2023

    A Framework for Data Quality: Case Studies . October 2023 . Recommended citation: Mirel LB, Singpurwalla D, Hoppe T, Liliedahl E, Schmitt R, Weber J. ... All data have quality problems, whether the data come from surveys, administrative records, sensors, or a blend of multiple sources. Our challenge as creators and users of data is to minimize ...

  4. Data Governance and Data Quality Use Cases

    Although Data Quality and Data Governance are often used interchangeably, they are very different, both in theory and practice. While Data Quality Management at an enterprise happens both at the front (incoming data pipelines) and back ends (databases, servers), the whole process is defined, structured, and implemented through a well-designed framework.

  5. Data Quality Roadmap. Part II: Case Studies

    Airbnb's case study. This is a case study for Airbnb, compiled based on public information and made by authors of the roadmap. The roadmap is based on their description of data quality (part 1 ...

  6. How Can Interactive Process Discovery Address Data Quality Issues in

    The second step, i.e., the case study, enables a broader and more generalised analysis of how IPD can address data quality issues in complex real-life contexts, like healthcare. This combined approach can be particularly useful when evaluating the suitability of BPM technologies and techniques for specific projects [61] , [62] .

  7. Data Quality Analysis and Improvement: A Case Study of a Bus ...

    Due to the rapid development of the mobile Internet and the Internet of Things, the volume of generated data keeps growing. The topic of data quality has gained increasing attention recently. Numerous studies have explored various data quality (DQ) problems across several fields, with corresponding effective data-cleaning strategies being researched. This paper begins with a comprehensive and ...

  8. PDF Data Quality Issues in 10 Studies

    CASE STUDY 1 26% Of data quality issues found across 10 studies had potential to delay drug approval Prevent Data Quality Issues That Derail Drug Approvals The Challenge Nearly 50 percent of new molecular entities (NME) submissions fail their first FDA approval, and 32 percent of these failures are attributed to data quality, data

  9. Drilling data quality improvement and information extraction with case

    The data quality issues have been identified, improvement approaches have been investigated, and results have been then analysed to verify the enhancement of data quality. Although one case study that utilizes laboratory data may not directly reflect the data quality situation of a standard rig operating in the field (due to the involvement of ...

  10. Assessing data quality in open data: A case study

    The method used to assess the data quality on open data. consists of: A. Metric selection. A searc hing pro cess o n i ndexed scientific databases and an. analysis is performed to verify the ...

  11. Big Data Quality Case Study Preliminary Findings

    Big Data Quality Case Study Preliminary Findings. Oct 1, 2013. By David Becker , Patricia King , William McMullen , Lisa Lalis , David Bloom , Dr. Ali Obaidi , Donna Fickett. A set of four case studies related to data quality in the context of the management and use of Big Data are being performed and reported separately; these will also be ...

  12. Improving Data Quality in Practice: A Case Study in the Italian Public

    The quality indicators that we propose are generic and can be used to describe a variety of data quality problems, thus representing a possible reference framework for practitioners. ... G., Verykios, V. et al. Improving Data Quality in Practice: A Case Study in the Italian Public Administration. Distributed and Parallel Databases 13, 135-160 ...

  13. Discovering Data Quality Problems

    Existing methodologies for identifying data quality problems are typically user-centric, where data quality requirements are first determined in a top-down manner following well-established design guidelines, organizational structures and data governance frameworks. In the current data landscape, however, users are often confronted with new, unexplored datasets that they may not have any ...

  14. Data Quality Management Maturity Model : A Case Study in Higher

    Data has increasingly become more imperative in organization's decision-making process. Low data quality can cause extensive organizational problems, such as inaccurate decision-making and dropped business possibilities. This is because low-quality data does not present a clear description of the actual situation. In Human Resource (HR) management, low data quality can cause recruitment ...

  15. Automated detection of poor-quality data: case studies in healthcare

    Scientific Reports - Automated detection of poor-quality data: case studies in healthcare. ... UDC was validated across two problem sets, (1) cats vs. dogs, and (2) vehicles, or binary and multi ...

  16. Building a case for data quality: What is it and why is it important

    Another interesting way of describing data quality is: The absence of intolerable defects in a dataset. Meaning, data cannot be completely free of defects and that is fine. It just has to be free of defects that are intolerable for the purpose it is used across the organization. Usually, data quality is monitored to see that the datasets ...

  17. 10 Real-World Data Science Case Studies Worth Reading

    Data quality issues, including missing or inaccurate data, can hinder analysis. Domain expertise gaps may result in misinterpretation of results. ... Real-world data science case studies play a crucial role in helping companies make informed decisions. By analyzing their own data, businesses gain valuable insights into customer behavior, market ...

  18. PDF Big Data Quality Case Study Preliminary Findings--U.S. Army ...

    1. 2Case Study Findings. This section presents findings about data quality for MEDCOM MODS Big Data. Finding 1: Data warehouses should not cleanse data. (Reference Section 7.1) Because the data warehouse does not typically "own" the data, it should not change the data (a narrow definition of the phrase "to cleanse it").

  19. PDF Data Quality Challenge Case Study

    It also reveals important opportunities that can be further exploited by trading partners to enhance their processes for data management and data quality. quality, so that GDS could move from being a technical project to being a business project. Carrefour also wanted to have fewer problems with inaccurate data. for more information.

  20. Digital Health Data Quality Issues: Systematic Review

    Conclusions. The multidisciplinary systematic literature review conducted in this study resulted in the development of a consolidated digital health DQ framework comprising 6 DQ dimensions, the interrelationships among these dimensions, 6 DQ outcomes, and the relationships between these dimensions and outcomes.

  21. PDF Data Quality—Concepts and Problems

    to identify the possible sources of data quality problems and the starting points for data quality assurance. Encyclopedia 2022, 2, FOR PEER REVIEW 3 ... (see Section 3.2 for a case study). However, the indi-vidual activities must be coordinated with each other in comprehe nsive data quality man-agement. 2.2. Demands, Dimensions and Approach

  22. Data Quality Case Study

    Improving Data Warehouse and Business Information Quality. New York: Wiley. Google Scholar Johns, M.L. 1997. Information Management for Health Professions, The Health Information Management Series. Albany, NY: Delmar. Google Scholar Orr, K. 1998. "Data Quality and Systems." Communications of the ACM 41(2):66-71.

  23. Case Study Method: A Step-by-Step Guide for Business Researchers

    Case study reporting is as important as empirical material collection and interpretation. The quality of a case study does not only depend on the empirical material collection and analysis but also on its reporting (Denzin & Lincoln, 1998). A sound report structure, along with "story-like" writing is crucial to case study reporting.

  24. Applied Sciences

    The integration of 3D laser scanning and digital photogrammetry in the architecture, engineering, and construction (AEC) industry has facilitated high-quality architectural surveys. However, the processes remains constrained by significant costs, extensive manual labor, and accuracy issues associated with manual data processing. This article addresses these operational challenges by ...

  25. Quality Risk Management for Biopharmaceuticals

    In the dynamic and highly regulated world of biopharmaceutical manufacturing, maintaining and ensuring quality is a critical success factor. An effective quality risk management (QRM) system is a key component in the overall quality management infrastructure of biopharmaceutical organizations. It offers a structured, scientific, and risk-based approach to decision-making, addressing potential ...

  26. Frontiers

    Coincidence loss can have detrimental effects on the image quality provided by pixelated counting detectors, especially in dose-sensitive applications like cryoEM where the information extracted from the recorded signal needs to be maximized. In this work, we investigate the impact of coincidence loss phenomena on the recorded statistics in counting detectors producing sparse binary images ...