U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

The Semantic Data Dictionary – An Approach for Describing and Annotating Data

Sabbir m. rashid.

a Rensselaer Polytechnic Institute, Troy, NY, 12180, USA

James P. McCusker

Paulo pinheiro, marcello p. bax.

b Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270-901, BR

Henrique Santos

Jeanette a. stingone.

c Columbia University, Mailman School of Public Health, New York, NY, 10032, USA

Amar K. Das

d IBM Research, Cambridge, MA 02142, USA

Deborah L. McGuinness

ORCID: https://orcid.org/0000-0001-8469-4043

ORCID: https://orcid.org/0000-0003-0503-3031

ORCID: https://orcid.org/0000-0002-2110-6416

ORCID: https://orcid.org/0000-0003-3508-8260

ORCID: https://orcid.org/0000-0003-3556-0844

ORCID: https://orcid.org/0000-0001-7037-4567

It is common practice for data providers to include text descriptions for each column when publishing datasets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a dataset, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse datasets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey dataset, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large NIH-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.

1. Introduction

With the rapid expansion of data-driven applications and the expansion of data science research over the past decade, data providers and users alike have relied on datasets as a means for recording and accessing information from a variety of distinct domains. Datasets are composed of distinct structures that require additional information to help users understand the meaning of the data. A common approach used by data providers involves providing descriptive information for a dataset in the form of a data dictionary, defined as a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format” [ 1 ]. Data dictionaries are useful for many data management tasks, including aiding users in data conversion processes, testing data generation, validating data, and storing data usage criteria [ 2 ].

When storing data into a system that adheres to the structure of a particular data dictionary, that document can be used to aid in validation both when inputting new data into the system or updating existing data. By including additional information about a dataset itself, data dictionaries can be used to store data usage criteria. Additionally, data conversion is aided by the inclusion of the format and units of the data points, which allows users to use conversion formulae to convert the data into another format or unit. When considering these benefits, we see that the use of data dictionaries has had a significant impact on data use and reuse. Nevertheless, we argue that data dictionaries can be improved by leveraging emerging Semantic Web technologies.

The use of data dictionaries to record descriptions about datasets and their elements has become widely adopted by data providers, often with the intent of aiding reusability. These data dictionaries are useful to data users in reducing ambiguity when interpreting dataset content. Considering the structure and annotations that traditional data dictionaries are comprised of, we find that for each column header in a dataset, these documents often contain a label that is more informative than the column name, as well as a comment describing the column header. Such annotations in themselves are essential for an end-user to understand the data, as column names are often arbitrary or encoded. Existing data dictionaries often contain structural information about a dataset column, such as the format of the data, the data type, or the associated units of measurement. As this information is required for the proper analysis of data, we commend data providers for including it in their data dictionaries. For datasets that contain categorical codes, data providers have done well to document the possible values and include descriptive labels for each category.

While many publicly available datasets include documents resembling data dictionaries, we find that, across institutions, these documents do not adhere to a common metadata standard. Metadata, defined as “structured data about data” [ 3 ], should be able to be processed using software. Existing data dictionary standards typically are aimed at human consumption and do not subscribe to models that are machine-understandable, and thus lack support for formal semantics. Consequently, tasks involving the combination of data from multiple datasets that are described using data dictionaries are not easily automated.

1.1. A need for semantics

From the dataset production perspective, datasets can convey much more information than the data itself. Dataset entries often correspond to physical observations, such as the weight of a sample, an event duration, or a person’s gender. Traditional data dictionaries do well in describing these measurements but cannot represent the measured objects. There is a need to annotate these implicit concepts (representing the measured objects) that are indispensable to a complete understanding of the data but do not correspond to columns in the dataset. Annotations of both explicit and implicit concepts allow for the conversion of a tabular format of data into a semantically richer graphical representation.

There may be a variety of ways that a data user can benefit from a semantic representation of data, such as enhanced provenance attributions, query capabilities, and the ability to infer new knowledge. We argue for the applicability of the Semantic Data Dictionary (SDD) as a standard model for representing machine-readable metadata for datasets. The SDD comprises a set of specifications formalizing the assignment of a semantic representation to data by annotating dataset columns and their values using concepts from best practice vocabularies and ontologies. It is a collection of individual documents, where each plays a role in creating a concise and consistent knowledge representation. Each of these components, described in Section 3 , is implemented using tables. In Appendix B , we provide the specifications for each of the SDD tables. Throughout the remainder of this article, we describe modeling methods, include informative examples from projects employing this approach, discuss modeling challenges, and evaluate our approach against traditional data dictionaries, mapping languages, and data integration tools.

As science moves towards a more open approach, priority has been given to publishing scientific data in a way that is Findable, Accessible, Interoperable, and Reusable (FAIR) [ 4 ]. The FAIR principles are used to evaluate the quality of published datasets or the workflow that is used to produce data. As part of our approach to evaluating our methodology, we examine adherence to the FAIR guiding principles. While we have considered these guidelines in designing our approach, and they have been adopted for many projects, the FAIR principles are not without limitations. For example, methods for the facilitation of data sharing are not specified, which may result in error perpetuation from differing interpretations of design choices, and more vigorous privacy concerns need to be addressed [ 5 ]. The use of the FAIR guidelines and traditional data integration approaches alone do not guarantee enough granularity of representation to support the pooling of data across studies, thereby limiting the potential impact for more significant statistical analyses. However, this capability has been demonstrated using the SDD approach for the Children’s Health Exposure Analysis Resource (CHEAR) project [ 6 ].

1.2. Supporting biomedical research

While the SDD approach can and has been used for the semantic annotation of data in multiple domains, we will limit our examples in this paper to the field of biomedicine. The application of semantic technologies in areas like healthcare or the life sciences has the potential to facilitate scientific research in these fields. Many vocabularies and ontologies that define concepts and relationships in a formal graphical structure have been created to describe critical terms related to anatomy, genetics, diseases, and pharmaceuticals [ 7 , 8 ]. Best practice ontologies should be leveraged for the annotation of biomedical and clinical data to create knowledge representations that align with existing semantic technologies, services, and workflows. Ideally, the desired representation model would allow for improved data discovery, interoperability, and reuse, while supporting provenance, trust, traceability, and reproducibility.

Challenges arise for biomedical researchers who are unfamiliar with approaches for performing semantic annotation. Existing methods to provide machine-understandable interpretations of data are difficult for most researchers to learn [ 9 ]. The biomedical community has traditionally used data dictionaries to provide information regarding the use of a dataset. While such documents are useful for a human interpreter, they generally cannot be used by themselves to automate the creation of a structured knowledge representation of the corresponding data. We recognize the need for an approach for annotating biomedical data that feels familiar to domain scientists while adhering to Semantic Web standards and machine-understandability. Since SDDs consist of tabular documents that resemble traditional data dictionaries, they can be used by biomedical scientists to annotate data naturally. In order to aid researchers who do not have a computer science background, we leverage the traits of SDDs, being both machine-readable and unambiguous, to provide interpretation software 1 that can be used to create a knowledge model that meets the desired semantic representation characteristics mentioned above.

1.3. Motivation

In Section 2.1 , we consider institutions that provide guidelines for the use of data dictionaries to record descriptive content for a dataset. While existing guidelines have helped create human-understandable documents, we believe that there is room for improvement by introducing a formalization that is machine-readable. With the current advances in Artificial Intelligence technologies, there is an increased need for data users to have annotated data that adhere to Semantic Web standards [ 10 , 11 ]. We consider the benefits of combining data from disparate sources in such a way that it can be used in a unified manner. Harmonization across datasets allows for the comparison between similar columns, using a controlled vocabulary. The ability to combine data from various sources and formats into a single cohesive knowledge base allows for the implementation of innovative applications, such as faceted browsers or data visualizers.

Data and provenance understanding refer respectively to data interpretability and the ability to discern provenance attributions, both by humans and machines. This level of knowledge is necessary for the reuse of data and the reproduction of scientific experiments. Annotation of data improves query and integration capabilities [ 12 ], and the use of Semantic Web standards enhances the ability to find the data through a web search [ 13 ]. Unfortunately, it is difficult for data users, who have a second-hand understanding of the data compared to data providers, to create these annotations themselves. As an example, a study related to data dissemination revealed that three researchers, independently analyzing a single dataset and using similar approaches, arrived at noticeably dissimilar interpretive conclusions [ 14 ].

Additionally, difficulties arise for someone without a technology background to develop competence in technical approaches, due to challenges associated with technological semantics, such as research problems being defined, clarified, and communicated in a way that is perceptable by a general audience [ 15 ]. Therefore, the desire to create a standard for people from a wide variety of domains, including those who are untrained in Computer Science and semantic technologies, is an additional motivation. Easing the semantic annotation process for these users is a significant challenge. A machine-readable standard for dataset metadata can improve data harmonization, integration, reuse, and reproducibility.

1.4. Claims

We claim that the formalism of the Semantic Data Dictionary addresses some of the limitations of existing data dictionary approaches. Traditional data dictionaries provide descriptions about the columns of a dataset, which typically represent physical measurements or characteristics, but omit details about the described entities. Existing data dictionaries do not acknowledge the notion that the data values are instances of concepts that may have relationships with other instances of concepts, such as entity-entity, attribute-attribute, or entity-attribute relations.

In contrast, the SDD approach allows for the direct annotation of concepts implicitly referenced in a dataset. Existing data dictionaries focus on the structure of the data rather than the inherent meaning, including value ranges, formats, and data types. Further information about the data, including the units, meaning, and associated objects, is provided in text descriptions that are not machine-interpretable. The SDD, on the other hand, focuses on the semantics of the data and includes the above information in a way that is readily able to be processed. The SDD consists of an intrinsic model with relationships that can be further customized, allowing the annotator to describe relationships between both explicit and implicit concepts inherent in the dataset. By considering these characteristics of SDDs, we argue that a standardized machine-readable representation for recording dataset metadata and column information is achieved.

We also claim that the SDD approach presents a level of abstraction over methodologies that use mapping languages. This is achieved by simplifying the programming knowledge requirements by separating the annotation portion of the approach from the software component. As a result, the SDD approach improves the ease of use for a domain scientist over other semantic tools. Additionally, by presenting the annotation component in a form that resembles traditional data dictionaries, this approach provides a bridge between the conventional data dictionary approaches, used by domain scientists, and the formal techniques used by Semantic Web researchers.

2. Related Work

The SDD approach leverages state-of-the-art advancements in many data and knowledge related areas: traditional data dictionaries, data integration, mapping languages, semantic extract, transform, and load (ETL) methods, and metadata standards. In this section, we present related work in each of those extensive areas by highlighting their accomplishments and discussing their limitations.

2.1. Data Dictionaries

There are several patents relating to the use of dictionaries to organize metadata [ 16 , 17 , 18 ]. However, published articles mentioning data dictionaries tend to refrain from including the associated formalism. Thus, we expanded our scope to search for data dictionaries that included standards published on the web, several of which are discussed below.

The Stony Brook Data Governance Council recommendations list required elements and present principles associated with data dictionaries. 2 However, the ability to semantically represent the data is not permitted. Additionally, while data columns can be explicitly described, this approach does not allow the description of implicit concepts that are being described by the dataset, which we refer to as object elicitation. The ability to annotate implicit concepts (described in Section 3.2 ) is one of the distinguishing features of our work. The Open Science Framework 3 and the United States Government (USG) Statistical Community of Practice and Engagement (SCOPE) 4 also guide the creation of a data dictionary that includes required, recommended, and optional entries. These data dictionaries support the specification of data types and categorical values, but minimally allow for the encorporation of semantics and do not leverage existing ontologies or vocabularies. The data dictionary specifications for the Biosystematic Database of World Diptera include both general and domain-specific elements [ 19 ]. Nevertheless, use of this data dictionary outside of the biological domain appears improbable. Based on the Data Catalog Vocabulary (DCAT [ 20 ]), the Project Open Data Metadata Schema provides a data dictionary specification. 5 Of the data dictionaries’ recommendations examined, the Project Open Data Metadata Schema was the most general and the only one to use Semantic Web standards.

There are many recommendations for constructing data dictionaries; however, we found that most are project- or domain-specific, and we find no clear evidence that they are consistently applied by users outside of these individual groups. The exploration of these data dictionaries reveals the need for a standard formalization that can be used across institutions and projects.

2.2. Data Integration Approaches

Data integration is a technique that utilizes data from multiple sources to construct a unified view of the combined data [ 21 ]. Here we consider existing approaches that have been employed to address data integration challenges.

The Semantic Web Integration Tool (SWIT) can be used to perform transformation and integration of heterogeneous data through a web interface in a manner that adheres to the Linked Open Data (LOD) principles [ 22 ]. While the writing of mapping rules is simplified through the use of a web interface, the use of this approach may still prove difficult for users without a Semantic Web background. Neo4j is designed as a graph database (GDB) system that supports data integration based on the labeled property graph (LPG) model, which consists of attributed nodes with directed and labeled edges [ 23 ]. Despite being implemented using an LPG model rather than RDF, Neo4j can read and write RDF, and by using GraphScale [ 24 ], it can further employ reasoning capabilities [ 25 ]. Nevertheless, data integration capabilities, such as using ontologies to semantically annotate data schema concepts and the associated objects, are limited.

To provide an integrated view of data collected on moving entities in geographical locations, RDF-Gen was developed as a means of SPARQL-based knowledge graph generation from heterogeneous streaming and archival data sources [ 26 ]. While this approach is promising and does support the representation of implicit objects, we find, due to the requirement of creating SPARQL-based graph transformation mappings, that it would likely be difficult for domain scientists to use. DataOps is an integration toolkit that supports the combination of data in varying, different formats, including relational databases, CSV, Excel, and others, that can be accessed via R [ 27 ]. While existing user interface components can be used to ease the annotation process and the use of DataOps in industry is expanding, the expertise required to use this approach presents a steep learning curve. OpenRefine is a standalone, open-source tool capable of cleaning and transforming large datasets [ 28 ]. Some limitations of this approach pertain to difficulties in performing subset selection, cell-based operations, and dataset merging.

It is important to note that most data integration approaches fall short when eliciting objects and relations to comprehensively characterize the semantics of the data. We continue this discussion on data integration by considering mapping languages and semantic extract, transform, and load applications.

2.2.1. Mapping Languages

In this section, we introduce mapping languages that can be used to convert a relational database (RDB), tabular file, or hierarchical structure to an RDF format and their related tool support.

The RDB to RDF Mapping Language (R2RML) is a W3C standard language for expressing mappings from relational databases to RDF datasets [ 29 ]. R2RML mappings contain properties to define the components of the mapping, including the source table, columns retrieved using SQL queries, relationships between columns, and a template for the desired output URI structure. The R2RML limitations stem from the requirement of writing the mapping using RDF format, the need to be familiar with the R2RML vocabulary to write mappings, and the support for only relational databases. R2RML extensions exist to address these limitations. The RDF Mapping Language (RML) extends the R2RML vocabulary to support a broader set of possible input data formats, including CSV, XML, and JSON [ 30 ]. In this regard, RML extends the R2RML logical table class to be instead defined as a logical source, which allows the user to specify the source URI, reference, reference formulation, and iterator. RML is supported by a tool to define mappings called the RMLEditor, which allows users to make edits to heterogeneous data source mappings using a graphical user interface (GUI) [ 31 ]. Both R2RML and RML are robust and provide a solid cornerstone for general RDF generation from tabular data. Still, they fall short when dealing with some particularities of our problem scenario, including the creation of implicit relationships for elicited objects and the annotation of categorical data values. The xR2RML language leverages RML to expand the R2RML vocabulary to support the increase of several RDF data formats as well as the mapping of non-relational databases [ 32 ]. With the use of R2RML mappings, the OpenLink Virtuoso Universal Server has an extension to import relational databases or CSV files that can then transform into RDF [ 33 ]. Due to the usage requirement of a mapping language to specify graph transformations, a domain scientist may be reluctant to employ the above approaches.

KR2RML is an extension to R2RML addressing several of its limitations, including support for multiple input and output data formats, new serialization formats, transformations and modeling that do not rely on knowledge about domain-specific languages, and scalability when handling large amounts of data [ 34 ]. KR2RML is implemented in an open-source application called Karma. Karma is a system that uses semantics to integrate data by allowing users to import data from a variety of sources, clean and normalize the data, and create semantic descriptions for each of the data sources used [ 35 ]. Karma includes a visual interface that helps automate parts of the modeling process by suggesting proposed mappings based on semantic type assignments, and hence reduces some of the usage barriers associated with other mapping language methodologies. Nevertheless, some distinguishing factors between this and our approach include the following: when using the SDD approach, there is no need to write mapping transformation rules, and through the use of the Codebook (described in Section 3.3 ), the SDD approach supports cell value annotation.

CSV2RDF is a W3C standard for converting tabular data into RDF [ 36 ]. Introduced to address the limitation of R2RML that only relational data could be mapped, CSV2RDF extends R2RML to allow the mapping of additional structured data formats, such as CSV, TSV, XML and JSON [ 37 ]. The applicability of CSV2RDF for converting large amounts of data has been demonstrated using publicly available resources from a data portal [ 38 ]. CSV2RDF has also been used in an approach to automatically convert tabular data to RDF [ 39 ].

The Sparqlification Mapping Language (SML) progresses towards a formal model for RDB2RDF mappings, maintaining the same expressiveness as R2RML while simplifying usage by providing a more concise syntax, achieved by combining traditional SQL CREATE VIEW statements with SPARQL CONSTRUCT queries [ 40 ]. SML is intended to be a more human-readable mapping language than R2RML. The R2R Mapping Language, also based on SPARQL, is designed for writing dataset mappings represented as RDF using dereferenceable URIs [ 41 ]. While it is possible for the user to specify metadata about each mapping, the possible mappings that can be specified correspond to direct translations between the data and the vocabulary being used, rather than allowing for detailed object elicitation.

Another mapping language based on SPARQL is Tarql, where databases are referenced in FROM clauses, mappings can be specified using a SELECT or ASK clause, and RDF can be generated using a CONSTRUCT clause [ 42 ]. One limitation of this approach is that it uses SPARQL notation for tasks that were not originally intended by the grammar, rather than extending SPARQL with additional keywords. The D2RQ mapping language, which allows for querying on mapped databases using SPARQL, is a declarative language that allows for querying through the use of the RDF Data Query Language (RDQL), publication of a database on the Semantic Web with the RDF Net API, reasoning over database content using the Jena ontology API, and accessing database information through the Jena model API [ 43 ]. Some limitations of D2RQ include integration capabilities over multiple databases, write operations such as CREATE, DELETE, or UPDATE, and support for Named Graphs [ 44 ].

While many of the mapping languages above focus on the conversion of RDBs to knowledge graphs, RDB2OWL is a high-level declarative RDB-to-RDF/OWL mapping language used to generate ontologies from RDBs [ 45 ]. It is achieved by mapping the target ontology to the database structure. RDB2OWL supports the reuse of RDB table column and key information, includes an intuitive human-readable syntax for mapping expressions, allows for both built-in and user-defined functions, incorporates advanced mapping definition primitives, and allows for the utilization of auxiliary structures defined at the SQL level [ 45 ].

In addition to the difficulties associated with writing mapping transformations, we find that mapping-language-based methodologies have limited object and relation elicitation capabilities, and cell value annotation is typically not permitted. These limitations are addressed in the SDD approach.

2.2.2. Semantic Extract, Transform, & Load

Extract, transform, and load (ETL) operations refer to processes that read data from a source database, convert the data into another format, and write the data into a target database. In this section, we examine several ETL approaches that leverage semantic technologies. LinkedPipes ETL (LP-ETL) is a lightweight, linked data preparation tool supporting SPARQL queries, including debug capabilities, and can be integrated into external platforms [ 46 ]. LP-ETL contains both back-end software for performing data transformations, as well as a front-end web application that includes a pipeline editor and an execution monitor. A pipeline is defined as “a repeatable data transformation process consisting of configurable components, each responsible for an atomic data transformation task” [ 46 ]. As transformations in this approach are typically written as SPARQL construct statements, this methodology would be difficult to employ for someone who is unfamiliar with SPARQL. Semantic Extract, Transform, and Load-er (SETLr) is a scalable tool that uses the JSON-LD Template (JSLDT) language 6 for the creation of RDF from a variety of data formats [ 47 ]. This approach permits the inclusion of conditionals and loops (written in JSLDT) within the mapping file, allowing for the transformation process to iterate through the input data in interesting ways. Nevertheless, there may be a steep learning curve for researchers without a programming background to adopt this approach.

Eureka! Clinical Analytics is a web application that performs ETL on Excel spreadsheets containing phenotype data [ 48 ]. Since this application was designed for use on clinical projects, it cannot easily be generalized for use in domains outside of biomedicine. The open-source Linked Data Integration Framework (LDIF) leverages Linked Data to provide both data translation and identity resolution capabilities [ 49 ]. LDIF uses runtime environments to manage data flow between a set of pluggable modules that correspond to data access, transformation, and output components. Improvements in the framework resulted in the extension of the importer capabilities to allow for input in the form of RDF/XML, N-Triples, and Turtle, import data by crawling RDF links through the use of LDspider, and replicate data through SPARQL construct queries [ 50 ]. One limitation of LDIF is that the runtime environment that supports RDF is slower than the in-memory and cluster environment implementations that do not support RDF. Other approaches use existing semantic technologies to perform ETL [ 51 , 52 , 53 ]. These approaches, however, have a similar hurdle for adoption, in that they are often perceived as challenging by those unfamiliar with Semantic Web vocabularies and standards. SDDs provide a means of performing Semantic ETL without the requiring the writing of complex transformation scripts.

2.3. Metadata Standards

The collection of SDD specifications that we discuss in Section 3 serve to provide a standard guideline for semantically recording the metadata associated with the dataset being annotated. In this section, we examine existing metadata standards for describing data that incorporate semantics. The ISO/IEC 11179 standard includes several components, including the (1) framework, (2) conceptual model for managing classification schemes, (3) registry metamodel and basic attributes, (4) formulation of data definitions, (5) naming and identification principles, (6) registration instructions, and (7) registry specification for datasets. 7 This standard is intended to address the semantics, representation, and registration of data. Nevertheless, a limitation of ISO/IEC 11179 is that it mainly focuses on the lifestyle management of the metadata describing data elements rather than of events associated with the data values [ 54 ]. The Cancer Data Standards Repository (caDSR) implements the ISO/IEC 111791 standard to organize a set of common data elements (CDEs) used in cancer research [ 55 ]. The Clinical Data Interchange Standards Consortium (CDISC) has produced several Unified Modeling Language (UML) models that provide schemas for expressing clinical data for research purposes [ 56 ]. However, as these schemas are based on the Health Level 7 (HL7) reference implementation model (RIM), which focuses on representing information records instead of things in the world, semantic concepts are used as codes that tag records rather than to provide types for entities.

3. The Semantic Data Dictionary

The Semantic Data Dictionary approach provides a way to create semantic annotations for the columns in a dataset, as well as for categorical or coded cell values. This is achieved by encoding mappings to terms in an appropriate ontology or set of ontologies, resulting in an aggregation of knowledge formed into a graphical representation. A well-formed SDD contains information about the objects and attributes represented or referred to by each column in a dataset, utilizing the relevant ontology URIs to convey this information in a manner that is both machine-readable and unambiguous.

The main output of interpreting SDDs are RDF graphs that we refer to as knowledge graph fragments, since they can be included as part of a larger knowledge graph. Knowledge graphs, or structured graph-based representations that encode information, are variably defined but often contain a common set of characteristics: (i) real world entities and their interrelations are described, (ii) classes and relations of entities are defined, (iii) interrelating of entities is allowed, and (iv) diverse domains are able to be covered [ 57 ]. We have published a number of SDD resources, such as tutorials, documentation, complete examples, and the resulting knowledge graph fragments. 8 Full sets of annotated SDDs for several public datasets are also available here.

To support the modularization and ease of adoption of the annotation process, we implement the SDD as a collection of tabular data that can be written as Excel spreadsheets or as Comma Separated Value (CSV) files. The SDD is organized into several components to help modularize the annotation process. We introduce the components here and go into further detail on each throughout the remainder of this section. A document called the Infosheet is used to specify the location of each of the SDD component tables. Furthermore, the user can record descriptive metadata about the dataset or SDD in this document. The Dictionary Mapping (DM) is used to specify mappings for the columns in the dataset that is being annotated. If only this component is included with the SDD, an interpreter can still be used to convert the data into an RDF representation. Therefore, we focus the majority of our discussion in this section on the DM table. We also briefly describe the remaining SDD components that allow for richer annotation capabilities and ease the annotation process. The Codebook is used to interpret categorical cell values, allowing the user to assign mappings for data points in addition to just the column headers. The Code Mapping table is used to specify shorthand notations to help streamline the annotation process. For example, the user can specify ‘mm’ to be the shorthand notation for uo:0000016, 9 the class in the Units of Measurement Ontology (UO [ 58 ]) for millimeter. The Timeline table is used to include detailed annotations for events or time intervals. Finally, the Properties table allows the user to specify custom predicates employed during the mapping process. We use S mall C aps font when referring to columns in an SDD table and italics when referring to properties from ontologies. Further information on the SDD modeling process is available on the SDD documentation website. 10

3.1. Infosheet

To organize the collection of tables in the SDD, we use the Infosheet ( Appendix Table B.1 ), which contains location references for the Dictionary Mapping, Code Mapping, Timeline, Codebook, and Properties tables. The Infosheet allows for the use of absolute, relative, or web resource locations. In addition to location references, the Infosheet is used to include supplemental metadata ( Appendix Table B.2 ) associated with the SDD, such as a title, version information, description, or keywords. In this regard, the Infosheet serves as a configuration document, weaving together each of the individual pieces of the Semantic Data Dictionary and storing the associated dataset-level metadata.

The properties that are included support distribution level dataset descriptions based on the Health Care and the Life Sciences (HCLS) standards, 11 as well as the Data on the Web Best Practices (DWBP). 12 The HCLS standards contain a set of metadata concepts that should be used to describe dataset attributes. While the resulting document was developed by stakeholders working in health related domains, the properties included are general enough to be used for datasets in any domain. The DWBP were developed by a working group to better foster communications between data publishers and users, improve data management consistency, and promote data trust and reuse. The associated document lists 35 best practices that should be followed when publishing data on the web, each of which includes an explanation for why the practice is relevant, the intended outcome, possible implementation and testing strategies, and potential benefits of applying the practice.

In Section 4 , we provide an example of using the SDD approach to annotate the National Health and Nutrition Examination Survey (NHANES). An example Infosheet for the Demographics table of this dataset is provided in Appendix Table C.1 .

3.2. Dictionary Mapping

The Dictionary Mapping (DM) table includes a row for each column in the dataset being annotated (referred to as explicit entries), and columns corresponding to specific annotation elements, such as the type of the data (A ttribute , E ntity ) 13 , label (L abel ), unit (U nit ), format (F ormat ), time point (T ime ), relations to other data columns ( in R elation T o , R elation ), and provenance information (w as D erived F rom , w as G eneraTED B y ). Figure 1 shows the conceptual diagram of the DM. Such a representation is similar to the structure of general science ontologies, such as the Semanticscience Integrated Ontology (SIO) [ 59 ] or the Human-Aware Science Ontology (HAScO) [ 60 ]. We use SIO properties for the mapping of many of the DM columns, as shown in the Dictionary Mapping specification in Appendix Table B.3 , while also leveraging the PROV-O ontology [ 61 ] to capture provenance information. Despite specifying this default set of mappings, we note that the Properties table of the SDD can be used to determine the set of predicates used in the mapping process, allowing the user to customize the foundational representation model.

An external file that holds a picture, illustration, etc.
Object name is nihms-1594051-f0009.jpg

A conceptual diagram of the Dictionary Mapping that allows for a representation model that aligns with existing scientific ontologies. The Dictionary Mapping is used to create a semantic representation of data columns. Each box, along with the “Relation” label, corresponds to a column in the Dictionary Mapping table. Blue rounded boxes correspond to columns that contain resource URIs, while white boxes refer to entities that are generated on a per-row/column basis. The actual cell value in concrete columns is, if there is no Codebook for the column, mapped to the has value object of the column object, which is generally either an attribute or an entity.

In addition to allowing for the semantic annotation of dataset columns, unlike traditional mapping approaches, the SDD supports the annotation of implicit concepts referenced by the data. These concepts, referred to as implicit entries, are typically used to represent the measured entity or the time of measurement. For example, for a column in a dataset for a subject’s age, the concept of age is explicitly included, while the idea that the age belongs to a human subject is implicit. These implicit entries can then be described to have a type, a role, relationships, and provenance information in the same manner as the explicit entries. For example, to represent the subject that had their age measured, we could create an implicit entry, ??subject. 14

3.2.1. Attributes and Entities

A ttribute and E ntity are included in the DM to allow for the type assignment of an entry. While both of these columns map to the property rdf:type , 15 they are both included as it may be semantically significant to distinguish between characteristics and objects. If an entry describes a characteristic, A ttribute should be populated with an appropriate ontology class. The entity that contains the characteristic described, which can be either explicit or implicit, should be referenced in attribute Of. While columns in a dataset typically describe an observed characteristic, this is not always the case. If an entry describes an object, such as a person, place, thing, or event, E ntity should be populated with an appropriate ontology class.

3.2.2. Annotation Properties and Provenance

A set of annotation properties, including comments, labels, or definitions, allows for the description of an explicit or implicit entry in further detail. While L abel is the only column included in the DM Specification for an annotation property, if support for comments and definitions is included in an SDD interpreter, we recommend the use of the rdfs:comment and skos:definition predicates, respectively. In terms of including provenance, w as D erived F rom can be used to reference pre-existing entities that are relevant in the construction of the entry, and w as G eneraTED B y can be used to describe the generation activity associated with the entry.

3.2.3. Additional Dictionary Mapping Columns

The R ole , R elation , and in R elation T o columns of the DM are used to specify roles and relationships associated with entries. A reference to objects or attributes an entry is related to should be populated in in R elation T o . By populating R ole , the sio:hasRole property is used to assign the specified role to the entry. Custom relationships using properties that are not included in the SDD can be specified using R elation . Events in the form of time instances or intervals associated with an entry should be referenced in T ime . The unit of measurement of the data value can be specified in U nit . In general, we recommend the use of concepts in the Units of Measurement Ontology (UO) for the annotation of units, as many existing vocabularies in various domains leverage this ontology. A W3C XML Schema Definition Language (XSD) primitive data type 16 can be included in F ormat to specify the data type associated with the data value.

3.2.4. Dictionary Mapping Formalism

We define a formalism for the mapping of DM columns to an RDF serialization. The notation we use for formalizing the SDD tables is based on an approach for translating constraints into first-order predicate logic [ 62 ]. While most of the DM columns have one-to-one mappings, we can see the interrelation of the mapping of R ole , R elation , and in R elation To. In the formalism included below, ‘Value’ represents the cell value of the data point that is being mapped.

3.3. Codebook

The Codebook table of the SDD allows for the annotation of individual data values that correspond to categorical codes. The Codebook table contains the possible values of the codes in C ode , their associated labels in L abel , and a corresponding ontology concept assignment in C lass . If the user wishes to map a Codebook value to an existing web resource or instance of an ontology class, rather than a reference to a concept in an ontology, R esource can be populated with the corresponding URI. We recommend that the class assigned to each code for a given column be a subclass of the attribute or entity assigned to that column. A conceptual diagram of the Codebook is shown in Figure 2 (a) . The Codebook Specification is provided in Appendix Table B.4 . The formalism for mapping the Codebook is included below.

An external file that holds a picture, illustration, etc.
Object name is nihms-1594051-f0010.jpg

(a) A conceptual diagram of the Codebook, which can be used to assign ontology classes to categorical concepts. Unlike other mapping approaches, the use of the Codebook allows for the annotation of cell values, rather than just columns. (b) A conceptual diagram of the Timeline, which can be used to represent complex time associated concepts, such as time intervals.

3.4. Code Mapping

The Code Mapping table contains mappings of abbreviated terms or units to their corresponding ontology concepts. This aids the human annotator by allowing the use of shorthand notations instead of repeating a search for the URI of the ontology class. The set of mappings used in the CHEAR project is useful for a variety of domains and is available online. 17

3.5. Timeline

If an implicit entry for an event included in the DM corresponds to a time interval, the implicit entry can be specified with greater detail in the Timeline table. Timeline annotations include the corresponding class of the time associated entry, the units of the entry, start and end times associated with an event entry, and a connection to other entries that the Timeline entry may be related to. Shown in Figure 2(b) is a conceptual diagram of the Timeline. The Timeline Specification is provided in Appendix Table B.5 . The formalism for mapping the Timeline is included below.

3.6. Property Customization

The Semantic Data Dictionary approach creates a linked representation of the class or collection of datasets it describes. The default model provided is based on SIO, which can be used to express a wide variety of objects using a fixed set of terms, incorporates annotation properties from RDFS and SKOS, and uses provenance predicates from PROV-O. Shown in Appendix Table B.6 are the default set of properties that we recommend.

By specifying the associated properties with specific columns of the Dictionary Mapping Table, the properties used in generating the knowledge graph can be customized. This means that it is possible to use an alternate knowledge representation model, thus making this approach ontology-agnostic. Nevertheless, we urge the user to practice caution when customizing the properties used to ensure that the resulting graph is semantically consistent (for example, not to replace an object property with a datatype property).

In the formalism presented above and the DM, CB, and TL specifications of Appendix Tables B.3 , B.4 , & B.5 , fourteen distinct predicates are used. 18 Fourteen of the sixteen rows of the Properties Table are included to allow the alteration of any of these predicates. The two additional rows pertain to A ttribute and E ntity , which, like T ype , by default map to rdf:type , but can be customized to use an alternate predicate if the user wishes. In this way, by allowing for the complete customization of the predicates that are used to write the formalism, the SDD approach is ontology-agnostic. Note that the predicates used in the Infosheet Metadata Supplement of Table B.2 , which are based on the best practices described in Section 3.1 , are not included in the Properties Specification.

4. Example – The National Health and Nutrition Examination Survey

The National Health and Nutrition Examination Survey (NHANES) contains publicly available demographic and biomedical information. A challenge in creating a knowledge representation from this dataset is determining how to represent the implicit entities referenced by the data, such as a participant of the study or the household that they live in. Additionally, information about a participant may be dispersed throughout multiple tables that consequently need to be integrated, resulting in difficulties when following traditional mapping approaches.

NHANES data dictionaries include a variable list that contains names and descriptions for the columns in a given dataset component, as well as a documentation page that consists of a component description, data processing and editing information, analytic notes, and a codebook. Unfortunately, the dataset description provided is textual and is therefore not readily processed.

We find that neither the data documentation nor the codebooks included in NHANES incorporate mappings to ontology concepts. Thus, we provide a simple example of how several columns from the NHANES Demographics dataset would be represented using the SDD approach. The terms in this example are annotated using the CHEAR, SIO, and National Cancer Institute Thesaurus (NCIT) ontologies. Shown in Tables 1 , ​ ,2, 2 , and ​ and3 3 are a portion of the SDD we encoded for the NHANES Demographics dataset, in which we respectively present a subset of the explicit DM entries, implicit DM entries, and the Codebook entries. An example Infosheet for the NHANES Demographic dataset is provided in Appendix Table C.1 . The complete set of explicit and implicit entries are provided in Appendix Table C.3 and Appendix Table C.2 , respectively. An expanded codebook is included in Appendix Table C.4 . Additional NHANES tables not included in this article were also encoded as part of this annotation effort. 19

Subset of Explicit Entries identified in NHANES Demographics Data

Subset of Implicit Entries identified in NHANES Demographics Data

Subset of NHANES Demographic Codebook Entries

In Table 1 , we provide the explicit entries that would be included in the DM. The data column SEQN corresponds to the identifier of the participant. The resource created from this column can be used to align any number of NHANES tables, helping address the data integration problem. Another column included is the categorical variable that corresponds to education level. Also included are two variables that correspond to the age of the participant taking the survey and the age of the specified reference person of the household, defined as the person who owns or pays rent for the house. We see how the use of implicit entries, as well as the use of specified Code Mapping units, helps differentiate the two ages. The corresponding implicit entries referenced by the explicit entries are annotated in Table 2 .

In Table 3 , we include a subset of the Codebook for this example. The SDD Codebook here is similar to the original NHANES codebook, with the addition of C olumn , so that multiple codebooks do not have to be created to correspond to each categorical variable, and C lass , used to specify a concept from an ontology to which the coded value maps.

5. Current Use

In this section, we provide a case study on projects that have leveraged the SDD for health-related use cases. We focus on work done for the Health Empowerment by Analytics, Learning, and Semantics (HEALS) project, while also briefly discussing efforts in other programs. In our funded research, our sponsors often desire the representation their data in a semantically consistent way that supports their intended applications. They wish to play a role in the annotation process by contributing their subject matter expertise. We find that the SDD approach is more accessible to domain scientists than other programming intensive approaches. Additionally, they appreciate that the ability to reuse SDDs limits the amount of necessary future updates when, for example, a data schema changes.

5.1. Health Empowerment by Analytics, Learning, and Semantics

As part of the RPI and IBM collaborative Health Empowerment by Analytics, Learning, and Semantics (HEALS) project, 20 SDDs have been used to aid in semantic representation tasks for use cases involving breast cancer and electronic health record (EHR) data.

5.1.1. Breast Cancer Use Case

For the creation of an application used for the automatic re-staging breast cancer patients, the SDD approach was used to create a knowledge representation of patient data from the Surveillance, Epidemiology, and End Results (SEER) program [ 63 ]. In order to integrate treatment recommendations associated with a given biomarker into the application, an SDD for the Clinical Interpretation of Variants in Cancer (CIViC) database was also created. By applying the SDD approach to help solve this problem, seamless data integration between these two distinct sources was demonstrated, which would have been more difficult to achieve using some of the methods described in Section 2.2 . For example, if any of the mapping language or Semantic ETL approaches were applied, the writing of a script that requires an intrinsic understanding of the dataset would be necessary, rather than needing to just fill out the SDD tables. While this approach still requires an understanding of the dataset, if the SDD approach was used for describing the datasets mentioned above, the data apprehension requirement on the user would greatly reduced. Another advantage demonstrated by using this approach was that, since a limited set of properties are leveraged in the semantic model that was created, the cost of implementing the application, in terms of programming resources and overhead, was reduced. A subset of the explicit entries from the SEER DM are shown in Table 4 .

Subset of Explicit Entries identified in SEER

Additional cancer-related work for the HEALS project involves the annotation of a subset of The Cancer Genome Atlas (TCGA) through the NCI Genomic Data Commons (GDC) portal. While these SDDs are not included here, they are openly available on our SDD resources web-page. The clinical subset of the TCGA data that was annotated contains patient demographic and tumor information, and the methylation portion contains genetic information. By using the same ontology classes that were used for the SEER dataset to annotate these concepts, we are able to leverage TCGA data to further enrich the cancer staging application described above.

5.1.2. Electronic Health Record Data

To create a knowledge representation from electronic health record (EHR) data, we annotated the Medical Information Mart for Intensive Care III (MIMIC-III) dataset using SDDs. While this effort involved annotating 26 relational tables, we only include a subset of the Dictionary Mapping of the admission table in Table 5 . Using this approach, we can represent implicit concepts associated with the data. The inclusion of implicit concepts provides connection points for linking the various EHR data tables into a single coherent knowledge representation model that reflects the reality recorded by the data. This would be difficult to accomplish using many of the alternate approaches we examined that do not support object elicitation.

Subset of Dictionary Mapping for the MIMIC-III Admission table

5.2. Additional Use Cases

Several institutions are employing the Semantic Data Dictionary approach for a variety of projects. The Icahn School of Medicine at Mount Sinai uses SDDs for the NIH CHEAR and the follow on HHEAR projects to annotate data related to demographics, anthropometry, birth outcomes, pregnancy characteristics, and biological responses. The Lighting Enabled Systems & Applications (LESA) Center is using SDDs to annotate sensor data. SDDs are being used in Brazil for the Big Data Ceara project, through Universidade de Fortaleza, and the Global Burden of Disease project, through Universidade Federal de Minas Gerais.

5.3. Remarks

In this section, we discussed how SDDs help represent knowledge for a variety of other projects that involve collaborative efforts with domain scientists, exhibiting the applicability of this approach for researchers in a variety of specializations. For the HEALS project, we have shown DMs for use cases that involve breast cancer and EHR records. As well as patient demographic characteristics from the SEER data, we encode the size of the patient’s tumor, the number of lymph nodes affected, whether or not the cancer metastasized, and several genetic biomarkers. Using this data, the successful automation of re-staging breast cancer patients was accomplished. While we only show a single DM for the MIMIC-III dataset, this use case involves the annotation of multiple relational data tables and demonstrates how data integration can be performed using SDDs.

6. Modeling Challenges for Domain Scientists

An initial strategy of training that was followed by qualitative evaluation was used to examine the difficulty experienced by researchers who do not have a Semantic Web background when first using the Semantic Data Dictionary. Domain scientists, including epidemiologists and bio-statisticians, were presented with initial training by a Semantic Web expert. Supporting materials were developed in collaboration with a domain expert and then were made available to provide guidance and examples to facilitate domain scientists’ use of the Semantic Data Dictionary.

First, a template for completing the Semantic Data Dictionary that included pre-populated fields for common demographic concepts, such as age, race, and gender, was provided to domain scientists to use for each study. Second, a help document was created that included instructions and representations of more complex concepts, including measurements of environmental samples, measurements of biological samples, and measurements taken at specific time-points. Third, a practical workshop was held where a semantic scientist provided training in semantic representation to the domain scientists. Following the workshop and distribution of supporting materials, domain scientists completed at least one Semantic Data Dictionary for an epidemiologic study and were then asked about the challenges they faced. Despite this training and workshop being conducted in a context related to epidemiology and health, the key takeaways resulted in general lessons learned.

The first identified challenge was the representation of implicit objects implied by the features in the dataset. This is an uncommon representation in the public health domain. While the modeling of simple concepts may be intuitive (e.g. maternal age has a clear implicit reference to mother), the representation of complex ideas, such as fasting blood glucose levels, proves to be more difficult as the implicit object, and relationships between concepts, is not as intuitive for domain scientists. A second modeling challenge involved discussions on how to represent time-associated concepts that power the ontology-enabled tools and allow domain scientists to harmonize data across studies. Additionally, when a concept was not found in a supporting ontology, there were questions of how to best represent the concept in a semantically-appropriate way. In many cases, these challenges resulted in a need to go back to a Semantic Web expert for clarification.

To alleviate these challenges, we have refined and expanded the number of publicly-available resources that include documentation, step-by-step modeling methods, tutorials, demonstrations, and informative examples. We increased the complexity of examples and incorporated time-associated concepts to initial templates and help documents. To facilitate further communication, a web-based Q&A document has been shared between the Semantic Web experts and the domain scientists to enable timely feedback and answers to specific questions on the representation of concepts and the need to generate new concepts.

In addition to the solutions presented above, we plan for future training events to explicitly demonstrate the use of the Semantic Data Dictionary. We will provide an overview on the semantic representation, as well as guidelines for using the corresponding documentation and training materials.

7. Evaluation

To evaluate the Semantic Data Dictionary approach, we categorize metrics from earlier evaluations on mapping languages [ 64 , 65 ] and requirements of data integration frameworks. In addition to evaluating the SDD for adherence to these metrics, we survey similar work to determine the extent to which they meet the metrics in comparison. We include a set of evaluation metrics that we organized into four categories. These categories are respectively related to data, semantics, the FAIR principles, and generality.

To measure the degree to which an approach meets each metric, we provide a value of 0, 0.5, or 1, depending on the extent to which an approach responds to an evaluation parameter. In general, if an approach does not meet a metric, it is given a score of 0. If it meets a metric partially, we assign a score of 0.5. We also assign this score to approaches that meet a metric by omission, such as being ontology-agnostic by not supporting the use of ontologies at all. If an approach completely meets the metric, it is given a score of 1. We list the criteria used for the assignment of numerical values below (refer to Table 6 for the complete list of categorized metrics).

High-level comparison of Semantic Data Dictionaries, Traditional Data Dictionaries, approaches involving Mapping Languages, and general Data Integration Tools

7.1. Data integration capabilities

In this category, we consider how the approach can harmonize and ingest data, allows for subset data selection, and permits a data type assignment. We evaluate whether the approach is harmonizable in the sense that it has the capability of creating a cohesive representation for similar concepts across columns or datasets in general. We check that knowledge generated across datasets can be compared using similar terms from a controlled set of vocabularies. For this metric, we respectively assign a score of 0, 0.5, or 1 if data integration capabilities are not supported, somewhat supported, or wholly supported.

Next, we consider whether the approach is ingestible , outputting data in a standard format that can be uploaded and stored (ingested) and supports inputs of varying formats. We assign a score of 1 if the resulting data representation can be stored in a database or triplestore, and if it can input data of varying formats. If one of the two features are supported, we assign a score of 0.5. If neither are supported, we assign a score of 0.

Furthermore, we consider a subset selection metric, where we check if the approach allows the user to select a subset of the data, either in terms of columns and rows, on which to perform the annotation. For this metric, a score of 0 is assigned if this capability is not included in the approach. We assign a score of 0.5 if either a subset of the rows or the columns can be specified for annotation, but not both. If the approach allows for the selection of both a subset of rows or of columns to be annotated, we assign a score of 1.

Finally, we include the data type assignment metric, measuring the extent to which XML data types can be assigned to attributes when mapping data. We assign a score of 0 for this metric if the approach does not allow for the assignment of data types when mapping data. If the assignment of a limited set of data types that are not based on XML standards is incorporated, a score of 0.5 is assigned. If the approach allows the assignment of XML data types, a score of 1 is given.

7.2. Formal semantics capabilities

In this category, we consider if the approach allows for object or relation elicitation, as well as value, time, or space annotation. We also check if the resulting data representation is queryable and if the approach supports both domain-specific and general ontology foundations. Finally, graph materialization is the last assessment metric we apply. Data usually consists of attributing value to observations, measurements, or survey results. Dataset descriptions contain metadata, but often omit details on the objects that the values describe. For a complete semantic representation, one must also consider the ability to represent implicit objects that are associated with the data points, which we measure using the object elicitation metric. If the approach does not include the ability to represent implicit objects, a score of 0 is assigned. If implicit objects are considered but not annotated in detail, we assign a score of 0.5. We assign a score of 1 if implicit objects can be represented and richly annotated.

In addition to being able to represent implicit concepts, we consider relation elicitation , where relationships between implicitly elicited objects can be represented. A score of 0 is assigned if an approach does not allow for the representation of relationships between elicited objects. If relationships between elicited objects can be represented, but not annotated in detail, a score of 0.5 is assigned. We assign a score of 1 if relationships between elicited objects can be represented and richly annotated.

Next, we consider if the resulting representation is queryable , so that specific data points can be easily retrieved using a query language. A score of 0 is assigned for this metric if specific content from the knowledge representation cannot be queried. If it can be queried using a relational querying method, such as SQL, but not a graph querying method, a score of 0.5 is assigned. If content can be queried using a graph querying method, such as SPARQL, we assign a score of 1.

We further consider the annotation of cell values, rather than just column headers, using the value annotation metric. This covers the ability to annotate categorical cell values, assign units to annotate non-categorical cell values, and specify attribute mappings of object properties related to cell values. If the approach does not allow for the annotation of cell values at all, or allows for a limited set of annotations for cell values, we assign scores of 0 and 0.5, respectively. We assign a score of 1 if an approach includes the ability to annotate categorical cell values, assigns units to annotate non-categorical cell values, and specifies attribute mappings of object properties related to cell values.

We consider the ability to represent specific scientific concepts, including time and space. Using the time annotation metric, we check for the ability to use timestamps to annotate time-series values, as well as named time instances to annotate cell values. A score of 0 is assigned for this metric if an approach does not allow for the representation of time. If the approach allows for the representation of time, but does not permit detailed annotations, we assign a score of 0.5. We assign a score of 1 if the approach allows for detailed annotation of time, such as the use of timestamps to annotate time-series values and named time instances to annotate cell values.

The space annotation metric is added to check for the use of semantic coordinate systems to annotate the acquisition location of measurements. We assign a score of 0 if an approach does not allow for the representation of space. If it allows for the representation of space, but does not permit detailed annotations, we assign a score of 0.5. A score of 1 is assigned if the use of semantic coordinate systems to annotate the acquisition location of measurements is supported. We examine domain knowledge support by checking if the approach permits the design of mappings using pre-existing domain-specific ontologies or controlled vocabularies. A score of 0 is assigned for this metric if the approach does not permit the design of reusable mappings driven by domain knowledge. We assign a score of 0.5 if it permits the design of reusable mappings using either pre-existing ontologies or controlled vocabularies, but not both. If annotations from both pre-existing ontologies or controlled vocabularies are allowed, we assign a score of 1.

Using the top-level ontology foundation metric, we consider the ability to use general upper ontologies as a foundation for the resulting model. If an approach cannot specify mapping rules based on foundation ontologies, a score of 0 is assigned for this metric. If a subset of mapping rules based on general foundation ontologies can be specified, we assign a score of 0.5. A score of 1 is assigned if the approach allows for the specifiation of all mapping rules based on general foundation ontologies. Essentially, we are checking if the semantic model that results from the annotation approach is structured based on a given ontology. While we recommend the use of well known upper ontologies such as SIO or Basic Formal Ontology (BFO [ 66 ]), in evaluating this metric we allow the approach the leverage any ontology.

Finally, with the graph materialization metric, we assess the persistence of the generated knowledge graph into an accessible endpoint or file. If the approach does not allow for the materialization of the generated graph, a score of 0 is assigned. If the generated graph is reified into an accessible endpoint or downloadable file, but not both, a score of 0.5 is assigned. If both materializations into an accessible endpoint and a downloadable file are supported, we assign a score of 1.

In the FAIR category, we consider the metrics associated with the FAIR guiding principles, including if the approach and resulting artifacts are findable, accessible, interoperable, and reusable. Furthermore, we also consider the related metrics of reproducibility and transparency, which are not included in the FAIR acronym. While several of the metrics we measure in the other categories of our evaluation aid with the creation of FAIR data, such as the incorporation of provenance or the inclusion of documention as discussed in Section 7.3.1 , we include these six metrics in the FAIR category since they are directly associated with intent of the principles in enhancing data reuse and are explicitly discussed in the introductory article on the FAIR principles [ 4 ].

For the findable metric, we consider the use of unique persistent identifiers, such as URLs, as well as the inclusion of web searchable metadata so that the knowledge is discoverable on the web. If the knowledge representation is neither persistent nor discoverable, we assign a score of 0 for this metric. If the knowledge representation is one of the two, we assign a score of 0.5. A score of 1 is assigned if the knowledge representation is both persistent and discoverable.

We consider a knowledge representation to be accessible if resources are openly available using standardized communication protocols, with the consideration that data that cannot be made publicly available is accessible through authentication. Accessibility also includes the persistence of metadata, that even if data is retired or made unavailable, its description still exists on the Web. As additional consideration for evaluating accessibility, we examine whether or not the associated software for an approach is free and publicly available. If resources and metadata are not published openly, a score of 0 is assigned for this metric. If some resources and metadata are persistent and openly available, we assign a score of 0.5. A score of 1 is assigned if all of the resources and metadata from a given approach are both persistent and openly available using standardized communication protocols.

For the interoperable metric, we consider the use of structured vocabularies, such as best practice ontologies, that are RDF compliant. Mainly, we are checking to see if the knowledge representation is published using an RDF serialization. If the knowledge representation does not use a structured vocabulary, a score of 0 is assigned. If it uses structured vocabularies that are not RDF compliant, we assign a score of 0.5. A score of 1 is assigned if the knowledge representation uses formal vocabularies or ontologies that are RDF compliant.

To test if an approach or the resulting knowledge representation is reusable , we consider the inclusion of a royalty-free license that permits unrestricted reuse, and that consent or terms of agreement documents are available when applicable. We also discuss if included metadata about the resource is detailed enough for a new user to understand. A score of 0 is assigned for this metric if an approach does not include a royalty-free license. If a royalty-free license that permits unrestricted use of some portions of the tool is included, a score of 0.5 is assigned. We assign a score of 1 if the approach includes a royalty-free license that permits unrestricted use of all portions of the tool.

We examine if an approach is reproducible in terms of scientific activities introduced within a given methodology, such that experiments can be independently conducted and verified by an outside party. If the approach creates a knowledge representation that cannot be reproduced, a score of 0 is assigned. If the knowledge representation that can be produced by an outside party with the help of the involved party, rather than entirely independently, we assign a score of 0.5. A score of 1 is assigned if the approach for creating a knowledge representation can be independently produced.

Finally, we consider if data and software are transparent , such that there are no “black boxes” used in the process of creating a knowledge representation. Transparency is readily achieved by making sure that software is made openly available. If the associated code for a given approach is not openly accessible, we assign a score of 0. We assign a score of 0.5 if some of the associated code is open, while other portions are not openly available. This generally applies to approaches that are both free and paid versions of software. If all of the associated code for an approach is open source, a score of 1 is given.

7.3.1. Generality assessment

To evaluate the generality of an approach, we investigate whether or not the method is domain-agnostic, is ontology-agnostic, and adheres to existing best practices. We weigh whether the method incorporates provenance attributions, is machine-understandable, and contains documents to aid the user, such as documentation, tutorials, or demonstrations.

We analyze whether an approach is domain-agnostic , in that its applicability does not restrict usage to a particular domain. A score of 0 is assigned for this metric if the approach only applies to a single field of study. If the approach applies to multiple fields of study but does not work for specific domains, a score of 0.5 is assigned. We assign a score of 1 if the approach can be generalized to any areas of study.

On a similar vein, we judge if the method is ontology-agnostic , where usage is not limited to a particular ontology or set of ontologies. If the approach depends on a particular ontology or set of ontologies, a score of 0 is assigned. If the dependence on particular ontologies is unclear from the examined literature and documentation, we assign a score of 0.5. A score of 1 is assigned for this metric if the approach is independent of any particular ontology.

We examine the literature and documentation associated with a given approach or knowledge representation to see if it leverages best practices . In particular, we consider the applicable best practices related to the HCLS and DWBP guidelines. Among the practices we test for include the ability of the approach to incorporate descriptive metadata, license and provenance information, version indicators, standardized vocabularies, and use locale-neutral data representations. A score of 0 is assigned if the literature associated with an approach does not acknowledge or adhere to existing best practice standards. If existing standards are acknowledged but are not adhered to or are partially adhered to, we assign a score of 0.5. If the literature acknowledges and adheres to existing best practices, a score of 1 is assigned.

We consider the inclusion of provenance , involving the capture of existential source information, such as attribution information for how a data point was measured or derived. A score of 0 is assigned for this metric if the approach does not include attributions to source or derivation information. If attribution information that does not use Semantic Web standards is included, we assign a score of 0.5. If the approach covers attributions recorded using a Semantic Web vocabulary, such as the PROV-O ontology, a score of 1 is assigned. In terms of documentation , we further search for the inclusion of assistive documents, tutorials, and demonstrations. We assign a score of 0 for this metric if just one of either documentation, tutorials, or demonstrations is included. If two or all of the above are involved, we assign scores of 0.5 or 1, respectively.

Finally, we consider the machine-readable metric, determining whether the resulting knowledge representation from an approach is discernable by software. In addition to the consideration of the machine-readability of output artifacts such as produced knowledge graphs, we also examine input artifacts, such as the document that contains the set of semantic mappings. If neither input nor output artifacts can be parsed using software, a score of 0 is assigned for this metric. If either input or output artifacts can be parsed, but not both, a score of 0.5 is assigned. We assign a score of 1 if both input and output artifacts are machine-readable.

In Table 6 , we provide a high-level comparison between the Semantic Data Dictionary, traditional data dictionaries, mapping languages and semantic approaches that leverage them, and data integration tools. Of the conventional data dictionaries examined in Section 2.1 , we use the Project Open Metadata Schema data dictionary for comparison since it was the only reviewed guideline that used a standard linked data vocabulary. Of the mapping languages, we use R2RML for comparison, as it is a standard that is well adopted by the Semantic Web community. Of the data integration tools we surveyed, we use Karma for this evaluation, as it is an example of a data integration approach that was designed with both the FAIR principles and ease of use for the end-user in mind. Rather than only using these approaches in conducting the evaluation, we think of these examples as guidelines and consider traditional data dictionaries, mapping languages, and data integration tools in general when assigning numerical scores.

We have demonstrated the benefits of using a standardized machine-readable representation for recording dataset metadata and column information, which is achieved through SDDs, over earlier data dictionary formats. Furthermore, we demonstrate that the SDD approach presents a level of abstraction over methodologies that use mapping languages, allowing improved ease of use for a domain scientist over other semantic tools. In this regard, SDDs tend to provide a bridge between conventional data dictionary approaches used by domain scientists and formal semantic approaches used by Semantic Web researchers, thereby accommodating both user groups. We recognize that the RDF mapping tools that exist are intended to provide a bridge by reducing manual mapping or KG creation work that would otherwise be necessary, but also acknowledge that they may be unusable to domain scientists.

9. Discussion

In presenting this work, we consider two general types of users. We consider those using SDDs to semantically annotate data as well as those using SDDs in place of traditional data dictionaries in order to understand the data being described. For the first group of users, benefits of using SDDs include that the annotation process is accessible for users outside of the Semantic Web domain and that existing SDDs can be reused to ease the creation of new annotations. Some benefits for the second group include that (i) traditionally humans alone can understand data descriptions in existing data dictionaries but SDDs can be interpreted by machines as well, (ii) SDDs are written using fixed vocabularies which reduces ambiguity, and (iii) the SDD provides a standard specification that can be used to interpret existing annotations.

By including a fixed set of tables for the annotator to fill out that are interpreted and converted using a standard set of rules, the SDD framework provides consistency by creating a formal semantic representation using direct RDF mappings, resulting in an increased likelihood of diverse annotators creating similar representations. This is in contrast with other mapping approaches, where multiple annotators are much less likely to produce similar results when addressing the same dataset. The SDD approach reduces such representational biases as it abstracts away structural modeling decisions from the user, both cultivating scalability of production and simultaneously lowering the barrier of entry since not all of the authors have to be computer scientists. Moreover, the vocabulary used in an SDD can be easily updated by replacing terms from any of the tables, where similar updates are much less amenable when using standard mapping methods. An advantage of these features of the SDD is that users can focus on their topic of specialization rather than on the RDF, reducing the need for domain scientists to also become ontology experts. Given a recommended set of ontologies to use, any user should be able to create their own SDD for a given dataset.

From the evaluation of Section 7 , we find that in the data category, SDDs perform much better than traditional data dictionaries, and equally well as mapping languages and data integration tools. SDDs outperform the three other approaches in the semantics category. In terms of semantics, a notable impact of this work is our approach to object and relation elicitation, where detailed annotations for objects implicitly referenced by the data can be included. SDDs and mapping languages perform equally well in the FAIR category, surpassing the scores of data integration tools and traditional data dictionaries. SDDs, mapping languages, and data integration tools tied for the best performance in the generality category, greatly outperforming traditional data dictionaries. While traditional DDs performed the worst over all four categories, they do outperform mapping languages and data integration tools in the value annotation metric.

10. Conclusion

While the use of SDDs addresses many of the shortcomings associated with the prior art, we do acknowledge several limitations of this approach. In Section 6 , we mention several challenges faced by epidemiologists in creating SDDs. We found that the domain scientists had difficulties representing complex ideas, implicit concepts, and time associations. Additionally, determining the best ontology term to use when creating annotations was not always clear. These challenges relate to the limitation that this approach has some reliance on the annotator containing knowledge about relevant ontologies in the domain of discourse. Several steps to help alleviate these challenges are discussed in Section 6 .

Another limitation of this approach is that it currently only supports the annotation of tabular data. Adopting techniques from some of the methods discussed in Section 2.2.2 can help with a future extension to support XML data. Additions to support the annotation of unstructured text data is beyond the scope of this work. Finally, we acknowledge that the annotation process discussed in this article is mostly done manually. This limitation decreases the likelihood of the adoption of this approach by those wishing to streamline the annotation process or incorporate the approach as part of a larger workflow. While automated annotation is not yet supported, existing research on an SDD editor is being conducted by members of the Tetherless World Constellation (TWC) involves the incorporation of Natural Language Processing (NLP) techniques to suggest concepts from ontologies based on text descriptions.

Our approach was outperformed in a few of the evaluation metrics, including space annotation, domain knowledge support, and the leveraging of best practices. Space annotation, to some degree, is supported through the use of implicit entries and property customization. Nevertheless, the SDD approach received a 0.5 rather than a 1 for this metric since, unlike Karma, which supports the annotation of geospatial data, and contains tutorials for how to annotate such data and tools developed specifically for geospatial data integration [ 67 , 68 , 69 ], it does not readily allow for the incorporation of the longitudinal and latitudinal coordinates. While the SDD approach allows the use of domain ontologies during the annotation process, a score of 0.5 was assigned to the domain knowledge support metric since we have not developed tools that suggest to the user the most appropriate domain concept to use. Nevertheless, as mentioned above, on-going work on an SDD editor will leverage NLP techniques to allow for this capability. Finally, while many of the DWBP and HCLS recommendations are incorporated into our approach, a score of 0.5 was received in terms of leveraging best practices because additional standards for these guidelines have yet to be incorporated. Additionally, further alignment with that standards mentioned in Section 2.3 should be achieved. The relevant best practices associated with our approach has been a subject of much discussion; further incorporation of these recommendations will be included in future revisions.

An ideal knowledge model promotes improved discovery, interoperability, reuse, traceability, and reproducibility. The knowledge model resulting from the SDD approach adheres to Semantic Web standards, resulting in improved discovery on the web, as well as interoperability with systems that also use RDF data serializations. These artifacts are reusable, as SDD tables created for one dataset can be reused to annotate another similar dataset. Scientific studies involving SDDs are traceable and reproducible by design, as the artifacts designed during the modeling process can be published and shared, helping to ensure consistency for other researchers attempting to examine the studies.

In this work, we advance the state of the art of metadata capture of datasets by improving on existing standards with the formalization of the Semantic Data Dictionary specification, which produces machine-readable knowledge representations by leveraging Semantic Web technologies. This is achieved by formalizing the assignment of a semantic representation of data and annotating dataset columns and their values using concepts from best practice ontologies. We provide resources such as documentation, examples, tutorials, and modeling guidelines to aid those who wish to create their own Semantic Data Dictionaries. We claim that this approach and the resulting artifacts are FAIR, help address limitations of traditional data dictionaries, and provides a bridge between representation methods used by domain scientists and semantic mapping approaches. We evaluate this work by defining metrics over several relevant categorizations, and scoring the Semantic Data Dictionary, traditional data dictionaries, mapping languages, and data integration tools for each metric. As we provide a methodology to aid in scientific workflows, this work eases the semantic annotation process for data providers and users alike.

Acknowledgements

This work is supported by the National Institute of Environmental Health Sciences (NIEHS) Award 0255-0236-4609 / 1U2CES026555-01, IBM Research AI through the AI Horizons Network, and the CAPES Foundation Senior Internship Program Award 88881.120772 / 2016-01. We acknowledge the members of the Tetherless World Constellation (TWC) and the Institute for Data Exploration and Applications (IDEA) at Rensellaer Polytechnic Institute (RPI) for their contributions, including Rebecca Cowan, John Erickson, and Oshani Seneviratne.

Author Biography

Sabbir M. Rashid is a Ph.D. student at Rensselaer Polytechnic Institute working with Professor Deborah McGuinness on research related to data annotation and harmonization, ontology engineering, knowledge representation, and various forms of reasoning. Prior to attending RPI, Mr. Rashid completed a double major at Worcester Polytechnic Institute, where he received B.S. degrees in both Physics and Electrical & Computer Engineering. Much of his graduate studies at RPI have involved the research discussed in this article. His current work includes the application of deductive and abductive inference techniques over linked health data, such as in the context of chronic diseases like diabetes.

James P. McCusker is the Director of Data Operations at the Tetherless World Constellation at Rensselaer Polytechnic Institute. He works with Deborah McGuinness on using knowledge graphs to further scientific research, especially in biomedical domains. He has worked on applying semantics to numerous projects, including drug repurposing using systems biology, cancer genome resequencing, childhood health and environmental exposure, analysis of sea ice conditions, and materials science. He is the architect of the open source Whyis knowledge graph development and management framework, which has been used across many of these domains.

Paulo Pinheiro is a data scientist and software engineer managing projects at the frontier between artificial intelligence and databases. His areas of expertise includes the following: data policies and information assurance, such as security and privacy; data operation including curation, quality monitoring, semantic integration, provenance management, and uncertainty assessment; data visualization; and data analytics including automated reasoning. Paulo holds a Ph.D. in Computer Science from the University of Manchester, UK.

Marcello P. Bax is a professor and researcher in the Postgraduate Program in Knowledge Management and Organization (PPG-GOC) at the School of Information Science at Federal University of Minas Gerais, Brazil. Prior to joining the School of Information Science, Dr. Bax was a postdoctoral fellow in the Computer Science Department at UFMG, a leading Computer Science research group in Latin America. Dr. Bax spent a year on sabbatical with Professor McGuinness’ group and the Tetherless World Constellation at RPI, during which he worked with the coauthors on the research described in this article. His research seeks to develop methods for the curating of scientific data, with a focus on semantic annotation, the goal of building curatorial repositories for data reuse and reproduction of scientific research results.

Henrique Santos is a Research Scientist in the Tetherless World Constellation at Rensselaer Polytechnic Institute, where he researches and applies Semantic Web technologies in multidisciplinary domains for supporting more flexible, more efficient, and improved solutions in comparison with traditional approaches. His research interests include data integration, knowledge representation, domain-specific reasoning, and explainable artificial intelligence. He has over 10 years of experience working with Semantic Web technologies and holds a Ph.D. in Applied Informatics from Universidade de Fortaleza.

Jeanette A. Stingone is an Assistant Professor in the Department of Epidemiology at Columbia University’s Mailman School of Public Health. She couples data science techniques with epidemiologic methods to address research questions in children’s environmental health. She currently leads the Data Science Translation and Engagement Group of the Human Health and Exposure Analysis Resource Data Center. In this role, she supports the use of metadata standards and ontologies for data harmonization efforts across disparate studies of environmental health. Dr. Stingone’s interests also include the use of collective science initiatives to advance public health research.

Amar K. Das is the Program Director of Integrated Care Research at IBM Research and an Adjunct Associate Professor of Biomedical Data Science at Dartmouth College. His research activities include the development of biomedical ontologies and Semantic Web technologies for clinical decision support, information retrieval, and machine learning. In his role in the RPI-IBM HEALS initiative, Dr. Das is the IBM technical lead for advancing knowledge representation and reasoning in healthcare. Dr. Das holds an M.D. and Ph.D. in Biomedical Informatics from Stanford University, and has completed a residency in Psychiatry and a postdoctoral fellowship in Clinical Epidemiology at Columbia University/New York State Psychiatric Institute.

Deborah L. McGuinness is the Tetherless World Senior Constellation Chair and Professor of Computer and Cognitive Science. She is also the founding director of the Web Science Research Center at Rensselaer Polytechnic Institute. Dr. McGuinness has been recognized with awards as a fellow of the American Association for the Advancement of Science (AAAS) for contributions to the Semantic Web, knowledge representation, and reasoning environments and as the recipient of the Robert Engelmore award from Association for the Advancement of Artificial Intelligence (AAAI) for leadership in Semantic Web research and in bridging Artificial Intelligence (AI) and eScience, significant contributions to deployed AI applications, and extensive service to the AI community. Deborah is a leading authority on the Semantic Web and has been working in knowledge representation and reasoning environments for over 30 years and leads the research group that designed and implemented the research presented in this paper.

Appendix A. Namespace Prefixes

Namespace Prefixes and IRIs for Relevant Ontologies

Appendix B. Specifications

Due to the subjective nature of deciding the importance of each component, the rows in each of the specifications are shown in alphabetical order rather than in a meaningful sequence.

Appendix B.1. Infosheet Specification

Infosheet Specification

Infosheet Metadata Supplement

Appendix B.2. Dictionary Mapping Specification

Dictionary Mapping Specification

Appendix B.3. Codebook Specification

Codebook Specification

Appendix B.4. Timeline Specification

Timeline Specification

Appendix B.5. Properties Specification

Properties Specification

Appendix C. National Health and Nutrition Examination Survey Annotations

NHANES Demographics Infosheet

NHANES Demographic Implicit Entries

NHANES Demographic Explicit Entries

Expanded NHANES Demographic Codebook Entries

1 https://github.com/tetherless-world/SemanticDataDictionary

2 https://www.stonybrook.edu/commcms/irpe/about/data_governance/_files/DataDictionaryStandards.pdf

3 https://help.osf.io/hc/en-us/articles/360019739054-How-to-Make-a-Data-Dictionary

4 https://github.com/USG-SCOPE/data-dictionary/blob/gh-pages/Metadata-Scheme-for-Data-Dictionaries.md

5 https://project-open-data.cio.gov/v1.1/schema/

6 https://github.com/tetherless-world/setlr/wiki/JSLDT-Template-Language

7 http://metadata-standards.org/11179/

8 https://tetherless-world.github.io/sdd/resources

9 A listing of ontology prefixes used in this article is provided in Appendix Table A.1 .

10 https://tetherless-world.github.io/sdd/

11 https://www.w3.org/TR/hcls-dataset/

12 https://www.w3.org/TR/dwbp/

13 When referencing columns from any of the SDD tables, the S mall C aps typeface is used.

14 When including implicit entries in an SDD table, the prefix “??” is used as a distinguishing labeling feature. The typewriter typeface is used in this article when referring to instances of implicit entries.

15 The italics typeface is used when a property from an ontology is mentioned.

16 https://www.w3.org/TR/xmlschema11–2/

17 https://github.com/tetherless-world/chear-ontology/blob/master/code_mappings.csv

18 rdf:type, sio:isAttributeOf, rdfs:comment, skos:definition, sio:hasStartTime, sio:existsAt, sio:hasEndTime, sio:inRelationTo, rdfs:label, sio:hasRole, sio:hasUnit, sio:hasValue, prov:wasDerivedFrom, & prov:wasGeneratedBy

19 https://tetherless-world.github.io/sdd/resources

20 See https://science.rpi.edu/biology/news/ibm-and-rensselaer-team-research-chronic-diseases-cognitive-computing or https://idea.rpi.edu/research/projects/heals for more information.

Advertisement

Issue Cover

  • Next Article

1. INTRODUCTION

2. related work, 3. the semantic data dictionary, 4. example – the national health and nutrition examination survey, 5. current use, 6. modeling challenges for domain scientists, 7. evaluation, acknowledgements, appendix a. namespace prefixes, appendix b. specifications, appendix c. national health and nutrition examination survey annotations, the semantic data dictionary – an approach for describing and annotating data.

  • Cite Icon Cite
  • Open the PDF for in another window
  • Permissions
  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Search Site

Sabbir M. Rashid , James P. McCusker , Paulo Pinheiro , Marcello P. Bax , Henrique O. Santos , Jeanette A. Stingone , Amar K. Das , Deborah L. McGuinness; The Semantic Data Dictionary – An Approach for Describing and Annotating Data. Data Intelligence 2020; 2 (4): 443–486. doi: https://doi.org/10.1162/dint_a_00058

Download citation file:

  • Ris (Zotero)
  • Reference Manager

It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey data set, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large National Institutes of Health (NIH)-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.

With the rapid expansion of data-driven applications and the expansion of data science research over the past decade, data providers and users alike have relied on data sets as a means for recording and accessing information from a variety of distinct domains. Data sets are composed of distinct structures that require additional information to help users understand the meaning of the data. A common approach used by data providers involves providing descriptive information for a data set in the form of a data dictionary, defined as a “centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format” [ 1 ]. Data dictionaries are useful for many data management tasks, including aiding users in data conversion processes, testing data generation, validating data, and storing data usage criteria [ 2 ].

When storing data into a system that adheres to the structure of a particular data dictionary, that document can be used to aid in validation both when inputting new data into the system or updating existing data. By including additional information about a data set itself, data dictionaries can be used to store data usage criteria. Additionally, data conversion is aided by the inclusion of the format and units of the data points, which allows users to use conversion formulae to convert the data into another format or unit. When considering these benefits, we see that the use of data dictionaries has had a significant impact on data use and reuse. Nevertheless, we argue that data dictionaries can be improved by leveraging emerging Semantic Web technologies.

The use of data dictionaries to record descriptions about data sets and their elements has become widely adopted by data providers, often with the intent of aiding reusability. These data dictionaries are useful to data users in reducing ambiguity when interpreting data set content. Considering the structure and annotations that traditional data dictionaries are comprised of, we find that for each column header in a data set, these documents often contain a label that is more informative than the column name, as well as a comment describing the column header. Such annotations in themselves are essential for an end-user to understand the data, as column names are often arbitrary or encoded. Existing data dictionaries often contain structural information about a data set column, such as the format of the data, the data type, or the associated units of measurement. As this information is required for the proper analysis of data, we commend data providers for including it in their data dictionaries. For data sets that contain categorical codes, data providers have done well to document the possible values and include descriptive labels for each category.

While many publicly available data sets include documents resembling data dictionaries, we find that, across institutions, these documents do not adhere to a common metadata standard. Metadata, defined as “structured data about data” [ 3 ], should be able to be processed using software. Existing data dictionary standards typically are aimed at human consumption and do not subscribe to models that are machine-understandable, and thus lack support for formal semantics. Consequently, tasks involving the combination of data from multiple data sets that are described using data dictionaries are not easily automated.

1.1 A Need for Semantics

From the data set production perspective, data sets can convey much more information than the data themselves. Data set entries often correspond to physical observations, such as the weight of a sample, an event duration, or a person's gender. Traditional data dictionaries do well in describing these measurements but cannot represent the measured objects. There is a need to annotate these implicit concepts (representing the measured objects) that are indispensable to a complete understanding of the data but do not correspond to columns in the data set. Annotations of both explicit and implicit concepts allow for the conversion of a tabular format of data into a semantically richer graphical representation.

There may be a variety of ways that a data user can benefit from a semantic representation of data, such as enhanced provenance attributions, query capabilities, and the ability to infer new knowledge. We argue for the applicability of the Semantic Data Dictionary (SDD) as a standard model for representing machine-readable metadata for data sets. The SDD comprises a set of specifications formalizing the assignment of a semantic representation to data by annotating data set columns and their values using concepts from best practice vocabularies and ontologies. It is a collection of individual documents, where each plays a role in creating a concise and consistent knowledge representation. Each of these components, described in Section 3, is implemented using tables. In   Appendix B , we provide the specifications for each of the SDD tables. Throughout the remainder of this article, we describe modeling methods, include informative examples from projects employing this approach, discuss modeling challenges, and evaluate our approach against traditional data dictionaries, mapping languages, and data integration tools.

As science moves towards a more open approach, priority has been given to publishing scientific data in a way that is Findable, Accessible, Interoperable, and Reusable (FAIR) [ 4 ]. The FAIR principles are used to evaluate the quality of published data sets or the workflow that is used to produce data. As part of our approach to evaluating our methodology, we examine adherence to the FAIR guiding principles. While we have considered guidelines in designing our approach, and they have been adopted for many projects, the FAIR principles are not without limitations. For example, methods for the facilitation of data sharing are not specified, which may result in error perpetuation from differing interpretations of design choices, and more vigorous privacy concerns need to be addressed [ 5 ]. The use of the FAIR guidelines and traditional data integration approaches alone do not guarantee enough granularity of representation to support the pooling of data across studies, thereby limiting the potential impact for more significant statistical analyses. However, this capability has been demonstrated using the SDD approach for the Children's Health Exposure Analysis Resource (CHEAR) project [ 6 ].

1.2 Supporting Biomedical Research

While the SDD approach can and has been used for the semantic annotation of data in multiple domains, we will limit our examples in this paper to the field of biomedicine. The application of semantic technologies in areas like healthcare or the life sciences has the potential to facilitate scientific research in these fields. Many vocabularies and ontologies that define concepts and relationships in a formal graphical structure have been created to describe critical terms related to anatomy, genetics, diseases, and pharmaceuticals [ 7 , 8 ]. Best practice ontologies should be leveraged for the annotation of biomedical and clinical data to create knowledge representations that align with existing semantic technologies, services, and workflows. Ideally, the desired representation model would allow for improved data discovery, interoperability, and reuse, while supporting provenance, trust, traceability, and reproducibility.

Challenges arise for biomedical researchers who are unfamiliar with approaches for performing semantic annotation. Existing methods to provide machine-understandable interpretations of data are difficult for most researchers to learn [ 9 ]. The biomedical community has traditionally used data dictionaries to provide information regarding the use of a data set. While such documents are useful for a human interpreter, they generally cannot be used by themselves to automate the creation of a structured knowledge representation of the corresponding data. We recognize the need for an approach for annotating biomedical data that feel familiar to domain scientists while adhering to Semantic Web standards and machine-understandability. Since SDDs consist of tabular documents that resemble traditional data dictionaries, they can be used by biomedical scientists to annotate data naturally. In order to aid researchers who do not have a computer science background, we leverage the traits of SDDs, being both machine-readable and unambiguous, to provide interpretation software ① that can be used to create a knowledge model that meets the desired semantic representation characteristics mentioned above.

1.3 Motivation

In Section 2.1, we consider institutions that provide guidelines for the use of data dictionaries to record descriptive content for a data set. While existing guidelines have helped create human-understandable documents, we believe that there is room for improvement by introducing a formalization that is machine-readable. With the current advances in Artificial Intelligence technologies, there is an increased need for data users to have annotated data that adhere to Semantic Web standards [ 10 , 11 ]. We consider the benefits of combining data from disparate sources in such a way that it can be used in a unified manner. Harmonization across data sets allows for the comparison between similar columns, using a controlled vocabulary. The ability to combine data from various sources and formats into a single cohesive knowledge base allows for the implementation of innovative applications, such as faceted browsers or data visualizers.

Data and provenance understanding refer respectively to data interpretability and the ability to discern provenance attributions, both by humans and machines. This level of knowledge is necessary for the reuse of data and the reproduction of scientific experiments. Annotation of data improves query and integration capabilities [ 12 ], and the use of Semantic Web standards enhances the ability to find the data through a Web search [ 13 ]. Unfortunately, it is difficult for data users, who have a second-hand understanding of the data compared to data providers, to create these annotations themselves. As an example, a study related to data dissemination revealed that three researchers, independently analyzing a single data set and using similar approaches, arrived at noticeably dissimilar interpretive conclusions [ 14 ]. Additionally, difficulties arise for someone without a technology background to develop competence in technical approaches, due to challenges associated with technological semantics, such as research problems being defined, clarified, and communicated in a way that is perceptable by a general audience [ 15 ]. Therefore, the desire to create a standard for people from a wide variety of domains, including those who are untrained in Computer Science and semantic technologies, is an additional motivation. Easing the semantic annotation process for these users is a significant challenge. A machine-readable standard for data set metadata can improve data harmonization, integration, reuse, and reproducibility.

We claim that the formalism of the Semantic Data Dictionary addresses some of the limitations of existing data dictionary approaches. Traditional data dictionaries provide descriptions about the columns of a data set, which typically represent physical measurements or characteristics, but omit details about the described entities. Existing data dictionaries do not acknowledge the notion that the data values are instances of concepts that may have relationships with other instances of concepts, such as entity-entity, attribute-attribute, or entity-attribute relations.

In contrast, the SDD approach allows for the direct annotation of concepts implicitly referenced in a data set. Existing data dictionaries focus on the structure of the data rather than the inherent meaning, including value ranges, formats, and data types. Further information about the data, including the units, meaning, and associated objects, is provided in text descriptions that are not machine-interpretable. The SDD, on the other hand, focuses on the semantics of the data and includes the above information in a way that is readily able to be processed. The SDD consists of an intrinsic model with relationships that can be further customized, allowing the annotator to describe relationships between both explicit and implicit concepts inherent in the data set. By considering these characteristics of SDDs, we argue that a standardized machine-readable representation for recording data set metadata and column information is achieved.

We also claim that the SDD approach presents a level of abstraction over methodologies that use mapping languages. This is achieved by simplifying the programming knowledge requirements by separating the annotation portion of the approach from the software component. As a result, the SDD approach improves the ease of use for a domain scientist over other semantic tools. Additionally, by presenting the annotation component in a form that resembles traditional data dictionaries, this approach provides a bridge between the conventional data dictionary approaches, used by domain scientists, and the formal techniques used by Semantic Web researchers.

The SDD approach leverages state-of-the-art advancements in many data and knowledge related areas: traditional data dictionaries, data integration, mapping languages, semantic extract-transform-load (ETL) methods, and metadata standards. In this section, we present related work in each of those extensive areas by highlighting their accomplishments and discussing their limitations.

2.1 Data Dictionaries

There are several patents relating to the use of dictionaries to organize metadata [ 16 , 17 , 18 ]. However, published articles mentioning data dictionaries tend to refrain from including the associated formalism. Thus, we expanded our scope to search for data dictionaries that included standards published on the Web, several of which are discussed below.

The Stony Brook Data Governance Council recommendations list required elements and presented principles associated with data dictionaries ② . However, the ability to semantically represent the data is not permitted. Additionally, while data columns can be explicitly described, this approach does not allow the description of implicit concepts that are being described by the data set, which we refer to as object elicitation. The ability to annotate implicit concepts (described in Section 3.2) is one of the distinguishing features of our work. The Open Science Framework ③ and the United States Government (USG) Statistical Community of Practice and Engagement (SCOPE) ④ also guide the creation of a data dictionary that includes required, recommended, and optional entries. These data dictionaries support the specification of data types and categorical values, but minimally allow for the encorporation of semantics and do not leverage existing ontologies or vocabularies. The data dictionary specifications for the Biosystematic Database of World Diptera include both general and domain-specific elements [ 19 ]. Nevertheless, use of this data dictionary outside of the biological domain appears improbable. Based on the Data Catalog Vocabulary (DCAT [ 20 ]), the Project Open Data Metadata Schema provides a data dictionary specification ⑤ . Of the data dictionaries' recommendations examined, the Project Open Data Metadata Schema was the most general and the only one to use Semantic Web standards.

There are many recommendations for constructing data dictionaries; however, we found that most are project- or domain-specific, and we find no clear evidence that they are consistently applied by users outside of these individual groups. The exploration of these data dictionaries reveals the need for a standard formalization that can be used across institutions and projects.

2.2 Data Integration Approaches

Data integration is a technique that utilizes data from multiple sources to construct a unified view of the combined data [ 21 ]. Here we consider existing approaches that have been employed to address data integration challenges.

The Semantic Web Integration Tool (SWIT) can be used to perform transformation and integration of heterogeneous data through a Web interface in a manner that adheres to the Linked Open Data (LOD) principles [ 22 ]. While the writing of mapping rules is simplified through the use of a Web interface, the use of this approach may still prove difficult for users without a Semantic Web background. Neo4j is designed as a graph database (GDB) system that supports data integration based on the labeled property graph (LPG) model, which consists of attributed nodes with directed and labeled edges [ 23 ]. Despite being implemented using an LPG model rather than Resource Description Framework (RDF), Neo4j can read and write RDF, and by using GraphScale [ 24 ], it can further employ reasoning capabilities [ 25 ]. Nevertheless, data integration capabilities, such as using ontologies to semantically annotate data schema concepts and the associated objects, are limited.

To provide an integrated view of data collected on moving entities in geographical locations, RDF-Gen was developed as a means of SPARQL-based knowledge graph generation from heterogeneous streaming and archival data sources [ 26 ]. While this approach is promising and does support the representation of implicit objects, we find, due to the requirement of creating SPARQL-based graph transformation mappings, that it would likely be difficult for domain scientists to use. DataOps is an integration toolkit that supports the combination of data in varying, different formats, including relational databases, Comma Separated Value (CSV), Excel, and others, which can be accessed via R [ 27 ]. While existing user interface components can be used to ease the annotation process and the use of DataOps in industry is expanding, the expertise required to use this approach presents a steep learning curve. OpenRefine is a standalone, open-source tool capable of cleaning and transforming large data sets [ 28 ]. Some limitations of this approach pertain to difficulties in performing subset selection, cell-based operations, and data set merging.

It is important to note that most data integration approaches fall short when eliciting objects and relations to comprehensively characterize the semantics of the data. We continue this discussion on data integration by considering mapping languages and semantic extract-transform-load (ETL) applications.

2.2.1 Mapping Languages

In this section, we introduce mapping languages that can be used to convert a relational database (RDB), tabular file, or hierarchical structure to an RDF format and their related tool support.

The RDB to RDF Mapping Language (R2RML) is a W3C standard language for expressing mappings from relational databases to RDF data sets [ 29 ]. R2RML mappings contain properties to define the components of the mapping, including the source table, columns retrieved using SQL queries, relationships between columns, and a template for the desired output Uniform Resource Identifier (URI) structure. The R2RML limitations stem from the requirement of writing the mapping using RDF format, the need to be familiar with the R2RML vocabulary to write mappings, and the support for only relational databases. R2RML extensions exist to address these limitations. The RDF Mapping Language (RML) extends the R2RML vocabulary to support a broader set of possible input data formats, including CSV, XML, and JSON [ 30 ]. In this regard, RML extends the R2RML logical table class to be instead defined as a logical source, which allows the user to specify the source URI, reference, reference formulation, and iterator. RML is supported by a tool to define mappings called the RMLEditor, which allows users to make edits to heterogeneous data source mappings using a graphical user interface (GUI) [ 31 ]. Both R2RML and RML are robust and provide a solid cornerstone for general RDF generation from tabular data. Still, they fall short when dealing with some particularities of our problem scenario, including the creation of implicit relationships for elicited objects and the annotation of categorical data values. The xR2RML language leverages RML to expand the R2RML vocabulary to support the increase of several RDF data formats as well as the mapping of non-relational databases [ 32 ]. With the use of R2RML mappings, the OpenLink Virtuoso Universal Server has an extension to import relational databases or CSV files that can then transform into RDF [ 33 ]. Due to the usage requirement of a mapping language to specify graph transformations, a domain scientist may be reluctant to employ the above approaches.

KR2RML is an extension to R2RML addressing several of its limitations, including support for multiple input and output data formats, new serialization formats, transformations and modeling that do not rely on knowledge about domain-specific languages, and scalability when handling large amounts of data [ 34 ]. KR2RML is implemented in an open-source application called Karma. Karma is a system that uses semantics to integrate data by allowing users to import data from a variety of sources, clean and normalize the data, and create semantic descriptions for each of the data sources used [ 35 ]. Karma includes a visual interface that helps automate parts of the modeling process by suggesting proposed mappings based on semantic type assignments, and hence reduces some of the usage barriers associated with other mapping language methodologies. Nevertheless, some distinguishing factors between this and our approach include the following: when using the SDD approach, there is no need to write mapping transformation rules, and through the use of the Codebook (described in Section 3.3), the SDD approach supports cell value annotation.

CSV2RDF is a W3C standard for converting tabular data into RDF [ 36 ]. Introduced to address the limitation of R2RML that only relational data could be mapped, CSV2RDF extends R2RML to allow the mapping of additional structured data formats, such as CSV, TSV, XML and JSON [ 37 ]. The applicability of CSV2RDF for converting large amounts of data has been demonstrated using publicly available resources from a data portal [ 38 ]. CSV2RDF has also been used in an approach to automatically convert tabular data to RDF [ 39 ].

The Sparqlification Mapping Language (SML) progresses towards a formal model for RDB2RDF mappings, maintaining the same expressiveness as R2RML while simplifying usage by providing a more concise syntax, achieved by combining traditional SQL CREATE VIEW statements with SPARQL CONSTRUCT queries [ 40 ]. SML is intended to be a more human-readable mapping language than R2RML. The R2R Mapping Language, also based on SPARQL, is designed for writing data set mappings represented as RDF using “dereferenceable” URIs [ 41 ]. While it is possible for the user to specify metadata about each mapping, the possible mappings that can be specified correspond to direct translations between the data and the vocabulary being used, rather than allowing for detailed object elicitation.

Another mapping language based on SPARQL is Tarql, where databases are referenced in FROM clauses, mappings can be specified using a SELECT or ASK clause, and RDF can be generated using a CONSTRUCT clause [ 42 ]. One limitation of this approach is that it uses SPARQL notation for tasks that were not originally intended by the grammar, rather than exending SPARQL with additional keywords. The D2RQ mapping language, which allows for querying on mapped databases using SPARQL, is a declarative language that allows for querying through the use of the RDF Data Query Language (RDQL), publication of a database on the Semantic Web with the RDF Net API, reasoning over database content using the Jena ontology API, and accessing database information through the Jena model API [ 43 ]. Some limitations of D2RQ include integration capabilities over multiple databases, write operations such as CREATE, DELETE, or UPDATE, and support for Named Graphs [ 44 ].

While many of the mapping languages above focus on the conversion of RDBs to knowledge graphs, RDB2OWL is a high-level declarative RDB-to-RDF/OWL mapping language used to generate ontologies from RDBs [ 45 ]. It is achieved by mapping the target ontology to the database structure. RDB2OWL supports the reuse of RDB table column and key information, includes an intuitive human-readable syntax for mapping expressions, allows for both built-in and user-defined functions, incorporates advanced mapping definition primitives, and allows for the utilization of auxiliary structures defined at the SQL level [ 45 ].

In addition to the difficulties associated with writing mapping transformations, we find that mapping-language-based methodologies have limited object and relation elicitation capabilities, and cell value annotation is typically not permitted. These limitations are addressed in the SDD approach.

2.2.2 Semantic Extract-Transform-Load

The extract-transform-load (ETL) operations refer to processes that read data from a source database, convert the data into another format, and write the data into a target database. In this section, we examine several ETL approaches that leverage semantic technologies. LinkedPipes ETL (LP-ETL) is a lightweight, linked data preparation tool supporting SPARQL queries, including debug capabilities, and can be integrated into external platforms [ 46 ]. LP-ETL contains both back-end software for performing data transformations, as well as a front-end Web application that includes a pipeline editor and an execution monitor. A pipeline is defined as “a repeatable data transformation process consisting of configurable components, each responsible for an atomic data transformation task” [ 46 ]. As transformations in this approach are typically written as SPARQL CONSTRUCT statements, this methodology would be difficult to employ for someone who is unfamiliar with SPARQL. Semantic extract-transform-load-er (SETLr) is a scalable tool that uses the JSON-LD Template (JSLDT) language ⑥ for the creation of RDF from a variety of data formats [ 47 ]. This approach permits the inclusion of conditionals and loops (written in JSLDT) within the mapping file, allowing for the transformation process to iterate through the input data in interesting ways. Nevertheless, there may be a steep learning curve for researchers without a programming background to adopt this approach.

Eureka! Clinical Analytics is a Web application that performs ETL on Excel spreadsheets containing phenotype data [ 48 ]. Since this application was designed for use on clinical projects, it cannot easily be generalized for use in domains outside of biomedicine. The open-source Linked Data Integration Framework (LDIF) leverages Linked Data to provide both data translation and identity resolution capabilities [ 49 ]. LDIF uses runtime environments to manage data flow between a set of pluggable modules that correspond to data access, transformation, and output components. Improvements in the framework resulted in the extension of the importer capabilities to allow for input in the form of RDF/XML, N-Triples, and Turtle, import data by crawling RDF links through the use of LDspider, and replicate data through SPARQL CONSTRUCT queries [ 50 ]. One limitation of LDIF is that the runtime environment that supports RDF is slower than the in-memory and cluster environment implementations do not support RDF. Other approaches use existing semantic technologies to perform ETL [ 51 , 52 , 53 ]. These approaches, however, have a similar hurdle for adoption, in that they are often perceived as challenging by those unfamiliar with Semantic Web vocabularies and standards. SDDs provide a means of performing Semantic ETL without requiring writing of complex transformation scripts.

2.3 Metadata Standards

The collection of SDD specifications that we discuss in Section 3 serve to provide a standard guideline for semantically recording the metadata associated with the data set being annotated. In this section, we examine existing metadata standards for describing data that incorporate semantics. The ISO/IEC 11179 standard includes several components, including the (1) framework, (2) conceptual model for managing classification schemes, (3) registry metamodel and basic attributes, (4) formulation of data definitions, (5) naming and identification principles, (6) registration instructions, and (7) registry specification for data sets ⑦ . This standard is intended to address the semantics, representation, and registration of data. Nevertheless, a limitation of ISO/IEC 11179 is that it mainly focuses on the lifestyle management of the metadata describing data elements rather than of events associated with the data values [ 54 ]. The Cancer Data Standards Repository (caDSR) implements the ISO/IEC 111791 standard to organize a set of common data elements (CDEs) used in cancer research [ 55 ]. The Clinical Data Interchange Standards Consortium (CDISC) has produced several Unified Modeling Language (UML) models that provide schemas for expressing clinical data for research purposes [ 56 ]. However, as these schemas are based on the Health Level 7 (HL7) reference implementation model (RIM), which focuses on representing information records instead of things in the world, semantic concepts are used as codes that tag records rather than to provide types for entities.

The Semantic Data Dictionary approach provides a way to create semantic annotations for the columns in a data set, as well as for categorical or coded cell values. This is achieved by encoding mappings to terms in an appropriate ontology or set of ontologies, resulting in an aggregation of knowledge formed into a graphical representation. A well-formed SDD contains information about the objects and attributes represented or referred to by each column in a data set, utilizing the relevant ontology URIs to convey this information in a manner that is both machine-readable and unambiguous.

The main output of interpreting SDDs are RDF graphs that we refer to as knowledge graph fragments, since they can be included as part of a larger knowledge graph. Knowledge graphs, or structured graph-based representations that encode information, are variably defined but often contain a common set of characteristics: (i) real world entities and their interrelations are described, (ii) classes and relations of entities are defined, (iii) interrelating of entities is allowed, and (iv) diverse domains are able to be covered [ 57 ]. We have published a number of SDD resources, such as tutorials, documentation, complete examples, and the resulting knowledge graph fragments ⑧ . Full sets of annotated SDDs for several public data sets are also available here. To support the modularization and ease of adoption of the annotation process, we implement the SDD as a collection of tabular data that can be written as Excel spreadsheets or as CSV files. The SDD is organized into several components to help modularize the annotation process. We introduce the components here and go into further detail on each throughout the remainder of this section. A document called the Infosheet is used to specify the location of each of the SDD component tables. Furthermore, the user can record descriptive metadata about the data set or SDD in this document. The Dictionary Mapping (DM) is used to specify mappings for the columns in the data set that is being annotated. If only this component is included with the SDD, an interpreter can still be used to convert the data into an RDF representation. Therefore, we focus the majority of our discussion in this section on the DM table. We also briefly describe the remaining SDD components that allow for richer annotation capabilities and ease the annotation process. The Codebook is used to interpret categorical cell values, allowing the user to assign mappings for data points in addition to just the column headers. The Code Mapping table is used to specify shorthand notations to help streamline the annotation process. For example, the user can specify ‘mm’ to be the shorthand notation for uo:0000016 ⑨ , the class in the Units of Measurement Ontology (UO [ 58 ]) for millimeter. The Timeline table is used to include detailed annotations for events or time intervals. Finally, the Properties table allows the user to specify custom predicates employed during the mapping process. We use SmallCaps font when referring to columns in an SDD table and italics when referring to properties from ontologies. Further information on the SDD modeling process is available on the SDD documentation website ⑩ .

3.1 Infosheet

To organize the collection of tables in the SDD, we use the Infosheet ( Appendix Table B.1 ), which contains location references for the Dictionary Mapping, Code Mapping, Timeline, Codebook, and Properties tables. The Infosheet allows for the use of absolute, relative, or Web resource locations. In addition to location references, the Infosheet is used to include supplemental metadata ( Appendix Table B.2 ) associated with the SDD, such as a title, version information, description, or keywords. In this regard, the Infosheet serves as a configuration document, weaving together each of the individual pieces of the Semantic Data Dictionary and storing the associated data set-level metadata.

The properties that are included support distribution level data set descriptions based on the Health Care and the Life Sciences (HCLS) standards ⑪ , as well as the Data on the Web Best Practices (DWBP) ⑫ . The HCLS standards contain a set of metadata concepts that should be used to describe data set attributes. While the resulting document was developed by stakeholders working in health related domains, the properties included are general enough to be used for data sets in any domain. The DWBP were developed by a working group to better foster communications between data publishers and users, improve data management consistency, and promote data trust and reuse. The associated document lists 35 best practices that should be followed when publishing data on the Web, each of which includes an explanation for why the practice is relevant, the intended outcome, possible implementation and testing strategies, and potential benefits of applying the practice.

In Section 4, we provide an example of using the SDD approach to annotate the National Health and Nutrition Examination Survey (NHANES). An example Infosheet for the demographics table of this data set is provided in Appendix Table C.1 .

3.2 Dictionary Mapping

The Dictionary Mapping (DM) table includes a row for each column in the data set being annotated (referred to as explicit entries), and columns corresponding to specific annotation elements, such as the type of the data (A ttribute , E ntity ) ⑬ , label (L abel ), unit (U nit ), format (F ormat ), time point (T ime ), relations to other data columns ( in R elation T o , R elation ), and provenance information ( was D erived F rom , was G enerated B y ). Figure 1 shows the conceptual diagram of the DM. Such a representation is similar to the structure of general science ontologies, such as the Semanticscience Integrated Ontology (SIO) [ 59 ] or the Human-Aware Science Ontology (HAScO) [ 60 ]. We use SIO properties for the mapping of many of the DM columns, as shown in the Dictionary Mapping specification in Appendix Table B.3 , while also leveraging the PROV-O ontology [ 61 ] to capture provenance information. Despite specifying this default set of mappings, we note that the Properties table of the SDD can be used to determine the set of predicates used in the mapping process, allowing the user to customize the foundational representation model.

A conceptual diagram of the Dictionary Mapping that allows for a
                            representation model that aligns with existing scientific ontologies.
                            The Dictionary Mapping is used to create a semantic representation of
                            data columns. Each box, along with the “Relation” label,
                            corresponds to a column in the Dictionary Mapping table. Blue rounded
                            boxes correspond to columns that contain resource URIs, while white
                            boxes refer to entities that are generated on a per-row/column basis.
                            The actual cell value in concrete columns is, if there is no Codebook
                            for the column, mapped to the “has value” object of the
                            column object, which is generally either an attribute or an
                            entity.

A conceptual diagram of the Dictionary Mapping that allows for a representation model that aligns with existing scientific ontologies. The Dictionary Mapping is used to create a semantic representation of data columns. Each box, along with the “Relation” label, corresponds to a column in the Dictionary Mapping table. Blue rounded boxes correspond to columns that contain resource URIs, while white boxes refer to entities that are generated on a per-row/column basis. The actual cell value in concrete columns is, if there is no Codebook for the column, mapped to the “has value” object of the column object, which is generally either an attribute or an entity.

In addition to allowing for the semantic annotation of data set columns, unlike traditional mapping approaches, the SDD supports the annotation of implicit concepts referenced by the data. These concepts, referred to as implicit entries, are typically used to represent the measured entity or the time of measurement. For example, for a column in a data set for a subject's age, the concept of age is explicitly included, while the idea that the age belongs to a human subject is implicit. These implicit entries can then be described to have a type, a role, relationships, and provenance information in the same manner as the explicit entries. For example, to represent the subject that had their age measured, we could create an implicit entry, ??subject ⑭ .

3.2.1 Attributes and Entities

A ttribute and E ntity are included in the DM to allow for the type assignment of an entry. While both of these columns map to the property rdf:type ⑮ , they are both included as it may be semantically significant to distinguish between characteristics and objects. If an entry describes a characteristic, A ttribute should be populated with an appropriate ontology class. The entity that contains the characteristic described, which can be either explicit or implicit, should be referenced in attribute O f . While columns in a data set typically describe an observed characteristic, this is not always the case. If an entry describes an object, such as a person, place, thing, or event, E ntity should be populated with an appropriate ontology class.

3.2.2 Annotation Properties and Provenance

A set of annotation properties, including comments, labels, or definitions, allows for the description of an explicit or implicit entry in further detail. While L abel is the only column included in the DM Specification for an annotation property, if support for comments and definitions is included in an SDD interpreter, we recommend the use of the rdfs:comment and skos:definition predicates, respectively. In terms of including provenance, was D erived F rom can be used to reference pre-existing entities that are relevant in the construction of the entry, and was G enerated B y can be used to describe the generation activity associated with the entry.

3.2.3 Additional Dictionary Mapping Columns

The R ole , R elation , and in R elation T o columns of the DM are used to specify roles and relationships associated with entries. A reference to objects or attributes an entry is related to should be populated in in R elation T o . By populating R ole , the sio:hasRole property is used to assign the specified role to the entry. Custom relationships using properties that are not included in the SDD can be specified using R elation . Events in the form of time instances or intervals associated with an entry should be referenced in T ime . The unit of measurement of the data value can be specified in U nit . In general, we recommend the use of concepts in the Units of Measurement Ontology (UO) for the annotation of units, as many existing vocabularies in various domains leverage this ontology. A W3C XML Schema Definition Language (XSD) primitive data type ⑯ can be included in F ormat to specify the data type associated with the data value.

3.2.4 Dictionary Mapping Formalism

We define a formalism for the mapping of DM columns to an RDF serialization. The notation we use for formalizing the SDD tables is based on an approach for translating constraints into first-order predicate logic [ 62 ]. While most of the DM columns have one-to-one mappings, we can see the interrelation of the mapping of R ole , R elation , and in R elation T o . In the formalism included below, ‘Value’ represents the cell value of the data point that is being mapped.

graphic

3.3 Codebook

The Codebook table of the SDD allows for the annotation of individual data values that correspond to categorical codes. The Codebook table contains the possible values of the codes in C ode , their associated labels in L abel , and a corresponding ontology concept assignment in C lass . If the user wishes to map a Codebook value to an existing Web resource or instance of an ontology class, rather than a reference to a concept in an ontology, R esource can be populated with the corresponding URI. We recommend that the class assigned to each code for a given column be a subclass of the attribute or entity assigned to that column. A conceptual diagram of the Codebook is shown in Figure 2(a) . The Codebook Specification is provided in Appendix Table B.4 . The formalism for mapping the Codebook is included below.

(a) A conceptual diagram of the Codebook, which can be used to assign
                            ontology classes to categorical concepts. Unlike other mapping
                            approaches, the use of the Codebook allows for the annotation of cell
                            values, rather than just columns. (b) A conceptual diagram of the
                            Timeline, which can be used to represent complex time associated
                            concepts, such as time intervals.

(a) A conceptual diagram of the Codebook, which can be used to assign ontology classes to categorical concepts. Unlike other mapping approaches, the use of the Codebook allows for the annotation of cell values, rather than just columns. (b) A conceptual diagram of the Timeline, which can be used to represent complex time associated concepts, such as time intervals.

graphic

3.4 Code Mapping

The Code Mapping table contains mappings of abbreviated terms or units to their corresponding ontology concepts. This aids the human annotator by allowing the use of short-hand notations instead of repeating a search for the URI of the ontology class. The set of mappings used in the CHEAR project is useful for a variety of domains and is available online ⑰ .

3.5 Timeline

If an implicit entry for an event included in the DM corresponds to a time interval, the implicit entry can be specified with greater detail in the Timeline table. Timeline annotations include the corresponding class of the time associated entry, the units of the entry, start and end times associated with an event entry, and a connection to other entries that the Timeline entry may be related to. Shown in Figure 2(b) is a conceptual diagram of the Timeline. The Timeline Specification is provided in Appendix Table B.5 . The formalism for mapping the Timeline is included below.

graphic

3.6 Property Customization

The Semantic Data Dictionary approach creates a linked representation of the class or collection of data sets it describes. The default model provided is based on SIO, which can be used to express a wide variety of objects using a fixed set of terms, incorporates annotation properties from RDFS and Simple Knowledge Organization System (SKOS), and uses provenance predicates from PROV-O. Shown in Appendix Table B.6 are the default sets of properties that we recommend.

By specifying the associated properties with specific columns of the Dictionary Mapping Table, the properties used in generating the knowledge graph can be customized. This means that it is possible to use an alternate knowledge representation model, thus making this approach ontology-agnostic. Nevertheless, we urge the user to practice caution when customizing the properties used to ensure that the resulting graph is semantically consistent (for example, not to replace an object property with a datatype property).

In the formalism presented above and the DM, CB, and TL specifications of Appendix Tables B.3 , B.4 , and B.5 , 14 distinct predicates are used ⑱ . Fourteen of the 16 rows of the Properties Table are included to allow the alteration of any of these predicates. The two additional rows pertain to A ttribute and E ntity , which, like T ype , by default map to rdf:type , but can be customized to use an alternate predicate if the user wishes. In this way, by allowing for the complete customization of the predicates that are used to write the formalism, the SDD approach is ontology-agnostic. Note that the predicates used in the Infosheet Metadata Supplement of Table B.2, which are based on the best practices described in Section 3.1, are not included in the Properties Specification.

The National Health and Nutrition Examination Survey (NHANES) contains publicly available demographic and biomedical information. A challenge in creating a knowledge representation from this data set is determining how to represent the implicit entities referenced by the data, such as a participant of the study or the household that they live in. Additionally, information about a participant may be dispersed throughout multiple tables that consequently need to be integrated, resulting in difficulties when following traditional mapping approaches.

NHANES data dictionaries include a variable list that contains names and descriptions for the columns in a given data set component, as well as a documentation page that consists of a component description, data processing and editing information, analytic notes, and a Codebook. Unfortunately, the data set description provided is textual and is therefore not readily processed.

We find that neither the data documentation nor the codebooks included in NHANES incorporate mappings to ontology concepts. Thus, we provide a simple example of how several columns from the NHANES demographics data set would be represented using the SDD approach. The terms in this example are annotated using the CHEAR, SIO, and National Cancer Institute Thesaurus (NCIT) ontologies. Shown in Tables 1 , 2 , and 3 are a portion of the SDD we encoded for the NHANES demographics data set, in which we respectively present a subset of the explicit DM entries, implicit DM entries, and the Codebook entries. An example Infosheet for the NHANES demographic data set is provided in Appendix Table C.1 . The complete set of explicit and implicit entries is provided in Appendix Table C.3 and Appendix Table C.2 , respectively. An expanded Codebook is included in Appendix Table C.4 . Additional NHANES tables not included in this article were also encoded as part of this annotation effort ⑲ .

Subset of explicit entries identified in NHANES demographics data.

Subset of implicit entries identified in NHANES demographics data.

Subset of NHANES demographic Codebook entries.

In Table 1 , we provide the explicit entries that would be included in the DM. The data column SEQN corresponds to the identifier of the participant. The resource created from this column can be used to align any number of NHANES tables, helping address the data integration problem. Another column included is the categorical variable that corresponds to education level. Also included are two variables that correspond to the age of the participant taking the survey and the age of the specified reference person of the household, referred to as the head of the household (HH in Table~\ref{tab:NHANESDemoExplicit}), defined as the person who owns or pays rent for the house. We see how the use of implicit entries, as well as the use of specified Code Mapping units, helps differentiate the two ages. The corresponding implicit entries referenced by the explicit entries are annotated in Table 2 .

In Table 3 , we include a subset of the Codebook for this example. The SDD Codebook here is similar to the original NHANES Codebook, with the addition of C olumn , so that multiple codebooks do not have to be created to correspond to each categorical variable, and C lass , used to specify a concept from an ontology to which the coded value maps.

In this section, we provide a case study on projects that have leveraged the SDD for health- related use cases. We focus on work done for the Health Empowerment by Analytics, Learning, and Semantics (HEALS) project, while also briefly discussing efforts in other programs. In our funded research, our sponsors often desire the representation of their data in a semantically consistent way that supports their intended applications. They wish to play a role in the annotation process by contributing their subject matter expertise. We find that the SDD approach is more accessible to domain scientists than other programming intensive approaches. Additionally, they appreciate that the ability to reuse SDDs limits the amount of necessary future updates when, for example, a data schema changes.

5.1 Health Empowerment by Analytics, Learning and Semantics

As part of the RPI and IBM collaborative Health Empowerment by Analytics, Learning, and Semantics (HEALS) project ⑳ , SDDs have been used to aid in semantic representation tasks for use cases involving breast cancer and electronic health record (EHR) data.

5.1.1 Breast Cancer Use Case

For the creation of an application used for the automatic re-staging breast cancer patients, the SDD approach was used to create a knowledge representation of patient data from the Surveillance, Epidemiology, and End Results (SEER) program [ 63 ]. In order to integrate treatment recommendations associated with a given biomarker into the application, an SDD for the Clinical Interpretation of Variants in Cancer (CIViC) database was also created. By applying the SDD approach to help solve this problem, seamless data integration between these two distinct sources was demonstrated, which would have been more difficult to achieve using some of the methods described in Section 2.2. For example, if any of the mapping language or Semantic ETL approaches were applied, the writing of a script that requires an intrinsic understanding of the data set would be necessary, rather than needing to just fill out the SDD tables. While this approach still requires an understanding of the data set, if the SDD approach was used for describing the data sets mentioned above, the data apprehension requirement on the user would be greatly reduced. Another advantage demonstrated by using this approach was that, since a limited set of properties are leveraged in the semantic model that was created, the cost of implementing the application, in terms of programming resources and overhead, was reduced. A subset of the explicit entries from the SEER DM are shown in Table 4 .

Subset of explicit entries identified in SEER.

Additional cancer-related work for the HEALS project involves the annotation of a subset of The Cancer Genome Atlas (TCGA) through the NCI Genomic Data Commons (GDC) portal. While these SDDs are not included here, they are openly available on our SDD resources webpage. The clinical subset of the TCGA data that was annotated contains patient demographic and tumor information, and the methylation portion contains genetic information. By using the same ontology classes that were used for the SEER data set to annotate these concepts, we are able to leverage TCGA data to further enrich the cancer staging application described above.

5.1.2 Electronic Health Record Data

To create a knowledge representation from electronic health record (EHR) data, we annotated the Medical Information Mart for Intensive Care III (MIMIC-III) data set using SDDs. While this effort involved annotating 26 relational tables, we only include a subset of the Dictionary Mapping of the admission table in Table 5 . Using this approach, we can represent implicit concepts associated with the data. The inclusion of implicit concepts provides connection points for linking the various EHR data tables into a single coherent knowledge representation model that reflects the reality recorded by the data. This would be difficult to accomplish using many alternate approaches we examined that do not support object elicitation.

Subset of Dictionary Mapping for the MIMIC-III Admission table.

5.2 Additional Use Cases

Several institutions are employing the Semantic Data Dictionary approach for a variety of projects. The Icahn School of Medicine at Mount Sinai uses SDDs for the NIH CHEAR and the follow-on HHEAR projects to annotate data related to demographics, anthropometry, birth outcomes, pregnancy characteristics, and biological responses. The Lighting Enabled Systems & Applications (LESA) Center is using SDDs to annotate sensor data. SDDs are being used in Brazil for the Big Data Ceara project, through Universidade de Fortaleza, and the Global Burden of Disease project, through Universidade Federal de Minas Gerais.

5.3 Remarks

In this section, we discussed how SDDs help represent knowledge for a variety of other projects that involve collaborative efforts with domain scientists, exhibiting the applicability of this approach for researchers in a variety of specializations. For the HEALS project, we have shown DMs for use cases that involve breast cancer and EHR records. As well as patient demo graphic characteristics from the SEER data, we encode the size of the patient's tumor, the number of lymph nodes affected, whether or not the cancer metastasized, and several genetic biomarkers. Using this data, the successful automation of re-staging breast cancer patients was accomplished. While we only show a single DM for the MIMIC-III data set, this use case involves the annotation of multiple relational data tables and demonstrates how data integration can be performed using SDDs.

An initial strategy of training that was followed by qualitative evaluation was used to examine the difficulty experienced by researchers who do not have a Semantic Web background when first using the Semantic Data Dictionary. Domain scientists, including epidemiologists and biostatisticians, were presented with initial training by a Semantic Web expert. Supporting materials were developed in collaboration with a domain expert and then were made available to provide guidance and examples to facilitate domain scientists' use of the Semantic Data Dictionary.

First, a template for completing the Semantic Data Dictionary that included pre-populated fields for common demographic concepts, such as age, race, and gender, was provided to domain scientists to use for each study. Second, a help document was created that included instructions and representations of more complex concepts, including measurements of environmental samples, measurements of biological samples, and measurements taken at specific time-points. Third, a practical workshop was held where a semantic scientist provided training in semantic representation to the domain scientists. Following the workshop and distribution of supporting materials, domain scientists completed at least one Semantic Data Dictionary for an epidemiologic study and were then asked about the challenges they faced. Despite this training and workshop being conducted in a context related to epidemiology and health, the key takeaways resulted in general lessons learned.

The first identified challenge was the representation of implicit objects implied by the features in the data set. This is an uncommon representation in the public health domain. While the modeling of simple concepts may be intuitive (e.g. maternal age has a clear implicit reference to mother), the representation of complex ideas, such as fasting blood glucose levels, proves to be more difficult as the implicit object, and relationships between concepts, is not as intuitive for domain scientists. A second modeling challenge involved discussions on how to represent time-associated concepts that power the ontology-enabled tools and allow domain scientists to harmonize data across studies. Additionally, when a concept was not found in a supporting ontology, there were questions of how to best represent the concept in a semantically-appropriate way. In many cases, these challenges resulted in a need to go back to a Semantic Web expert for clarification.

To alleviate these challenges, we have refined and expanded the number of publicly-available resources that include documentation, step-by-step modeling methods, tutorials, demonstrations, and informative examples. We increased the complexity of examples and incorporated time-associated concepts to initial templates and help documents. To facilitate further communication, a Web-based Q&A document has been shared between the Semantic Web experts and the domain scientists to enable timely feedback and answers to specific questions on the representation of concepts and the need to generate new concepts.

In addition to the solutions presented above, we plan for future training events to explicitly demonstrate the use of the Semantic Data Dictionary. We will provide an overview on the semantic representation, as well as guidelines for using the corresponding documentation and training materials.

To evaluate the Semantic Data Dictionary approach, we categorize metrics from earlier evaluations on mapping languages [ 64 , 65 ] and requirements of data integration frameworks. In addition to evaluating the SDD for adherence to these metrics, we survey similar work to determine the extent to which they meet the metrics in comparison. We include a set of evaluation metrics that we organized into four categories. These categories are respectively related to data, semantics, the FAIR principles, and generality.

To measure the degree to which an approach meets each metric, we provide a value of 0, 0.5, or 1, depending on the extent to which an approach responds to an evaluation parameter. In general, if an approach does not meet a metric, it is given a score of 0. If it meets a metric partially, we assign a score of 0.5. We also assign this score to approaches that meet a metric by omission, such as being ontology-agnostic by not supporting the use of ontologies at all. If an approach completely meets the metric, it is given a score of 1. We list the criteria used for the assignment of numerical values below (refer to Table 6 for the complete list of categorized metrics).

High-level comparison of semantic data dictionaries, traditional data dictionaries, approaches involving mapping languages, and general data integration tools.

7.1 Data Integration Capabilities

In this category, we consider how the approach can harmonize and ingest data, allows for subset data selection, and permits a data type assignment. We evaluate whether the approach is harmonizable in the sense that it has the capability of creating a cohesive representation for similar concepts across columns or data sets in general. We check that knowledge generated across data sets can be compared using similar terms from a controlled set of vocabularies. For this metric, we respectively assign a score of 0, 0.5, or 1 if data integration capabilities are not supported, somewhat supported, or wholly supported.

Next, we consider whether the approach is ingestible, outputting data in a standard format that can be uploaded and stored (ingested) and supports inputs of varying formats. We assign a score of 1 if the resulting data representation can be stored in a database or triplestore, and if it can input data of varying formats. If one of the two features is supported, we assign a score of 0.5. If neither is supported, we assign a score of 0.

Furthermore, we consider a subset selection metric, where we check if the approach allows the user to select a subset of the data, either in terms of columns and rows, on which to perform the annotation. For this metric, a score of 0 is assigned if this capability is not included in the approach. We assign a score of 0.5 if either a subset of the rows or the columns can be specified for annotation, but not both. If the approach allows for the selection of both a subset of rows or of columns to be annotated, we assign a score of 1.

Finally, we include the data type assignment metric, measuring the extent to which XML data types can be assigned to attributes when mapping data. We assign a score of 0 for this metric if the approach does not allow for the assignment of data types when mapping data. If the assignment of a limited set of data types that are not based on XML standards is incorporated, a score of 0.5 is assigned. If the approach allows the assignment of XML data types, a score of 1 is given.

7.2 Formal Semantics Capabilities

In this category, we consider if the approach allows for object or relation elicitation, as well as value, time, or space annotation. We also check if the resulting data representation is queryable and if the approach supports both domain-specific and general ontology foundations. Finally, graph materialization is the last assessment metric we apply. Data usually consist of attributing value to observations, measurements, or survey results. Data set descriptions contain metadata, but often omit details on the objects that the values describe. For a complete semantic representation, one must also consider the ability to represent implicit objects that are associated with the data points, which we measure using the object elicitation metric. If the approach does not include the ability to represent implicit objects, a score of 0 is assigned. If implicit objects are considered but not annotated in detail, we assign a score of 0.5. We assign a score of 1 if implicit objects can be represented and richly annotated.

In addition to being able to represent implicit concepts, we consider relation elicitation, where relationships between implicitly elicited objects can be represented. A score of 0 is assigned if an approach does not allow for the representation of relationships between elicited objects. If relationships between elicited objects can be represented, but not annotated in detail, a score of 0.5 is assigned. We assign a score of 1 if relationships between elicited objects can be represented and richly annotated.

Next, we consider if the resulting representation is queryable, so that specific data points can be easily retrieved using a query language. A score of 0 is assigned for this metric if specific content from the knowledge representation cannot be queried. If it can be queried using a relational querying method, such as SQL, but not a graph querying method, a score of 0.5 is assigned. If content can be queried using a graph querying method, such as SPARQL, we assign a score of 1.

We further consider the annotation of cell values, rather than just column headers, using the value annotation metric. This covers the ability to annotate categorical cell values, assign units to annotate non-categorical cell values, and specify attribute mappings of object properties related to cell values. If the approach does not allow for the annotation of cell values at all, or allows for a limited set of annotations for cell values, we assign scores of 0 and 0.5, respectively. We assign a score of 1 if an approach includes the ability to annotate categorical cell values, assigns units to annotate non-categorical cell values, and specifies attribute mappings of object properties related to cell values.

We consider the ability to represent specific scientific concepts, including time and space. Using the time annotation metric, we check for the ability to use timestamps to annotate time- series values, as well as named time instances to annotate cell values. A score of 0 is assigned for this metric if an approach does not allow for the representation of time. If the approach allows for the representation of time, but does not permit detailed annotations, we assign a score of 0.5. We assign a score of 1 if the approach allows for detailed annotation of time, such as the use of timestamps to annotate time-series values and named time instances to annotate cell values.

The space annotation metric is added to check for the use of semantic coordinate systems to annotate the acquisition location of measurements. We assign a score of 0 if an approach does not allow for the representation of space. If it allows for the representation of space, but does not permit detailed annotations, we assign a score of 0.5. A score of 1 is assigned if the use of semantic coordinate systems to annotate the acquisition location of measurements is supported.

We examine domain knowledge support by checking if the approach permits the design of mappings using pre-existing domain-specific ontologies or controlled vocabularies. A score of 0 is assigned for this metric if the approach does not permit the design of reusable mappings driven by domain knowledge. We assign a score of 0.5 if it permits the design of reusable mappings using either pre-existing ontologies or controlled vocabularies, but not both. If annotations from both pre-existing ontologies or controlled vocabularies are allowed, we assign a score of 1.

Using the top-level ontology foundation metric, we consider the ability to use general upper ontologies as a foundation for the resulting model. If an approach cannot specify mapping rules based on foundation ontologies, a score of 0 is assigned for this metric. If a subset of mapping rules based on general foundation ontologies can be specified, we assign a score of 0.5. A score of 1 is assigned if the approach allows for the specifiation of all mapping rules based on general foundation ontologies. Essentially, we are checking if the semantic model that results from the annotation approach is structured based on a given ontology. While we recommend the use of well-known upper ontologies such as SIO or Basic Formal Ontology (BFO [ 66 ]), in evaluating this metric we allow the approach to leverage any ontology.

Finally, with the graph materialization metric, we assess the persistence of the generated knowledge graph into an accessible endpoint or file. If the approach does not allow for the materialization of the generated graph, a score of 0 is assigned. If the generated graph is reified into an accessible endpoint or downloadable file, but not both, a score of 0.5 is assigned. If both materializations into an accessible endpoint and a downloadable file are supported, we assign a score of 1.

In the FAIR category, we consider the metrics associated with the FAIR guiding principles, including if the approach and resulting artifacts are findable, accessible, interoperable, and reusable. Furthermore, we also consider the related metrics of reproducibility and transparency, which are not included in the FAIR acronym. While several of the metrics we measure in the other categories of our evaluation aid with the creation of FAIR data, such as the incorporation of provenance or the inclusion of documentation as discussed in Section 7.3.1, we include these six metrics in the FAIR category since they are directly associated with intent of the principles in enhancing data reuse and are explicitly discussed in the introductory article on the FAIR principles [ 4 ].

For the findable metric, we consider the use of unique persistent identifiers, such as URLs, as well as the inclusion of Web searchable metadata so that the knowledge is discoverable on the Web. If the knowledge representation is neither persistent nor discoverable, we assign a score of 0 for this metric. If the knowledge representation is one of the two, we assign a score of 0.5. A score of 1 is assigned if the knowledge representation is both persistent and discoverable.

We consider a knowledge representation to be accessible if resources are openly available using standardized communication protocols, with the consideration that data that cannot be made publicly available is accessible through authentication. Accessibility also includes the persistence of metadata, that even if data are retired or made unavailable, their description still exists on the Web. As additional consideration for evaluating accessibility, we examine whether or not the associated software for an approach is free and publicly available. If resources and metadata are not published openly, a score of 0 is assigned for this metric. If some resources and metadata are persistent and openly available, we assign a score of 0.5. A score of 1 is assigned if all of the resources and metadata from a given approach are both persistent and openly available using standardized communication protocols.

For the interoperable metric, we consider the use of structured vocabularies, such as best practice ontologies, that are RDF compliant. Mainly, we are checking to see if the knowledge representation is published using an RDF serialization. If the knowledge representation does not use a structured vocabulary, a score of 0 is assigned. If it uses structured vocabularies that are not RDF compliant, we assign a score of 0.5. A score of 1 is assigned if the knowledge representation uses formal vocabularies or ontologies that are RDF compliant.

To test if an approach or the resulting knowledge representation is reusable, we consider the inclusion of a royalty-free license that permits unrestricted reuse, and that consent or terms of agreement documents are available when applicable. We also discuss if included metadata about the resource is detailed enough for a new user to understand. A score of 0 is assigned for this metric if an approach does not include a royalty-free license. If a royalty-free license that permits unrestricted use of some portions of the tool is included, a score of 0.5 is assigned. We assign a score of 1 if the approach includes a royalty-free license that permits unrestricted use of all portions of the tool.

We examine if an approach is reproducible in terms of scientific activities introduced within a given methodology, such that experiments can be independently conducted and verified by an outside party. If the approach creates a knowledge representation that cannot be reproduced, a score of 0 is assigned. If the knowledge representation that can be produced by an outside party with the help of the involved party, rather than entirely independently, we assign a score of 0.5. A score of 1 is assigned if the approach for creating a knowledge representation can be independently produced.

Finally, we consider if data and software are transparent, such that there are no “black boxes” used in the process of creating a knowledge representation. Transparency is readily achieved by making sure that software is made openly available. If the associated code for a given approach is not openly accessible, we assign a score of 0. We assign a score of 0.5 if some of the associated code is open, while other portions are not openly available. This generally applies to approaches that are both free and paid versions of software. If all of the associated code for an approach is open source, a score of 1 is given.

7.3.1 Generality Assessment

To evaluate the generality of an approach, we investigate whether or not the method is domain-agnostic, is ontology-agnostic, and adheres to existing best practices. We weigh whether the method incorporates provenance attributions, is machine-understandable, and contains documents to aid the user, such as documentation, tutorials, or demonstrations.

We analyze whether an approach is domain-agnostic, in that its applicability does not restrict usage to a particular domain. A score of 0 is assigned for this metric if the approach only applies to a single field of study. If the approach applies to multiple fields of study but does not work for specific domains, a score of 0.5 is assigned. We assign a score of 1 if the approach can be generalized to any areas of study.

On a similar vein, we judge if the method is ontology-agnostic, where usage is not limited to a particular ontology or set of ontologies. If the approach depends on a particular ontology or set of ontologies, a score of 0 is assigned. If the dependence on particular ontologies is unclear from the examined literature and documentation, we assign a score of 0.5. A score of 1 is assigned for this metric if the approach is independent of any particular ontology.

We examine the literature and documentation associated with a given approach or knowledge representation to see if it leverages best practices. In particular, we consider the applicable best practices related to the HCLS and DWBP guidelines. Among the practices we test for include the ability of the approach to incorporate descriptive metadata, license and provenance information, version indicators, standardized vocabularies, and use locale-neutral data representations. A score of 0 is assigned if the literature associated with an approach does not acknowledge or adhere to existing best practice standards. If existing standards are acknowledged but are not adhered to or are partially adhered to, we assign a score of 0.5. If the literature acknowledges and adheres to existing best practices, a score of 1 is assigned.

We consider the inclusion of provenance, involving the capture of existential source information, such as attribution information for how a data point was measured or derived. A score of 0 is assigned for this metric if the approach does not include attributions to source or derivation information. If attribution information that does not use Semantic Web standards is included, we assign a score of 0.5. If the approach covers attributions recorded using a Semantic Web vocabulary, such as the PROV-O ontology, a score of 1 is assigned. In terms of documentation, we further search for the inclusion of assistive documents, tutorials, and demonstrations. We assign a score of 0 for this metric if just one of either documentation, tutorials, or demonstrations is included. If two or all of the above are involved, we assign scores of 0.5 or 1, respectively.

Finally, we consider the machine-readable metric, determining whether the resulting knowledge representation from an approach is discernable by software. In addition to the consideration of the machine-readability of output artifacts such as produced knowledge graphs, we also examine input artifacts, such as the document that contains the set of semantic mappings. If neither input nor output artifacts can be parsed using software, a score of 0 is assigned for this metric. If either input or output artifacts can be parsed, but not both, a score of 0.5 is assigned. We assign a score of 1 if both input and output artifacts are machine-readable.

In Table 6 , we provide a high-level comparison between the Semantic Data Dictionary, traditional data dictionaries, mapping languages and semantic approaches that leverage them, and data integration tools. Of the conventional data dictionaries examined in Section 2.1, we use the Project Open Metadata Schema data dictionary for comparison since it was the only reviewed guideline that used a standard linked data vocabulary. Of the mapping languages, we use R2RML for comparison, as it is a standard that is well adopted by the Semantic Web community. Of the data integration tools we surveyed, we use Karma for this evaluation, as it is an example of a data integration approach that was designed with both the FAIR principles and ease of use for the end-user in mind. Rather than only using these approaches in conducting the evaluation, we think of these examples as guidelines and consider traditional data dictionaries, mapping languages, and data integration tools in general when assigning numerical scores.

We have demonstrated the benefits of using a standardized machine-readable representation for recording data set metadata and column information, which is achieved through SDDs, over earlier data dictionary formats. Furthermore, we demonstrate that the SDD approach presents a level of abstraction over methodologies that use mapping languages, allowing improved ease of use for a domain scientist over other semantic tools. In this regard, SDDs tend to provide a bridge between conventional data dictionary approaches used by domain scientists and formal semantic approaches used by Semantic Web researchers, thereby accommodating both user groups. We recognize that the RDF mapping tools that exist are intended to provide a bridge by reducing manual mapping or KG creation work that would otherwise be necessary, but also acknowledge that they may be unusable to domain scientists.

9. DISCUSSION

In presenting this work, we consider two general types of users. We consider those using SDDs to semantically annotate data as well as those using SDDs in place of traditional data dictionaries in order to understand the data being described. For the first group of users, benefits of using SDDs include that the annotation process is accessible for users outside of the Semantic Web domain and that existing SDDs can be reused to ease the creation of new annotations. Some benefits for the second group include that (i) traditionally humans alone can understand data descriptions in existing data dictionaries but SDDs can be interpreted by machines as well, (ii) SDDs are written using fixed vocabularies which reduce ambiguity, and (iii) the SDD provides a standard specification that can be used to interpret existing annotations.

By including a fixed set of tables for the annotator to fill out that are interpreted and converted using a standard set of rules, the SDD framework provides consistency by creating a formal semantic representation using direct RDF mappings, resulting in an increased likelihood of diverse annotators creating similar representations. This is in contrast with other mapping approaches, where multiple annotators are much less likely to produce similar results when addressing the same data set. The SDD approach reduces such representational biases as it abstracts away structural modeling decisions from the user, both cultivating scalability of production and simultaneously lowering the barrier of entry since not all of the authors have to be computer scientists. Moreover, the vocabulary used in an SDD can be easily updated by replacing terms from any of the tables, where similar updates are much less amenable when using standard mapping methods. An advantage of these features of the SDD is that users can focus on their topic of specialization rather than on the RDF, reducing the need for domain scientists to also become ontology experts. Given a recommended set of ontologies to use, any user should be able to create their own SDD for a given data set.

From the evaluation of Section 7, we find that in the data category, SDDs perform much better than traditional data dictionaries, and equally well as mapping languages and data integration tools. SDDs outperform the three other approaches in the semantics category. In terms of semantics, a notable impact of this work is our approach to object and relation elicitation, where detailed annotations for objects implicitly referenced by the data can be included. SDDs and mapping languages perform equally well in the FAIR category, surpassing the scores of data integration tools and traditional data dictionaries. SDDs, mapping languages, and data integration tools tied for the best performance in the generality category, greatly outperforming traditional data dictionaries. While traditional DDs performed the worst over all four categories, they do outperform mapping languages and data integration tools in the value annotation metric.

10. CONCLUSION

While the use of SDDs addresses many of the shortcomings associated with the prior art, we do acknowledge several limitations of this approach. In Section 6, we mention several challenges faced by epidemiologists in creating SDDs. We found that the domain scientists had difficulties representing complex ideas, implicit concepts, and time associations. Additionally, determining the best ontology term to use when creating annotations was not always clear. These challenges relate to the limitation that this approach has some reliance on the annotator containing knowledge about relevant ontologies in the domain of discourse. Several steps to help alleviate these challenges are discussed in Section 6.

Another limitation of this approach is that it currently only supports the annotation of tabular data. Adopting techniques from some of the methods discussed in Section 2.2.2 can help with a future extension to support XML data. Additions to support the annotation of unstructured text data is beyond the scope of this work. Finally, we acknowledge that the annotation process discussed in this article is mostly done manually. This limitation decreases the likelihood of the adoption of this approach by those wishing to streamline the annotation process or incorporate the approach as part of a larger workflow. While automated annotation is not yet supported, existing research on an SDD editor being conducted by members of the Tetherless World Constellation (TWC) involves the incorporation of Natural Language Processing (NLP) techniques to suggest concepts from ontologies based on text descriptions.

Our approach was outperformed in a few of the evaluation metrics, including space annotation, domain knowledge support, and the leveraging of best practices. Space annotation, to some degree, is supported through the use of implicit entries and property customization. Nevertheless, the SDD approach received a 0.5 rather than a 1 for this metric since, unlike Karma, which supports the annotation of geospatial data, and contains tutorials for how to annotate such data and tools developed specifically for geospatial data integration [ 67 , 68 , 69 ], it does not readily allow for the incorporation of the longitudinal and latitudinal coordinates. While the SDD approach allows the use of domain ontologies during the annotation process, a score of 0.5 was assigned to the domain knowledge support metric since we have not developed tools that suggest to the user the most appropriate domain concept to use. Nevertheless, as mentioned above, ongoing work on an SDD editor will leverage NLP techniques to allow for this capability. Finally, while many of the DWBP and HCLS recommendations are incorporated into our approach, a score of 0.5 was received in terms of leveraging best practices because additional standards for these guidelines have yet to be incorporated. Additionally, further alignment with that standards mentioned in Section 2.3 should be achieved. The relevant best practices associated with our approach have been a subject of much discussion; further incorporation of these recommendations will be included in future revisions.

An ideal knowledge model promotes improved discovery, interoperability, reuse, traceability, and reproducibility. The knowledge model resulting from the SDD approach adheres to Semantic Web standards, resulting in improved discovery on the Web, as well as interoperability with systems that also use RDF data serializations. These artifacts are reusable, as SDD tables created for one data set can be reused to annotate another similar data set. Scientific studies involving SDDs are traceable and reproducible by design, as the artifacts designed during the modeling process can be published and shared, helping to ensure consistency for other researchers attempting to examine the studies.

In this work, we advance the state of the art of metadata capture of data sets by improving on existing standards with the formalization of the Semantic Data Dictionary specification, which produces machine-readable knowledge representations by leveraging Semantic Web technologies. This is achieved by formalizing the assignment of a semantic representation of data and annotating data set columns and their values using concepts from best practice ontologies. We provide resources such as documentation, examples, tutorials, and modeling guidelines to aid those who wish to create their own Semantic Data Dictionaries. We claim that this approach and the resulting artifacts are FAIR, help address limitations of traditional data dictionaries, and provide a bridge between representation methods used by domain scientists and semantic mapping approaches. We evaluate this work by defining metrics over several relevant categorizations, and scoring the Semantic Data Dictionary, traditional data dictionaries, mapping languages, and data integration tools for each metric. As we provide a methodology to aid in scientific workflows, this work eases the semantic annotation process for data providers and users alike.

AUTHOR CONTRIBUTIONS

S.M. Rashid ([email protected]), in drafting the paper, introduced the research, motivation, and claims of this article in Section 1, conducted the majority of the literature review presented in Section 2, summarized the methodology associated with the approach in Section 3, formulated the example of Section 4, detailed the case studies presented in Section 5, performed the evaluation of Section 7 and 8, helped with the discussion in Section 9, and summarized the conclusions of the article in Section 10. J.P. McCusker ([email protected]) contributed to the content of Section 3 and aided in the formulation of the evaluation of Section 7. P. Pinheiro ([email protected]) helped scope the example of Section 4. M.P. Bax ([email protected]) helped with the conducting of the literature review of Section 2 and aided in the formulation of the evaluation of Section 7. H.O. Santos ([email protected]) helped synthesize the related literature in Section 2 and presented some limitations of our approach in Section 10. J.A. Stingone ([email protected]) conducted the experiment and drafted the content presented in Section 6. A.K. Das ([email protected]) led the proposal of the research problems associated with the HEALS projects mentioned in Section 5. D.L. McGuinness ([email protected]) has guided the overall direction of this research. All the authors have made meaningful and valuable contributions in revising and proofreading the resulting manuscript.

This work is supported by the National Institute of Environmental Health Sciences (NIEHS) Award 0255-0236-4609/1U2CES026555-01, IBM Research AI through the AI Horizons Network, and the CAPES Foundation Senior Internship Program Award 88881.120772 / 2016-01. We acknowledge the members of the Tetherless World Constellation (TWC) and the Institute for Data Exploration and Applications (IDEA) at Rensellaer Polytechnic Institute (RPI) for their contributions, including Rebecca Cowan, John Erickson, and Oshani Seneviratne.

https://github.com/tetherless-world/SemanticDataDictionary

https://www.stonybrook.edu/commcms/irpe/about/data_governance/_files/DataDictionaryStandards.pdf

https://help.osf.io/hc/en-us/articles/360019739054-How-to-Make-a-Data-Dictionary

https://github.com/USG-SCOPE/data-dictionary/blob/gh-pages/Metadata-Scheme-for-Data-Dictionaries.md

https://project-open-data.cio.gov/v1.1/schema/

https://github.com/tetherless-world/setlr/wiki/JSLDT-Template-Language

http://metadata-standards.org/11179/

https://tetherless-world.github.io/sdd/resources

A listing of ontology prefixes used in this article is provided in Appendix Table A.1 .

https://tetherless-world.github.io/sdd/

https://www.w3.org/TR/hcls-dataset/

https://www.w3.org/TR/dwbp/

When referencing columns from any of the SDD tables, the Small Caps typeface is used.

When including implicit entries in an SDD table, the prefix “??” is used as a distinguishing labeling feature. The typewriter typeface is used in this article when referring to instances of implicit entries.

The italics typeface is used when a property from an ontology is mentioned.

https://www.w3.org/TR/xmlschema11-2/

https://github.com/tetherless-world/chear-ontology/blob/master/code_mappings.csv

rdf:type, sio:isAttributeOf, rdfs:comment, skos:definition, sio:hasStartTime, sio:existsAt, sio:hasEndTime, sio:inRelationTo, rdfs:label, sio:hasRole, sio:hasUnit, sio:hasValue, prov:wasDerivedFrom , and prov:wasGeneratedBy

See https://science.rpi.edu/biology/news/ibm-and-rensselaer-team-research-chronic-diseases-cognitive-computing or https://idea.rpi.edu/research/projects/heals for more information.

Namespace prefixes and IRIs for relevant ontologies.

Due to the subjective nature of deciding the importance of each component, the rows in each of the specifications are shown in alphabetical order rather than in a meaningful sequence.

Infosheet specification.

Infosheet metadata supplement.

Dictionary mapping specification.

Codebook specification.

Timeline specification.

Properties specification.

The tables in this appendix correspond to annotations created for the National Health and Nutrition Examination Survey (NHANES). For more details on each of the annotated columns, we recommend that the reader visits the NHANES website at https://www.cdc.gov/nchs/nhanes/index.htm .

NHANES demographics Infosheet.

NHANES demographic implicit entries.

NHANES demographic explicit entries.

Expanded NHANES demographic Codebook entries.

Email alerts

Related articles, affiliations.

  • Online ISSN 2641-435X

A product of The MIT Press

Mit press direct.

  • About MIT Press Direct

Information

  • Accessibility
  • For Authors
  • For Customers
  • For Librarians
  • Direct to Open
  • Open Access
  • Media Inquiries
  • Rights and Permissions
  • For Advertisers
  • About the MIT Press
  • The MIT Press Reader
  • MIT Press Blog
  • Seasonal Catalogs
  • MIT Press Home
  • Give to the MIT Press
  • Direct Service Desk
  • Terms of Use
  • Privacy Statement
  • Crossref Member
  • COUNTER Member  
  • The MIT Press colophon is registered in the U.S. Patent and Trademark Office

This Feature Is Available To Subscribers Only

Sign In or Create an Account

The Semantic Data Dictionary - An Approach for Describing and Annotating Data

Affiliations.

  • 1 Rensselaer Polytechnic Institute, Troy, NY, 12180, USA.
  • 2 Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270-901, BR.
  • 3 Columbia University, Mailman School of Public Health, New York, NY, 10032, USA.
  • 4 IBM Research, Cambridge, MA 02142, USA.
  • PMID: 33103120
  • PMCID: PMC7583433
  • DOI: 10.1162/dint_a_00058

It is common practice for data providers to include text descriptions for each column when publishing datasets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a dataset, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce the Semantic Data Dictionary, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse datasets. In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research and describe how the Semantic Data Dictionary can help address existing limitations in the related literature. We discuss our approach, present an example by annotating portions of the publicly available National Health and Nutrition Examination Survey dataset, present modeling challenges, and describe the use of this approach in sponsored research, including our work on a large NIH-funded exposure and health data portal and in the RPI-IBM collaborative Health Empowerment by Analytics, Learning, and Semantics project. We evaluate this work in comparison with traditional data dictionaries, mapping languages, and data integration tools.

Keywords: Codebook; Data; Data Dictionary; Data Integration; Dictionary Mapping; FAIR; Knowledge Modeling; Mapping Language; Metadata Standard; Semantic Data Dictionary; Semantic ETL; Semantic Web.

Grants and funding

  • R00 ES027022/ES/NIEHS NIH HHS/United States
  • U2C ES026555/ES/NIEHS NIH HHS/United States

UC Merced Library logo

  • What Is a Data Dictionary?

A Data Dictionary Definition

A Data Dictionary is a collection of names, definitions, and attributes about data elements that are being used or captured in a database, information system, or part of a research project. It describes the meanings and purposes of data elements within the context of a project, and provides guidance on interpretation, accepted meanings and representation. A Data Dictionary also provides metadata about data elements. The metadata included in a Data Dictionary can assist in defining the scope and characteristics of data elements, as well the rules for their usage and application. 

Why Use a Data Dictionary?

Data Dictionaries are useful for a number of reasons. In short, they:

  • Assist in avoiding data inconsistencies across a project
  • Help define conventions that are to be used across a project
  • Provide consistency in the collection and use of data across multiple members of a research team
  • Make data easier to analyze
  • Enforce the use of Data Standards

What Are Data Standards and Why Should I Use Them?

Data Standards are rules that govern the way data are collected, recorded, and represented. Standards provide a commonly understood reference for the interpretation and use of data sets.

By using standards, researchers in the same disciplines will know that the way their data are being collected and described will be the same across different projects. Using Data Standards as part of a well-crafted Data Dictionary can help increase the usability of your research data, and will ensure that data will be recognizable and usable beyond the immediate research team.

Resources and Examples

Northwest Environmental Data Network, Best Practices for Data Dictionary Definitions and Usage

USGS: Data Dictionaries and Metadata

If you'd like more information on research data curation and management, please schedule a consultation:

Schedule Appointment

  • New? Start Here
  • Research Guides
  • Citing Sources
  • Meet with a Librarian
  • Library DIY Tutorials
  • Workshop Recordings
  • Library Tour
  • Starting Your Research Series
  • Graduate Students
  • Digital Curation and Scholarship
  • Metadata and Documentation
  • File and Folder Organization
  • Storage and Preservation
  • Version Control
  • Data Carpentry
  • Research Data Management Toolkit
  • Research Data Management Glossary
  • Data Management Plans
  • Scholarly Publishing
  • GIS Services and Support
  • Schedule a Research Appointment
  • Library Instruction Services
  • Schedule an Instruction Session
  • Course Resources
  • By Request Workshops

University of California, Merced

  • Skip to Guides Search
  • Skip to breadcrumb
  • Skip to main content
  • Skip to footer
  • Skip to chat link
  • Report accessibility issues and get help
  • Go to Penn Libraries Home
  • Go to Franklin catalog
  • Penn Libraries
  • Research Data & Digital Scholarship

Data Management Resources

  • Codebooks & Data Dictionaries
  • Data Management Plans
  • File Organization
  • Spreadsheets
  • Metadata & Standards
  • ReadMe Files
  • Repositories
  • Storage & Backups
  • Sustainable File Types
  • Citing Data

Codebook and Data Dictionary Resources

  • Codebook Cookbook  - Patrick Belisle of McGill University
  • How to Create a Codebook with SPSS - Kent State University Libraries
  • How to Make a Data Dictionary  - OSF Support
  • Data Management: Data Dictionaries Video [6:30] - University of Wisconsin Data Services with Kristin Briney

Codebooks & Data Dictionaries

Data dictionaries and codebooks are essential documentation of the variables, structure, content, and layout of your datasets. A good dictionary/codebook has enough information about each variable for it to be self explanatory and interpreted properly by someone outside of your research group. The terms are often used interchangeably, but codebooks tend to for survey data and allow the reader to follow the structured format of the survey and possible response value. 

Data dictionaries and codebooks should include:

  • Examples: H40-SF12-2, FLJ36031Y, DOB
  • Example: SF12 - ASSESSMENT OF R'S GENERAL HEALTH
  • Example: Unified Medical Language System (UMLS)
  • Example: In general, would you say your health is . . .
  • Examples: Nominal, ordinary, scale, ratio, interval, none (such as for qualitative variables)
  • Example: Likert scale  - 1, 2, 3, 4, 5, temperature reading - 100.4
  • Example: Excellent, Very Good, Good, Fair, Poor
  • Summary statistics : Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.
  • Example: Refusal (-1), Missing due to instrument calibration issue (-9)
  • Example: Default Next Question: H00035.00
  • Example: 2007-04-05T14:30-04:00
  • Notes : Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

Developed out of ICPSR, What is a Codebook?  and SAMHDA, What is a Codebook? . 

Data Dictionary Blank Template

Looking for a place to start when creating a new data dictionary? You can feel free to download and use the Data Dictionary Blank Template that we have created. The first sheet on the Excel file provides you with commonly required columns that are necessary to fully define your data. The second sheet in the Excel file is where you define the column headers and possible values that can be entered. There is an example in the first row that can be deleted for you to enter in your own data. 

This template is build off of the Ag Data Commons " Data Dictionary - Blank Template " from the United States Department of Agriculture [no longer accessible online as of 2023-12-18].

  • << Previous: ReadMe Files
  • Next: Sharing >>
  • Last Updated: Apr 16, 2024 9:47 AM
  • URL: https://guides.library.upenn.edu/datamgmt

U.S. flag

An official website of the United States government

Here's how you know

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Home

  •   Facebook
  •   Twitter
  •   Linkedin
  •   Digg
  •   Reddit
  •   Pinterest
  •   Email

Latest Earthquakes |    Chat Share Social Media  

Data Dictionaries

A data dictionary is used to catalog and communicate the structure and content of data, and provides meaningful descriptions for individually named data objects.

Data Dictionaries & Metadata

CSV of data dictionary used to create structured metadata record

Data dictionary information can be used to fill in entity & attribute section or feature catalog of formal metadata. If you are working with data dictionary information within formal metadata, there are a number of tools that can help.

Table of Contents

  • What's in a Data Dictionary?
  • How Data Dictionaries are Used
  • Data Dictionaries are for Sharing
  • Keep Your Data Dictionary Up to Date
  • Data Dictionaries Can Reveal Poor Design Decisions
  • Making a Data Dictionary
  • What the U.S. Geological Survey Manual Requires

What's in a Data Dictionary?  

Data dictionaries store and communicate metadata about data in a database, a system, or data used by applications. A useful introduction to data dictionaries is provided in this  video . Data dictionary contents can vary but typically include some or all of the following:

  • A listing of data objects (names and definitions)
  • Detailed properties of data elements (data type, size, nullability, optionality, indexes)
  • Entity-relationship (ER) and other system-level diagrams
  • Reference data (classification and descriptive domains)
  • Missing data and quality-indicator codes
  • Business rules, such as for validation of a schema or data quality

How Data Dictionaries are Used  

  • Documentation  - provide data structure details for users, developers, and other stakeholders
  • Communication  - equip users with a common vocabulary and definitions for shared data, data standards, data flow and exchange, and help developers gage impacts of schema changes
  • Application Design  - help application developers create forms and reports with proper data types and controls, and ensure that navigation is consistent with data relationships
  • Systems Analysis  - enable analysts to understand overall system design and data flow, and to find where data interact with various processes or components
  • Data Integration  - clear definitions of data elements provide the contextual understanding needed when deciding how to map one data system to another, or whether to subset, merge, stack, or transform data for a specific use
  • Decision Making  - assist in planning data collection, project development, and other collaborative efforts

Data Dictionaries are for Sharing  

For groups of people working with similar data, having a shared data dictionary facilitates standardization by documenting common data structures and providing the precise vocabulary needed for discussing specific data elements. Shared dictionaries ensure that the meaning, relevance, and quality of data elements are the same for all users. Data dictionaries also provide information needed by those who build systems and applications that support the data. Lastly, if there is a common, vetted, and documented data resource, it is not necessary to produce separate documentation for each implementation.

Examples of Shared USGS Data Dictionaries

  • EarthExplorer USGS Landsat Data Dictionary
  • Data Dictionary for Surficial Sediment Data from the Gulf of Maine, Georges Bank, and Vicinity GIS Compilation (USGS Open-File Report 03-001)
  • Aerial Photo Single Frames Data Dictionary
  • National Elevation Dataset (NED) Data Dictionary [PDF] (Example only - updated NED Data Dictionary will be available soon)
  • National Hydrography Dataset Data Dictionary  

Examples of non-USGS Data Dictionaries

  • Planetary Science Dictionary  (NASA)
  • MODIS Level 1B Products Data Dictionary  (NASA)
  • Data Dictionary for Organic Carbon Sorption and Decomposition in Selected Global Soils  (ORNL)
  • Human Health Risk Assessment Data Dictionary  (ORNL)
  • Climate and Forecast Conventions Standard Name Table
  • Data Dictionary for the National Database of Deep-Sea Corals  (NOAA)
  • JPL Planetary Data System Data Dictionary

Keep Your Data Dictionary Up to Date  

Plan ahead  for storing data at the start of any project by developing a schema or data model as a guide to data requirements. As required and optional data elements are identified, add them to the data dictionary. When data structures change, update the dictionary. Try to use naming conventions appropriate to the system or subject area. The easiest path is to adopt and cite a data standard, thus avoiding the need to provide and manage your own documentation.

The Alaska Science Center  Research Data Management Plan [PDF]  has excellent examples of a Data Description Form and other forms to capture metadata before, during, and at the end of a project.

Data Dictionaries Can Reveal Poor Design Decisions  

For both data reviewers and data users, the data dictionary can reveal potential credibility problems within the data. Poor table organization and object naming can severely limit data understandability and ease-of-use, incomplete data definitions can render otherwise stellar data virtually useless, and failure to keep the dictionary up to date with the actual data structures suggests a lack of data stewardship. Although getting critical feedback about their data may be initially troublesome for some data creators, developing good data design and description habits is worth the effort and ultimately benefits everyone who will use the data.

Learn more about naming conventions and find guides to writing column descriptions at  Best Practices for Data Dictionary Definitions and Usage  and  Captain Obvious' Guide to Column Descriptions - Data Dictionary Best Practices .

Making a Data Dictionary  

Most database management systems (DBMS) have built-in, active data dictionaries and can generate documentation as needed ( SQL Server ,  Oracle ,  mySQL ). The same is true when designing data systems using  CASE tools  (Computer-aided software engineering). The open source  Analyzer tool  for MS Access can be used to document Access databases and Access-connected data (SQL Server, Oracle, and others). Finally, use the  Data Dictionary - Blank Template  for manually creating a simple 'data dictionary' in Excel.

For information on creating a data dictionary in a formal metadata file (Entity and Attribute section) refer to the  Metadata page .

What the U.S. Geological Survey Manual Requires  

The USGS Survey Manual Chapter  502.7 – Fundamental Science Practices: Metadata for USGS Scientific Information Products Including Data  requires that data metadata records include information such as who produced the data and why, methodologies and citations, collection and processing methods, definitions of entities and attributes, geographic location, and any access or use constraints, all of which facilitate evaluation of the data and information for use.

Related Topics  

  • Data Acquisition Methods  - check the data dictionary when acquiring data from external sources
  • Data and File Formats  - capture file, table, and field names and properties in a data dictionary
  • Data Modeling - gather data requirements and use design standards to help build data dictionaries
  • Data Standards  - use a standard that includes a fully defined data structure
  • Data Templates  - use a template for a predefined schema and data dictionary
  • Domains  - include domains (reference lists, lookup tables) as part of the dictionary information
  • Naming Conventions - apply a consistent approach to create meaningful table and field names; consider a similar naming convention for files and folders
  • Organize Files and Data  - include the name and description of data files in the metadata and associate the file names with tables in the data dictionary
  • DOI. 2008.  Data Quality Management Guide [PDF] .
  • USGS Science Analytics and Synthesis (SAS) -  Biocomplexity Thesaurus .
  • Northwest Environmental Data-Network.  Best Practices for Data Dictionary Definitions and Usage [PDF] .

Examples, Tools and Templates

  • Entity/Attribute  metadata  for: Knight, R.R., Cartwright, J.M., and Ladd, D.E., 2016, Streamflow and fish community diversity data for use in developing ecological limit functions for the Cumberland Plateau, northeastern Middle Tennessee and southwestern Kentucky, 2016: U.S. Geological Survey Data Release:  https://doi.org/10.5066/F7JH3J83 .
  • JPL, 2008, Planetary Science Data Dictionary, JPL D-7116, Rev. F (Corresponds to Database Build pdscat1r71),  https://mirrors.asun.co/climate-mirror/pds/pds.nasa.gov/documents/psdd/PSDDmain_1r71.pdf .
  • National Water Information System (NWIS).  Search Criteria and Codes .
  • USDA, Ag Data Commons Data Submission Manual v1.3.  Data Dictionary Blank Template .
  • 53 Data Dictionary Tools .

Page last updated 1/2/24.

Data Topics

  • Data Architecture
  • Data Literacy
  • Data Science
  • Data Strategy
  • Data Modeling
  • Governance & Quality
  • Data Education
  • Enterprise Information Management
  • Information Management Articles

The Data Dictionary Demystified

Understanding Big Data and Data Governance goes hand in hand with the concept of a Data Dictionary. Data Dictionaries have been integral to business functions. This article will demystify and help to clarify the Data Dictionary model. What is a Data Dictionary?  A Data Dictionary provides the ingredients and steps needed to create relevant business […]

data dictionary in research paper

What is a Data Dictionary?

  A Data Dictionary provides the ingredients and steps needed to create relevant business reports from a database. The UCMerced Library simply states in “What is A Data Dictionary” that a Data Dictionary is a “collection of names, definitions and attributes about elements that are being used or captured in a database.” This array, describing a database, needs to provides guidelines, as users enter, edit and delete data in real time. A  Database Administrator may likely deal with fluid data. In this case, an Active Data Dictionary, as defined by Gartner’s IT Glossary, provides a “facility for storing dynamically accessible and modifiable information.”

The International Standards Organization (ISO) proposes, in Understanding the Data Dictionary, three categories: Business Concepts, Data Types and Message Concepts. Business Concepts define a business Metadata layer, as described by Zaino, as the “definitions for the physical data that people will access in business terms.” Data Types describe formats for data elements to be considered valid. Message Concepts a shared understanding between institutions and companies to ensure business communications are within the same context. These three Data Dictionary items: Business Concepts, Data Types, and Message Concepts interrelate to one another.

Advantages of a Data Dictionary

  A Data Dictionary helps change to be possible. It saves the extra time figuring out what the data means and how it interrelates. Advantages of a Data Dictionary include:

  • Consistent Use of Vocabulary: Meaningful information requires instructions on how vocabulary is used and understanding of the context. For example, take the “contact” data element. In a college’s Corporate Relations office, a contact may mean a person, in a private corporation, who would be willing to fund college research and scholarships. To an Admissions Department, a contact data element consists of student’s parents or an alumnus. To the person just hired as an admin assistant, a contact data element may mean a person whom he or she has telephoned or emailed. Without clear definition, in a Data Dictionary, the data entered could take any one of the meanings.
  • Useful reports: As the University of Michigan’s Information and Technology Services states “if you don’t understand how the data is structured, the links between tables, and which BusinessObjects folders to use, your report results may be incorrect.” Add the need to generate reports in a dynamic environment, and <Data Dictionaries> become essential.
  • Easier Data Document Management: Making a Data Dictionary responsive to change requires simply, access to a computer program with word processing or pen with paper. Blaha states in Documenting Data Models that a Data Dictionary can be easily printed. Such a resource” is simple to receive and requires no modeling tool skill. There is no tool cost” or special software needed to access such information.
  • Smoother Database Upgrades: Like the Windows OS, database software, such as that from Oracle, needs to be periodically upgraded. To do this a Data Dictionary is crucial and is a built in aspect of the program. For example Oracle Financial Services Analytical Applications (OFSAA) as well as the Oracle Financial Services Data Foundation (OFSDF) detail how to generate Data Dictionary documentation “to account for site-specific changes as well as release-specific changes from Oracle.”
  • More Meaningful Metadata: To have accessible data it needs to be “properly collected and stored.” Metadata provides information about the “ context, content, quality, provenance, and/or accessibility of a set of data.” Data Dictionaries provides a centralized location to describe Metadata about the database. As mentioned by AHIMA, having an established Data (AHIMA, 2016). This includes the Metadata pertaining to a database. Just as in the health industry, a Data Dictionary maps any businesses data use by keeping everyone on the same page about the data’s function.

Alternatives to a Data Dictionary

Data Dictionaries do have some draw backs. First, it can be time consuming and cumbersome for a business to maintain and use a comprehensive Data Dictionary. For example, it would be inconvenient for a customer to learn Metadata in order to places an order. Likewise, a Business Analyst, under a tight deadline, may not have time to update or consult Data Dictionary documentation. A start-up environment may not have the information necessary to start a Data Dictionary. Consider these alternatives to a Data Dictionary:

  • Captions and Prompts in Forms and Reports: Define Data Elements as they are needed. For example, go to the Address section of a typical e commerce site. A “Select” caption, by State or Provence, instructs a user to choose from a pull down list. Options only include specific menu selections, depending on a particular country chosen. This prevents customers from entering bad data and keeps data consistent. Should a business analyst need to report on the revenue from a particular state, a similar prompts and a pull down box can be used. This verbal prompting may be used along with other data elements to keep business elements consistent.
  • User Stories: In Agile development, user stories form the basis to creating a new or updating a product, including a database. “A user story is an artifact describing that an agent (the who) wants to do a specific action    (the what) for a specific purpose (the why). It also specifies what steps are required to show or measure (the how).”

As project managers and participants hash out how a program functions and what a customer needs, they define data elements in terms of business context, format, and message. Add specifics about what needs to be captured or used in the database to the story and make the collection of user stories searchable by business context for future sprints. Voila, the objectives in creating a common understanding and vocabulary of data elements happen concurrently with the objectives in the Agile development process.

While captions, prompting or user stories may provide an immediate fix to defining databases, it probably is not a good long-term strategy. Over time businesses grow and the databases evolve. Also the data elements needed to report on how business contacts benefit a business or the number of doctor’s visits needs, becomes murky and complex. Spending the extra time constructing a Data Dictionary would allow for clarification sooner than later.

Data Dictionaries: A Case Study

To look at the value of the Data Dictionary consider the Human Genome Project (HGP). International researchers have worked for years on Human Genome Project to construct a genetic map of humans, account for different genetic variations, and to make this genetic information available for use and analysis . Support for a Data Dictionary type of resource became a crucial requirement for the HGP and was thus created. An established Data Dictionary led to the success of the HGP. According to the National Institutes of Health (NIH) this includes:

  • Completion of the Human Genome Project, under budget and more than two years ahead of schedule, in April 2003.
  • Discovery of more than 1,800 disease genes, as of today
  • Identification of a genetic cause for a disease assessed in a matter of days, from many years.
  • More than 2000 genetic tests, enabling patients to learn about their risks.

As the HGP demonstrates, quicker results at a low cost come in part, from an excellent Data Dictionary with a shared vocabulary. This continuing legacy shows how Metadata in a database works hand in hand towards business’s success with Big Data.

Leave a Reply Cancel reply

You must be logged in to post a comment.

Book cover

Encyclopedia of Mathematical Geosciences pp 1–5 Cite as

Data Dictionary

  • Madhurima Panja 7 ,
  • Tanujit Chakraborty 7 &
  • Uttam Kumar 7  
  • Living reference work entry
  • Later version available View entry history
  • First Online: 10 November 2022

11 Accesses

Part of the book series: Encyclopedia of Earth Sciences Series ((EESS))

Data dictionary stores catalog information about schemas and constraints, design decisions, usage standards, application program description, user information, etc., that can be accessed directly by users or the database administrators when needed (Elmasri and Navathe 2000 ). Such a system is also called an information repository and is a record of the objects in the database (Raschka and Mirjalili 2019 ). In technological domain, these objects are referred to as metadata. As per the IBM Dictionary of Computing, data dictionary is defined as “centralized repository containing information of the data in the database such that the meaning, relationship, source of the data, where it will be used and the format is clearly mentioned or specified” (McDaniel 1994 ). In other words, a data dictionary is often termed as a data definition matrix, which is a textual description of data objects and their interrelationships. It is commonly used in confirming data requirements and for...

This is a preview of subscription content, log in via an institution .

Baccianella, S., Esuli, A. & Sebastiani, F., 2010. Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10).

Google Scholar  

Barclay J, Starn J, Briggs M, Helton A (2020) Improved prediction of management-relevant groundwater discharge characteristics throughout river networks. Water Resour Res

Batini, C., Di Battista, G. & Santucci, G., 1990. A methodology for the design of data dictionaries. Ninth Annual International Phoenix Conference on Computers and Communications (pp. 706 – 707 ). IEEE Computer Society

Date CJ (2000) An introduction to database systems, 7th edn. Pearson Education Inc

Elmasri R, Navathe S (2000) Fundamentals of database systems. Addison-Wesley

Gatti L, Guerini M, Turchi M (2015) SentiWords: deriving a high precision and high coverage lexicon for sentiment analysis. IEEE Trans Affect Comput 7(4):409–421

Article   Google Scholar  

Hutto C, Gilbert E (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. Proc Int AAAI Conf Web Soc Media 8(1):216–225

Long Z, Xinqing W (2014) General geo-spatial database construction method based on data dictionary. Remote Sens Land Resour 26(1):173–178

McDaniel G (1994) IBM dictionary of computing. McGraw-Hill, Inc., New York

Medhat W, Hassan A, Korashy H (2014) Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J:1093–1113

Needham, M., n.d. International data corporation. [online] Available at: https://www.idc.com/getdoc.jsp?containerId=prUS48165721

Ramakrishna R, Gehrke J (2000) Database management systems, 2nd edn. McGraw-Hill

Raschka S, Mirjalili V (2019) Python machine learning: machine learning and deep learning with python, scikit-learn, and TensorFlow 2, 3rd edn. Packt Publishing Ltd.

Silberschatz A, Korth HF, Sudarshan S (2006) Database system concepts. McGraw-Hill, Singapore

Download references

Author information

Authors and affiliations.

Spatial Computing Laboratory, Center for Data Sciences, International Institute of Information Technology Bangalore (IIITB), Bangalore, India

Madhurima Panja, Tanujit Chakraborty & Uttam Kumar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Uttam Kumar .

Editor information

Editors and affiliations.

System Science & Informatics Unit, Indian Statistical Institute- Bangalore Centre, Bangalore, India

B. S. Daya Sagar

Insititue of Earth Sciences, China University of Geosciences, Beijing, China

Qiuming Cheng

School of Natural and Built Environment, Queen's University Belfast, Belfast, UK

Jennifer McKinley

Canada Geological Survey, Ottawa, ON, Canada

Frits Agterberg

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this entry

Cite this entry.

Panja, M., Chakraborty, T., Kumar, U. (2022). Data Dictionary. In: Daya Sagar, B.S., Cheng, Q., McKinley, J., Agterberg, F. (eds) Encyclopedia of Mathematical Geosciences. Encyclopedia of Earth Sciences Series. Springer, Cham. https://doi.org/10.1007/978-3-030-26050-7_75-1

Download citation

DOI : https://doi.org/10.1007/978-3-030-26050-7_75-1

Received : 25 February 2022

Accepted : 25 April 2022

Published : 10 November 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-26050-7

Online ISBN : 978-3-030-26050-7

eBook Packages : Springer Reference Earth and Environm. Science Reference Module Physical and Materials Science Reference Module Earth and Environmental Sciences

  • Publish with us

Policies and ethics

Chapter history

DOI: https://doi.org/10.1007/978-3-030-26050-7_75-2

DOI: https://doi.org/10.1007/978-3-030-26050-7_75-1

  • Find a journal
  • Track your research

Subscribe to the PwC Newsletter

Join the community, edit social preview.

data dictionary in research paper

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row.

  • DATA AUGMENTATION
  • MULTI-TASK LEARNING
  • NATURAL LANGUAGE INFERENCE

Remove a task

data dictionary in research paper

Add a method

Remove a method, edit datasets, dke-research at semeval-2024 task 2: incorporating data augmentation with generative models and biomedical knowledge to enhance inference robustness.

14 Apr 2024  ·  Yuqi Wang , Zeqiang Wang , Wei Wang , Qi Chen , Kaizhu Huang , Anh Nguyen , Suparna De · Edit social preview

Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit, methods edit add remove.

To revisit this article, visit My Profile, then View saved stories .

  • Backchannel
  • Newsletters
  • WIRED Insider
  • WIRED Consulting

Amanda Hoover

Students Are Likely Writing Millions of Papers With AI

Illustration of four hands holding pencils that are connected to a central brain

Students have submitted more than 22 million papers that may have used generative AI in the past year, new data released by plagiarism detection company Turnitin shows.

A year ago, Turnitin rolled out an AI writing detection tool that was trained on its trove of papers written by students as well as other AI-generated texts. Since then, more than 200 million papers have been reviewed by the detector, predominantly written by high school and college students. Turnitin found that 11 percent may contain AI-written language in 20 percent of its content, with 3 percent of the total papers reviewed getting flagged for having 80 percent or more AI writing. (Turnitin is owned by Advance, which also owns Condé Nast, publisher of WIRED.) Turnitin says its detector has a false positive rate of less than 1 percent when analyzing full documents.

ChatGPT’s launch was met with knee-jerk fears that the English class essay would die . The chatbot can synthesize information and distill it near-instantly—but that doesn’t mean it always gets it right. Generative AI has been known to hallucinate , creating its own facts and citing academic references that don’t actually exist. Generative AI chatbots have also been caught spitting out biased text on gender and race . Despite those flaws, students have used chatbots for research, organizing ideas, and as a ghostwriter . Traces of chatbots have even been found in peer-reviewed, published academic writing .

Teachers understandably want to hold students accountable for using generative AI without permission or disclosure. But that requires a reliable way to prove AI was used in a given assignment. Instructors have tried at times to find their own solutions to detecting AI in writing, using messy, untested methods to enforce rules , and distressing students. Further complicating the issue, some teachers are even using generative AI in their grading processes.

Detecting the use of gen AI is tricky. It’s not as easy as flagging plagiarism, because generated text is still original text. Plus, there’s nuance to how students use gen AI; some may ask chatbots to write their papers for them in large chunks or in full, while others may use the tools as an aid or a brainstorm partner.

Students also aren't tempted by only ChatGPT and similar large language models. So-called word spinners are another type of AI software that rewrites text, and may make it less obvious to a teacher that work was plagiarized or generated by AI. Turnitin’s AI detector has also been updated to detect word spinners, says Annie Chechitelli, the company’s chief product officer. It can also flag work that was rewritten by services like spell checker Grammarly, which now has its own generative AI tool . As familiar software increasingly adds generative AI components, what students can and can’t use becomes more muddled.

Detection tools themselves have a risk of bias. English language learners may be more likely to set them off; a 2023 study found a 61.3 percent false positive rate when evaluating Test of English as a Foreign Language (TOEFL) exams with seven different AI detectors. The study did not examine Turnitin’s version. The company says it has trained its detector on writing from English language learners as well as native English speakers. A study published in October found that Turnitin was among the most accurate of 16 AI language detectors in a test that had the tool examine undergraduate papers and AI-generated papers.

Airchat Is Silicon Valley’s Latest Obsession

Lauren Goode

Donald Trump Poses a Unique Threat to Truth Social, Says Truth Social

William Turton

The Paradox That's Supercharging Climate Change

Eric Ravenscraft

Schools that use Turnitin had access to the AI detection software for a free pilot period, which ended at the start of this year. Chechitelli says a majority of the service’s clients have opted to purchase the AI detection. But the risks of false positives and bias against English learners have led some universities to ditch the tools for now. Montclair State University in New Jersey announced in November that it would pause use of Turnitin’s AI detector. Vanderbilt University and Northwestern University did the same last summer.

“This is hard. I understand why people want a tool,” says Emily Isaacs, executive director of the Office of Faculty Excellence at Montclair State. But Isaacs says the university is concerned about potentially biased results from AI detectors, as well as the fact that the tools can’t provide confirmation the way they can with plagiarism. Plus, Montclair State doesn’t want to put a blanket ban on AI, which will have some place in academia. With time and more trust in the tools, the policies could change. “It’s not a forever decision, it’s a now decision,” Isaacs says.

Chechitelli says the Turnitin tool shouldn’t be the only consideration in passing or failing a student. Instead, it’s a chance for teachers to start conversations with students that touch on all of the nuance in using generative AI. “People don’t really know where that line should be,” she says.

You Might Also Like …

In your inbox: The best and weirdest stories from WIRED’s archive

Jeffrey Epstein’s island visitors exposed by data broker

8 Google employees invented modern AI. Here’s the inside story

The crypto fraud kingpin who almost got away

It's shadow time! How to view the solar eclipse, online and in person

data dictionary in research paper

Steven Levy

No One Actually Knows How AI Will Affect Jobs

Will Knight

Perplexity's Founder Was Inspired by Sundar Pichai. Now They’re Competing to Reinvent Search

Kate Knibbs

Inside the Creation of the World’s Most Powerful Open Source AI Model

Matt Burgess

To Build a Better AI Supercomputer, Let There Be Light

Help | Advanced Search

Computer Science > Computers and Society

Title: synthetic census data generation via multidimensional multiset sum.

Abstract: The US Decennial Census provides valuable data for both research and policy purposes. Census data are subject to a variety of disclosure avoidance techniques prior to release in order to preserve respondent confidentiality. While many are interested in studying the impacts of disclosure avoidance methods on downstream analyses, particularly with the introduction of differential privacy in the 2020 Decennial Census, these efforts are limited by a critical lack of data: The underlying "microdata," which serve as necessary input to disclosure avoidance methods, are kept confidential. In this work, we aim to address this limitation by providing tools to generate synthetic microdata solely from published Census statistics, which can then be used as input to any number of disclosure avoidance algorithms for the sake of evaluation and carrying out comparisons. We define a principled distribution over microdata given published Census statistics and design algorithms to sample from this distribution. We formulate synthetic data generation in this context as a knapsack-style combinatorial optimization problem and develop novel algorithms for this setting. While the problem we study is provably hard, we show empirically that our methods work well in practice, and we offer theoretical arguments to explain our performance. Finally, we verify that the data we produce are "close" to the desired ground truth.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. What is a data dictionary? [Downloadable Example Template]

    data dictionary in research paper

  2. What Is a Data Dictionary?

    data dictionary in research paper

  3. Defining a Data Dictionary

    data dictionary in research paper

  4. Business Data Dictionary Template

    data dictionary in research paper

  5. Sample Data dictionary

    data dictionary in research paper

  6. Data dictionary by example

    data dictionary in research paper

VIDEO

  1. What is Data Dictionary in Software Engineering #softwareengineer #dbms #engineering

  2. O-level #Lecture- Topic- Dictionary# Python class NIELIT# Paper M3 #BCA #MCA O LEVEL CLASS

  3. Data Dictionary in Analysis Workspace

  4. Types of Data dictionary in DBMS

  5. Creating Data Dictionary

  6. Data Dictionary

COMMENTS

  1. Getting Started Creating Data Dictionaries: How to Create a Shareable

    A data dictionary is a supplementary document that details the information provided in a data set. Data dictionaries usually include the meaning and attributes of the contained variables as well as information about the creation, format, and usage of the data (McDaniel & International Business Machines Corporation, 1994).Data dictionaries can be contrasted with codebooks, which are customarily ...

  2. The Semantic Data Dictionary

    In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition ...

  3. The Semantic Data Dictionary

    A common approach used by data providers involves providing descriptive information for a dataset in the form of a data dictionary, defined as a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format" [ 1 ]. Data dictionaries are useful for many data management tasks ...

  4. Describing Your Data: Data Dictionaries

    A data dictionary, "read me" file, or key explains the contents of the dataset. Any information someone would need to interpret or re-use your data should be included in the data dictionary. It may include full definitions of any abbreviations used, units of measurement, allowable values in a field, data types, thesauri or controlled ...

  5. Data Dictionary: Examples, Templates, & Best practices

    a data dictionary is a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format. It assists management, database administrators, system analysts, and application programmers in planning, controlling, and evaluating the collection, storage, and use of data.".

  6. The Semantic Data Dictionary

    Abstract. It is common practice for data providers to include text descriptions for each column when publishing data sets in the form of data dictionaries. While these documents are useful in helping an end-user properly interpret the meaning of a column in a data set, existing data dictionaries typically are not machine-readable and do not follow a common specification standard. We introduce ...

  7. Getting Started Creating Data Dictionaries: How to Create a Shareable

    research data in the social sciences, with the goal of sharing files on a platform for other researchers to read. Disclosures The materials for this Tutorial can be found at https:// osf.io/3y2ex/. These materials include detailed video tutorials that will be updated as the demonstrated appli-cations are updated. The code for the Data Dictionary

  8. The Semantic Data Dictionary

    In this paper, we present our Semantic Data Dictionary work in the context of our work with biomedical data; however, the approach can and has been used in a wide range of domains. The rendition of data in this form helps promote improved discovery, interoperability, reuse, traceability, and reproducibility. We present the associated research ...

  9. The Semantic Data Dictionary

    The Semantic Data Dictionary is introduced, a specification that formalizes the assignment of a semantic representation of data, enabling standardization and harmonization across diverse data sets and is evaluated in comparison with traditional data dictionaries, mapping languages, and data integration tools. It is common practice for data providers to include text descriptions for each column ...

  10. PDF Smithsonian Data Management Best Practices

    Create one descriptive file (dictionary) for each dataset. Name the dataset, data dictionary, and any other supporting files similarly. Follow the conventions of your discipline when choosing standardized terms or when structuring your data, e.g., use USGS Thesauri terms for Earth science data, or Darwin Core for Natural History collections.

  11. PDF Guideline for Describing a Data Dictionary

    Before a Data Dictionary can be developed, the research question of the study or registry has to be defined. Also, required variables and parameters should already be known in principle, e. g. units, ... Further examples and tips can be found e. g. in the Guidelines for research data management [German, original title: Leitlinien zum ...

  12. What Is a Data Dictionary?

    A Data Dictionary Definition. A Data Dictionary is a collection of names, definitions, and attributes about data elements that are being used or captured in a database, information system, or part of a research project. It describes the meanings and purposes of data elements within the context of a project, and provides guidance on ...

  13. Codebooks & Data Dictionaries

    Codebooks & Data Dictionaries. Data dictionaries and codebooks are essential documentation of the variables, structure, content, and layout of your datasets. A good dictionary/codebook has enough information about each variable for it to be self explanatory and interpreted properly by someone outside of your research group.

  14. PDF Paper TT07 A SAS Macro to Create a Data Dictionary with Ease

    A SAS® Macro to Create a Data Dictionary with Ease Amy Gravely, Center for Chronic Disease Outcomes Research, A VA HSR&D Center of Innovation Barbara Clothier, Center for Chronic Disease Outcomes Research, A VA HSR&D Center of ... Lafler K (2005), Exploring DICTIONARY Tables and Views, SUGI 30, Paper 070-30. Thorton P (2011), SAS ...

  15. A data dictionary as a Lexicon

    ABSTRACT. Data Dictionaries (DD) contain crucial information about the (technical) meaning of words used in a certain company. In linguistics a lexicon contains syntactic and semantic information about words used in the society. In this paper we study the possibility y of structuring a Data Dictionary as if it were a lexicon.

  16. Data Dictionary

    The phrase data dictionary has two closely related meanings: (i) as documentation primarily for consumption by human users, administrators, and designers; and (ii) as a mini-database managed by a DBMS and tightly coupled with the software components of the DBMS. In the first meaning, a data dictionary is a document (or collection of documents ...

  17. PDF Data Dictionary

    Data Dictionary allows users to search for Common Data Elements (CDE), create and manage Unique Data Elements (UDE), browse, define, use and re-use electronic forms (eCRFs) and form structures across the studies. The Data Dictionary module is closely related to the Data Repository module which provides long term repository for research data.

  18. Development and validation of a data dictionary for a feasibility

    The potential benefits of a data dictionary for research purposes include improved data quality, improved data integrity, consistency in data use and easier data analysis [7,8]. Furthermore a data dictionary facilitates comparing performance consistently across sites, systems and over time. ... (Paper 4) 2021, Prehospital and Disaster Medicine ...

  19. Data Dictionaries

    Data dictionaries store and communicate metadata about data in a database, a system, or data used by applications. A useful introduction to data dictionaries is provided in this video. Data dictionary contents can vary but typically include some or all of the following: A listing of data objects (names and definitions)

  20. (PDF) Getting Started Creating Data Dictionaries: How to Create a

    A data dictionary is a supplementary document that describes the information provided about data in detail and is the primary source of standard definitions for information elements and data ...

  21. The Data Dictionary Demystified

    A Data Dictionary provides the ingredients and steps needed to create relevant business reports from a database. The UCMerced Library simply states in "What is A Data Dictionary" that a Data Dictionary is a "collection of names, definitions and attributes about elements that are being used or captured in a database.".

  22. Data Dictionaries

    A data dictionary is a means for recording the metadata of some organisation (Navathe and Kerschberg, 1986). That is, data about data. Data dictionaries have been used in three ways within information systems development: 1. Conceptual data dictionaries record meta-data/process at a very high level of abstraction. 2.

  23. Data Dictionary

    Data dictionary stores catalog information about schemas and constraints, design decisions, usage standards, application program description, user information, etc., that can be accessed directly by users or the database administrators when needed (Elmasri and Navathe 2000).Such a system is also called an information repository and is a record of the objects in the database (Raschka and ...

  24. DKE-Research at SemEval-2024 Task 2: Incorporating Data Augmentation

    This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater ...

  25. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

    In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any ...

  26. ResearchAgent: Iterative Research Idea Generation over Scientific

    Scientific Research, vital for improving human life, is hindered by its inherent complexity, slow pace, and the need for specialized experts. To enhance its productivity, we propose a ResearchAgent, a large language model-powered research idea writing agent, which automatically generates problems, methods, and experiment designs while iteratively refining them based on scientific literature ...

  27. [2404.07143] Leave No Context Behind: Efficient Infinite Context

    Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention. This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention.

  28. Students Are Likely Writing Millions of Papers With AI

    Students have submitted more than 22 million papers that may have used generative AI in the past year, new data released by plagiarism detection company Turnitin shows. A year ago, Turnitin rolled ...

  29. Reliability Research on Quantum Neural Networks

    Quantum neural networks (QNNs) leverage the strengths of both quantum computing and neural networks, offering solutions to challenges that are often beyond the reach of traditional neural networks. QNNs are being used in areas such as computer games, function approximation, and big data processing. Moreover, quantum neural network algorithms are finding utility in social network modeling ...

  30. Synthetic Census Data Generation via Multidimensional Multiset Sum

    The US Decennial Census provides valuable data for both research and policy purposes. Census data are subject to a variety of disclosure avoidance techniques prior to release in order to preserve respondent confidentiality. While many are interested in studying the impacts of disclosure avoidance methods on downstream analyses, particularly with the introduction of differential privacy in the ...