A Review of Face Recognition Technology

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Face Recognition by Humans and Machines: Three Fundamental Advances from Deep Learning

Alice j. o’toole.

1 School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas 75080, USA;

Carlos D. Castillo

2 Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA;

Deep learning models currently achieve human levels of performance on real-world face recognition tasks. We review scientific progress in understanding human face processing using computational approaches based on deep learning. This review is organized around three fundamental advances. First, deep networks trained for face identification generate a representation that retains structured information about the face (e.g., identity, demographics, appearance, social traits, expression) and the input image (e.g., viewpoint, illumination). This forces us to rethink the universe of possible solutions to the problem of inverse optics in vision. Second, deep learning models indicate that high-level visual representations of faces cannot be understood in terms of interpretable features. This has implications for understanding neural tuning and population coding in the high-level visual cortex. Third, learning in deep networks is a multistep process that forces theoretical consideration of diverse categories of learning that can overlap, accumulate over time, and interact. Diverse learning types are needed to model the development of human face processing skills, cross-race effects, and familiarity with individual faces.

1. INTRODUCTION

The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks ( Jacquet & Champod 2020 , Phillips et al. 2018 , Taigman et al. 2014 ). This behavioral parity allows for meaningful comparisons of representations in two successful systems. DCNNs also emulate computational aspects of the ventral visual system ( Fukushima 1988 , Krizhevsky et al. 2012 , LeCun et al. 2015 ) and support surprisingly direct, layer-to-layer comparisons with primate visual areas ( Yamins et al. 2014 ). Nonlinear, local convolutions, executed in cascaded layers of neuron-like units, form the computational engine of both biological and artificial neural networks for human and machine-based face recognition. Enormous numbers of parameters, diverse learning mechanisms, and high-capacity storage in deep networks enable a wide variety of experiments at multiple levels of analysis, from reductionist to abstract. This makes it possible to investigate how systems and subsystems of computations support face processing tasks.

Our goal is to review scientific progress in understanding human face processing with computational approaches based on deep learning. As we proceed, we bear in mind wise words written decades ago in a paper on science and statistics: “All models are wrong, but some are useful” ( Box 1979 , p. 202) (see the sidebar titled Perspective: Theories and Models of Face Processing and the sidebar titled Caveat: Iteration Between Theory and Practice ). Since all models are wrong, in this review, we focus on what is useful. For present purposes, computational models are useful when they give us insight into the human visual and perceptual system. This review is organized around three fundamental advances in understanding human face perception, using knowledge generated from deep learning models. The main elements of these advances are as follows.

PERSPECTIVE: THEORIES AND MODELS OF FACE PROCESSING

Box (1976) reminds us that scientific progress comes from motivated iteration between theory and practice. In understanding human face processing, theories should be used to generate the questions, and machines (as models) should be used to answer the questions. Three elemental concepts are required for scientific progress. The first is flexibility. Effective iteration between theory and practice requires feedback between what the theory predicts and what the model reveals. The second is parsimony. Because all models are wrong, excessive elaboration will not find the correct model. Instead, economical descriptions of a phenomenon should be preferred over complex descriptions that capture less fundamental elements of human perception. Third, Box (1976 , p. 792) cautions us to avoid “worrying selectivity” in model evaluation. As he puts it, “since all models are wrong, the scientist must be alert to what is importantly wrong.”

These principles represent a scientific ideal, rather than a reality in the field of face perception by humans and machines. Applying scientific principles to computational modeling of human face perception is challenging for diverse reasons (see the sidebar titled Caveat: Iteration Between Theory and Practice below). We argue, as Cichy & Kaiser (2019) have, that although the utility of scientific models is usually seen in terms of prediction and explanation, their function for exploration should not be underrated. As scientific models, DCNNs carry out high-level visual tasks in neurally inspired ways. They are at a level of development that is ripe for exploring computational and representational principles that actually work but are not understood. This is a classic problem in reverse engineering—yet the use of deep learning as a model introduces a dilemma. The goal of reverse engineering is to understand how a functional but highly complex system (e.g., the brain and human visual system) solves a problem (e.g., recognizes a face). To accomplish this, a well-understood model is used to test hypotheses about the underlying mechanisms of the complex system. A prerequisite of reverse engineering is that we understand how the model works. Failing that, we risk using one poorly understood system to test hypotheses about another poorly understood system. Although deep networks are not black boxes (every parameter is knowable) ( Hasson et al. 2020 ), we do not fully understand how they recognize faces ( Poggio et al. 2020 ). Therefore, the primary goal should be to understand deep networks for face recognition at a conceptual and representational level.

CAVEAT: ITERATION BETWEEN THEORY AND PRACTICE

Box (1976) noted that scientific progress depends on motivated iteration between theory and practice. Unfortunately, a motivation to iterate between theory and practice is not a reasonable expectation for the field of computer-based face recognition. Automated face recognition is big business, and the best models were not developed to study human face processing. DCNNs provide a neurally inspired, but not copied, solution to face processing tasks. Computer scientists formulated DCNNs at an abstract level, based on neural networks from the 1980s ( Fukushima 1988 ). Current DCNN-based models of human face processing are computationally refined, scaled-up versions of these older networks. Algorithm developers make design and training decisions for performance and computational efficiency. In using DCNNs to model human face perception, researchers must choose between smaller, controlled models and larger-scale, uncontrolled networks (see also Richards et al. 2019 ). Controlled models are easier to analyze but can be limited in computational power and training data diversity. Uncontrolled models better emulate real neural systems but may be intractable. The easy availability of cutting-edge pretrained face recognition models, with a variety of architectures, has been the deciding factor for many research labs with limited resources and expertise to develop networks. Given the widespread use of these models in vision science, brain-similarity metrics for artificial neural networks have been developed ( Schrimpf et al. 2018 ). These produce a Brain-Score made up of a composite of neural and behavioral benchmarks. Some large-scale (uncontrolled) network architectures used in modeling human face processing (See Section 2.1 ) score well on these metrics.

A promising long-term strategy is to increase the neural accuracy of deep networks ( Grill-Spector et al. 2018 ). The ventral visual stream and DCNNs both enable hierarchical and feedforward processing. This offers two computational benefits consistent with DCNNs as models of human face processing. First, the universal approximation theorem ( Hornik et al. 1989 ) ensures that both types of networks can approximate any complex continuous function relating the input (visual image) to the output (face identity). Second, linear and nonlinear feedforward connections enable fast computation consistent with the speed of human facial recognition ( Grill-Spector et al. 2018 , Thorpe et al. 1996 ). Although current DCNNs lack other properties of the ventral visual system, these can be implemented as the field progresses.

  • Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. The face representations that emerge from deep networks trained for identification operate invariantly across changes in image and appearance, but they are not themselves invariant.
  • Computational theory and simulation studies of deep learning indicate a reconsideration of a long-standing axiom in vision science that face or object representations can be understood in terms of interpretable features. Instead, in deep learning models, the concept of a nameable deep feature, localized in an output unit of the network or in the latent variables of the space, should be reevaluated.
  • Natural environments provide highly variable training data that can structure the development of face processing systems using a variety of learning mechanisms that overlap, accumulate over time, and interact. It is no longer possible to invoke learning as a generic theoretical account of a behavioral or neural phenomenon.

We focus on deep learning findings that are relevant for understanding human face processing—broadly construed. The human face provides us with diverse information, including identity, gender, race or ethnicity, age, and emotional state. We use the face to make inferences about a person’s social traits ( Oosterhof & Todorov 2008 ). As we discuss below, deep networks trained for identification retain much of this diverse facial information (e.g., Colón et al. 2021 , Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 , Terhörst et al. 2020 ). The use of face recognition algorithms in applied settings (e.g., law enforcement) has spurred detailed performance comparisons between DCNNs and humans (e.g., Phillips et al. 2018 ). For analogous reasons, the problem of human-like race bias in DCNNs has also been studied (e.g., Cavazos et al. 2020 ; El Khiyari & Wechsler 2016 ; Grother et al. 2019 ; Krishnapriya et al. 2019 , 2020 ). Developmental data on infants’ exposure to faces in the first year(s) of life offer insight into how to structure the training of deep networks ( Smith & Slone 2017 ). These topics are within the scope of this review. Although we consider general points of comparison between DCNNs and neural responses in face-selective areas of the primate inferotemporal (IT) cortex, a detailed discussion of this topic is beyond the scope of this review. (For a review of primate face-selective areas that considers computational perspectives, see Hesse & Tsao 2020 ). In this review, we focus on the computational and representational principles of neural coding from a deep learning perspective.

The review is organized as follows. We begin with a brief review of where machine performance on face identification stands relative to humans in quantitative terms. Qualitative performance comparisons on identification and other face processing tasks (e.g., expression classification, social perception, development) are integrated into Sections 2 – 4 . These sections consider advances in understanding human face processing from deep learning approaches. We close with a discussion of where the next steps might lead.

1.1. Where We Are Now: Human Versus Machine Face Recognition

Deep learning models of face identification map widely variable images of a face onto a representation that supports identification accuracy comparable to that of humans. The steady progress of machines over the past 15 years can be summarized in terms of the increasingly challenging face images that they can recognize ( Figure 1 ). By 2007, the best algorithms surpassed humans on a task of identity matching for unfamiliar faces in frontal images taken indoors ( O’Toole et al. 2007 ). By 2012, well-established algorithms exceeded human performance on frontal images with moderate changes in illumination and appearance ( Kumar et al. 2009 , Phillips & O’Toole 2014 ). Machine ability to match identity for in-the-wild images appeared with the advent of DCNNs in 2013–2014. Human face recognition was marginally more accurate than DeepFace ( Taigman et al. 2014 ), an early DCNN, on the Labeled Faces in the Wild (LFW) data set ( Huang et al. 2008 ). LFW contains in-the-wild images taken mostly from the front. DCNNs now fare well on in-the-wild images with significant pose variation (e.g., Maze et al. 2018 , data set). Sengupta et al. (2016) found parity between humans and machines on frontal-to-frontal identity matching but human superiority on frontal-to-profile matching.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0001.jpg

The progress of computer-based face recognition systems can be tracked by their ability to recognize faces with increasing levels of image and appearance variability. In 2006, highly controlled, cropped face images with moderate variability, such as the images of the same person shown, were challenging (images adapted with permission from Sim et al. 2002 ). In 2012, algorithms could tackle moderate image and appearance variability (the top 4 images are extreme examples adapted with permission from Huang et al. 2012 ; the bottom two images adapted with permission from Phillips et al. 2011 ). By 2018, deep convolutional neural networks (DCNNs) began to tackle wide variation in image and appearance, (images adapted with permission from the database in Maze et al. 2018 ). In the 2012 and 2018 images, all side-by side images show the same person except the bottom pair of 2018 panels.

Identity matching:

process of determining if two or more images show the same identity or different identities; this is the most common task performed by machines

Human face recognition:

the ability to determine whether a face is known

1.2. Expert Humans and State-of-the-Art Machines Work Together

DCNNs can sometimes even surpass normal human performance. Phillips et al. (2018) compared humans and machines matching the identity of faces in high-quality frontal images. Although this is generally considered an easy task, the images tested were chosen to be highly challenging based on previous human and machine studies. Four DCNNs developed between 2015 and 2017 were compared to human participants from five groups: professional forensic face examiners, professional forensic face reviewers, superrecognizers ( Noyes et al. 2017 , Russell et al. 2009 ), professional fingerprint examiners, and students. Face examiners, reviewers, and superrecognizers performed more accurately than fingerprint examiners, and fingerprint examiners performed more accurately than students. Machine performance, from 2015 to 2017, tracked human skill levels. The 2015 algorithm ( Parkhi et al. 2015 ) performed at the level of the students; the 2016 algorithm ( Chen et al. 2016 ) performed at the level of the fingerprint examiners ( Ranjan et al. 2017c ); and the two 2017 algorithms ( Ranjan et al. 2017 , c ) performed at the level of professional face reviewers and examiners, respectively. Notably, combining the judgments of individual professional face examiners with those of the best algorithm ( Ranjan et al. 2017 ) yielded perfect performance. This suggests a degree of strategic diversity for the face examiners and the DCNN and demonstrates the potential for effective human–machine collaboration ( Phillips et al. 2018 ).

Combined, the data indicate that machine performance has improved from a level comparable to that of a person recognizing unfamiliar faces to one comparable to that of a person recognizing more familiar faces ( Burton et al. 1999 , Hancock et al. 2000 , Jenkins et al. 2011 ) (see Section 4.1 ).

2. RETHINKING INVERSE OPTICS AND FACE REPRESENTATIONS

Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. These networks operate with a degree of invariance to image and appearance that was unimaginable by researchers less than a decade ago. Invariance refers to the model’s ability to consistently identify a face when image conditions (e.g., viewpoint, illumination) and appearance (e.g., glasses, facial hair) vary. The nature of the representation that accomplishes this is not well understood. The inscrutability of DCNN codes is due to the enormous number of computations involved in generating a face representation from an image and the uncontrolled training data. To create a face representation, millions of nonlinear, local convolutions are executed over tens (to hundreds) of layers of units. Researchers exert little or no control over the training data, but instead source face images from the web with the goal of finding as much labeled training data as possible. The number of images per identity and the types of images (e.g., viewpoint, expression, illumination, appearance, quality) are left (mostly) to what is found through web scraping. Nevertheless, DCNNs produce a surprisingly structured and rich face representation that we are beginning to understand.

2.1. Mining the Face Identity Code in Deep Networks

The face representation generated by DCNNs for the purpose of identifying a face also retains detailed information about the characteristics of the input image (e.g., viewpoint, illumination) and the person pictured (e.g., gender, age). As shown below, this unified representation can solve multiple face processing tasks in addition to identification.

2.1.1. Image characteristics.

Face representations generated by deep networks both are and are not invariant to image variation. These codes can identify faces invariantly over image change, but they are not themselves invariant. Instead, face representations of a single identity vary systematically as a function of the characteristics of the input image. The representations generated by DCNNs are, in fact, representations of face images.

Work to dissect face identity codes draws on the metaphor of a face space ( Valentine 1991 ) adapted to representations generated by a DCNN. Visualization and simulation analyses demonstrate that identity codes for face images retain ordered information about the input image ( Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 ). Viewpoint (yaw and pitch) can be predicted accurately from the identity code, as can media source (still image or video frame) ( Parde et al. 2017 ). Image quality (blur, usability, occlusion) is also available as the identity code norm (vector length). 1 Poor-quality images produce face representations centered in the face space, creating a DCNN garbage dump. This organizational structure was replicated in two DCNNs with different architectures, one developed by Chen et al. (2016) with seven convolutional layers and three fully connected layers and another developed by Sankaranarayanan et al. (2016) with 11 convolutional layers and one fully connected layer. Image quality estimates can also be optimized directly in a DCNN using human ratings ( Best-Rowden & Jain 2018 ).

Face space:

representation of the similarity of faces in a multidimensional space

For a closer look at the structure of DCNN face representations, Hill et al. (2019) examined the representations of highly controlled face images in a face space generated by a deep network trained with in-the-wild images. The network processed images of three-dimensional laser scans of human heads rendered from five viewpoints under two illumination conditions (ambient, harsh spotlight). Visualization of these representations in the resulting face space showed a highly ordered pattern (see Figure 2 ). Consistent with the network’s high accuracy at face identification, images clustered by identity. Identity clusters separated into regions of male and female faces (see Section 2.1.2 ). Within each identity cluster, the images separated by illumination condition—visible in the face space as chains of images. Within each illumination chain, the image representations were arranged in the space by viewpoint, which varied systematically along the image chain. To further probe the coding of identity, Hill et al. (2019) processed images of caricatures of the 3D heads (see also Blanz & Vetter 1999 ). Caricature representations were centered in each identity cluster, indicating that the network perceived a caricature as a good likeness of the identity.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0002.jpg

Visualization of the top-level deep convolutional neural network (DCNN) similarity space for all images from Hill et al. (2019) . ( a – f ) Points are colored according to different variables. Grey polygonal borders are for illustration purposes only and show the convex hull of all images of each identity. These convex hulls are expanded by a margin for visibility. The network separates identities accurately. In panels a and d , the space is divided into male and female sections. In panels b and e , illumination conditions subdivide within identity groupings. In panels c and f , the viewpoint varies sequentially within illumination clusters. Dotted-line boxes in panels a – c show areas enlarged in panels d – g . Figure adapted with permission from Hill et al. (2019) .

DCNN face representation:

output vector produced for a face image processed through a deep network trained for faces

All results from Hill et al. (2019) were replicated using two networks with starkly different architectures. The first, developed by Ranjan et al. (2019) , was based on a ResNet-101 with 101 layers and skip connections; the second, developed by Chen et al. (2016) , had 15 convolution and pooling layers, a dropout layer, and one fully connected top layer. As measured using the brain-similarity metrics developed in Brain-Score ( Schrimpf et al. 2018 ), one of these architectures (ResNet-101) was the third most brain-like of the 25 networks tested. The ResNet-101 network scored well on both neural (V4 and IT cortex) and behavioral predictability for object recognition. Hill et al.’s (2019) replication of this face space using a shallower network ( Chen et al. 2016 ), however, suggests that network architecture may be less important than computational capacity in understanding high-level visual codes for faces (see Section 3.2 ).

Brain-Score:

neural and behavioral benchmarks that score an artificial neural network on its similarity to brain mechanisms for object recognition

Returning to the issue of human-like view invariance in a DCNN, Abudarham & Yovel (2020) compared the similarity of face representations computed within and across identities and viewpoints. Consistent with view-invariant performance, same-identity, different-view face pairs were more similar than different-identity, same-view face pairs. Consistent with a noninvariant face representation, correlations between similarity scores across head view decreased monotonically with increasing view disparity. These results support the characterization of DCNN codes as being functionally view invariant but with a view-specific code. Notably, earlier layers in the network showed view specificity, whereas higher layers showed view invariance.

It is worth digressing briefly to consider invariance in the context of neural approaches to face processing. An underlying assumption of neural approaches is that “a major purpose of the face patches is thus to construct a representation of individual identity invariant to view direction” ( Hesse & Tsao 2020 , pp. 703). Ideas about how this is accomplished have evolved. Freiwald & Tsao (2010) posited the progressive computation of invariance via the pooling of neurons across face patches, as follows. In early patches, a neuron responds to a specific identity from specific views; in middle face patches, greater invariance is achieved by pooling the responses of mirror-symmetric views of an identity; in later face patches, each neuron pools inputs representing all views of the same individual to create a fully view-invariant representation. More recently, Chang & Tsao (2017) proposed that the brain computes a view-invariant face code using shape and appearance parameters analogous to those used in a computer graphics model of face synthesis ( Cootes et al. 1995 ) (see the sidebar titled Neurons, Neural Tuning, Population Codes, Features, and Perceptual Constancy ). This code retains information about the face, but not about the particular image viewed.

NEURONS, NEURAL TUNING, POPULATION CODES, FEATURES, AND PERCEPTUAL CONSTANCY

Barlow (1972 , p. 371) wrote, “Results obtained by recording from single neurons in sensory pathways…obviously tell us something important about how we sense the world around us; but what exactly have we been told?” In answer, Barlow (1972 , p. 371) proposed that “our perceptions are caused by the activity of a rather small number of neurons selected from a very large population of predominantly silent cells. The activity of each single cell is thus an important perceptual event and it is thought to be related quite simply to our subjective experience.” Although this proposal is sometimes caricatured as the grandmother cell doctrine (see also Gross 2002 ), Barlow simply asserts that single-unit activity can be interpreted in perceptual terms, and that the responses of small numbers of units, in combination, underlie subjective perceptual experience. This proposal reflects ideas gleaned from studies of early visual areas that have been translated, at least in part, to studies of high-level vision.

Over the past decade, single neurons in face patches have been characterized as selective for facial features (e.g., aspect ratio, hair length, eyebrow height) ( Freiwald et al. 2009 ), face viewpoint and identity ( Freiwald & Tsao 2010 ), eyes ( Issa & DiCarlo 2012 ), and shape or appearance parameters from an active appearance model of facial synthesis ( Chang & Tsao 2017 ). Neurophysiological studies of face and object processing also employ techniques aimed at understanding neural population codes. Using the pattern of neural responses in a population of neurons (e.g., IT), linear classifiers are used often to predict subjective percepts (commonly defined as the image viewed). For example, Chang & Tsao (2017) showed that face images viewed by a macaque could be reconstructed using a linear combination of the activity of just 205 face cells in face patches ML–MF and AM. This classifier provides a real neural network model of the face-selective cortex that can be interpreted in simple terms.

Population code models generated from real neural data (a few hundred units), however, differ substantially in scale from the face- and object-selective cortical regions that they model (1 mm 3 of the cerebral cortex contains approximately 50,000 neurons and 300 million adjustable parameters; Azevedo et al. 2009 , Kandel et al. 2000 , Hasson et al. 2020 ). This difference in scale is at the core of a tension between model interpretability and real-world task generalizability ( Hasson et al. 2020 ). It also creates tension between the neural coding hypotheses suggested by deep learning and the limitations of current neuroscience techniques for testing these hypotheses. To model neural function, an electrode gives access to single neurons and (with multi-unit recordings) to relatively small numbers of neurons (a few hundred). Neurocomputational theory based on direct fit models posits that overparameterization (i.e., the extremely high number of parameters available for neural computation) is critical to the brain’s solution to real-world problems (see Section 3.2 ). Bridging the gap between the computational and neural scale of these perspectives remains an ongoing challenge for the field.

Deep networks suggest an alternative that is largely consistent with neurophysiological data but interprets the data in a different light. Neurocomputational theory posits that the ventral visual system untangles face identity information from image parameters ( DiCarlo & Cox 2007 ). The idea is that visual processing starts in the image domain, where identity and viewpoint information are entangled. With successive levels of neural processing, manifolds corresponding to individual identities are untangled from image variation. This creates a representational space where identities can be separated with hyperplanes. Image information is not lost, but rather, is rearranged (for object recognition results, see Hong et al. 2016 ). The retention of image and identity information in DCNN face representations is consistent with this theory. It is also consistent with basic neuroscience findings indicating the emergence of a representation dominated by identity that retains sensitivity to image features (See Section 2.2 ).

2.1.2. Appearance and demographics.

Faces can be described using what computer vision researchers have called attributes or soft biometrics (hairstyle, hair color, facial hair, and accessories such as makeup and glasses). The definition of attributes in the computational literature is vague and can include demographics (e.g., gender, age, race) and even facial expression. Identity codes from deep networks retain a wide variety of face attributes. For example, Terhörst et al. (2020) built a massive attribute classifier (MAC) to test whether 113 attributes could be predicted from the face representations produced by deep networks [ArcFace ( Deng et al. 2019 ) or FaceNet ( Schroff et al. 2015 )] for images from in-the-wild data sets ( Huang et al. 2008 , Liu et al. 2015 ). The MAC learned to map from DCNN-generated face representations to attribute labels. Cross-validated results showed that 39 of the attributes were easily predictable, and 74 of the 113 were predictable at reliable levels. Hairstyle, hair color, beard, and accessories were predicted easily. Attributes such as face geometry (e.g., round), periocular characteristics (e.g., arched eyebrows), and nose were moderately predictable. Skin and mouth attributes were not well predicted.

The continuous shuffling of identity, attribute, and image information across layers of the network was demonstrated by Dhar et al. (2020) . They tracked the expressivity of attributes (identity, sex, age, pose) across layers of a deep network. Expressivity was defined as the degree to which a feature vector, from any given layer of a network, specified an attribute. Dhar et al. (2020) computed expressivity using a second neural network that estimated the mutual information between attributes and DCNN features. Expressivity order in the final fully connected layer of both networks (Resnet-101 and Inception Resnet v2; Ranjan et al. 2019 ) indicated that identity was most expressed, followed by age, sex, and yaw. Identity expressivity increased dramatically from the final pooling layer to the last fully connected layer. This echos the progressive increase in the detectability of view-invariant face identity representations seen across face patches in the macaque ( Freiwald & Tsao 2010 ). It also raises the computational possibility of undetected viewpoint sensitivity in these neurons (see Section 3.1 ).

Mutual information:

a statistical term from information theory that quantifies the codependence of information between two random variables

2.1.3. Social traits.

People make consistent (albeit invalid) inferences about a person’s social traits based on their face ( Todorov 2017 ). These judgments have profound consequences. For example, competence judgments about faces predict election success at levels far above chance ( Todorov et al. 2005 ). The physical structure of the face supports these trait inferences ( Oosterhof & Todorov 2008 , Walker & Vetter 2009 ), and thus it is not surprising that deep networks retain this information. Using face representations produced by a network trained for face identification ( Sankaranarayanan et al. 2016 ), 11 traits (e.g., shy, warm, impulsive, artistic, lazy), rated by human participants, were predicted at levels well above chance ( Parde et al. 2019 ). Song et al. (2017) found that more than half of 40 attributes were predicted accurately by a network trained for object recognition (VGG-16; Simonyan & Zisserman 2014 ). Human and machine trait ratings were highly correlated.

Other studies show that deep networks can be optimized to predict traits from images. Lewenberg et al. (2016) crowd-sourced large numbers of objective (e.g., hair color) and subjective (e.g., attractiveness) attribute ratings from faces. DCNNs were trained to classify images for the presence or absence of each attribute. They found highly accurate classification for the objective attributes and somewhat less accurate classification for the subjective attributes. McCurrie et al. (2017) trained a DCNN to classify faces according to trustworthiness, dominance, and IQ. They found significant accord with human ratings, with higher agreement for trustworthiness and dominance than for IQ.

2.1.4. Facial expressions.

Facial expressions are also detectable in face representations produced by identity-trained deep networks. Colón et al. (2021) found that expression classification was well above chance for face representations of images from the Karolinska data set ( Lundqvist et al. 1998 ), which includes seven facial expressions (happy, sad, angry, surprised, fearful, disgusted, neutral) seen from five viewpoints (frontal and 90- and 45-degree left and right profiles). Consistent with human data, happiness was classified most accurately, followed by surprise, disgust, anger, neutral, sadness, and fear. Notably, accuracy did not vary across viewpoint. Visualization of the identities in the emergent face space showed a structured ordering of similarity in which viewpoint dominated over expression.

2.2. Functional Invariance, Useful Variability

The emergent code from identity-trained DCNNs can be used to recognize faces robustly, but it also retains extraneous information that is of limited, or no, value for identification. Although demographic and trait information offers weak hints to identity, image characteristics and facial expression are not useful for identification. Attributes such as glasses, hairstyle, and facial hair are, at best, weak identity cues and, at worst, misleading cues that will not remain constant over extended time periods. In purely computational terms, the variability of face representations for different images of an identity can lead to errors. Although this is problematic in security applications, coincidental features and attributes can be diagnostic enough to support acceptably accurate identification performance in day-to-day face recognition ( Yovel & O’Toole 2016 ). (For related arguments based on adversarial images for object recognition, see Ilyas et al. 2019 , Xie et al. 2020 , Yuan et al. 2020 .) A less-than-perfect identification system in computational terms, however, can be a surprisingly efficient, multipurpose face processing system that supports identification and the detection of visually derived semantic information [called attributes by Bruce & Young (1986) ].

What do we learn from these studies that can be useful in understanding human visual processing of faces? First, we learn that it is computationally feasible to accommodate diverse information about faces (identity, demographics, visually derived semantic information), images (viewpoint, illumination, quality), and emotions (expression) in a unified representation. Furthermore, this diverse information can be accessed selectively from the representation. Thus, identity, image parameters, and attributes are all untangled when learning prioritizes the difficult within-category discrimination problem of face identification.

Second, we learn that to understand high-level visual representations for faces, we need to think in terms of categorical codes unbound from a spatial frame of reference. Although remnants of retinotopy and image characteristics remain in high-level visual areas (e.g., Grill-Spector et al. 1999 , Kay et al. 2015 , Kietzmann et al. 2012 , Natu et al. 2010 , Yue et al. 2010 ), the expressivity of spatial layout weakens dramatically from early visual areas to categorically structured areas in the IT cortex. Categorical face representations should capture what cognitive and perceptual psychologists call facial features (e.g., face shape, eye color). Indeed, altering these types of features in a face affects identity perception similarly for humans and deep networks ( Abudarham et al. 2019 ). However, neurocomputational theory suggests that finding these features in the neural code will likely require rethinking the interpretation of neural tuning and population coding (see Section 3.2 ).

Third, if the ventral stream untangles information across layers of computations, then we should expect traces of identity, image data, and attributes at many, if not all, neural network layers. These may variously dominate the strength of the neural signal at different layers (see Section 3.1 ). Thus, various layers in the network will likely succeed in predicting several types of information about the face and/or image, though with differing accuracy. For now, we should not ascribe too much importance to findings about which specific layer(s) of a particular network predict specific attributes. Instead, we should pay attention to the pattern of prediction accuracy across layers. We would expect the following pattern. Clearly, for the optimized attribute (identity), the output offers the clearest access. For subject-related attributes (e.g., demographics), this may also be the case. For image-related attributes, we would expect every layer in the network to retain some degree of prediction ability. Exactly how, where, and whether the neural system makes use of these attributes for specific tasks remain open questions.

3. RETHINKING VISUAL FEATURES: IMPLICATIONS FOR NEURAL CODES

Deep learning models force us to rethink the definition and interpretation of facial features in high-level representations. Theoretical ideas about the brain’s solution to complex real-world tasks such as face recognition must be reconciled at the level of neural units and representational spaces. Deep learning models can be used to test hypotheses about how faces are stored in the high-dimensional representational space defined by the pattern of responses of large numbers of neurons.

3.1. Units Confound Information that Separates in the Representation Space

Insight into interpreting facial features comes from deep network simulations aimed at understanding the relationship between unit responses and the information retained in the face representation. Parde et al. (2021) compared identification, gender classification, and viewpoint estimation in subspaces of a DCNN face space. Using an identity-trained network capable of all three tasks, they tested performance on the tasks using randomly sampled subsets of output units. Beginning at full dimensionality (512-units) and progressively decreasing sample size, they found no notable decline in identification accuracy for more than 3,000 in-the-wild-faces until the sample size reached 16 randomly chosen units (3% of full dimensionality). Correlations between unit responses across representations were near zero, indicating that individual units captured nonredundant identity cues. Statistical power for identification (i.e., separating identities) was uniformly high for all output units, demonstrating that units used their entire response range to separate identities. A unit firing at its maximum provided no more, and no less, information than any other response value. This distinction may seem trivial, but it is not. The data suggest that every output unit acts to separate identities to the maximum degree possible. As such, all units participate in coding all identities. In information theory terms, this is an ideal use of neural resources.

For gender classification and viewpoint estimation, performance declined at a much faster rate than for identification as units were deleted ( Parde et al. 2021 ). Statistical power for predicting gender and viewpoint was strong in the distributed code but weak at the level of the unit. Prediction power for these attributes was again roughly equivalent for all units. Thus, individual units contributed to coding all three attributes, but identity modulated individual unit responses far more strongly than did gender or viewpoint. Notably, a principal component (PC) analysis of representations in the full-dimensional space revealed subspaces aligned with identity, gender, and viewpoint ( Figure 3 ). Consistent with the strength of the categorical identity code in the representation, identity information dominated PCs explaining large amounts of variance, gender dominated the middle range of PCs, and viewpoint dominated PCs explaining small amounts of variation.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0003.jpg

Illustration of the separation of the task-relevant information into subspaces for an identity-trained deep convolutional neural network (DCNN). Each plot shows the similarity (cosine) between principal components (PCs) of the face space and directional vectors in the space that are diagnostic of identity ( top ), gender ( middle ), and viewpoint ( bottom ). Figure adapted with permission from Parde et al. (2021) .

The emergence and effectiveness of these codes in DCNNs suggest that caution is needed in ascribing significance only to stimuli that drive a neuron to high rates of response. Small-scale modulations of neural responses can also be meaningful. Let us consider a concrete example. A neurophysiologist probing the network used by Parde et al. (2021) would find some neurons that respond strongly to a few identities. Interpreting this as identity tuning, however, would be an incorrect characterization of a code in which all units participate in coding all identities. Concomitantly, few units in the network would appear responsive to viewpoint or gender variations because unit firing rates would modulate only slightly with changes in viewpoint or gender. Thus, the distributed coding of view and gender across units would likely be missed. The finding that neurons in macaque face patch AM respond selectively (i.e., with high response rates) to identity over variable views ( Freiwald & Tsao 2010 ) is consistent with DCNN face representations. It is possible, however, that these units also encode other face and image attributes, but with differential degrees of expressivity. This would be computationally consistent with the untangling theory and with DCNN codes.

Macaque face patches:

regions of the macaque cortex that respond selectively to faces, including the posterior lateral (PL), middle lateral (ML), middle fundus (MF), anterior lateral (AL), anterior fundus (AF), and anterior medial (AM)

Another example comes from the use of generative adversarial networks and related techniques to characterize the response properties of single (or multiple) neuron(s) in the primate visual cortex ( Bashivan et al. 2019 , Ponce et al. 2019 , Yuan et al. 2020 ). These techniques have examined neurons in areas V4 ( Bashivan et al. 2019 ) and IT ( Ponce et al. 2019 , Yuan et al. 2020 ). The goal is to progressively evolve images that drive neurons to their maximum response or that selectively (in)activate subsets of neurons. Evolved images show complex mosaics of textures, shapes, and colors. They sometimes show animals or people and sometimes reveal spatial patterns that are not semantically interpretable. However, these techniques rely on two strong assumptions. First, they assume that a neuron’s response can be characterized completely in terms of the stimuli that activate it maximally, thereby discounting other response rates as noninformative. The computational utility of a unit’s full response range in DCNNs suggests that reconsideration of this assumption is necessary. Second, these techniques assume that a neuron’s response properties can be visualized accurately as a two-dimensional image. Given the categorical, nonretinotopic nature of representations in high-level visual areas, this seems problematic. If the representation under consideration is not in the image or pixel domain, then image-based visualization may offer limited, and possibly misleading, insight into the underlying nature of the code.

3.2. Direct-Fit Models and Deep Learning

In rethinking visual features at a theoretical level, direct-fit models of neural coding appear to best explain deep learning findings in multiple domains (e.g., face recognition, language) ( Hasson et al. 2020 ). These models posit that neural computation fits densely sampled data from the environment. Implementation is accomplished using “overparameterized optimization algorithms that increase predictive (generalization) power, without explicitly modeling the underlying generative structure of the world” ( Hasson et al. 2020 , p. 418). Hasson et al. (2020) begins with an ideal model in a small-parameter space ( Figure 4 ). When the underlying structure of the world is simple, a small-parameter model will find the underlying generative function, thereby supporting generalization via interpolation and extrapolation. Despite decades of effort, small-parameter functions have not solved real-world face recognition with performance anywhere near that of humans.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0004.jpg

( a ) A model with too few parameters fails to fit the data. ( b ) The ideal-fit model fits with a small number of parameters and has generative power that supports interpolation and extrapolation. ( c ) An overfit function can model noise in the training data. ( d ) An overparameterized model generalizes well to new stimuli within the scope of the training samples. Figure adapted with permission from Hasson et al. (2020) .

When the underlying structure of the world is complex and multivariate, direct-fit models offer an alternative to models based on small-parameter functions. With densely sampled real-world training data, each new observation can be placed in the context of past experience. More formally, direct-fit models solve the problem of generalization to new exemplars by experience-scaffolded interpolation ( Hasson et al. 2020 ). This produces face recognition performance in the range of that of humans. A fundamental element of the success of deep networks is that they model the environment with big data, which can be structured in overparameterized spaces. The scale of the parameterization and the requirement to operate on real-world data are pivotal. Once the network is sufficiently parameterized to fit the data, the exact details of its architecture are not important. This may explain why starkly different network architectures arrive at similarly structured representations ( Hill et al. 2019 , Parde et al. 2017 , Storrs et al. 2020 ).

Returning to the issue of features, in neurocomputational terms, the strength of connectivity between neurons at synapses is the primary locus of information, just as weights between units in a deep network comprise information. We expect features, whatever they are, to be housed in the combination of connection strengths among units, not in the units themselves. In a high-dimensional multivariate encoding space, they are hyperplane directions through the space. Thus, features are represented across many computing elements, and each computing element participates in encoding many features ( Hasson et al. 2020 , Parde et al. 2021 ). If features are directions in a high-dimensional coding space ( Goodfellow et al. 2014 ), then units act as an arbitrary projection surface from which this information can be accessed—albeit in a nontransparent form.

A downside of direct-fit models is that they cannot generalize via extrapolation. The other-race effect is an example of how face recognition may fail due to limited experience ( Malpass & Kravitz 1969 ) (see Section 4.3.2 ). The extrapolation limit may be countered, however, by the capacity of direct-fit models to acquire expertise within the confines of experience. For example, in human perception, category experience selectively structures representations as new exemplars are learned. Collins & Behrmann (2020) show that this occurs in a way that reflects the greater experience that humans have with faces and computer-generated objects from novel made-up categories of objects, which the authors call YUFOs. They tracked the perceived similarity of pairs of other-race faces and YUFOs as people learned novel exemplars of each. Experience changed perceived similarities more selectively for faces than for YUFOs, enabling more nuanced discrimination of exemplars from the experienced category of faces.

In summary, direct-fit models offer a framework for thinking about high-level visual codes for faces in a way that unifies disparate data on single units and high-dimensional coding spaces. These models are fueled by the rich experience that we (models) gain from learning (training on) real-world data. They solve complex visual tasks with interpolated solutions that elude transparent semantic interpretation.

4. RETHINKING LEARNING IN HUMANS AND DEEP NETWORKS

Deep network models of human face processing force us to consider learning as a complex and diverse set of mechanisms that can overlap, accumulate over time, and interact. Learning in both humans and artificial neural networks can refer to qualitatively different phenomena. In both cases, learning involves multiple steps. For DCNNs, these steps are fundamental to a network’s ability to recognize faces across image and appearance variation. Human visual learning is likewise diverse and unfolds across the developmental lifespan in a process governed by genetics and environmental input ( Goodman & Shatz 1993 ). The stepwise implementation of learning is one way that DCNNs differ from previous face recognition networks. Considered as manipulable modeling tools, the learning steps in DCNNs force us to think in concrete and nuanced ways about how humans learn faces.

In this section, we outline the learning layers in human face processing ( Section 4.1 ), introduce the layers of learning used in training machines ( Section 4.2 ), and consider the relationship between the two in the context of human behavior ( Section 4.3.1 ). The human learning layers support a complex, biologically realized face processing system. The machine learning layers can be thought of as building blocks that can be combined in a variety of ways to model human behavioral phenomena. At the outset, we note that machine learning is designed to maximize performance—not to model the development of the human face processing system ( Smith & Slone 2017 ). Concomitantly, the sequential presentation of training data in DCNNs differs from the pattern of exposure that infants and young children have with faces and objects ( Jayaraman et al. 2015 ). The machine learning steps, however, can be modified to model human learning more closely. In practical terms, fully trained DCNNs, available on the web, are used (almost exclusively) to model human neural systems (see the sidebar titled Caveat: Iteration Between Theory and Practice ). It is important, therefore, to understand how (and why) these models are configured as they are and to understand the types of learning tools available for modeling human face processing. These steps may provide computational grounding for basic learning mechanisms hypothesized in humans.

4.1. Human Learning for Face Processing

To model human face processing, researchers need to consider the following types of learning. The most specific form of learning is familiar face recognition. People learn the faces of specific familiar individuals (e.g., friends, family, celebrities). Familiar faces are recognized robustly over challenging changes in appearance and image characteristics. The second-most specific is local population tuning. People recognize own-race faces more accurately than other-race faces, a phenomenon referred to as the other-race effect (e.g., Malpass & Kravitz 1969 ). This likely results from tuning to the statistical properties of the faces that we see most frequently—typically faces of our own race. The third-most specific is nfamiliar face recognition. People can differentiate unfamiliar faces perceptually. Unfamiliar refers to faces that a person has not encountered previously or has encountered infrequently. Unfamiliar face recognition is less robust to image and appearance change than is familiar face recognition. The least specific form of learning is object recognition. At a fundamental level of analysis, faces are objects, and both share early visual processing wetware.

4.2. How Deep Convolutional Neural Networks Learn Face Identification

Training DCNNs for face recognition involves a sequence of learning stages, each with a concrete objective. Unlike human learning, machine learning stages are executed in strict sequence. The goal across all stages of training is to build an effective method for converting images of faces into points in a high-dimensional space. The resulting high-dimensional space allows for easy comparison among faces, search, and clustering. In this section, we sketch out the engineering approach to learning, working forward from the most general to the most specific form of learning. This follows the implementation order used by engineers.

4.2.1. Object classification (between-category learning): Stage 1.

Deep networks for face identification are commonly built on top of DCNNs that have been pretrained for object classification. Pretraining is carried out using large data sets of objects, such as those available in ImageNet ( Russakovsky et al. 2015 ), which contains more than 14 million images of over 1,000 classes of objects (e.g., volcanoes, cups, chihuahuas). The object categorization training procedure involves adjusting the weights on all layers of the network. For training to converge, a large training set is required. The loss function optimized in this procedure typically uses the well-understood cross-entropy loss + Softmax combination. Most practitioners do not execute this step because it has been performed already in a pretrained model downloaded from a public repository in a format compatible with DCNN software libraries [e.g., PyTorch ( Paszke et al. 2019 ), TensorFlow ( Abadi et al. 2016 )]. Networks trained for object recognition have proven better for face identification than networks that start with a random configuration ( Liu et al. 2015 , Yi et al. 2014 ).

4.2.2. Face recognition (within-category learning): Stage 2.

Face recognition training is implemented in a second stage of training. In this stage, the last fully connected layer that connects to object-category nodes (e.g., volcanoes, cups) is removed from the results of the Stage 1 training. Next, a fully connected layer that maps to the number of face identities available for face training is connected. Depending on the size of the face training set, the weights of either all layers or all but a few layers at the beginning of the network are updated. The former is common when very large numbers of face identities are available for training. In academic laboratories, data sets include 5–10 million face images of 40,000–100,000 identities. In industry, far larger data sets are often used ( Schroff et al. 2015 ). A technical difficulty encountered in retraining an object classification network to a face recognition network is the large increase in the number of categories involved (approximately 1,000 objects versus 50,000+ faces). Special loss functions can address this issue [e.g., L2-Softmax/crystal loss ( Ranjan et al. 2017 ), NormFace ( Wang et al. 2017 ), angular Softmax ( Li et al. 2018 ), additive Softmax ( Wang et al. 2018 ), additive angular margins ( Deng et al. 2019 )].

When the Stage 2 face training is complete, the last fully connected layer that connects to the 50,000+ face identity nodes is removed, leaving below it a relatively low-dimensional (128- to 5,000-unit) layer of output units. This can be thought of as the face representation. This output represents a face image, not a face identity. At this point in training, any arbitrary face image from any identity (known or unknown to the network) can be processed by the DCNN to produce a compact face image descriptor across the units of this layer. If the network functions perfectly, then it will produce identical codes for all images of the same person. This would amount to perfect image and appearance generalization. This is not usually achieved, even when the network is highly accurate (see Section 2 ).

In this state, the network is commonly employed to recognize faces not seen in training (unfamiliar faces). Stage 2 training supports a surprising degree of generalization (e.g., pose, expression, illumination, and appearance) for images of unfamiliar faces. This general face learning gives the system special knowledge of faces and enables it to perform within-category face discrimination for unfamiliar faces ( O’Toole et al. 2018 ). With or without Stage 3 training, the network is now capable of converting images of faces into points in a high-dimensional space, which, as noted above, is the primary goal of training. In practice, however, Stages 3 and 4 can provide a critical bridge to modeling behavioral characteristics of the human face processing system.

4.2.3. Adapting to local statistics of people and visual environments: Stage 3.

The objective of Stage 3 training is to finalize the modification of the DCNN weights to better adapt to the application domain. The term application domain can refer to faces from a particular race or ethnicity or, as it is commonly used in industry, to the type of images to be processed (e.g., in-the-wild faces, passport photographs). This training is a crucial step in many applications because there will be no further transformation of the weights. Special care is needed in this training to avoid collapsing the representation into a form that is too specific. Training at this stage can improve performance for some faces and decrease it for others.

Whereas Stages 1 and 2 are used in the vast majority of published computational work, in Stage 3, researchers diverge. Although there is no standard implementation for this training, fine-tuning and learning a triplet loss embedding ( van der Maaten & Weinberger 2012 ) are common methods. These methods are conceptually similar but differ in implementation. In both methods, ( a ) new layers are added to the network, ( b ) specific subsets of layers are frozen or unfrozen, and ( c ) optimization continues with an appropriate loss function using a new data set with the desired domain characteristics. Fine-tuning starts from an already-viable network state and updates a nonempty subset of weights, or possibly all weights. It is typically implemented with smaller learning rates and can use smaller training sets than those needed for full training. Triplet loss is implemented by freezing all layers and adding a new, fully connected layer. Minimization is done with the triplet loss, again on a new (smaller) data set with the desired domain characteristics.

A natural question is why Stage 2 (general face training) is not considered fine-tuning. The answer, in practice, comes down to viability and volume. When the training for Stage 2 starts, the network is not in a viable state to perform face recognition. Therefore, it requires a voluminous, diverse data set to function. Stage 3 begins with a functional network and can be tuned effectively with a small targeted data set.

This face knowledge history provides a tool for adapting to local face statistics (e.g., race) ( O’Toole et al. 2018 ).

4.2.4. Learning individual people: Stage 4.

In psychological terms, learning individual familiar faces involves seeing multiple, diverse images of the individuals to whom the faces belong. As we see more images of a person, we become more familiar with their face and can recognize it from increasingly variable images ( Dowsett et al. 2016 , Murphy et al. 2015 , Ritchie & Burton 2017 ). In computational terms, this translates into the question of how a network can learn to recognize a random set of special (familiar) faces with greater accuracy and robustness than other nonspecial (unfamiliar) faces—assuming, of course, the availability of multiple, variable images of the special faces. This stage of learning is defined, in nearly all cases, outside of the DCNN, with no change to weights within the DCNN.

The problem is as follows. The network starts with multiple images of each familiar identity and can produce a representation for each of the images–but what then? There is no standard familiarization protocol, but several approaches exist. We categorize these approaches first and link them to theoretical accounts of face familiarity in Section 4.3.3 .

The first approach is averaging identity codes, or 1-class learning. It is common in machine learning to use an average (or weighted average) of the DCNN-generated face image representations as an identity code (see also Crosswhite et al. 2018 , Su et al. 2015 ). Averaging creates a person-identity prototype ( Noyes et al. 2021 ) for each familiar face.

The second is individual face contrast, or 2-class learning. This technique employs direct learning of individual identities by contrasting them with all other identities. There are two classes because the model learns what makes each identity (positive class) different than all other identities (negative class). The distinctiveness of each familiar face is enhanced relative to all other known faces (e.g., Noyes et al. 2021 ).

The third is multiple face contrast, or K-class learning. This refers to the use of identification training for a random set of (familiar) faces with a simple network (often a one-layer network). The network learns to map DCNN-generated face representations of the available images onto identity nodes.

The fourth approach is fine-tuning individual face representations. Fine-tuning has also been used for learning familiar identities ( Blauch et al. 2020a ). It is an unusual method because it alters weights within the DCNN itself. This can improve performance for the familiarized faces but can limit the network’s ability to represent other faces.

These methods create a personal face learning history that supports more accurate and robust face processing for familiar people ( O’Toole et al. 2018 ).

4.3. Mapping Learning Between Humans and Machines

Deep networks rely on multiple types of learning that can be useful in formulating and testing complex, nuanced hypotheses about human face learning. Manipulable variables include order of learning, training data, and network plasticity at different learning stages. We consider a sample of topics in human face processing that can be investigated by manipulating learning in deep networks. Because these investigations are just beginning, we provide an overview of the work in progress and discuss possible next steps in modeling.

4.3.1. Development of face processing.

Early infants’ experience with faces is critical for the development of face processing skills ( Maurer et al. 2002 ). The timing of this experience has become increasingly clear with the availability of data sets gathered using head-mounted cameras in infants (1–15 months of age) (e.g., Jayaraman et al. 2015 , Yoshida & Smith 2008 ). In seeing the world from the perspective of the infant, it becomes clear that the development of sensorimotor abilities drives visual experience. Infants’ experience transitions from seeing only what is made available to them (often faces in the near range), to seeing the world from the perspective of a crawler (objects and environments), to seeing hands and the objects that they manipulate ( Fausey et al. 2016 , Jayaraman et al. 2015 , Smith & Slone 2017 , Sugden & Moulson 2017 ). Between 1 and 3 months of age, faces are frequent, temporally persistent, and viewed frontally at close range. This early experience with faces is limited to a few individuals. Faces become less frequent as the child’s first year progresses and attention shifts to the environment, to objects, and later to hands ( Jayaraman & Smith 2019 ).

The prevalence of a few important faces in the infants’ visual world suggests that early face learning may have an out-sized influence on structuring visual recognition systems. Infants’ visual experience of objects, faces, and environments can provide a curriculum for teaching machines ( Smith et al. 2018 ). DCNNs can be used to test hypotheses about the emergence of competence on different face processing tasks. Some basic computational challenges, however, need to be addressed. Training with very large numbers of objects (or faces) is required for deep network learning to converge (see Section 4.2.1 ). Starting small and building competence on multiple domains (faces, objects, environments) might require basic changes to deep network training. Alternatively, the small number of special faces in an infant’s life might be considered familiar faces. Perception and memory of these faces may be better modeled using tools that operate outside the deep network on representations that develop within the network (Stage 4 learning; Section 4.2.4 ). In this case, the quality of the representation produced at different points in a network’s development of more general visual knowledge varies (Stages 1 and 2 of training; Sections 4.2.1 and 4.2.2 ). The learning of these special faces early in development might interact with the learning of objects and scenes at the categorical level ( Rosch et al. 1976 , Yovel et al. 2012 ). A promising approach would involve pausing training in Stages 1 and 2 to test face representation quality at various points along the way to convergence.

4.3.2. Race bias in the performance of humans and deep networks.

People recognize own-race faces more accurately than other-race faces. For humans, this other-race effect begins in infancy ( Kelly et al. 2005 , 2007 ) and is manifest in children ( Pezdek et al. 2003 ). Although it is possible to reverse these effects in childhood ( Sangrigoli et al. 2005 ), training adults to recognize other-race faces yields only modest gains (e.g., Cavazos et al. 2019 , Hayward et al. 2017 , Laurence et al. 2016 , Matthews & Mondloch 2018 , Tanaka & Pierce 2009 ). Concomitantly, evidence for the experience-based contact hypothesis is weak when it is evaluated in adulthood ( Levin 2000 ). Clearly, the timing of experience is critical in the other-race effect. Developmental learning, which results in perceptual narrowing during a critical childhood period, may provide a partial account of the other-race effect ( Kelly et al. 2007 , Sangrigoli et al. 2005 , Scott & Monesson 2010 ).

Perceptual narrowing:

sculpting of neural and perceptual processing via experience during a critical period in child development

Face recognition algorithms from the 1990s and present-day DCNNs differ in accuracy for faces of different races (for a review, see Cavazos et al. 2020 ; for a comprehensive test of race bias in DCNNs, see Grother et al. 2019 ). Although training with faces of different races is often cited as a cause of race effects, it is unclear which training stage(s) contribute to the bias. It is likely that biased learning affects all learning stages. From the human perspective, for many people, experience favors own-race faces across the lifespan, potentially impacting performance through multiple learning mechanisms (developmental, unfamiliar, and familiar face learning). DCNN training may also use race-biased data at all stages. For humans, understanding the role of different types of learning in the other-race effect is challenging because experience with faces cannot be controlled. DCNNs can serve as a tool for studying critical periods and perceptual narrowing. It is possible to compare the face representations that emerge from training regimes that vary in the time course of exposure to faces of different races. The ability to manipulate training stage order, network plasticity, and training set diversity in deep networks offers an opportunity to test hypotheses about how bias emerges. The major challenge for DCNNs is the limited availability of face databases that represent the diversity of humans.

4.3.3. Familiar versus unfamiliar face recognition.

Face familiarity in a deep network can be modeled in more ways than we can count. The approaches presented in Section 4.2.4 are just a beginning. Researchers should focus first on the big questions. How do familiar and unfamiliar face representations differ—beyond simple accuracy and robustness? This has been much debated recently, and many questions remain ( Blauch et al. 2020a , b ; Young & Burton 2020 ; Yovel & Abudarham 2020 ). One approach is to ask where in the learning process representations for familiar and unfamiliar faces diverge. The methods outlined in Section 4.2.4 make some predictions.

In the individual and multiple face contrast methods, familiar and unfamiliar face representations are not differentiated within the deep network. Instead, familiar face representations generated by the DCNN are enhanced in another, simpler network populated with known faces. A familiar face’s representation is affected, therefore, by the other faces that we know well. Contrast techniques have preliminary empirical support. In the work of Noyes et al. (2021) , familiarization using individual-face contrast improved identification for both evasion and impersonation disguise. It also produced a pattern of accuracy similar to that seen for people familiar with the disguised individuals ( Noyes & Jenkins 2019 ). For humans who were unfamiliar with the disguised faces, the pattern of accuracy resembled that seen after general face training inside of the DCNN. There is also support for multiple-face contrast familiarization. Perceptual expertise findings that emphasize the selective effects of the exemplars experienced during highly skilled learning are consistent with this approach ( Collins & Behrmann 2020 ) (see Section 3.2 ).

Familiarization by averaging and fine-tuning both improve performance, but at a cost. For example, averaging the DCNN representations increased performance for evasion disguise by increasing tolerance for appearance variation ( Noyes et al. 2021 ). It decreased performance, however, for imposter disguise by allowing too much tolerance for appearance variation. Averaging methods highlight the need to balance the perception of identity across variable images with an ability to tell similar faces apart.

Familiarization via fine-tuning was explored by Blauch et al. (2020a) , who varied the number of layers tuned (all layers, fully connected layers, only the fully connected layer mapping the perceptual layer to identity nodes). Fine-tuning applied at lower layers alters the weights within the deep network to produce a perceptual representation potentially affected by familiar faces. Fine-tuning in the mapping layer is equivalent to multiclass face contrast learning ( Blauch et al. 2020b ). Blauch et al. (2020b) show that fine-tuning the perceptual representation, which they consider analogous to perceptual learning, is not necessary for producing a familiarity effect ( Blauch et al. 2020a ).

These approaches are not (necessarily) mutually exclusive and therefore can be combined to exploit useful features of each.

4.3.4. Objects, faces, both.

The organization of face-, body-, and object-selective areas in the ventral temporal cortex has been studied intensively (cf. Grill-Spector & Weiner 2014 ). Neuroimaging studies in childhood reveal the developmental time course of face selectivity and other high-level visual tasks (e.g., Natu et al. 2016 ; Nordt et al. 2019 , 2020 ). How these systems interact during development in the context of constantly changing input from the environment is an open question. DCNNs can be used to test functional hypotheses about the development of object and face learning (see also Grill-Spector et al. 2018 ).

In the case of machine learning, face recognition networks are more accurate when pretrained to categorize objects ( Liu et al. 2015 , Yi et al. 2014 ), and networks trained with only faces are more accurate for face recognition than networks trained with only objects ( Abudarham & Yovel 2020 , Blauch et al. 2020a ). Human-like viewpoint invariance was found in a DCNN trained for face recognition but not in one trained for object recognition ( Abudarham & Yovel 2020 ). In machine learning, networks are trained first with objects, and then with faces. Moreover, networks can simultaneously learn object and face recognition ( Dobs et al. 2020 ), which incurs minimal duplication of neural resources.

4.4. New Tools, New Questions, New Data, and a New Look at Old Data

Psychologists have long posited diverse and complex learning mechanisms for faces. Deep networks provide new tools that can be used to model human face learning with greater precision than was possible previously. This is useful because it encourages theoreticians to articulate hypotheses in ways specific enough to model. It may no longer be sufficient to explain a phenomenon in terms of generic learning or contact. Concepts such as perceptual narrowing should include ideas about where and how in the learning process this narrowing occurs. A major challenge ahead is the sheer number of knobs to be set in deep networks. Plasticity, for example, can be dialed up or down, and it can be applied to selected network layers or specific face diets administered across multiple learning stages (in sequence or simultaneously). The list goes on. In all of the topics discussed, and others not discussed, theoretical ideas should specify the manipulations thought to be most critical. We should follow the counsel of Box (1976) to avoid worrying selectivity and instead focus on what is most important. New tools succeed when they facilitate the discovery of things that we did not know or had not hypothesized. Testing these hypotheses will require new data and may suggest a reevaluation of existing data.

5. THE PATH FORWARD

In this review, we highlight fundamental advances in thinking brought about by deep learning approaches. These networks solve the inverse optics problem for face identification by untangling image, appearance, and identity over layers of neural-like processing. This demonstrates that robust face identification can be achieved with a representation that includes specific information about the face image(s) actually experienced. These representations retain information about appearance, perceived traits, expressions, and identity.

Direct-fit models posit that deep networks operate by placing new observations into the context of past experience. These models depend on overparameterized networks that create a high-dimensional space from real-world training data. Face representations housed within this space project onto units, thereby confounding stimulus features that (may) separate in the high-dimensional space. This raises questions about the transparency and interpretability of information gained by examining the response properties of network units. Deep networks can be studied at the both micro- and macroscale simultaneously and can be used to formulate hypotheses about the underlying neural code for faces. A key to understanding face representations is to reconcile the responses of neurons to the structure of the code in the high-dimensional space. This is a challenging problem best approached by combining psychological, neural, and computational methods.

The process of training a deep network is complex and layered. It draws on learning mechanisms aimed at objects and faces, visual categories of faces (e.g., race), and special familiar faces. Psychological and neural theory considers the many ways in which people and brains learn faces from real-world visual experience. DCNNs offer the potential to implement and test sophisticated hypotheses about how humans learn faces across the lifespan.

We should not lose sight of the fact that a compelling reason to study deep networks is that they actually work, i.e., they perform nearly as well as humans, on face recognition tasks that have stymied computational modelers for decades. This might qualify as a property of deep networks that is importantly right ( Box 1976 ). There is a difference, of course, between working and working like humans. Determining whether a deep network can work like humans, or could be made to do so by manipulating other properties of the network (e.g., architectures, training data, learning rules), is work that is just beginning.

SUMMARY POINTS

  • Face representations generated by DCNN networks trained for identification retain information about the face (e.g., identity, demographics, attributes, traits, expression) and the image (e.g., viewpoint).
  • Deep learning face networks generate a surprisingly structured face representation from unstructured training with in-the-wild face images.
  • Individual output units from deep networks are unlikely to signal the presence of interpretable features.
  • Fundamental structural aspects of high-level visual codes for faces in deep networks replicate over a wide variety of network architectures.
  • Diverse learning mechanisms in DCNNs, applied simultaneously or in sequence, can be used to model human face perception across the lifespan.

FUTURE ISSUES

  • Large-scale systematic manipulations of training data (race, ethnicity, image variability) are needed to give insight into the role of experience in structuring face representations.
  • Fundamental challenges remain in understanding how to combine deep networks for face, object, and scene recognition in ways analogous to the human visual system.
  • Deep networks model the ventral visual stream at a generic level, arguably up to the level of the IT cortex. Future work should examine how downstream systems, such as face patches, could be connected into this system.
  • In rethinking the goals of face processing, we argue in this review that some longstanding assumptions about visual representations should be reconsidered. Future work should consider novel experimental questions and employ methods that do not rely on these assumptions.

ACKNOWLEDGMENTS

The authors are supported by funding provided by National Eye Institute grant R01EY029692-03 to A.J.O. and C.D.C.

DISCLOSURE STATEMENT

C.D.C. is an equity holder in Mukh Technologies, which may potentially benefit from research results.

1 This is the case in networks trained with the Softmax objective function.

LITERATURE CITED

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, et al. 2016. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pp. 265–83. Berkeley, CA: USENIX [ Google Scholar ]
  • Abudarham N, Shkiller L, Yovel G. 2019. Critical features for face recognition . Cognition 182 :73–83 [ PubMed ] [ Google Scholar ]
  • Abudarham N, Yovel G. 2020. Face recognition depends on specialized mechanisms tuned to view-invariant facial features: insights from deep neural networks optimized for face or object recognition . bioRxiv 2020.01.01.890277 . 10.1101/2020.01.01.890277 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Azevedo FA, Carvalho LR, Grinberg LT, Farfel JM, Ferretti RE, et al. 2009. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain . J. Comp. Neurol 513 ( 5 ):532–41 [ PubMed ] [ Google Scholar ]
  • Barlow HB. 1972. Single units and sensation: a neuron doctrine for perceptual psychology? Perception 1 ( 4 ):371–94 [ PubMed ] [ Google Scholar ]
  • Bashivan P, Kar K, DiCarlo JJ. 2019. Neural population control via deep image synthesis . Science 364 ( 6439 ):eaav9436 [ PubMed ] [ Google Scholar ]
  • Best-Rowden L, Jain AK. 2018. Learning face image quality from human assessments . IEEE Trans. Inform. Forensics Secur 13 ( 12 ):3064–77 [ Google Scholar ]
  • Blanz V, Vetter T. 1999. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques , pp. 187–94. New York: ACM [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020a. Computational insights into human perceptual expertise for familiar and unfamiliar face recognition . Cognition 208 :104341. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020b. Deep learning of shared perceptual representations for familiar and unfamiliar faces: reply to commentaries . Cognition 208 :104484. [ PubMed ] [ Google Scholar ]
  • Box GE. 1976. Science and statistics . J. Am. Stat. Assoc 71 ( 356 ):791–99 [ Google Scholar ]
  • Box GEP. 1979. Robustness in the strategy of scientific model building. In Robustness in Statistics , ed. Launer RL, Wilkinson GN, pp. 201–36. Cambridge, MA: Academic Press [ Google Scholar ]
  • Bruce V, Young A. 1986. Understanding face recognition . Br. J. Psychol 77 ( 3 ):305–27 [ PubMed ] [ Google Scholar ]
  • Burton AM, Bruce V, Hancock PJ. 1999. From pixels to people: a model of familiar face recognition . Cogn. Sci 23 ( 1 ):1–31 [ Google Scholar ]
  • Cavazos JG, Noyes E, O’Toole AJ. 2019. Learning context and the other-race effect: strategies for improving face recognition . Vis. Res 157 :169–83 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cavazos JG, Phillips PJ, Castillo CD, O’Toole AJ. 2020. Accuracy comparison across face recognition algorithms: Where are we on measuring race bias? IEEE Trans. Biom. Behav. Identity Sci 3 ( 1 ):101–11 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chang L, Tsao DY. 2017. The code for facial identity in the primate brain . Cell 169 ( 6 ):1013–28 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chen JC, Patel VM, Chellappa R. 2016. Unconstrained face verification using deep CNN features. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Cichy RM, Kaiser D. 2019. Deep neural networks as scientific models . Trends Cogn. Sci 23 ( 4 ):305–17 [ PubMed ] [ Google Scholar ]
  • Collins E, Behrmann M. 2020. Exemplar learning reveals the representational origins of expert category perception . PNAS 117 ( 20 ):11167–77 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Colón YI, Castillo CD, O’Toole AJ. 2021. Facial expression is retained in deep networks trained for face identification . J. Vis 21 ( 4 ):4 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cootes TF, Taylor CJ, Cooper DH, Graham J. 1995. Active shape models-their training and application . Comput. Vis. Image Underst 61 ( 1 ):38–59 [ Google Scholar ]
  • Crosswhite N, Byrne J, Stauffer C, Parkhi O, Cao Q, Zisserman A. 2018. Template adaptation for face verification and identification . Image Vis. Comput 79 :35–48 [ Google Scholar ]
  • Deng J, Guo J, Xue N, Zafeiriou S. 2019. Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 4690–99. Piscataway, NJ: IEEE [ PubMed ] [ Google Scholar ]
  • Dhar P, Bansal A, Castillo CD, Gleason J, Phillips P, Chellappa R. 2020. How are attributes expressed in face DCNNs? In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 61–68. Piscataway, NJ: IEEE [ Google Scholar ]
  • DiCarlo JJ, Cox DD. 2007. Untangling invariant object recognition . Trends Cogn. Sci 11 ( 8 ):333–41 [ PubMed ] [ Google Scholar ]
  • Dobs K, Kell AJ, Martinez J, Cohen M, Kanwisher N. 2020. Using task-optimized neural networks to understand why brains have specialized processing for faces . J. Vis 20 ( 11 ):660 [ Google Scholar ]
  • Dowsett A, Sandford A, Burton AM. 2016. Face learning with multiple images leads to fast acquisition of familiarity for specific individuals . Q. J. Exp. Psychol 69 ( 1 ):1–10 [ PubMed ] [ Google Scholar ]
  • El Khiyari H, Wechsler H. 2016. Face verification subject to varying (age, ethnicity, and gender) demographics using deep learning . J. Biom. Biostat 7 :323 [ Google Scholar ]
  • Fausey CM, Jayaraman S, Smith LB. 2016. From faces to hands: changing visual input in the first two years . Cognition 152 :101–7 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY. 2010. Functional compartmentalization and viewpoint generalization within the macaque face-processing system . Science 330 ( 6005 ):845–51 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY, Livingstone MS. 2009. A face feature space in the macaque temporal lobe . Nat. Neurosci 12 ( 9 ):1187–96 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Fukushima K 1988. Neocognitron: a hierarchical neural network capable of visual pattern recognition . Neural Netw 1 ( 2 ):119–30 [ Google Scholar ]
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, et al. 2014. Generative adversarial nets. In NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems , pp. 2672–80. New York: ACM [ Google Scholar ]
  • Goodman CS, Shatz CJ. 1993. Developmental mechanisms that generate precise patterns of neuronal connectivity . Cell 72 :77–98 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Kushnir T, Edelman S, Avidan G, Itzchak Y, Malach R. 1999. Differential processing of objects under various viewing conditions in the human lateral occipital complex . Neuron 24 ( 1 ):187–203 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS. 2014. The functional architecture of the ventral temporal cortex and its role in categorization . Nat. Rev. Neurosci 15 ( 8 ):536–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS, Gomez J, Stigliani A, Natu VS. 2018. The functional neuroanatomy of face perception: from brain measurements to deep neural networks . Interface Focus 8 ( 4 ):20180013. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gross CG. 2002. Genealogy of the “grandmother cell” . Neuroscientist 8 ( 5 ):512–18 [ PubMed ] [ Google Scholar ]
  • Grother P, Ngan M, Hanaoka K. 2019. Face recognition vendor test (FRVT) part 3: demographic effects . Rep., Natl. Inst. Stand. Technol., US Dept. Commerce, Gaithersburg, MD [ Google Scholar ]
  • Hancock PJ, Bruce V, Burton AM. 2000. Recognition of unfamiliar faces . Trends Cogn. Sci 4 ( 9 ):330–37 [ PubMed ] [ Google Scholar ]
  • Hasson U, Nastase SA, Goldstein A. 2020. Direct fit to nature: an evolutionary perspective on biological and artificial neural networks . Neuron 105 ( 3 ):416–34 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hayward WG, Favelle SK, Oxner M, Chu MH, Lam SM. 2017. The other-race effect in face learning: using naturalistic images to investigate face ethnicity effects in a learning paradigm . Q. J. Exp. Psychol 70 ( 5 ):890–96 [ PubMed ] [ Google Scholar ]
  • Hesse JK, Tsao DY. 2020. The macaque face patch system: a turtle’s underbelly for the brain . Nat. Rev. Neurosci 21 ( 12 ):695–716 [ PubMed ] [ Google Scholar ]
  • Hill MQ, Parde CJ, Castillo CD, Colon YI, Ranjan R, et al. 2019. Deep convolutional neural networks in the face of caricature . Nat. Mach. Intel 1 ( 11 ):522–29 [ Google Scholar ]
  • Hong H, Yamins DL, Majaj NJ, DiCarlo JJ. 2016. Explicit information for category-orthogonal object properties increases along the ventral stream . Nat. Neurosci 19 ( 4 ):613–22 [ PubMed ] [ Google Scholar ]
  • Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward networks are universal approximators . Neural Netw 2 ( 5 ):359–66 [ Google Scholar ]
  • Huang GB, Lee H, Learned-Miller E. 2012. Learning hierarchical representations for face verification with convolutional deep belief networks. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 2518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Huang GB, Mattar M, Berg T, Learned-Miller E. 2008. Labeled faces in the wild: a database for studying face recognition in unconstrained environments . Paper presented at the Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition, Marseille, France [ Google Scholar ]
  • Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. 2019. Adversarial examples are not bugs, they are features . arXiv:1905.02175 [stat.ML] [ Google Scholar ]
  • Issa EB, DiCarlo JJ. 2012. Precedence of the eye region in neural processing of faces . J. Neurosci 32 ( 47 ):16666–82 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jacquet M, Champod C. 2020. Automated face recognition in forensic science: review and perspectives . Forensic Sci. Int 307 :110124. [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Fausey CM, Smith LB. 2015. The faces in infant-perspective scenes change over the first year of life . PLOS ONE 10 ( 5 ):e0123780. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Smith LB. 2019. Faces in early visual environments are persistent not just frequent . Vis. Res 157 :213–21 [ PubMed ] [ Google Scholar ]
  • Jenkins R, White D, Van Montfort X, Burton AM. 2011. Variability in photos of the same face . Cognition 121 ( 3 ):313–23 [ PubMed ] [ Google Scholar ]
  • Kandel ER, Schwartz JH, Jessell TM, Siegelbaum S, Hudspeth AJ, Mack S, eds. 2000. Principles of Neural Science , Vol. 4 . New York: McGraw-Hill [ Google Scholar ]
  • Kay KN, Weiner KS, Grill-Spector K. 2015. Attention reduces spatial uncertainty in human ventral temporal cortex . Curr. Biol 25 ( 5 ):595–600 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Ge L, Pascalis O. 2007. The other-race effect develops during infancy: evidence of perceptual narrowing . Psychol. Sci 18 ( 12 ):1084–89 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Gibson A, et al. 2005. Three-month-olds, but not newborns, prefer own-race faces . Dev. Sci 8 ( 6 ):F31–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kietzmann TC, Swisher JD, König P, Tong F. 2012. Prevalence of selectivity for mirror-symmetric views of faces in the ventral and dorsal visual pathways . J. Neurosci 32 ( 34 ):11763–72 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Krishnapriya KS, Albiero V, Vangara K, King MC, Bowyer KW. 2020. Issues related to face recognition accuracy varying based on race and skin tone . IEEE Trans. Technol. Soc 1 ( 1 ):8–20 [ Google Scholar ]
  • Krishnapriya K, Vangara K, King MC, Albiero V, Bowyer K. 2019. Characterizing the variability in face recognition accuracy relative to race. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , Vol. 1 , pp. 2278–85. Piscataway, NJ: IEEE [ Google Scholar ]
  • Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems , pp. 1097–105. New York: ACM [ Google Scholar ]
  • Kumar N, Berg AC, Belhumeur PN, Nayar SK. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE International Conference on Computer Vision , pp. 365–72. Piscataway, NJ: IEEE [ Google Scholar ]
  • Laurence S, Zhou X, Mondloch CJ. 2016. The flip side of the other-race coin: They all look different to me . Br. J. Psychol 107 ( 2 ):374–88 [ PubMed ] [ Google Scholar ]
  • LeCun Y, Bengio Y, Hinton G. 2015. Deep learning . Nature 521 ( 7553 ):436–44 [ PubMed ] [ Google Scholar ]
  • Levin DT. 2000. Race as a visual feature: using visual search and perceptual discrimination tasks to understand face categories and the cross-race recognition deficit . J. Exp. Psychol. Gen 129 ( 4 ):559–74 [ PubMed ] [ Google Scholar ]
  • Lewenberg Y, Bachrach Y, Shankar S, Criminisi A. 2016. Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information . arXiv:1605.09062 [cs.CV] [ Google Scholar ]
  • Li Y, Gao F, Ou Z, Sun J. 2018. Angular softmax loss for end-to-end speaker verification. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) , pp. 190–94. Baixas, France: ISCA [ Google Scholar ]
  • Liu Z, Luo P, Wang X, Tang X. 2015. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 3730–38. Piscataway, NJ: IEEE [ Google Scholar ]
  • Lundqvist D, Flykt A, Ohman A. 1998. Karolinska directed emotional faces . Database of standardized facial images, Psychol. Sect., Dept. Clin. Neurosci. Karolinska Hosp., Solna, Swed. https://www.kdef.se/#:~:text=The%20Karolinska%20Directed%20Emotional%20Faces,from%20the%20original%20KDEF%20images [ Google Scholar ]
  • Malpass RS, Kravitz J. 1969. Recognition for faces of own and other race . J. Personal. Soc. Psychol 13 ( 4 ):330–34 [ PubMed ] [ Google Scholar ]
  • Matthews CM, Mondloch CJ. 2018. Improving identity matching of newly encountered faces: effects of multi-image training . J. Appl. Res. Mem. Cogn 7 ( 2 ):280–90 [ Google Scholar ]
  • Maurer D, Le Grand R, Mondloch CJ. 2002. The many faces of configural processing . Trends Cogn. Sci 6 ( 6 ):255–60 [ PubMed ] [ Google Scholar ]
  • Maze B, Adams J, Duncan JA, Kalka N, Miller T, et al. 2018. IARPA Janus Benchmark—C: face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB) , pp. 158–65. Piscataway, NJ: IEEE [ Google Scholar ]
  • McCurrie M, Beletti F, Parzianello L, Westendorp A, Anthony S, Scheirer WJ. 2017. Predicting first impressions with deep learning. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Murphy J, Ipser A, Gaigg SB, Cook R. 2015. Exemplar variance supports robust learning of facial identity . J. Exp. Psychol. Hum. Percept. Perform 41 ( 3 ):577–81 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Barnett MA, Hartley J, Gomez J, Stigliani A, Grill-Spector K. 2016. Development of neural sensitivity to face identity correlates with perceptual discriminability . J. Neurosci 36 ( 42 ):10893–907 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Jiang F, Narvekar A, Keshvari S, Blanz V, O’Toole AJ. 2010. Dissociable neural patterns of facial identity across changes in viewpoint . J. Cogn. Neurosci 22 ( 7 ):1570–82 [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu V, Jeska B, Barnett M, Grill-Spector K. 2019. Learning to read increases the informativeness of distributed ventral temporal responses . Cereb. Cortex 29 ( 7 ):3124–39 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu VS, Rezai AA, Finzi D, Grill-Spector K. 2020. Selectivity to limbs in ventral temporal cortex decreases during childhood as selectivity to faces and words increases . J. Vis 20 ( 11 ):152 [ Google Scholar ]
  • Noyes E, Jenkins R. 2019. Deliberate disguise in face identification . J. Exp. Psychol. Appl 25 ( 2 ):280–90 [ PubMed ] [ Google Scholar ]
  • Noyes E, Parde C, Colon Y, Hill M, Castillo C, et al. 2021. Seeing through disguise: getting to know you with a deep convolutional neural network . Cognition . In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Noyes E, Phillips P, O’Toole A. 2017. What is a super-recogniser. In Face Processing: Systems, Disorders and Cultural Differences , ed. Bindemann M, pp. 173–201. Hauppage, NY: Nova Sci. Publ. [ Google Scholar ]
  • Oosterhof NN, Todorov A. 2008. The functional basis of face evaluation . PNAS 105 ( 32 ):11087–92 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Castillo CD, Parde CJ, Hill MQ, Chellappa R. 2018. Face space representations in deep convolutional neural networks . Trends Cogn. Sci 22 ( 9 ):794–809 [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Phillips PJ, Jiang F, Ayyad J, Pénard N, Abdi H. 2007. Face recognition algorithms surpass humans matching faces over changes in illumination . IEEE Trans. Pattern Anal. Mach. Intel ( 9 ):1642–46 [ PubMed ] [ Google Scholar ]
  • Parde CJ, Castillo C, Hill MQ, Colon YI, Sankaranarayanan S, et al. 2017. Face and image representation in deep CNN features. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 673–80. Piscataway, NJ: IEEE [ Google Scholar ]
  • Parde CJ, Colón YI, Hill MQ, Castillo CD, Dhar P, O’Toole AJ. 2021. Face recognition by humans and machines: closing the gap between single-unit and neural population codes—insights from deep learning in face recognition . J. Vis In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parde CJ, Hu Y, Castillo C, Sankaranarayanan S, O’Toole AJ. 2019. Social trait information in deep convolutional neural networks trained for face identification . Cogn. Sci 43 ( 6 ):e12729. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parkhi OM, Vedaldi A, Zisserman A. 2015. Deep face recognition . Rep., Vis. Geom. Group, Dept. Eng. Sci., Univ. Oxford, UK [ Google Scholar ]
  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, et al. 2019. Pytorch: an imperative style, high-performance deep learning library. In NeurIPS 2019: Proceedings of the 32nd International Conference on Neural Information Processing Systems , pp. 8024–35. New York: ACM [ Google Scholar ]
  • Pezdek K, Blandon-Gitlin I, Moore C. 2003. Children’s face recognition memory: more evidence for the cross-race effect . J. Appl. Psychol 88 ( 4 ):760–63 [ PubMed ] [ Google Scholar ]
  • Phillips PJ, Beveridge JR, Draper BA, Givens G, O’Toole AJ, et al. 2011. An introduction to the good, the bad, & the ugly face recognition challenge problem. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG) , pp. 346–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Phillips PJ, O’Toole AJ. 2014. Comparison of human and computer performance across face recognition experiments . Image Vis. Comput 32 ( 1 ):74–85 [ Google Scholar ]
  • Phillips PJ, Yates AN, Hu Y, Hahn CA, Noyes E, et al. 2018. Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms . PNAS 115 ( 24 ):6171–76 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Poggio T, Banburski A, Liao Q. 2020. Theoretical issues in deep networks . PNAS 117 ( 48 ):30039–45 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ponce CR, Xiao W, Schade PF, Hartmann TS, Kreiman G, Livingstone MS. 2019. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences . Cell 177 ( 4 ):999–1009 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ranjan R, Bansal A, Zheng J, Xu H, Gleason J, et al. 2019. A fast and accurate system for face detection, identification, and verification . IEEE Trans. Biom. Behav. Identity Sci 1 ( 2 ):82–96 [ Google Scholar ]
  • Ranjan R, Castillo CD, Chellappa R. 2017. L2-constrained softmax loss for discriminative face verification . arXiv:1703.09507 [cs.CV] [ Google Scholar ]
  • Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R. 2017c. An all-in-one convolutional neural network for face analysis. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 17–24. Piscataway, NJ: IEEE [ Google Scholar ]
  • Richards BA, Lillicrap TP, Beaudoin P, Bengio Y, Bogacz R, et al. 2019. A deep learning framework for neuroscience . Nat. Neurosci 22 ( 11 ):1761–70 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ritchie KL, Burton AM. 2017. Learning faces from variability . Q. J. Exp. Psychol 70 ( 5 ):897–905 [ PubMed ] [ Google Scholar ]
  • Rosch E, Mervis CB, Gray WD, Johnson DM, Boyes-Braem P. 1976. Basic objects in natural categories . Cogn. Psychol 8 ( 3 ):382–439 [ Google Scholar ]
  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, et al. 2015. ImageNet Large Scale Visual Recognition Challenge . Int. J. Comput. Vis 115 ( 3 ):211–52 [ Google Scholar ]
  • Russell R, Duchaine B, Nakayama K. 2009. Super-recognizers: people with extraordinary face recognition ability . Psychon. Bull. Rev 16 ( 2 ):252–57 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sangrigoli S, Pallier C, Argenti AM, Ventureyra V, de Schonen S. 2005. Reversibility of the other-race effect in face recognition during childhood . Psychol. Sci 16 ( 6 ):440–44 [ PubMed ] [ Google Scholar ]
  • Sankaranarayanan S, Alavi A, Castillo C, Chellappa R. 2016. Triplet probabilistic embedding for face verification and clustering . arXiv:1604.05417 [cs.CV] [ Google Scholar ]
  • Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, et al. 2018. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv 407007 . 10.1101/407007 [ CrossRef ] [ Google Scholar ]
  • Schroff F, Kalenichenko D, Philbin J. 2015. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition , pp. 815–23. Piscataway, NJ: IEEE [ Google Scholar ]
  • Scott LS, Monesson A. 2010. Experience-dependent neural specialization during infancy . Neuropsychologia 48 ( 6 ):1857–61 [ PubMed ] [ Google Scholar ]
  • Sengupta S, Chen JC, Castillo C, Patel VM, Chellappa R, Jacobs DW. 2016. Frontal to profile face verification in the wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sim T, Baker S, Bsat M. 2002. The CMU pose, illumination, and expression (PIE) database. In Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition , pp. 53–58. Piscataway, NJ: IEEE [ Google Scholar ]
  • Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition . arXiv:1409.1556 [cs.CV] [ Google Scholar ]
  • Smith LB, Jayaraman S, Clerkin E, Yu C. 2018. The developing infant creates a curriculum for statistical learning . Trends Cogn. Sci 22 ( 4 ):325–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Smith LB, Slone LK. 2017. A developmental approach to machine learning? Front. Psychol 8 :2124. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Song A, Linjie L, Atalla C, Gottrell G. 2017. Learning to see people like people: predicting social impressions of faces . Cogn. Sci 2017 :1096–101 [ Google Scholar ]
  • Storrs KR, Kietzmann TC, Walther A, Mehrer J, Kriegeskorte N. 2020. Diverse deep neural networks all predict human it well, after training and fitting . bioRxiv 2020.05.07.082743 . 10.1101/2020.05.07.082743 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Su H, Maji S, Kalogerakis E, Learned-Miller E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 945–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sugden NA, Moulson MC. 2017. Hey baby, what’s “up”? One-and 3-month-olds experience faces primarily upright but non-upright faces offer the best views . Q. J. Exp. Psychol 70 ( 5 ):959–69 [ PubMed ] [ Google Scholar ]
  • Taigman Y, Yang M, Ranzato M, Wolf L. 2014. Deepface: closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition , pp. 1701–8. Piscataway, NJ: IEEE [ Google Scholar ]
  • Tanaka JW, Pierce LJ. 2009. The neural plasticity of other-race face recognition . Cogn. Affect. Behav. Neurosci 9 ( 1 ):122–31 [ PubMed ] [ Google Scholar ]
  • Terhörst P, Fährmann D, Damer N, Kirchbuchner F, Kuijper A. 2020. Beyond identity: What information is stored in biometric face templates? arXiv:2009.09918 [cs.CV] [ Google Scholar ]
  • Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system . Nature 381 ( 6582 ):520–22 [ PubMed ] [ Google Scholar ]
  • Todorov A 2017. Face Value: The Irresistible Influence of First Impressions . Princeton, NJ: Princeton Univ. Press [ Google Scholar ]
  • Todorov A, Mandisodza AN, Goren A, Hall CC. 2005. Inferences of competence from faces predict election outcomes . Science 308 ( 5728 ):1623–26 [ PubMed ] [ Google Scholar ]
  • Valentine T 1991. A unified account of the effects of distinctiveness, inversion, and race in face recognition . Q. J. Exp. Psychol. A 43 ( 2 ):161–204 [ PubMed ] [ Google Scholar ]
  • van der Maaten L, Weinberger K. 2012. Stochastic triplet embedding. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Processing , pp. 1–6. Piscataway, NJ: IEEE [ Google Scholar ]
  • Walker M, Vetter T. 2009. Portraits made to measure: manipulating social judgments about individuals with a statistical face model . J. Vis 9 ( 11 ):12 [ PubMed ] [ Google Scholar ]
  • Wang F, Liu W, Liu H, Cheng J. 2018. Additive margin softmax for face verification . IEEE Signal Process. Lett 25 :926–30 [ Google Scholar ]
  • Wang F, Xiang X, Cheng J, Yuille AL. 2017. Normface: L 2 hypersphere embedding for face verification. In MM ‘17: Proceedings of the 25th ACM International Conference on Multimedia , pp. 1041–49. New York: ACM [ Google Scholar ]
  • Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV. 2020. Adversarial examples improve image recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 819–28. Piscataway, NJ: IEEE [ Google Scholar ]
  • Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex . PNAS 111 ( 23 ):8619–24 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yi D, Lei Z, Liao S, Li SZ. 2014. Learning face representation from scratch . arXiv:1411.7923 [cs.CV] [ Google Scholar ]
  • Yoshida H, Smith LB. 2008. What’s in view for toddlers? Using a head camera to study visual experience . Infancy 13 ( 3 ):229–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Young AW, Burton AM. 2020. Insights from computational models of face recognition: a reply to Blauch, Behrmann and Plaut . Cognition 208 :104422. [ PubMed ] [ Google Scholar ]
  • Yovel G, Abudarham N. 2020. From concepts to percepts in human and machine face recognition: a reply to Blauch, Behrmann & Plaut . Cognition 208 :104424. [ PubMed ] [ Google Scholar ]
  • Yovel G, Halsband K, Pelleg M, Farkash N, Gal B, Goshen-Gottstein Y. 2012. Can massive but passive exposure to faces contribute to face recognition abilities? J. Exp. Psychol. Hum. Percept. Perform 38 ( 2 ):285–89 [ PubMed ] [ Google Scholar ]
  • Yovel G, O’Toole AJ. 2016. Recognizing people in motion . Trends Cogn. Sci 20 ( 5 ):383–95 [ PubMed ] [ Google Scholar ]
  • Yuan L, Xiao W, Kreiman G, Tay FE, Feng J, Livingstone MS. 2020. Adversarial images for the primate brain . arXiv:2011.05623 [q-bio.NC] [ Google Scholar ]
  • Yue X, Cassidy BS, Devaney KJ, Holt DJ, Tootell RB. 2010. Lower-level stimulus features strongly influence responses in the fusiform face area . Cereb. Cortex 21 ( 1 ):35–47 [ PMC free article ] [ PubMed ] [ Google Scholar ]

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • NEWS FEATURE
  • 18 November 2020

The ethical questions that haunt facial-recognition research

  • Richard Van Noorden

You can also search for this author in PubMed   Google Scholar

A collage of images from the MegaFace data set , which scraped online photos. Images are obscured to protect people’s privacy. Credit: Adam Harvey/megapixels.cc based on the MegaFace data set by Ira Kemelmacher-Shlizerman et al. based on the Yahoo Flickr Creative Commons 100 Million data set and licensed under Creative Commons Attribution (CC BY) licences

In September 2019, four researchers wrote to the publisher Wiley to “respectfully ask” that it immediately retract a scientific paper. The study, published in 2018, had trained algorithms to distinguish faces of Uyghur people, a predominantly Muslim minority ethnic group in China, from those of Korean and Tibetan ethnicity 1 .

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

Nature 587 , 354-358 (2020)

doi: https://doi.org/10.1038/d41586-020-03187-3

Wang, C., Zhang, Q., Liu, W., Liu, Y. & Miao, L. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 9 , e1278 (2019).

Article   Google Scholar  

Stewart, R., Andriluka, M. & Ng, A. Y. in Proc. 2016 IEEE Conf. on Computer Vision and Pattern Recognition 2325–2333 (IEEE, 2016).

Ristani, E., Solera, F., Zou, R. S., Cucchiara, R. & Tomasi, C. Preprint at https://arxiv.org/abs/1609.01775 (2016).

Nech, A. & Kemelmacher-Shlizerman, I. in Proc. 2017 IEEE Conf. on Computer Vision and Pattern Recognition 3406–3415 (IEEE, 2017).

Guo, Y., Zhang, L., Hu., Y., He., X. & Gao, J. in Computer Vision — ECCV 2016 (eds Leibe, B., Matas, J., Sebe, N. & Welling, M.) https://doi.org/10.1007/978-3-319-46487-9_6 (Springer, 2016).

Google Scholar  

Jasserand, C. in Data Protection and Privacy: The Internet of Bodie s (eds Leenes, R., van Brakel, R., Gutwirth, S. & de Hert, P.) Ch. 7 (Hart, 2018).

Moreau, Y. Nature 576 , 36–38 (2019).

Article   PubMed   Google Scholar  

Zhang, D. et al. Int. J. Legal Med . https://doi.org/10.1007/s00414-019-02049-6 (2019).

Pan, X. et al. Int. J. Legal Med. 134 , 2079 (2020).

Wu, X. & Xhang, X. Preprint at https://arxiv.org/abs/1611.04135 (2016).

Hashemi, M. & Hall, M. J. Big Data 7 , 2 (2020).

Download references

Reprints and permissions

Supplementary Information

  • Spreadsheet of Nature survey summary results

Related Articles

face recognition technology research paper

  • Machine learning
  • Computer science

Why mathematics is set to be revolutionized by AI

Why mathematics is set to be revolutionized by AI

World View 14 MAY 24

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models

News Feature 14 MAY 24

The US Congress is taking on AI — this computer scientist is helping

The US Congress is taking on AI — this computer scientist is helping

News Q&A 09 MAY 24

Are robots the solution to the crisis in older-person care?

Are robots the solution to the crisis in older-person care?

Outlook 25 APR 24

Lethal AI weapons are here: how can we control them?

Lethal AI weapons are here: how can we control them?

News Feature 23 APR 24

Do insects have an inner life? Animal consciousness needs a rethink

Do insects have an inner life? Animal consciousness needs a rethink

News 19 APR 24

US halts funding to controversial virus-hunting group: what researchers think

US halts funding to controversial virus-hunting group: what researchers think

News 16 MAY 24

A DARPA-like agency could boost EU innovation — but cannot come at the expense of existing schemes

A DARPA-like agency could boost EU innovation — but cannot come at the expense of existing schemes

Editorial 14 MAY 24

Postdoc in CRISPR Meta-Analytics and AI for Therapeutic Target Discovery and Priotisation (OT Grant)

APPLICATION CLOSING DATE: 14/06/2024 Human Technopole (HT) is a new interdisciplinary life science research institute created and supported by the...

Human Technopole

face recognition technology research paper

Research Associate - Metabolism

Houston, Texas (US)

Baylor College of Medicine (BCM)

face recognition technology research paper

Postdoc Fellowships

Train with world-renowned cancer researchers at NIH? Consider joining the Center for Cancer Research (CCR) at the National Cancer Institute

Bethesda, Maryland

NIH National Cancer Institute (NCI)

Faculty Recruitment, Westlake University School of Medicine

Faculty positions are open at four distinct ranks: Assistant Professor, Associate Professor, Full Professor, and Chair Professor.

Hangzhou, Zhejiang, China

Westlake University

face recognition technology research paper

PhD/master's Candidate

PhD/master's Candidate    Graduate School of Frontier Science Initiative, Kanazawa University is seeking candidates for PhD and master's students i...

Kanazawa University

face recognition technology research paper

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

ORIGINAL RESEARCH article

Research on face recognition and privacy in china—based on social cognition and cultural psychology.

\r\nTao Liu*

  • Department of Sociology, Hangzhou Dianzi University, Hangzhou, China

With the development of big data technology, the privacy concerns of face recognition have become the most critical social issue in the era of information sharing. Based on the perceived ease of use, perceived usefulness, social cognition, and cross-cultural aspects, this study analyses the privacy of face recognition and influencing factors. The study collected 518 questionnaires through the Internet, SPSS 25.0 was used to analyze the questionnaire data as well as evaluate the reliability of the data, and Cronbach’s alpha (α coefficient) was used to measure the data in this study. Our findings demonstrate that when users perceive the risk of their private information being disclosed through face recognition, they have greater privacy concerns. However, most users will still choose to provide personal information in exchange for the services and applications they need. Trust in technology and platforms can reduce users’ intention to put up guards against them. Users believe that face recognition platforms can create secure conditions for the use of face recognition technology, thus exhibiting a higher tendency to use such technology. Although perceived ease of use has no significant positive impact on the actual use of face recognition due to other external factors, such as accuracy and technology maturity, perceived usefulness still has a significant positive impact on the actual use of face recognition. These results enrich the literature on the application behavior of face recognition and play an important role in making better use of face recognition by social individuals, which not only facilitates their daily life but also does not disclose personal privacy information.

Introduction

Face recognition is a biometric recognition technology that uses pattern matching to recognize individual identities based on facial feature data. Compared to traditional non-biological recognition and physiological feature recognition technology, face recognition technology has specific technical advantage ( Jiang, 2019 ). Nowadays, relying on ubiquitous mobile camera devices, face recognition technology has been widely used in various fields, including face attendance, face payment, smart campus, access control system, and security system, which demonstrate the advances in the face recognition service level in the intelligent hardware system. Face recognition technology has dramatically improved the intelligence level of business systems in these fields. The human face is rich in features. In the society of acquaintances in the past, the face was the foundation for us to involve emotional communication and social relations with others.

Technology has been one of the most important factors that changed the way of life and commercial activities of human society. With continuous innovation and the development of technology, human society is changing rapidly. Technological innovation has changed people’s lifestyles in the spheres of shopping, education, medical services, business organizations, and so on. “Technology is not only an essential tool for finding out new ways to join different actors in service innovation processes, but also as an element able to foster the emergence of new and ongoing innovations” ( Ciasullo et al., 2017 ). For example, in the healthcare service ecosystem, health care providers adapt to the innovative medical service ecosystem so that patients can obtain better medical services. Medical service innovation has had a great impact on the continuous reconstruction of the service ecosystem ( Ciasullo et al., 2017 ). Technology forces the market to change constantly, and the changing market leads business organizations to innovate. “The contemporary world is characterized by a fast changing environment. Business organizations are faced with the challenge of keeping pace with developments in the field of technology, markets, cultural and socio-economic structures” ( Kaur et al., 2019 ). In the era of big data and information, business organizations must “to explore how cognitive computing technology can act as potential enabler of knowledge integration-based collaborations with global strategic partnerships as a special case” ( Kaur et al., 2019 ).

At present, innovations in network technology provide the greatest convenience and advantages for organizations dealing with such networks. “Small and medium-sized enterprises (SMEs) have been considered the most innovative oriented businesses in developed countries even in emerging markets acting as pioneer in the digital transformational word.” Meanwhile, it is important for technology upgrading, knowledge spillover, and technology transfer to explore SMEs’ competitiveness ( Del Giudice et al., 2019 ).

Knowledge and technology transfer is a “pathway” for accelerating economic system growth and advancement. Technology transfer can be explored from theory to practice for knowledge and technology. From the users’ perspective, technology transfer affects their sense of use and experience ( Elias et al., 2017 ). Big data analytics capabilities (BDAC) represent critical tools for business competitiveness in highly dynamic markets. BDAC has both direct and indirect positive effects on business model innovation (BMI), and they influence strategic company logics and objectives ( Ciampi et al., 2021 ). “In the world of Big Data, innovation, technology transfer, collaborative approaches, and the contribution of human resources have a direct impact on a company’s economic performance.” Therefore, big data companies should make corresponding changes in management and strategy. Moreover, skilled human resources have a positive contribution to the company’s economic performance. “Information and knowledge are the foundation on which act for aligning company’s strategies to market expectations and needs” ( Caputo et al., 2020 ).

With the arrival of the era of artificial intelligence, intelligent social life has become a reality, and artificial intelligence has become a new engine for China’s economic and social development. According to the latest data released by the China Internet Information Center, the number of artificial intelligence enterprises in China ranks second in the world ( CNNIC, 2020 ). As a new technology, face recognition—a typical application of artificial intelligence—rises with the construction of a smart city According to the statistics presented in the Report on In-depth Market Research and Future Development Trend of China’s Face Recognition Industry (2018–2024) released by the Intelligence Research Group, it is estimated that the face recognition industry in China will reach 5.316 billion Yuan by 2021 ( Biometrics Identity Standardization [BIS], 2020 ). As the gateway connecting humans and intelligence face recognition has excellent development potential.

Given that the modern era emphasizes looks, the face remains socially functional, but technology has given it new meaning and a mission. The attributes and features of a facial image are enough to convey a person’s identity. When our face is tied to our personal information and even used as a password substitute, it is no longer the traditional concept of face. Face recognition technology can extract personally identifiable information, such as age, gender, and race, from images. To some extent, in the Internet age, almost everyone’s personal information is displayed without any protection.

With the technical support of big data, user portraits based on facial recognition and a variety of personal data have increasingly become identification for individuals in this day and age ( Guo, 2020 ). From face-swapping apps, access by face recognition to Hangzhou Safari Park, the application of face recognition in subway security checks, to the formulation of the Personal Information Protection Law of the People’s Republic of China (PRC), a series of public opinions have brought face recognition to the forefront. On the other hand, Internet privacy, which has been neglected so far, is increasingly taken seriously by the public.

The issues of face recognition and privacy have been studied extensively by experts and scholars in their respective fields, but there are few empirical studies on the combination of the use of face recognition and personal privacy security. At present, most scholars’ research on face recognition focuses on face recognition algorithms, recognition systems, legal supervision and security, users’ willingness to accept face payments, and the application of face recognition in the library. No quantitative research has been conducted on the relationship between the use of face recognition technology and people’s attitudes toward privacy issues. Therefore, based on the two main determinants of the technology acceptance model (TAM) and according to public attitudes toward privacy and the specific context of the use of face recognition in the current networked environment, variables such as privacy concerns, risk perception, and trust are introduced in this study to build the hypothesis model of the actual use of face recognition. The concept of privacy concerns is applied to the research on personal information security behavior of facial recognition users, which further expands the practical scope of the privacy theory and provides suggestions to promote the development of facial recognition applications.

This research makes two contributions. First, it demonstrates the impact of privacy concerns, perceived risk, trust, social cognition, and cross-cultural aspects on facial recognition. This result enriches face recognition literature, and a hypothesis model based on perceived ease of use and perceived usefulness—the two determinants of user behavior—is created. Second, this research confirms that the privacy paradox still exists. In the digital information age, most users will still choose to provide personal information in exchange for the services and applications they need. Trust, social cognition, and culture play a vital role in intelligent societies and virtual interactions. Meanwhile, when technology applications can provide users with diversified and user-friendly functions, their perceived usefulness is significantly improved.

The structure of the article is as follows. In section “Theoretical Basis and Research Hypothesis,” we examine the theoretical basis and research hypothesis. Section “Variable Measurement and Data Collection” describes variable measurement and data collection, including questionnaire design and data collection. Section “Data Analysis” presents the results of the data analysis. Section “Conclusion” discusses the key findings of the research along with the final remarks.

Theoretical Basis and Research Hypothesis

In the era of mobile data services based on big data, “the nature of economic exchange is more inclined to exchange personal information for personalized services. Privacy violations may occur in the acquisition, storage, use and transaction of personal information, thus giving rise to problems in information privacy” ( Chen and Cliquet, 2020 ). Moreover, in the Internet environment, information privacy security in intelligent society is increasingly threatened. Since facial recognition is based on the acquisition of human face image information and face information demands privacy, face information security becomes the focus of the public when choosing whether to use facial recognition technology. On the one hand, human faces are rich in features, which provide powerful biometric features for identifying individuals; thus, a third party can identify individuals through face positioning, and so it is necessary to prevent the malicious collection and abuse of such information. On the other hand, through image storage and feature extraction, a variety of demographic and private information can be obtained, such as facial age, health status, and even relatives, which leads to unnecessary privacy invasion ( Zahid and Ajita, 2017 ). Therefore, in view of the uniqueness of human face and information privacy, the focus of this paper will be whether the public’s actual use of face recognition is affected by their attitudes toward personal privacy and the perceived risk of personal data.

Privacy Concerns

Privacy concerns are widely used to explain the behavior intention of users ( Zhang and Li, 2018 ). In the Internet field, privacy concerns of users include people’s perceptions and concerns about improper access, illegal acquisition, illegal analysis, illegal monitoring, illegal transmission, illegal storage, and illegal use of private information ( Wang et al., 1998 ). Users do not have full control over the use of their personal information. Thus, users become concerned about privacy when it may be violated due to security loopholes or inappropriate use or when individuals perceive the risk of privacy infringement.

Personal privacy in the age of mobile data services involves both online and offline domains. The extensive use of various personal biological information applications poses new challenges to personal privacy security. Specifically, with the progress of computer algorithms, the Internet of Things, and other technologies, the threshold of information collection becomes increasingly lower, and computerized information may be easily copied and shared, resulting in problems such as secondary data mining and inadequate privacy ( Qi and Li, 2018 ). In the existing research on privacy concerns, Cha found that there is a negative correlation between users’ concerns regarding the information privacy of a technology-driven platform and the frequency of users using the media ( Cha, 2010 ). McKnight conducted research on Facebook, whereby they found that the greater the concern about privacy is in a medium, the less willing people are to continue using the medium for fear of personal information being abused ( McKnight et al., 2011 ). In the context of big data, the privacy concerns of face recognition users originate from the risk of facial image information being collected and used without personal knowledge or consent or the risk of personal biometrics being transmitted or leaked. In other words, the cautious choice of face recognition application is influenced by the extent of individual concerns regarding privacy. Considering these notions, the following hypothesis is proposed:

Hypothesis 1: Privacy concerns have a negative impact on the actual use of face recognition.

Perceived Risk

Due to the virtuality or uncertainty of a network, perceived risk is an individual’s perception of the risk of information breach. The perceived risk of facial recognition may arise from the disclosure or improper use of face information. Chen conducted an empirical study on this and believed that the degree of individuals’ concerns for information security is affected by the perceived network risk ( Chen, 2013 ). Norberg et al. (2007) showed in their study that the negative effect of perceived disclosure is affected by perceived risk. In other words, the more users perceive that the disclosure of personal information will lead to the illegal breach of privacy and other adverse effects, the more they will be concerned about the security of personal privacy. Not only is the degree of privacy concerns positively affected by perceived risk, but studies have also shown that perceived risk also affects actual use behavior ( Zhang and Li, 2018 ). Hichang’s (2010) research results show that the degree of severity of privacy risks perceived by users is positively correlated with the degree of their self-protection behaviors. When people realize that their personal information is at risk, they take active preventive actions. Therefore, in this paper, regarding the intention to use facial recognition, it is believed that the higher the risk perceived by users, the more users will pay attention to the breach of personal privacy, thus affecting the actual use of facial recognition. In this vein, the following hypotheses are proposed:

Hypothesis 2: Perceived risk has a positive effect on privacy concerns.

Hypothesis 3: Perceived risk has a negative influence on the actual use of face recognition.

Trust Theory

Simmel (2002) pioneered the sociological study of trust, believing that trust is an essential comprehensive social force. Putnam (2001) believed that trust is an essential social capital and can improve social efficiency through actions that promote coordination and communication. In an intelligent social environment, social transactions cannot occur without trust. Hence, trust has also become an essential factor in the study of privacy issues. In the context of face recognition, trust is defined as users’ belief in the ability of face recognition technology and application platforms to protect their personal information. Joinson et al. (2010) found in his study that users’ perceived risk to personal privacy is affected by their degree of trust. Moreover, through research on the behavioral intention of intelligent media use, some scholars present that trust will directly affect the use intention, and there is a significant correlation between trust and users’ use intention. Therefore, the following hypotheses are proposed:

Hypothesis 4: Trust negatively affects the perceived risk of users with face recognition.

Hypothesis 5: Trust positively affects the actual use of face recognition.

Technology Acceptance Model

The TAM is widely used to explain users “acceptance of new technologies and products, and it is the most influential and commonly used theory to describe individuals” degree of acceptance to information systems ( Lee et al., 2003 ). The TAM is used for research in different fields: education ( Scherer et al., 2019 ), hospitals and healthcare ( Nasir and Yurder, 2015 ; Fletcher-Brown et al., 2020 ; Hsieh and Lai, 2020 ; Papa et al., 2020 ), sports and fitness ( Lunney et al., 2016 ; Lee and Lee, 2018 ; Reyes-Mercado, 2018 ), fashion ( Turhan, 2013 ; Chuah et al., 2016 ), consumer behavior ( Wang and Sun, 2016 ; Yang et al., 2016 ; Kalantari and Rauschnabel, 2018 ), gender and knowledge sharing ( Nguyen and Malik, 2021 ), wearable devices ( Magni et al., 2021 ), human resource management ( Del Giudice et al., 2021 ), Internet of Things ( Caputo et al., 2018 ), technophobia and emotional intelligence influence on technology acceptance ( Khasawneh, 2018 ).

In this study, a hypothesis model is developed based on perceived ease of use and perceived usefulness, two determinants of user behavior.

Perceived usefulness refers to the extent to which users believe that using a specific system will improve their job performance. Perceived ease of use refers to the ease with which users think a particular system can be used, which also affects their perceived usefulness of technology ( Davis, 1989 ). The easier it is to use face recognition, the more useful it is considered be. For the purpose of this study, face recognition aims to realize multiple functions, such as providing efficient and convenient services. Therefore, the definition of perceived usefulness should be extended to users think face recognition can improve the degree of convenience and service. In this paper, the ease of using a face recognition application refers to users’ perceived ease of use of the technology. Previously, Davis (1989) conducted an empirical study on the e-mail system and concluded that perceived ease of use has a positive impact on the use of applications. In a study on the adoption and use of information systems in the workplace, Venkatesh and Davis (2000) demonstrated that perceived usefulness has a positive impact on people’s usage behavior. With the extensive application of the TAM in the information system, the face recognition technology studied in this paper also comprises intelligent media. Perceived usefulness is an important variable that affects the use of face recognition. Thus, the following hypotheses are proposed:

Hypothesis 6: Perceived ease of use has a positive impact on perceived usefulness.

Hypothesis 7: Perceived ease of use has a positive influence on the actual use of face recognition.

Hypothesis 8: Perceived usefulness has a positive impact on the actual use of face recognition.

The research model of this paper is shown in Figure 1 .

www.frontiersin.org

Figure 1. Structural equation model.

Variable Measurement and Data Collection

Questionnaire design.

In order to ensure the scientificity and credibility of the measurement variables, this study modified the mature scale in previous studies and combined it with the information concerns of current users on the use of face recognition and developed a questionnaire. This questionnaire consists of two parts. The first part investigates the demographic characteristics of users, such as gender and age. The second part is measured by a Likert scale. The options of each measurement item include “Strongly disagree,” “Disagree,” “Neither agree nor disagree,” “Agree,” and “Strongly agree.” The survey included seven latent variables and 21 measured variables. Latent variables included perceived ease of use, perceived usefulness, privacy concerns, risk perception, trust, and actual use. The contents of the scale are shown in Table 1 .

www.frontiersin.org

Table 1. Design of measurement items for variables studied.

Data Collection

In this study, the questionnaire was designed on the survey platform 1 and distributed in the form of links through WeChat, QQ, and other channels. The survey was conducted from May 26 to June 10, 2020, and a total of 635 questionnaires were recovered. The subjects of the questionnaire were users of face recognition technology. After the second screening, 518 valid questionnaires remained after the elimination of incomplete questionnaires and all the questionnaires with the same options. The specific statistics are shown in Table 2 .

www.frontiersin.org

Table 2. Statistical analysis of demographic characteristics ( N = 518).

From the reported statistics, it can be seen that the gender ratio of the sample data is balanced. The age structure of the respondents is mainly between 18 and 35 years old, so it is an overall young sample, conforming to the age characteristics of the main user group of facial recognition. The respondents mostly have a high level of education, with a bachelor’s degree or above. In terms of urban distribution, 58.7% of respondents came from first-tier cities and new first-tier cities. The sample coverage is reasonable and thus representative. As for privacy, more than 86.1% of respondents believe that face information is private. Consequently, the sample data collected in this questionnaire applies to the relevant research on the privacy problems of face recognition users.

Data Analysis

Reliability and validity analysis.

For this study, SPSS 25.0 was used to analyze the collected data and evaluate the reliability of the data. Cronbach’s alpha (α coefficient) was used to measure the data in this study. With 0.7 as the critical value, it is generally believed that when Cronbach’s α coefficient is greater than 0.7, the scale has considerable reliability. Based on the test results, the overall Cronbach’s α coefficients of privacy concerns, perceived risk, perceived ease of use, perceived use, trust, and actual use are between 0.876 and 0.907, all of which are greater than 0.7. This indicates that the measurement of each latent variable shows excellent internal consistency and that the questionnaire is reliable as a whole.

Structural validity refers to the corresponding relationship between measurement dimensions and measurement items. It is often used in research to analyze questionnaire items. According to the results of AMOS 24.0 for confirmatory factor analysis, the fitting index of confirmatory factor analysis in this study was X2/df = 2.722, which is less than 3, thus indicating that the fit was ideal. RMSEA = 0.058, which is less than 0.08, indicating that the model is acceptable. It is generally believed that when the fitting index of NFI, IFI, and CFI is greater than 0.9, it indicates that the model fits well; in this regard, NFI = 0.938, RFI = 0.925, IFI = 0.960, TLI = 0.951, CFI = 0.959. Therefore, the fitting index of this model conforms to the common standard, and the fitting degree of the model is proper.

Exploratory factor analysis is utilized to determine whether each measurement item converges to the corresponding factor, and the number of selected factors is determined by the number of factors whose eigenvalue exceeds 1. If the value of factor loading is greater than 0.6, it is generally considered that each latent variable corresponds to a representative subject ( Gerbing and Anderson, 1988 ; Gefen and Straub, 2005 ).

As shown in Table 3 , the values of factor loading of the latent variables, including privacy concerns, perceived risk, perceived ease of use, perceived usefulness, trust, and actual use, were all greater than 0.7, which shows that the corresponding topic of latent variables is highly representative.

www.frontiersin.org

Table 3. Factor load and variable combination reliability.

Combined reliability (CR) and average variance extracted (AVE) were used for the convergent validity analysis. Generally, the recommended threshold of CR is greater than 0.8 or higher ( Werts et al., 1974 ; Nunnally and Bernstein, 1994 ). AVE is recommended to be above 0.5 ( Fornell and Larcker, 1981 ). As shown in Table 3 , the AVE of each latent variable was greater than 0.5, and CR was greater than 0.8, indicating that the convergence validity was ideal.

According to the results in Table 4 , there was a significant correlation between actual use and privacy concerns, perceived risk, perceived ease of use, perceived usefulness, and trust ( p < 0.001). In addition, the absolute value of the correlation coefficient corresponding to each variable was less than 0.5 and was less than the corresponding AVE square root. It indicates that there was a specific correlation between latent variables and a certain degree of differentiation among them, so the scale has an ideal level of discriminant validity.

www.frontiersin.org

Table 4. Correlation coefficient and AVE square root.

Correlation Analysis

Correlation analysis studies whether there is a correlation between variables and uses the correlation coefficient to measure the degree of closeness between variables. The three statistical correlation coefficients are the Pearson correlation coefficient, the Spearman correlation coefficient, and the Kendall correlation coefficient, of which the Pearson correlation coefficient is commonly used in questionnaire and scale studies ( Qi and Li, 2018 ). In this study, SPSS 25.0 and Pearson’s correlation analysis were used to study whether there is a significant correlation between privacy concerns, perceived risk, perceived ease of use, perceived use, trust, and actual use in a hypothetical model to validate the validity of the research hypotheses.

Table 5 shows the means and standard deviations of privacy concerns, perceived risk, perceived ease of use, perceived usefulness, trust, and actual use and the Pearson correlation coefficient between the variables. From the mean, users had a higher perceived risk and a lower degree of trust. The results of correlation coefficient matrix showed that perceived risk and privacy concerns are significantly and positively correlated, and H2 was initially verified; privacy concerns, persistent risk, and actual use were negatively correlated ( r = –0.158, p < 0.01), and the correlation degree was weak, preliminarily supporting H1 and H3. There was a positive correlation between perceived ease of use, perceived usefulness, and actual use ( p < 0.01). Among these, perceived ease of use had a weak correlation with actual use ( r = 0.292) and perceived usefulness showed a moderate correlation with actual use ( r = 0.494); thus, H6, H7, and H8 were preliminarily verified. There was a significantly strong correlation between trust and actual use ( p < 0.01, r = 0.608), so H5 was preliminarily verified. In addition, trust was also negatively correlated with perceived risk, due to which H4 was preliminarily verified.

www.frontiersin.org

Table 5. Correlation coefficient matrix and mean and standard deviation of variables.

Path Analysis and Hypothesis Testing

The correlation analysis results showed that there was a correlation between the variables, so these hypotheses were preliminarily supported. Nevertheless, it could not adequately explain the systematic relationship between variables. Thus, AMOS 24.0 and the structural equation model were further employed in this study to explore the systematic relationship between the variables. As shown in Figure 2 .

www.frontiersin.org

Figure 2. Path analysis diagram of the structural equation model.

As can be seen from Table 6 , the ratio of chi-square to the degree of freedom in the structural equation was less than 5, which is within the acceptable range. RFI, CFI, NFI, TLI, IFI, and GFI indexes were all significantly greater than 0.9, and the root mean square error of approximation (RMSEA) was less than 0.08. Thus, it shows that the structural equation model fits well.

www.frontiersin.org

Table 6. Fitting of the structural equation model ( N = 518).

According to Table 7 , the hypotheses H2, H4, H5, H6, and H8 were verified, which shows that trust and perceived usefulness both positively influence the actual use intentions of face recognition users and that perceived risk also has a significant positive impact on privacy concerns. This indicates that the higher the public’s awareness of privacy is, the more risks it will perceive and the higher the public’s concerns about privacy will be. However, H1 and H3 were not accepted. From the test results, it can be seen that privacy concerns and perceived risk had a negative influence on the actual use of face recognition, but the influence was not significant. In addition, H7 was not supported, indicating that perceived ease of use had no significant influence on the actual use of face recognition.

www.frontiersin.org

Table 7. Results of the hypothesis test.

Hypotheses H1, H3, and H7 were not supported for the following reasons:

1. H1 and H3 were not supported: Perceived risk and privacy concerns had no significant adverse effect on the actual use of face recognition. It shows that the public chooses to use face recognition despite their concern and perception of privacy and risk. Some scholars have called this contradictory phenomenon a privacy paradox ( Xue et al., 2016 ). In other words, although users are worried that face recognition may lead to improper use or disclosure of personal information, they still choose to use face recognition in the field of mobile networks. An important reason is that the application of intelligent media technology, facial recognition, is becoming increasingly prevalent in our daily lives, which is reflected in all aspects of our lives. Especially in the field of public services, relying on the digital platform has improved effectiveness and efficiency via face scanning.

2. H7 was not supported: The positive influence of perceived ease of use on the actual use of face recognition was not significant. This conclusion is not consistent with previous research, but to some extent, it confirms the correlation between perceived ease of use and the use of information systems. In other words, since ease of use involves self-efficacy cognition, technology anxiety can make users perceive it to be difficult to operate and reduce their evaluation of the ease of use of the system, thus further affecting the use of face recognition technology ( Bhattacherjee, 2001 ). Affected by external factors such as light and image clarity, the maturity of face recognition technology is not high, and the algorithm is not accurate, which affects the public’s perceived ease of use. It also reflects that, for the face recognition technology, perceived usefulness has a more substantial impact on the actual use, and those users value the functional benefits brought by face recognition applications.

Robustness Test of the Model

In this paper, gender, age, educational background, and city of the respondents were introduced into the model as control variables to test the robustness of the hypothesis model. The test results are shown in the figure below.

It can be seen from Figure 3 that despite introducing control variables, such as gender, age, education background, and city, the relationship and significance level of each factor of the model were consistent with the conclusion of hypothesis test results above. Meanwhile, the test results of the influence of each control variable on the actual use of face recognition were not significant, indicating that the model passed the robustness test.

www.frontiersin.org

Figure 3. Robustness test.

In this study, taking the users of face recognition as the research objects, the TAM was integrated, and variables such as privacy concerns, perceived risk, and trust were added to the model to analyze the mechanism of how they affect the actual use of face recognition and explain the determinants for the use of facial recognition by the public. The results showed that the model fit well and that most of the hypotheses were supported.

Based on the results of the model analysis, this paper draws the following conclusions:

1. In the context of big data, the concept of information privacy has been continuously expanded. When users perceive the risk of their private information being disclosed through face recognition, they will have greater privacy concerns. However, although users’ privacy concerns are deep, the privacy paradox still exists. In the digital information age, most users will still choose to provide personal information in exchange for the services and applications they need.

2. Trust plays a vital role in intelligent societies and virtual interactions. In this paper, users’ trust in face recognition applications includes trust in the technology application platforms and trust in the face recognition technology itself. This study shows that the trust of technology and platform will reduce the user’s intention to safeguard themselves against it. Users believe that face recognition platforms can provide secure conditions for the use of the technology, and thus, they show a higher tendency to use such technology. On the other hand, users’ trust in face recognition technology improves, so their perceived risk of privacy information leakage is significantly reduced. In this regard, in the information age, users are willing to disclose personal information more out of their trust in face recognition technology and the related platforms.

3. In the context of face recognition as an emerging technique, the TAM still has excellent explanatory power. Although perceived ease of use has no significant positive impact on the actual use of face recognition due to other external factors, such as accuracy and technology maturity, perceived usefulness still has a significantly positive impact on the actual use of face recognition. To an extent, when technology applications can provide users with diversified and user-friendly functions, their perceived usefulness will be significantly improved.

4. The final consideration is the use management of government and technical ethics of enterprises. When developing face recognition, enterprises must pay attention to technical ethics, as well as privacy, to ensure personal privacy and protect against biological information leakage. The government must also strengthen its management of face recognition technology on a large scale to prevent enterprises and individuals from using technology to affect social security and personal privacy.

Limitations

There are some limitations to this study. First, the sample data in the model are mostly from a young group. In future research, survey data of other age groups can be explored to discuss whether the privacy concerns of users of different age groups will affect their use of facial recognition. Second, this study focuses on the influence of privacy concerns, perceived risk, perceived ease of use, perceived usefulness, and trust on the actual use of face recognition but has not assessed whether other factors, such as user experience and usage habits, affect the actual use of face recognition. In addition, this study only analyzes the direct impact of the research variables on the actual use but fails to account for the impact of the mediating variables or moderator variables.

Future Research Directions

Although this research provides some interesting insights, it has some significant limitations. First, future research should conduct research on different age groups to study the acceptance of face use and attention to privacy at different ages.

Second, privacy is one of the most critical ethical issues in the era of mobile data services. In the current age dominated by big data, privacy issues have become more prominent due to over-identification, technical flaws, and lagging legal construction. In this information era, the connection characteristics of the Internet pose a particularly unique information privacy threat, and many databases and records have led to the privacy boundary continually expanding. How do we balance technological enabling with privacy protection? What should users do about the privacy paradox? The different social cultures and psychology between China and the West cause people to use face recognition differently.

In terms of the impact of Western culture on face recognition, the culture pays attention to privacy and freedom, and politics and social culture affect the use of face recognition. The error and discrimination of face recognition algorithm will cause great psychological harm, coupled with the impact of social culture, and lead to social contradictions. For example, after testing the face recognition systems of Microsoft, Facebook, IBM, and other companies at MIT, it was found that the error rate of women with darker skin color is 35% higher than that of men with lighter skin color. In this regard, the algorithm was suspected to exhibit gender and racial discrimination. The algorithm is designed by people. Developers may embed their values in the algorithm, so there are artificial bias factors, which will lead to social contradictions. Therefore, politics, society, and culture have affected the governance attitude of the West. In terms of social background, religious contradictions and ethnic contradictions in Western society have intensified, and ethnic minorities have been discriminated against for a long time. The West is highly sensitive to prejudices caused by differences in religious beliefs, ethnic groups, and gender. Culturally and psychologically, the West attaches great importance to personal privacy and absolute freedom. Europeans regard privacy as dignity, and Americans regard privacy as freedom. These are some of the new problems we should focus on resolving now.

The core element of cognitive science is cognition, which is also known as information processing. Cognitive science and artificial intelligence are closely linked. The American philosopher J.R. Searle indicated that in the history of cognitive science, computers are key. Without digital computers, there would be no cognitive science ( Baumgartner and Payr, 1995 ). It is particularly important in the research of face recognition and cognitive science. Whether people use face or not has a great relationship with their cognition, consciousness, psychology, and culture. The global workspace theory of Baars, a psychologist, posits that the brain is a modular information processing device composed of many neurons, and the information processing process is composed of different neurons with different divisions of labor and functions. The distributed operation process of specialized modules. The rapidly changing neuronal activity process constructs a virtual space called the global workspace at any given time through competition and cooperation between modules. Consciousness and unintentional state are generated through competition in the workspace. The generation of consciousness refers to all specialized modules in the brain responding to these new stimuli at the same time and analyzing and integrating this stimulus information in the global workspace through competition and cooperation until the best matching effect is achieved in the information processing between modules ( Baars, 1988 ). Andrejevic and Volcic (2019) believes that exposing his face to the machine is in the interest of “efficiency” in this new world situation, creating contradictions with religious and cultural traditions. Face recognition largely depends on the exact meaning given to them by a wide range of actors, such as government, businesses, and civil society organizations ( Norval and Prasopoulou, 2017 ).

Finally, there is the consideration of face recognition and privacy management. Governments and enterprises should strengthen the management and design of face recognition technology. The technology itself is neutral, and the intelligent measures developing from online to offline, as in the case of facial recognition, are targeted at efficient, convenient, and humanized services. Thus, the public must be willing to disclose their personal information to experience the benefits of the use of intelligent media entirely. As scholars have declared, “It is the default transaction rule in the data age to give up part of privacy for the fast operation” ( Mao, 2019 ). Therefore, for the technology of face recognition at a crossroads, on the one hand, one cannot give up the application of technology because of privacy security. Instead, we should rely on smart hardware systems to empower cities and life with innovative technologies.

On the other hand, we cannot abuse this facial recognition technology after only viewing its bright prospects. Data security is always a crucial factor. Therefore, we believe that for face recognition technology, we must balance security, convenience, and privacy, strengthen the research on privacy issues in the field of big data networks, pay attention to the data flow behind it, constrain the technology with other evolving technologies, and cultivate the privacy literacy of the public.

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Ethics Statement

The studies involving human participants were reviewed and approved by the Secretariat of Academic Committee, Hangzhou Dianzi University. The participants provided their written informed consent to participate in this study.

Author Contributions

TL and BY: conceptualization, software, and formal analysis. TL, SD, and YG: methodology and validation. TL and SD: investigation, resources, and data curation. TL, YG, and BY: writing—original draft preparation and visualization. TL, BY, and SD: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

We are grateful for the financial support from the Zhejiang Social Science Planning “Zhijiang Youth Project” Academic Research and Exchange Project: Social Science Research in the Era of AI (22ZJQN06YB), the Special Fund of Fundamental Research Funds for Universities Directly Under the Zhejiang Provincial Government (GK199900299012-207), and Excellent Backbone Teacher Support Program of Hangzhou Dianzi University (YXGGJS).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

We are highly appreciative of the invaluable comments and advice from the editor and the reviewers.

  • ^ www.wjx.cn

Andrejevic, M., and Volcic, Z. (2019). “smart” cameras and the operational enclosure. Telev. New Media 22, 1–17.

Google Scholar

Baars, B. J. (1988). A Cognitive Theory of Consciousness. Cambridge: Cambridge University Press.

Baumgartner, P., and Payr, S. (1995). Speaking Minds: interviews with Twenty Eminent Cognitive Scientists. New Jersey: Princeton University Press. 204.

Bhattacherjee, A. (2001). Understanding information systems continuance: an expectation-confirmation model. Mis Q. 25, 351–370. doi: 10.2307/3250921

CrossRef Full Text | Google Scholar

Biometrics Identity Standardization [BIS] (2020). 2020 Face Recognition Industry Research Report. Available online at: http://sc37.cesinet.com/view-0852f50939dd442daa42f566c950e336-fe654ac1ec464ae7b780f9fd78553c79.html [Accessed December 25, 2020]

Caputo, F., Mazzoleni, A., Pellicellic, A. C., and Muller, J. (2020). Over the mask of innovation management in the world of Big Data. J. Bus. Res. 119, 330–338. doi: 10.1016/j.jbusres.2019.03.040

Caputo, F., Scuotto, V., Carayannis, E., and Cillo, V. (2018). Intertwining the internet of things and consumers’ behaviour science: future promises for businesses. Technol. Forecast. Soc. Change 136, 277–284. doi: 10.1016/j.techfore.2018.03.019

Cha, J. (2010). Factors affecting the frequency and amount of social networking site use: motivations, perceptions, and privacy concerns. First Monday 15, 12–16. doi: 10.5210/fm.v15i12.2889

Chen, R. (2013). Living a private life in public social networks: an exploration of member self-disclosure. Decis. Supp. Syst. 55, 661–668. doi: 10.1016/j.dss.2012.12.003

Chen, X. Y., and Cliquet, G. (2020). The blocking effect of privacy concerns in the “Quantified Self” movement–a case study of the adoption behavior of smart bracelet users. Enterpr. Econ. 4:109.

Chuah, S. H. W., Rauschnabel, P. A., Krey, N., Nguyen, B., Ramayah, T., and Lade, S. (2016). Wearable technologies: the role of usefulness and visibility in smartwatch adoption. Comp. Hum. Behav. 65, 276–284. doi: 10.1016/j.chb.2016.07.047

Ciampi, F., Demi, S., Magrini, A., Marzi, G., and Papa, A. (2021). Exploring the impact of big data analytics capabilities on business model innovation: the mediating role of entrepreneurial orientation. J. Bus. Res. 123, 1–13. doi: 10.1016/j.jbusres.2020.09.023

Ciasullo, M. V., Cosimato, S., and Pellicano, M. (2017). Service Innovations in the Healthcare Service Ecosystem: a Case Study. Systems 5, 2–19.

CNNIC (2020). The 45th China Statistical Report on Internet Development. Available online at: http://www.gov.cn/xinwen/2020-04/28/content_5506903.htm . [Accessed April 28, 2020]

Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. Mis. Q. 13, 319–340. doi: 10.2307/249008

Del Giudice, M., Scuottoc, V., Garcia-Perezd, A., and Messeni Petruzzellie, A. (2019). Shifting wealth II in Chinese economy. the effect of the horizontal technology spillover for SEMs for international growth. Technol. Forecast. Soc. Change 145, 307–316. doi: 10.1016/j.techfore.2018.03.013

Del Giudice, M., Scuottoc, V., Orlando, B., and Mustilli, M. (2021). Toward the human-Centered approach. A revised model of individual acceptance of AI. Hum. Resour. Manag. Rev. 100856. doi: 10.1016/j.hrmr.2021.100856

Elias, C., Francesco, C., and Del Giudice, M. (2017). “Technology transfer as driver of smart growth: a quadruple/quintuple innovation framework approach,” Proceedings of the 10th Annual Conference of the EuroMed Academy of Business (Cyprus: EuroMed Press) 313–333.

Fletcher-Brown, J., Carter, D., Pereira, V., and Chandwani, R. (2020). Mobile technology to give a resource-based knowledge management advantage to community health nurses in an emerging economies context. J. Knowledge Manag. 25, 525–544. doi: 10.1108/jkm-01-2020-0018

Fornell, C., and Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. J. Mark. Res. 18, 39–50. doi: 10.2307/3151312

Gefen, D., and Straub, D. (2005). A practical guide to factorial validity using PLS-Graph: tutorial and annotated example. Commun. Assoc. Inform. Syst. 16, 91–109.

Gerbing, D. W., and Anderson, J. C. (1988). An updated paradigm for scale development incorporating unidimensionality and its assessment. J. Market. Res. 25, 186–192. doi: 10.1177/002224378802500207

Guo, R. (2020). Face recognition, equal protection and contract society. Ningbo Econ. 02:42.

He, J. P., and Huang, X. X. (2020). The smartphone use and eudaimonic well-being of urban elderly: based on intergenerational support and TAM. J. Int. Commun. 03, 49–73.

Hichang, C. (2010). Determinants of behavioral responses to online privacy: the effects of concern, risk beliefs, self-efficacy, and communication sources on self-protection strategies. J. Inform. Privacy Secur. 1, 3–27. doi: 10.1080/15536548.2010.10855879

Hsieh, P. J., and Lai, H. M. (2020). Exploring people’s intentions to use the health passbook in self-management: an extension of the technology acceptance and health behavior theoretical perspectives in health literacy. Technol. Forecast. Soc. Change 161:120328. doi: 10.1016/j.techfore.2020.120328

Jiang, J. (2019). Infringement risks and control strategies on the application of face recognition technology. Library Inform. 5:59.

Joinson, A. N., Reips, U. D., Buchanan, T., and Schofield, C. B. P. (2010). Privacy, trust, and self-disclosure online. Hum. Comp. Interact. 25, 1–24. doi: 10.1080/07370020903586662

Kalantari, M., and Rauschnabel, P. (2018). “Exploring the Early Adopters of Augmented Reality Smart Glasses: the Case of Microsoft Hololens” in Augmented Reality and Virtual Reality. Ed T. Jung and M. Tom Dieck (Germany: Springer). 229–245. doi: 10.1007/978-3-319-64027-3_16

Kaur, S., Gupta, S., Singh, S. K., and Perano, M. (2019). Organizational ambidexterity through global strategic partnerships: a cognitive computing perspective. Technol. Forecast. Soc. Change 145, 43–54. doi: 10.1016/j.techfore.2019.04.027

Khasawneh, O. Y. (2018). Technophobia without boarders: the influence of technophobia and emotional intelligence on technology acceptance and the moderating influence of organizational climate. Comp. Hum. Behav. 88, 210–218. doi: 10.1016/j.chb.2018.07.007

Lee, S. Y., and Lee, K. (2018). Factors that influence an individual’s intention to adopt a wearable healthcare device: the case of a wearable fitness tracker. Technol. Forecast. Soc. Change 129, 154–163. doi: 10.1016/j.techfore.2018.01.002

Lee, Y., Kozar, K. A., and Larsen, K. R. T. (2003). The technology acceptance model: past, present, and future. Commun. Assoc. Inform. Syst. 12, 752–780.

Liu, W. W. (2013). Research on the Influence of Privacy Concerns on Users’ Intention to Use Mobile Payment. Beijing: Beijing University of Posts and Telecommunications.

Lunney, A., Cunningham, N. R., and Eastin, M. S. (2016). Wearable fitness technology: a structural investigation into acceptance and perceived fitness outcomes. Comp. Hum. Behav. 65, 114–120. doi: 10.1016/j.chb.2016.08.007

Magni, D., Scuotto, V., Pezzi, A., and Del Giudice, M. (2021). Employees’ acceptance of wearable devices: Towards a predictive model. Technol. Forecast. Soc. Change 172:121022. doi: 10.1016/j.techfore.2021.121022

Mao, Y. N. (2019). The first case of face recognition: what is the complaint? Fangyuan Mag. 24, 14–17.

McKnight, D. H., Lankton, N., and Tripp, J. (2011). “Social Networking Information Disclosure and Continuance Intention: a Disconnect” in 2011 44th Hawaii International Conference on System Sciences (HICSS 2011). (United States: IEEE).

Nasir, S., and Yurder, Y. (2015). Consumers’ and physicians’ perceptions about high tech wearable health products. Proc. Soc. Behav. Sci. 195, 1261–1267.

Nguyen, T.-M., and Malik, A. (2021). Employee acceptance of online platforms for knowledge sharing: exploring differences in usage behavior. J. Knowledge Manag. Epub online ahead of print. doi: 10.1108/JKM-06-2021-0420

Norberg, P. A., Horne, D. R., and Horne, D. A. (2007). The Privacy Paradox: personal Information Disclosure Intentions versus Behaviors. J. Consum. Affairs 41, 100–126. doi: 10.1111/j.1745-6606.2006.00070.x

Norval, A., and Prasopoulou, E. (2017). Public faces? A critical exploration of the diffusion of face recognition technologies in online social network. New Media Soc. 4, 637–654. doi: 10.1177/1461444816688896

Nunnally, J. C., and Bernstein, I. H. (1994). Psychometric Theory. New York: McGraw-Hill.

Papa, A., Mital, M., Pisano, P., and Del Giudice, M. (2020). E-health and wellbeing monitoring using smart healthcare devices: an empirical investigation. Technol. Forecast. Soc. Change 153:119226. doi: 10.1016/j.techfore.2018.02.018

Putnam, R. D. (2001). Making Democracy Work: civic Traditions in Modern Italy (trans. by Wang L & Lai H R). Nanchang: Jiangxi People’s Publishing House. 195.

Qi, K. P., and Li, Z. Z. (2018). A Study on Privacy Concerns of Chinese Public and Its Influencing Factors. Sci. Soc. 2, 36–58.

Reyes-Mercado, P. (2018). Adoption of fitness wearables: insights from Partial Least Squares and Qualitative Comparative Analysis. J. Syst. Inform. Technol. 20, 103–127. doi: 10.1108/jsit-04-2017-0025

Scherer, R., Siddiq, F., and Tondeur, J. (2019). The technology acceptance model (TAM): a meta-analytic structural equation modeling approach to explaining teachers’ adoption of digital technology in education. Comp. Educ. 128, 13–35. doi: 10.1016/j.compedu.2018.09.009

Simmel, G. (2002). Sociology: investigations on the Forms of Sociation (trans. by Lin R Y). Beijing: Huaxia Publishing House. 244–275.

Turhan, G. (2013). An assessment towards the acceptance of wearable technology to consumers in Turkey: the application to smart bra and t-shirt products. J. Textile Inst. 104, 375–395. doi: 10.1080/00405000.2012.736191

Venkatesh, V., and Davis, F. D. (2000). A theoretical extension of the technology acceptance model: four longitudinal field studies. Manag. Sci. 46, 186–204. doi: 10.1287/mnsc.46.2.186.11926

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang, H., Lee, M. K. O., and Wang, C. (1998). Consumer privacy concerns about Internet marketing. Commun. ACM 41, 63–70. doi: 10.1145/272287.272299

Wang, Q., and Sun, X. (2016). Investigating gameplay intention of the elderly using an extended technology acceptance model (ETAM). Technol. Forecast. Soc. Change 107, 59–68. doi: 10.1016/j.techfore.2015.10.024

Werts, C. E., Linn, R. L., and Jöreskog, K. G. (1974). Intraclass reliability estimates: testing structural assumptions. Educ. Psychol. Measur. 34, 25–33. doi: 10.1177/001316447403400104

Xue, K., He, J., and Yu, M. Y. (2016). Research on Influencing Factors of Privacy Paradox in Social Media. Contempor. Commun. 1:5.

Yang, H., Yu, J., Zo, H., and Choi, M. (2016). User acceptance of wearable devices: an extended perspective of perceived value. Elemat. Inform. 33, 256–269.

Yu, J. (2018). Research on the Use Intention of VR Glasses Based on the Technology Acceptance Model. Shenzhen: Shenzhen University.

Zahid, A., and Ajita, R. (2017). A Face in any Form: new Challenges and Opportunities for Face Recognition Technology. IEEE Comp. 50, 80–90. doi: 10.1109/mc.2017.119

Zhang, Q. J., and Gong, H. S. (2018). An Empirical Study on Users Behavioral Intention of Face Identification Mobile Payment. Theor. Pract. Fin. Econom. 5, 109–115.

Zhang, X. J., and Li, Z. Z. (2018). Research on the Influence of Privacy Concern on Smartphone Users’ Behavior Intention in Information Security. Inform. Stud. Theor. Appl. 2, 77–78.

Keywords : face recognition, technology acceptance model, social cognitive, cross-culture, privacy concerns psychology, perceived risk, trust, cultural psychology

Citation: Liu T, Yang B, Geng Y and Du S (2021) Research on Face Recognition and Privacy in China—Based on Social Cognition and Cultural Psychology. Front. Psychol. 12:809736. doi: 10.3389/fpsyg.2021.809736

Received: 05 November 2021; Accepted: 06 December 2021; Published: 24 December 2021.

Reviewed by:

Copyright © 2021 Liu, Yang, Geng and Du. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Tao Liu, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

A survey of appearance-based approaches for human gait recognition: techniques, challenges, and future directions

  • Open access
  • Published: 15 May 2024

Cite this article

You have full access to this open access article

face recognition technology research paper

  • Pınar Güner Şahan 1 ,
  • Suhap Şahin 1 &
  • Fidan Kaya Gülağız 1  

Gait recognition has become an important biometric feature for human identification, in addition to data such as face, iris, and fingerprint. The goal of human gait recognition is to identify people based on walking images. Artificial intelligence technologies have revolutionized the field of gait recognition by enabling computers to automatically learn and extract intricate patterns. These techniques examine video recordings to determine key features in an individual's gait, and these features are used to identify the person. This paper examines the existing appearance-based gait recognition methods that have been published in recent years. The primary objective of this paper is to provide an informative survey of the state-of-the-art in appearance-based gait recognition techniques, highlighting their applications, strengths, and limitations. Through our analysis, we aim to highlight the significant advance that has been made in this field, draw attention to the challenges that have been faced, and identify areas of prospective future research and advances in technology. Furthermore, we comprehensively examine common datasets used in gait recognition research. By analyzing the latest developments in appearance-based gait recognition, our study aims to be a helpful resource for researchers, providing an extensive overview of current methods and guiding future attempts in this dynamic field.

Similar content being viewed by others

face recognition technology research paper

Methods for Automatic Gait Recognition: A Review

face recognition technology research paper

Gait Recognition with Various Data Modalities: A Review

face recognition technology research paper

A Vision-Based Feature Extraction Techniques for Recognizing Human Gait: A Review

Avoid common mistakes on your manuscript.

1 Introduction

Gait recognition is a sort of biometric technology that identifies people based on their distinct walking patterns [ 1 ]. It evaluates how a person walks by capturing and quantifying numerous gait variables such as step width, stride length and foot angle (the angle between the foot and the horizontal) during heel strike and toe-off (pre-swing). These metrics are used to derive a gait signature for each person that can be compared to a database of recognized signatures to help identify them [ 2 ]. The most beneficial advantage of gait as a biometric feature is that it can be used for identifying people at a distance. Furthermore, it does not necessitate the user's participation unlike other features [ 3 ]. These advantages make gait useful for video surveillance-based applications. Gait recognition has potential uses in security and surveillance, including the identification of people in crowded public places and the tracking of criminal suspects [ 4 ]. It could also have medical uses, such seeing variations in gait patterns that might point to illnesses or injuries [ 5 ]. Among the abovementioned advantages, gait recognition performance can be negatively affected by certain factors related to human pose analysis. Human pose analysis in computer vision faces several challenges, including occlusions, changing lighting conditions, and low image quality.

The following steps are often included in a gait recognition system [ 6 ]: (1) Data collection. To recognize an individual’s gait, it is necessary to collect data about their gait patterns. Many techniques, including video recordings, pressure sensors, floor sensors and motion capture systems, can be used to obtain this data. (2) Feature Extraction. To identify an individual’s gait, it is necessary to extract features that are unique to their walking pattern, such as stride length, walking speed and foot angle. (3) Dimension Reduction. In general, features extracted from gait data cannot be used for classification directly because in the feature representation step, the dimensionality of features (the number of features) collected from raw data is higher than the number of samples in the training data. Consequently, it is preferred to use a dimension reduction approach prior to classification. (4) Classification. To identify the individual based on their gait features extracted in the previous step, classification is performed using a machine learning or a deep learning algorithm.

Gait recognition problem approaches in computer vision are generally classified into two categories: model-based and appearance-based (model-free) [ 7 ]. Model-based gait recognition approaches utilize mathematical models to represent the walking motion of a person. In this approach, the kinematics of joint angles are modeled when people walk. Appearance-based gait recognition approaches extract features from the visual appearance of a person’s walking pattern, such as body shape and limb movements. In this approach, silhouettes are analyzed from a gait sequence that embed both appearance and movement information, ensuring that the analysis encompasses the entire body structure, including key joints, without isolating them [ 8 ].

Appearance-based methods do not require extra sensors or subject consent because they depend on visual data obtained from security cameras. This makes them useful for real-world applications. Although model-based methods have benefits like providing detailed motion information and explicitly modeling skeletal systems, they also have disadvantages such as resource-intensive processing requirements or inaccurate key point estimation. Consequently, when compared to appearance-based approaches, they exhibit lower performance in recognition tasks [ 9 , 10 , 11 ]. Such reasons have led to the widespread research and establishment of appearance-based methods in the field. They have a solid foundation in the existing literature, with many methods and datasets available. Hence, the purpose of this paper is to survey appearance-based gait recognition methods that rely mostly on deep learning. Although there are many existing surveys [ 6 , 8 , 12 , 13 , 14 , 15 ] conducted on gait recognition, it is the first survey paper based only on recent appearance-based gait recognition studies as far as we know. By focusing entirely on appearance-based methods, the paper gives a full and extensive evaluation of many approaches used in gait recognition. This provides for a better understanding of the specific strategies that rely only on visual clues from gait patterns. Detailed information about the existing surveys and the number of references and citations from Web of Science are shown in Table  1 .

The paper aims to provide an extensive overview of the appearance-based gait recognition methods. The paper summarizes the important methods and models used in this area, allowing readers to get a deep understanding of some of the most recent advances. The main contributions of this survey are as follows:

The survey provides a comprehensive and systematic examination of appearance-based gait recognition methods. It analyzes the current literature and provides a comprehensive assessment of the state-of-the-art in this field of gait recognition.

The survey evaluates the performance of gait recognition techniques. This evaluation provides useful insights for researchers in determining usability of appearance-based gait recognition methods.

The survey provides a thorough examination of various publicly available datasets used in the literature.

The survey highlights challenges in gait recognition. It suggests researchers in new and significant directions within this domain by suggesting prospective options for future study.

We employed a review methodology in parallel with these purposes. We first identified potential papers using search engines (e.g., Google Scholar [ 16 ]) and online archives (e.g., IEEE Xplore [ 17 ], ScienceDirect [ 18 ]). Our search string was a combination of different keywords such as “gait recognition”, “deep learning”, “human identification”, and “gait dataset”. We have included search results after 2018 because we want to focus on the studies of recent years. We then excluded the papers that use model-based gait recognition approaches, do not provide a unique solution, use private datasets for performance assessment, or do not evaluate their performance in comparison to the state-of-the-art. Finally, we identified a series of papers that have applied deep learning to gait recognition.

The reminder of this survey is organized as follows. Section  2 introduces the conceptual framework of gait recognition. Gait datasets and evaluation criteria are shown in Sect.  3 . Section  4 reviews and compares appearance-based gait recognition approaches published in recent years. Section  5 discusses some challenges and future trends in gait recognition. Section  6 concludes and ends the paper.

2 Gait recognition

In order to help demonstrate a general structure for understanding gait recognition approaches, which will be discussed in the following sections, we present the conceptual framework of gait recognition (Fig.  1 ). It includes obtaining different types of input data, feature extraction and representation, dimension reduction and classification. Deep pipelines for gait recognition require fewer steps than traditional pipelines, because the deep learning model can perform feature extraction and classification in a single step (Fig.  2 ). This can improve efficiency and reduce the likelihood of errors introduced by human-defined feature extraction and selection techniques. Deep pipelines, on the other hand, may need additional data and computational resources for training and evaluation, as well as deep learning skills. In this section, we initially described data collection processes conducted independently of the methodologies. Subsequently, we provided an overview of the general framework for gait recognition in both machine learning and deep learning, examining in detail the deep learning techniques employed in the methods examined within this article.

figure 1

Conceptual framework of traditional gait recognition

figure 2

Deep gait recognition pipeline

2.1 Data collection

The first stage in the gait recognition framework involves collecting data to identify individual’s gait patterns. Gait recognition can be performed using various input data such as RGB image, silhouette, GEI (Gait Energy Image), optical flow image, body skeleton and human mesh acquired by various sensors. In addition, movement data and pressure data from some wearable sensors can also be used for gait recognition. However, since the focus of this study is on vision-based gait recognition, the gait datasets mentioned in this study do not include them. Figure  3 contains examples of different input data types obtained from different gait dataset [ 19 , 20 , 21 , 22 ].

figure 3

Examples of input data types for gait recognition. a RGB image. b Silhouette. c GEI. d Optical flow. e 2D Skeleton. f 3D Skeleton. g 3D Mesh

2.2 Machine learning techniques

2.2.1 feature extraction and representation.

This is the process of extracting features from the data that are most useful for identifying an individual's gait pattern. Feature extraction requires the ability to describe the distinctive characteristics of individuals and be robust to changing conditions. There are two main approaches for gait recognition as already mentioned: (1) model-based, (2) appearance-based. The key distinction between the two approaches is in how the features are extracted and the type of data used for recognition. Model-based gait recognition extracts features from a physical model of the human body that predicts joint angles and trajectories during walking. In appearance-based gait recognition, features are extracted by considering the entire movement pattern of the walking person’s body. It handles occlusion better and contains more invariant features [ 14 ]. Feature representation for gait recognition transforming raw gait data into a set of features that can be utilized for classification. Appearance-based feature representation methods are statistical methods and spatiotemporal methods. Statistical features include shape (e.g., how high the leg is raised during the walking cycle), motion (e.g., speed) and texture (e.g., variations of clothing and carrying conditions). The spatiotemporal methods gather the motion characteristics and maintain both the spatial aspects (such as shape, distance, and direction) and the temporal aspects (like duration and occurrence time) of gait video sequences [ 12 ]. Human movement is usually represented through both spatial and temporal information.

2.2.2 Dimensionality reduction

The major goal of dimensionality reduction is to reduce the dimensionality of the feature vector that represents the gait patterns. Typically, the feature vector is high-dimensional and comprises a huge number of variables. This makes gait recognition methods computationally costly and time-consuming to compute. Dimensionality reduction aims to address this issue by reducing the dimensionality of the feature vector, while preserving essential information. There are different techniques for dimensionality reduction, such as principal component analysis (PCA) and linear discriminant analysis (LDA). These techniques attempt to transform the high-dimensional feature vector into a lower-dimensional space that still captures the important information. The classification algorithm is then fed the resulting lower-dimensional feature vector.

PCA [ 23 ] transforms the feature vector into a set of orthogonal principal components, each of which is a linear combination of the original variables. Most of the information is included in the first few primary components, which are kept, while the other components are disposed.

LDA [ 24 ] seeks to maximize the distance between the means of different classes, while minimizing the variance within each class. It aims to project the feature vector into a lower-dimensional space with the goal of maximizing the separation between the different classes.

2.2.3 Classification

In gait recognition, the classification step refers to the process of giving a label or class to a gait sequence. This stage is critical because it allows the system to recognize and distinguish between different individuals based on their gait patterns. The features selected in the previous steps are used to create a feature vector representing the gait sequence. In this stage, the feature vector is input to a classification algorithm that assigns the gait sequence to a specific class or label. During the classification process, it is learned to recognize patterns in feature vectors associated with certain individuals, and the gait sequence is assigned to the appropriate label.

In this section, it is useful to mention the two modes of biometrics, identification and verification. Person identification involves recognizing an individual from a group of known persons, which can be challenging due to the need to distinguish between highly similar gait patterns. Person verification compares a gait pattern to a single individual's known patterns to confirm or deny their identity. It is less useful in applications, where the identity of the individual is unknown or needs to be determined from a large number of possibilities.

In traditional machine learning approaches, the classification stage consists of applying an algorithm that can distinguish between the different classes (i.e., individuals) based on the features extracted from their gait. Because the process of extracting features is separated from the classification step. The similarity between features is measured by a vector similarity metric such as Euclidean distance, Cosine similarity, Manhattan distance or dynamic time warping (DTW). Euclidean distance measures the straight-line distance between two points in a multi-dimensional space. Instead of measuring distance, cosine similarity measures the cosine of the angle between two vectors. Manhattan distance sums the absolute differences of their cartesian coordinates. DTW defines an optimum path that can transform one signal into another [ 25 , 26 ]. Siamese networks can also be used in gait recognition applications to learn how to differentiate between inputs, effectively learning a similarity metric. The siamese network can orient the similarity metric to be small for pairs of gait from the same individual and large for pairs from different people [ 27 ].

Finally, a label is assigned to each image by a classifier. The algorithm used depends on the type of feature set and the specific requirements of the recognition task (e.g. complexity of the data). Common algorithms used in this context are given below.

2.2.3.1 Support vector machine (SVM)

Support vector machine (SVM) is a popular supervised machine learning algorithm used for the classification of gait patterns [ 28 ]. The basic idea behind SVM in gait recognition is to find a hyperplane that separates the data points representing the gait patterns of different individuals. The hyperplane is selected in such a way as to maximize the margin, which is the distance between the hyperplane and the closest data points from each class. After the SVM model has been trained, it can be used to classify new gait patterns using the features that were extracted. Depending on which side of the hyperplane the new data point falls, the SVM model will predict the class of the new gait pattern. In this point, it is important to note that SVM is basically a binary classification algorithm, aiming to distinguish between two classes by finding the optimal hyperplane that separates them in the feature space. However, gait recognition often involves identifying individuals from a set of multiple classes, requiring a multiclass classification technique. The one-vs-the-rest strategy is a popular way for adapting SVM for multiclass classification. This involves creating multiple, dedicated SVMs, each trained to distinguish between one of the classes and the sum of all other classes [ 29 ]. The authors cover the use of SVMs for automatic recognition of age-related gait changes in [ 30 ]. In [ 31 ], a gait recognition system is presented based on SVMs and acceleration data.

2.2.3.2 Hidden markov model (HMM)

Hidden markov model (HMM) can be used to represent the temporal properties of gait patterns in gait recognition [ 32 ]. The main concept is to describe the sequence of gait features as several states, each representing a different gait pattern. The transition probabilities between the states show the possibility of transitioning gait patterns. Given the current condition, the observation probabilities describe the likelihood of observing a specific gait feature. A training set of gait data is used to estimate the model parameters for an HMM. The model parameters include the transition and observation probability. The authors in [ 33 ] describe a potential approach for identifying people by their gait that involves modeling the dynamic silhouettes of a human body using a HMM. The research in [ 34 ] suggests utilizing a HMM to assess gait phases to examine a patient's gait for appropriate rehabilitation treatment.

2.3 Deep learning techniques

The key concept of the gait recognition using deep learning is automatically learning to identify individuals based on their unique gait patterns directly from the data. This ability brings the advantages of making them robust to variations in input data for the gait recognition task. The layered architecture of deep learning facilitates the incremental extraction of complex features from unprocessed data, eliminating the necessity for manually identifying important features, a process often demanding specialized expertise. This becomes especially significant in the context of analyzing gait patterns, where the automated identification of distinguishing features is crucial [ 35 ]. The automatic feature extraction concept in deep learning can include extracting and learning spatial features from individual frames and temporal features across sequences of frames.

When we look at the dimensionality reduction process in the context of deep learning, it is crucial to simplify models, increase their efficiency and reduce overfitting. Some prominent dimensionality reduction techniques used in deep learning are described below.

Pooling is often applied to a set of values arranged in a grid-like structure, such as the feature maps produced by a convolutional neural network (CNN) in computer vision applications [ 36 ]. In order to produce a single output value, the pooling process divides the grid into non-overlapping or overlapping sub-regions and applies an aggregate function to the values within each subregion. The information stored within the subregion is then summarized using this output value. Maximum pooling and average pooling are the two most often used pooling functions. Max pooling includes taking the max value within each subregion, and average pooling involves taking the average value [ 36 ].

Autoencoders (AEs) are neural networks designed to learn efficient representations (encodings) of the input data, typically for the purpose of dimensionality reduction. An autoencoder is composed of an encoder that reduces the input dimensions and a decoder that reconstructs the input data from the reduced representation. The middle layer, also known as the code layer, has a lower dimensionality and acts as a reduced representation of the input data [ 37 ].

Variational autoencoders (VAEs) are generative models that learn a latent variable model for the input data. They are similar to autoencoders but are intended to produce a probabilistic representation of the input data. Compared to the input space, the latent space learned by VAEs is generally significantly lower dimensionality [ 38 ].

Deep learning models offer an end-to-end learning approach, which means that the raw input is fed into the deep learning model, which then outputs the classification result directly. This smooth process optimizes the pipeline, while improving the model's ability to learn complex patterns. In the classification stage, the deep learning model uses the learned features to classify the gait data into predetermined classes, with each class representing an individual. This could be done through an activation function (e.g., softmax) in the output layer. The model is trained using a labeled dataset, where each gait sequence is associated with a specific individual. The training involves adjusting the model’s weights via back propagation based on the difference between the predicted and actual labels, minimizing a loss function to improve classification accuracy over time.

2.3.1 Convolutional neural networks (CNN)

Convolutional neural network (CNN) [ 39 ] is a type of neural network that is commonly used in gait recognition. A CNN consists of many layers of interconnected nodes, such as convolutional layers, pooling layers, and fully connected layers. The convolutional layers are responsible for detecting and extracting features from the input data. The pooling layers then decimate the feature maps created by the convolutional layers, reducing the dimensionality of the data, while preserving the most critical information. Eventually, the fully connected layers classify the output from the previous layers into different gait patterns or individuals. A CNN can be trained to recognize the unique gait patterns of individuals using a huge dataset of labeled walking sequences. The network learns to extract relevant features from the input data and utilize them to make accurate predictions about the identity of the individual during training. Because CNN models are highly effective at learning spatial features, they are frequently trained using image data for gait recognition tasks. In these tasks, the CNN architecture allows the models to maintain the spatial or positional connections among the input data points. Besides that, CNN can be adapted to extract temporal features effectively by employing kernels that move in one direction across the temporal dimension of the data. This approach is typically realized through the use of one-dimensional (1D) convolutional neural networks (1D-CNNs), where the convolution operation is applied along the time axis of the input data [ 40 ].

Most of the studies analyzed in this survey (please check Table  3 ) used these properties of Convolutional Neural Networks (CNNs) for gait recognition.

2.3.2 Recurrent neural networks (RNN)

Recurrent neural networks (RNNs) perform well at processing sequential data, making them an ideal tool for gait recognition tasks that require evaluating the temporal dynamics of human walking patterns. RNNs are designed to recognize patterns in data sequences by storing previous inputs in their internal state (hidden layers), which is updated when new data points are processed. An RNN layer typically comprises multiple neurons that exhibit recurrent behavior, enabling the layer to accept a sequence of inputs and, in turn, output a sequence [ 41 ]. Their ability to learn from the sequence and duration of movement patterns allows for a detailed classification of distinct gait patterns.

However, traditional RNNs often struggle with the vanishing gradient problem when learning long sequences, making it hard to capture very long-term dependencies [ 42 ]. Solutions like long short-term memory (LSTM) [ 43 ] and gated recurrent units (GRU) [ 44 ] have been developed to address this issue. LSTM is a form of RNN designed to capture long-term dependencies in sequence data by using a set of gates to control the flow of information [ 43 ]. GRUs are a simplified version of LSTMs that try to capture dependencies in sequential data but use a more compact design that merges the forget and input gates into a single update gate, reducing complexity.

2.3.3 Generative adversarial networks (GAN)

Generative adversarial networks (GANs), offer novel approaches to gait recognition among other applications and can be used to generate synthetic gait data, improve feature extraction, and enhance the robustness of gait recognition approaches under various conditions. A GAN consists of two neural networks, the generator and the discriminator, which are trained simultaneously through adversarial processes [ 45 ]. GANs are especially useful in cross-view gait recognition, where the goal is to recognize individuals from different viewing angles. GANs can be used to produce gait data from unobserved angles, allowing the training of flexible gait recognition models that perform well from multiple perspectives. Applying GANs to gait recognition brings various challenges, including training stability and convergence concerns, which might result in low-quality or unrealistic synthetic data [ 46 ].

2.3.4 3D Convolutional neural networks (3D CNN)

3D convolutional neural networks (3D CNNs) enhance the capabilities of conventional CNNs by directly processing volumetric data, enabling them to collect both spatial and temporal information. This makes 3D CNNs ideal for video analysis applications such as gait recognition, which require an in-depth understanding of movement dynamics across time. 3D CNNs examine a sequence of frames as a single input, in contrast to 2D CNNs which process individual frames and may require additional mechanisms to integrate temporal information. This allows them to extract features that capture both the shape and the movement of the subject [ 47 ]. This means that 3D CNNs can recognize distinct patterns in the way a person walks by considering several frames together. Despite its benefits, 3D CNNs have several challenges, including the high computational cost of processing 3D data and the requirement for huge labeled datasets to adequately train the models [ 48 ].

2.3.5 Hybrid models

Hybrid models in gait recognition use the benefits of a number of neural networks to improve the accuracy and robustness of gait recognition systems. Compared to a single model employed on its own, these models are more suitable for capturing the complex spatial and temporal features of the human gait. Combining CNNs with RNNs or LSTM networks is a popular strategy. CNNs are used to extract spatial features such as the shape and posture of a walking person from individual frames, while RNNs or LSTMs are used to analyze temporal sequences by capturing gait dynamics of gait over time [ 49 ]. This hybrid strategy integrates the CNN's ability to recognize spatial patterns with the RNN/LSTM's ability to understand temporal associations, resulting in more accurate gait recognition.

3 Datasets and evaluation criteria

3.1 datasets.

Datasets are crucial for the gait recognition process because they are used to evaluate methods. Over the years, several gait recognition datasets have been developed to aid research and development in this field. Some publicly available gait datasets that are commonly used for gait recognition are shown in Table  2 . This table provides an overview of the key features of these gait datasets. These features comprise the number of subjects (classes), the number of sequences, the number of cameras, resolution of the image, the frame rate captured per second, the number of training and testing subjects, the environment conditions, the type of data and variations in the appearance of the individual.

The CMU body movement (MoBo) database contains high-quality video recordings from multiple angles of subjects walking on a treadmill. This data collection, which includes different walking speeds and conditions of 25 subjects, provides a solid resource for analyzing and recognizing individual gait patterns [ 50 ]. The SOTON dataset [ 51 ] is a collection of gait videos acquired from a multi-camera system that captures people walking along a straight path. The dataset includes videos from 115 subjects and in indoor and outdoor environments. The CASIA-A [ 52 ] dataset is another dataset for gait recognition research, containing data from 20 subjects. The USF HumanID dataset [ 1 ] includes gait videos from 122 subjects, with variations in shoes, carrying briefcase, and with acquisition times. The videos were captured using two cameras. The CASIA-B dataset [ 53 ] is a large dataset containing gait cycles from 124 subjects, captured under various conditions such as normal walking (NM), different clothing (CL), and carrying a bag (BG). The CASIA-C dataset [ 54 ] includes gait videos 153 subjects walking in a cross-view scenario. The dataset also includes challenging variations such as three different walking speeds (Normal walking—NM, slow walking—SW, fast walking—FW), and carrying a bag (BW). OU-ISIR Treadmill dataset [ 55 ] is a gait dataset that was collected at the University of Osaka in Japan. The speed dataset includes gait videos of 34 subjects walking on a treadmill at nine different speeds. The clothes dataset includes gait videos of 68 subjects with different clothes up to 32 options. OU-LP dataset [ 19 ] is a large-scale gait database that includes gait sequences of 4,007 subjects (in version 1). The gait sequences were collected using four camera angles. The OU-LP dataset includes a large number of participants with a wide range of gait patterns, all captured in a controlled environment to minimize external variables such as lighting and background variations, and subjects are typically dressed uniformly to reduce the impact of clothing variations on gait recognition. The TUM GAID dataset [ 56 ] incorporates audio, image (video), and depth data, providing a comprehensive set of modalities for gait analysis. It consists of 305 subjects and the 32 subjects in the subset enable study in clothing and time invariant gait recognition. The OU-LP Bag dataset [ 57 ] includes gait sequences of 62,528 subjects carrying an object, while walking. The dataset includes variations in types of carried objects. OU-LP Age dataset [ 58 ] includes gait sequences of 63,846 subjects at different ages. The OU-MVLP (Multi-View Large Population) dataset [ 20 ] is another large-scale gait database that includes gait sequences of 10,307 subjects captured from 14 different views ranging from 0 to 90, and 180 to 270. CASIA-E dataset [ 59 ] includes silhouettes from 1,014 subjects and variations in walking style, carrying objects, and wearing different clothing. The OU-MVLP Pose dataset [ 60 ] was created by taking the RGB images from the OU-MVLP and extracting pose skeleton sequences from them. VersatileGait [ 61 ] is a large-scale synthetic gait dataset produced using a gaming engine. The dataset includes nearly one million silhouette sequences of 11,000 participants, each with fine-grained features. This dataset intends to solve the shortcomings of existing real-world gait datasets, which frequently have small sample sizes and simple scenarios. The ReSGait dataset [ 62 ] consists of 172 subjects and 870 video clips that were collected over a period of 15 months. The dataset include gender, clothing and carrying conditions, and use of mobile phones. The GREW dataset [ 21 ] is known as the first extensive dataset for gait recognition in the wild. The dataset consists of gait sequences from 26,345 subjects collected from 882 cameras. Also, the dataset includes some information such as gender, age group, carrying and clothing condition. OU-MVLP Mesh [ 63 ] dataset was built upon OU-MVLP and it examines informative 3D human mesh model using parametric pose and shape features (i.e., SMPL). The Gait3D dataset [ 22 ] is a large-scale gait recognition dataset based on 3D representation. It contains 4,000 subjects taken from 39 cameras in the wild. The dataset also includes variations such as different walking speeds and clothing conditions.

3.2 Evaluation criteria

To evaluate gait recognition methods using different databases, there are two types of evaluation protocols that have been frequently used: subject-dependent and subject-independent [ 13 ]. Subject-dependent protocols involve training and testing the gait recognition method using the same set of subjects. In this scenario, the solution is trained on a subset of the gait data and then tested on the remaining data for each subject. The goal of this approach is to find out how well the method recognizes the gait patterns of an individual in the context of intraclass variations such as different walking speeds, clothing, and carrying conditions. Subject-independent protocols, on the other hand, involve training the gait recognition method on a set of subjects and testing it on a different set of subjects. This approach intended to evaluate how well the method generalizes to new individuals who were not included in the training data. The test data are subdivided into gallery and probe sets, and the learned model on the separate training subjects are utilized to extract features from these subsets. Overall, a classifier is used to compare the probe and gallery data to determine the most related gait patterns and categorize them as belonging to the same identity.

The gait recognition methods studied in this paper use the cumulative match characteristic (CMC) as an evaluation criterion. The CMC curve is a performance evaluation metric commonly used in biometrics and computer vision, particularly in tasks related to recognition systems such as face recognition, fingerprint identification, and gait recognition. It helps to assess the accuracy of identification systems. The CMC curve is essentially a rank-based metric and represents the probability that a query identity appears within the top K ranks of a sorted list of candidates generated by the system [ 64 ]. The CMC curve, despite being a widely used metric for measuring the precision of identification systems, has its limitations. It ignores the overall accuracy and confidence of matches as a result of focusing only on ranking performance. It provides limited insights on system performance across different circumstances, which might overlook the complex nature of real-world applications [ 65 ]. However, some studies reviewed in this survey focus solely on reporting rank-1 recognition accuracy that is the first point on a CMC curve. Consequently, in the subsequent sections, we will also use the rank 1 accuracy as our primary evaluation criteria.

4 Appearance-based gait recognition approaches

In the last decades, numerous approaches to gait recognition have been developed. We mentioned that these approaches are divided into model-based and appearance-based approaches. This section reviews the appearance-based gait recognition approaches published in recent years. Appearance-based techniques consider the complete human body structure or motion. This approach extracts gait features from human walking sequences, focusing on the silhouette shape and dynamic information needed for pattern matching.

Numerous methodologies exist in the literature, yet this section cannot cover all these methods. It does discuss the details of the state-of-the-art techniques. Table 3 summarizes the reviewed appearance-based gait recognition approaches arranged by the dates of publication. Section  4.1 contains a rigorous comparison of these approaches.

In [ 66 ] a new loss function for cross-view gait recognition called angle center loss (ACL) and a method for learning spatial-temporal features that combines learned horizontal partition and an LSTM attention model are proposed. Gait silhouettes are divided into four horizontal parts and each part is fed into a separate CNN. Attention weights for each part are used to average frame-level features. During training, various weighted features are fed into various loss functions, but during testing, the weighted features for each part are concatenated to form a feature vector. For both verification and identification tasks, cosine similarities are determined between these feature vectors. For each local part, several independent CNNs are used to learn the local gait features, and a simplified spatial transformer network is used to localize the informative parts. An LSTM-based temporal attention model is used to capture the temporal features. The proposed method is evaluated using silhouettes on three gait recognition datasets (CASIA-B, OULP, and OUMVLP with the accuracy of 96.0%, 99.3%, 89.0% respectively).

Fan et al. [ 67 ] introduces a deep learning-based solution for gait recognition (Gait Part) which recognizes people based on their walking patterns. The method uses a temporal part-based architecture consisting of Frame-level part feature extractor (FPFE) and micro-motion capture module (MCM), two separate components. FPFE aims to improve fine-grained learning of part-level features and while attention based MCM aims to derive local short-range spatiotemporal expressions. Experiments are performed on the CASIA-B and OUMVLP datasets and the averaged rank-1 accuracies of the method are 96.2% and 88.7% respectively.

The research in [ 68 ] proposes the Gait Lateral Network (GLN), a new network for learning discriminative and compact representations from silhouettes of gait sequences. For correct recognition, GLN takes advantage of the intrinsic feature pyramid in deep CNNs to extract discriminative features and lateral connections to integrate silhouette-level and set-level features. Furthermore, GLN has a Compact Block to considerably reduce the dimension of the gait representations, while maintaining accuracy. The experiments are conducted on CASIA-B and OUMVLP datasets with the accuracy of 96.8% and 89.1% respectively.

In [ 11 ] a Set Residual Network (SRN) is presented for silhouette-based gait recognition. It has a fundamental block named Set Residual Block (SRBlock) which builds the framework for feature learning from silhouettes. The SR Block is divided into two parallel branches: the silhouette-branch (learn features from each silhouette individually) and the set-branch (learn features from all silhouettes collectively). The features retrieved from the two branches are concatenated using a residual connection and Leaky ReLU. The paper also presents a Dual Feature Pyramid approach for learning more robust part representations for gait recognition using shallow layer features. The proposed SRN is tested on the CASIA-B (accuracy is 97.1%) and OUMVLP (accuracy is 89.1%) datasets.

In [ 69 ], a novel approach is proposed that uses 3D convolutional deep neural network (3D CNN) to extract the spatiotemporal features of a gait sequence, while adopting a holistic approach using GEIs. This network is made up of two sets of convolutional layers, each of which is succeeded by a pooling layer, followed by batch normalization, and two fully connected layers. The proposed model is evaluated on the CASIA-B and OULP datasets and to enhance performance, optimization techniques are applied. The best accuracy reported for CASIA-B dataset is 98.3% and OULP dataset 93.1%.

The research in [ 70 ], a unique approach for gait recognition, introduces the use of 3D local convolutional neural networks (CNNs) as building blocks. This block enables the retrieval of local 3D volumes sequentially with adaptable spatial and temporal scales, locations, and lengths. Location, sampling, feature extraction, and fusion modules make up the network. Additionally, a framework for interacting with and enhancing global and local 3D volume information in any layer of 3D CNNs is presented in the paper. The proposed approach evaluated on CASIA-B (accuracies are 97.5% and 98.3% for the resolution of 64 × 44 and 128 × 88 respectively) and OUMVLP datasets (accuracy is 90.9%).

The authors in [ 71 ] propose a method for gait recognition using a context-sensitive temporal feature learning (CSTL) network and salient spatial feature learning (SSFL) module. The authors highlight that by focusing on various temporal sequences with varying time scales, humans may distinguish between different gaits. The CSTL network uses relation modeling to evaluate the importance of multi-scale features, increasing the more significant scale and suppressing the less important one. The SSFL module solves the misalignment problem induced by temporal operations by selecting discriminative spatial hints throughout the sequence. The suggested method combines adaptive temporal learning with salient spatial mining. The experiments are conducted on three datasets: CASIA-B (accuracies are 98.5% and 98.7% for the resolution of 64 × 44 and 128 × 88 respectively), OUMVLP (accuracy is 91.0%) and GREW (accuracy is 50.6%). Although CSTL achieves Rank-1 scores more than 90% on both CASIA-B and OU-MVLP datasets it achieves a 50.6% success rate in recognizing sequences on the GREW dataset. GREW is an unconstrained benchmark for gait recognition, aiming to better simulate real-world conditions than its predecessors, such as CASIA-B and OU-MVLP. The significant variation in performance due to the GREW dataset's inherently challenging conditions. Unlike CASIA-B and OU-MVLP, which are partially controlled environments with limited variations, GREW considers a wider range of factors, such as different views, significant differences in clothing, and the presence of objects held by participants [ 21 ]. These factors provide an amount of complexity and unpredictability that better captures real-world circumstances, but they also present more difficulties for gait recognition systems.

The authors in [ 72 ] describe a method for gait recognition named gait quality aware network (GQAN). It directly evaluates the quality of each silhouette and each part, and it is made up of two blocks: the frame quality block (FQBlock) and the part quality block (PQBlock). FQBlock adjusts the features of every silhouette separately and combines the scores of all channels to generate a frame quality measure. Meanwhile, PQBlock calculates the weighted distance between the probe and gallery by estimating a score for each part. GQAN can be trained using only identity annotations at the sequence level by using a loss function called Part Quality Loss (PQLoss). CASIA-B and OUMVLP datasets are used to evaluate the proposed network model and the best accuracy reported is 98.5% and 89.7%, respectively.

GaitSlice proposed in [ 10 ] is a unique gait recognition model and it enhances recognition accuracy by refining spatial and temporal details of each portion of the human body. The model has slice extraction device (SED) and residual frame attention mechanism (RFAM) modules. SED divides the body into parts and connects features of neighboring body parts from head to toe, and for each body component, RFAM collects and emphasizes the significant frames of sequences. The GaitSlice model combines RFAMs that run in parallel with interrelated slice features in order to allow for flexible selection of the key frames of each body part. The model is tested on two gait recognition datasets: CASIA-B (accuracy is 96.2%) and OUMVLP (accuracy is 89.3%).

GaitSet proposed in [ 73 ] considers gait as a set of gait silhouettes and uses a deep learning model to recognize gaits. The paper emphasizes that the sequence of poses during a walking period is not the most important information for differentiating individuals, since the pattern of the sequence is universal. The GaitSet model extracts frame-level information from each silhouette using a CNN, then combines these features into a single set-level feature using Set Pooling. Using Horizontal pyramid mapping, the set-level feature is transformed into a space with more differentiation ability. The experiments are conducted on CASIA-B and OUMVLP datasets with the accuracy of 96.1% and 87.9% respectively.

The research in [ 74 ] proposes a sequential lightweight deep learning framework for gait recognition. The researchers modify two pre-existing deep learning models (VGG-19 and MobileNet-V2) and train them using transfer learning. Then, feature engineering is conducted on the VGG-19 and MobileNet-V2. Finally, using discriminant correlation analysis (DCA), the resulting features were merged. In order to select optimum features, a modified moth-flame optimization algorithm is proposed. The chosen features are then categorized using an extreme learning machine (ELM). The proposed method evaluated on CASIA-B (accuracy is 91.2%) and TUM-GAID datasets (accuracy is 98.6%).

STAR (Spatio-Temporal Augmented Relation Network) introduced in [ 9 ] is a novel approach for gait recognition. Multi-branch diverse-region feature generator (MDFG) and spatiotemporal augmented interactor (STAI) are the two modules that make up the STAR. The MDFG has the capability to identify body features within separate regions that do not overlap, while the STAI, uses the connections of these regions within a frame and across various frames to create intra- and inter-relation models. The introduced approach evaluated on CASIA-B and OUMVLP datasets and the best accuracy reported is 97.3% and 89.7%, respectively.

In [ 75 ], GaitAMR is offered as a method for extracting discriminative subject features for gait recognition. GaitAMR uses a holistic and partial temporal aggregation technique that collects global and local body movement parameters. It is composed of four primary parts: a baseline, spatial extraction, temporal extraction, and view assessment. The baseline part uses silhouette information to convert gait samples into features. Then, a multi-scale feature extractor processes the features to provide richer motion data. The remaining sections analyze the features further to extract relevant information, solving appearance occlusion and silhouette misalignment challenges. After all the features from different domains have been combined, they are sent to the classification layer for recognition. The proposed method evaluated on CASIA-B (accuracies are 98.1% and 98.6% for the resolution of 64 × 44 and 128 × 88 , respectively) and OUMVLP datasets (accuracy is 88.3%).

4.1 Comparison of different approaches

Considering the methods reviewed in this survey, it has been observed that the CASIA-B and OUMVLP datasets are the preferred primary datasets for evaluating appearance-based gait recognition applications.

In this section, detailed explanations are provided on how each method differs from previous ones and how these differences have led to success compared to earlier methods. Since, the CASIA-B dataset is commonly used across all examined studies, the papers are organized in ascending order based on the Rank-1 accuracy rates achieved on this dataset. The accuracy rates for the CASIA-B dataset in Table  3 , correspond to the accuracy under normal walking conditions.

The study conducted in [ 66 ], a loss function is proposed that enhances robustness, especially when different spatial-temporal features are used. Loss functions in deep learning have the advantage of learning discriminative features or metrics. Prior to this study, gait recognition methods typically employed classical loss functions like softmax. The loss function proposed in this paper has been shown to improve performance when compared to previous works. Additionally, the study combines different parts of silhouettes with certain weight values. It is stated that the features obtained in this way have increased the accuracy of the model, but this process has brought along computational cost and feature dimension problems. Finally, the LSTM attention model used to extract temporal features is mentioned to be insufficient in terms of efficiency due to the length of the testing sequence and low parallel computing capacity.

In [ 73 ], a new method named GaitSet is proposed to obtain spatial and temporal information, differing from existing methods that view walking as a template or sequence. The study demonstrates that using additional feature extraction methods alongside deep networks yields more successful results than those found in the literature.

GaitPart [ 67 ] performs individual gait recognition by considering both static appearance features and dynamic temporal information. Previous studies have been conducted without detailed acquisition of temporal features. GaitPart stands out with its detailed modeling of temporal features.

In [ 10 ], GaitSlice is proposed to refine gait recognition features in both spatial and temporal dimensions, based on the logic that the less information included in gait silhouettes, the more significant the role of key frames of body parts. The proposed model has particularly improved gait recognition accuracy under cross-view conditions and complex walking conditions.

In [ 74 ], the VGG-19 and MobileNet-V2 models were trained using deep transfer learning. Subsequently, a new moth-flame optimization algorithm was developed to select the best features. It has been stated that combining lightweight model features with the developed algorithms is time-consuming, but accuracy has been increased in this way. Additionally, it has been determined that the optimization algorithm reduces computation time and increases accuracy.

In GLN [ 68 ], features at the silhouette-level and set-level were extracted at different stages within the deep network backbone and were combined from top to bottom via lateral connections. This approach aggregated more visual details, thereby enhancing the accuracy of gait recognition. Additionally, the size of the gait representations was reduced using a compact block. The proposed method has outperformed previous studies in the literature in terms of both accuracy and size.

SRN [ 11 ] differs from previous studies mainly by its method of coordinating silhouette-level and set-level information for set-based feature learning from silhouettes. Additionally, SRN proposes a method to leverage shallow layer features to better learn part representations. In particular, compared to GLN, which uses silhouette-level and set-level information, it has been stated that upsampling or lateral connections are unnecessary. Therefore, SRN suggests a method that utilizes only marginal memory cost and takes advantage of shallow layer features to learn more robust part representations. The proposed approach is superior to its counterparts in terms of accuracy especially under challenging conditions.

The study conducted in [ 9 ] introduces a new spatiotemporal augmented relation network (STAR). It facilitates the generation of visual clues in various regions for fine-grained feature learning through its contained modules and adaptively locates non-overlapped various regions that have significant identity information. With these aspects it offers, it enables better extraction of distinct information among frames and has improved accuracy compared to studies in the literature.

The method proposed in [ 70 ] extracts temporal features using its simple but effective three-dimensional CNN model. This method performs better than the other studies through this feature extraction technique.

In the study conducted in [ 72 ], unlike other methods, a module named FQBlock is proposed to measure the quality of each frame. FQBlock works on the number of feature channels, evaluating the features of each frame separately. Moreover, the attention values of each frame are based solely on its own features and do not change with permutation according to the silhouette pattern. FQBlock shares weights across different silhouettes, thus ensuring the comparability of attention values of frames in different sequences. These features have enabled the GQAN method to achieve more successful results than previous ones.

GaitAMR [ 75 ] is superior over other methods in both feature representation and temporal representation dimensions, due to its attention to potential silhouette error issues, the impact of local body features on final recognition, spatial occlusion errors, and appearance variation. It also performs better than other methods in terms of recognition performance within a smaller iteration period.

In [ 69 ], an effort was made to capture spatial features such as body shape and the temporal characteristics of walking patterns, specifically to address the challenges of person recognition encountered by gait recognition algorithms in open environments. GEI (gait energy images) and 3D CNN model was employed for both feature extraction and gait recognition. Additionally, the network's parameters were optimized using Bayesian optimization. Thanks to the proposed three-dimensional model and the conducted hyperparameter optimization, this method ranks among the successful studies in the literature.

In [ 71 ], a temporal modeling network is proposed to combine multi-scale temporal features. Additionally, a spatial feature learning module is also suggested to fix feature corruption problems resulting from temporal processes. Studies conducted on datasets have demonstrated the superiority of the model compared to current methods.

5 Challenges and future perspectives

Despite substantial improvement in recent years, there are still several challenges in human gait recognition. These possible challenges include variability in walking patterns, occluded views, environmental factors, lack of gait datasets, ethical and privacy concerns and learning challenges. The subsequent part of this section provides a detailed description of them. The accuracy, reliability, and usefulness of gait recognition systems can be improved by researchers by focusing on these challenges.

5.1 Variability of walking patterns

People walk in different ways, and the same person may exhibit many walking styles depending on circumstances such as walking speed, carrying and clothing, surface type, and aging. When people walk at different speeds, or on different surfaces, they naturally adjust gait parameters such as stride length and step width to maintain balance. Carrying a bag may cause the upper body to lean forward, resulting in a longer stride length. Tight or restricting clothes can limit hip and leg range of motion and high-heeled shoes can tilt the ankles forward, resulting in a shorter stride length. These adjustments can cause changes in the way a person walks and the features of their gait pattern. Most of the previous studies [ 9 , 10 , 11 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 ] achieve promising results even with datasets with some of these conditions. Aging also can lead to changes in the walking pattern. Changes in joint flexibility and mobility can lead to a reduction in the stride length. Some data sets [ 21 ] contain gait data of the same people at different times. Even so, the longest time interval is 15 months. There is a need for further research over a much longer time frame.

5.2 Occluded views

In real-world scenarios, gait recognition systems can be blocked by obstacles such as bags, cars that occlude the view of a part of a person's body. This could be challenging to capture enough information about the gait pattern to identify an individual. The researchers who will create the new dataset can use multiple cameras or sensors to collect data from different angles and viewpoints to solve the problem of occluded views. Another way to solve this problem can be through human body alignment, where the system aligns various parts of the body, including the head, torso, and limbs. In this way, gait recognition algorithms can better detect and track the person's gait patterns, even in situations with partial obstacles. There are several studies [ 69 , 71 , 75 ] specified they improve gait recognition in occlusion conditions, and GaitPart [ 67 ] extracts gait features from different parts of the body and can partially solve the problem of occluded views.

5.3 Environmental factors

In real-world scenarios there are many uncontrolled factors in the environment such as lighting conditions, shadows, and camera angles that can affect gait recognition accuracy. Researchers can conduct experiments in a number of real-world environments to investigate the impact of these factors on gait recognition accuracy. They can identify which factors have the greatest impact on accuracy by collecting data in varying lighting conditions, for example, and develop algorithms that are more resilient to these variations. Varying lighting conditions affect the consistency and reliability of the captured gait data collected. Gait recognition can be highly sensitive to changes in lighting, which can alter the appearance of the subject's silhouette and overall visibility. Poor lighting can lead to incomplete or inaccurate silhouettes, making it difficult to extract reliable gait features [ 76 ]. Fluctuating lighting can introduce variability in important features, reducing the model's ability to recognize and classify gait patterns accurately. Strong lighting can create shadows that may be misinterpreted as part of the gait, leading to incorrect feature extraction and analysis [ 77 ]. By employing a combination of robust feature selection, preprocessing techniques, depth sensing, and adaptive machine learning approaches, it's possible to mitigate the impact of lighting variability and enhance the performance of gait recognition systems in diverse environments. Most of the current publicly available gait datasets were obtained under controlled conditions and are comparatively simple to recognize. ResGait [ 62 ] dataset is based on real scenarios and GREW [ 21 ] dataset is optimized for real-world applications.

5.4 Lack of gait datasets

Gait recognition systems rely on large amounts of data to accurately identify individuals. However, obtaining and labeling such data can be time-consuming and expensive. A possible solution for researchers to access more data is to generate synthetic gait data from virtual 3D human models. As synthetic data can be produced with remarkable control and accuracy, it can also be used to create data that captures specific variations in gait patterns that may be difficult to capture in real-world data. VersatileGait [ 61 ] is the only synthetic data in gait recognition as far as we know, and it contains gait data of 11,000 subjects. The use of unlabeled data, which is easily accessible via the videos on the Internet, can also help overcome the lack of gait data. But, since labeling these data one by one will be tedious and time-consuming, self-supervised learning can help researchers at this point. Self-supervised learning has shown potential to train such unlabeled data, as it can learn useful representations of the data without requiring human labeling [ 78 ].

5.5 Ethical and privacy concerns

Gait recognition is a form of biometric identification, and there are worries over data privacy and its exploitation. People may be worried about having their gait patterns captured and retained, particularly if they are unfamiliar with the technology or how their data will be used. Gait recognition could become a powerful tool for mass surveillance, as gait can be captured remotely without the person's knowledge or consent. Gait data could be repurposed for uses not originally intended, or it might fall into the hands of unauthorized individuals who could exploit it for illegal activities [ 79 ]. Addressing these concerns requires comprehensive regulatory frameworks. There should be strict guidelines on data collection, usage, and storage, ensuring individuals' rights are protected. The gait data should be maintained securely and protected from unauthorized access through secure storage and protective measures such as encryption and access control. Moreover, the development of gait recognition technologies must include ethical considerations from the outset, with ongoing assessments of their impact on society.

5.6 Learning challenges

The use of deep learning and machine learning techniques within gait recognition areas brings several crucial challenges that must be resolved in order to get reliable findings. An examination of a few key issues is provided below.

When a model learns the training set too well, overfitting occurs, and this leads to poor generalization to new unknown data [ 80 ]. This could mean that the model performs well when applied to known subjects but misrecognizes the gait patterns of unknown subjects. Overfitting can be addressed by regularization strategies, data augmentation approaches, and dropout layers (in deep learning models). Furthermore, using cross-validation to check model performance on unseen data during training might lead to the early termination of training to avoid overfitting [ 81 ].

Deep learning models, in particular, are considered black boxes due to their complex architectures and the high dimensionality of their learned feature spaces [ 80 ]. This lack of interpretability may cause issues in sensitive gait recognition systems, where it is important to understand the reasoning behind decisions. Model behavior may be partially understood by visualizing the parts of the input data that have the most impact on the model's decisions through the use of techniques such as layer-wise relevance propagation (LRP) [ 82 ].

Machine learning and deep learning models, especially the deep learning models, require large amounts of labeled data for training. It takes a lot of time and resources to gather and analyze a large number of gait patterns. Using synthetic data or unlabeled data might be an option as shown in 5.4.

Two major issues that can frequently arise in gait recognition with deep learning are catastrophic forgetting and low inter-class variance. Catastrophic forgetting happens when a neural network loses information from past tasks after training on a new task. Elastic weight consolidation (EWC), proposed to solve this problem, allows the network to learn new tasks while helping to preserve weights that are important for previous tasks [ 83 ]. Low inter-class variance indicates a situation in which distinct classes (i.e., gait patterns of different individuals) have highly similar features, making it challenging for the model to distinguish between them successfully. This can result in higher misclassification rates, since the model fails to identify unique identifying features that distinguish one person's gait from another's. Feature aggregation is an effective method for addressing the challenge of low inter-class variance in gait recognition and other tasks that require distinguishing between extremely similar classes [ 84 ]. To address the challenge of low inter-class variance in gait recognition, in [ 25 ] a novel approach is introduced through the development of a generalized inter-class loss. This strategy tackles the problem by focusing on both the sample-level and class-level feature distributions.

One of the most typical challenges in machine learning is class imbalance referring to a situation, where the number of instances of one class significantly outnumbers the instances of one or more other classes in a dataset. This imbalance can lead to biased models that tend to predict the majority class better than the minority classes. In gait recognition tasks, each class typically represents an individual. If the dataset contains approximately an equal number of walking examples for all individuals, significant class imbalance does not occur. However, there may be situations where some individuals have more examples than others. This could lead to the model recognizing some individuals better than others, which can cause problems, especially in sensitive applications. There are some strategies to solve the problem of class imbalance such as oversampling (increasing the number of minority class instances), undersampling (reducing the number of majority class instances) and using ensemble methods [ 85 ].

6 Conclusion

This survey provides an extensive examination of appearance-based methods for human gait recognition, covering the significant developments made in this field. The paper highlights the enormous advances that have been achieved in this field as well as the many strategies used for gait recognition using visual information. We have demonstrated the effectiveness of appearance-based methods in successfully recognizing individuals based on their unique gait patterns through careful examination. The paper also reviews publicly available gait datasets that are commonly used for gait recognition, and it underlines the significance of dataset size, quality, and diversity in the development of accurate and robust gait recognition algorithms. Furthermore, challenges such as variability in walking patterns, occluded views, environmental factors, lack of gait datasets, ethical and privacy concerns, and learning challenges are addressed, along with potential solutions proposed in recent research. In conclusion, while appearance-based human gait recognition demonstrates considerable promise and has achieved significant progress, there is still need for further exploration and improvement. Future research should focus on addressing the identified challenges, exploring the integration of different types of data, and improving the interpretability and generalizability of gait recognition models. Overall, appearance-based human gait recognition algorithms have a lot of potential for applications in surveillance, health and biometrics, and future progress in this subject will help to improve security, personal identity, and health tracking systems.

Data availability

Not applicable as no datasets were generated during the current study.

Sarkar S, Phillips PJ, Liu Z et al (2005) The humanID gait challenge problem: data sets, performance, and analysis. IEEE Trans Pattern Anal Mach Intell 27:162–177. https://doi.org/10.1109/TPAMI.2005.39

Article   Google Scholar  

Nixon MS, Carter JN, Cunado D et al (1999) Automatic gait recognition. In: Jain AK, Bolle R, Pankanti S (eds) Biometrics. Springer, Boston, pp 231–249

Chapter   Google Scholar  

Wang L, Ning H, Tan T, Hu W (2004) Fusion of static and dynamic body biometrics for gait recognition. IEEE Trans Circuits Syst Video Technol 14:149–158. https://doi.org/10.1109/TCSVT.2003.821972

Wu Z, Huang Y, Wang L et al (2017) A comprehensive study on cross-view gait based human identification with deep CNNs. IEEE Trans Pattern Anal Mach Intell 39:209–226. https://doi.org/10.1109/TPAMI.2016.2545669

Chen J (2014) Gait correlation analysis based human identification. Sci World J 2014:1–8. https://doi.org/10.1155/2014/168275

Wan C, Wang L, Phoha VV (2019) A survey on gait recognition. ACM Comput Surv 51:1–35. https://doi.org/10.1145/3230633

Kale A, Sundaresan A, Rajagopalan AN et al (2004) Identification of humans using gait. IEEE Trans on Image Process 13:1163–1173. https://doi.org/10.1109/TIP.2004.832865

Kusakunniran W (2020) Review of gait recognition approaches and their challenges on view changes. IET Biom 9:238–250. https://doi.org/10.1049/iet-bmt.2020.0103

Huang X, Wang X, He B et al (2023) STAR: spatio-temporal augmented relation network for gait recognition. IEEE Trans Biom Behav Identity Sci 5:115–125. https://doi.org/10.1109/TBIOM.2022.3211843

Li H, Qiu Y, Zhao H et al (2022) GaitSlice: a gait recognition model based on spatio-temporal slice features. Pattern Recogn 124:108453. https://doi.org/10.1016/j.patcog.2021.108453

Hou S, Liu X, Cao C, Huang Y (2021) Set residual network for silhouette-based gait recognition. IEEE Trans Biom Behav Identity Sci 3:384–393. https://doi.org/10.1109/TBIOM.2021.3074963

Singh JP, Jain S, Arora S, Singh UP (2021) A survey of behavioral biometric gait recognition: current success and future perspectives. Arch Comput Methods Eng 28:107–148. https://doi.org/10.1007/s11831-019-09375-3

Sepas-Moghaddam A, Etemad A (2023) Deep gait recognition: a survey. IEEE Trans Pattern Anal Mach Intell 45:264–284. https://doi.org/10.1109/TPAMI.2022.3151865

Rani V, Kumar M (2023) Human gait recognition: a systematic review. Multimed Tools Appl. https://doi.org/10.1007/s11042-023-15079-5

Parashar A, Parashar A, Shabaz M et al (2024) Advancements in artificial intelligence for biometrics: a deep dive into model-based gait recognition techniques. Eng Appl Artif Intell 130:107712

Google Scholar. https://scholar.google.com/?hl=en&as_sdt=0,5 . Accessed 4 Jul 2023

IEEE Xplore. https://ieeexplore.ieee.org/Xplore/home.jsp . Accessed 4 Jul 2023

ScienceDirect.global | Science, health and medical journals, full text articles and books. https://sciencedirect.global/ . Accessed 4 Jul 2023

Iwama H, Okumura M, Makihara Y, Yagi Y (2012) The OU-ISIR gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE Trans Inform Forensic Secur 7:1511–1521. https://doi.org/10.1109/TIFS.2012.2204253

Takemura N, Makihara Y, Muramatsu D et al (2018) Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition. IPSJ Trans Comput Vis Appl 10:4. https://doi.org/10.1186/s41074-018-0039-6

Zhu Z, Guo X, Yang T, et al. (2022). Gait Recognition in the Wild: A Benchmark. IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 14769–14779.

Zheng J, Liu X, Liu W, et al. (2022). Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20228–20237).

Pearson K (1901) LIII. On lines and planes of closest fit to systems of points in space. Lond Edinburgh Dublin Philosophical Mag J Sci 2:559–572. https://doi.org/10.1080/14786440109462720

Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

Yu W, Yu H, Huang Y, Wang L. (2022). Generalized inter-class loss for gait recognition. In: Proceedings of the 30th ACM International Conference on Multimedia (pp. 141–150).

Crouse MB, Chen K, Kung HT. (2014). Gait Recognition using Encodings with Flexible Similarity Metrics. In: 11th International Conference on Autonomic Computing (ICAC 14) (pp. 169–175).

Zhang C, Liu, W, Ma H, Fu H. (2016). Siamese neural network based gait recognition for human identification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2832–2836). IEEE.

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1007/BF00994018

Bishop CM, Nasrabadi NM (2006) Pattern recognition and machine learning, vol 4. Springer, New York, p 738

Google Scholar  

Begg RK, Palaniswami M, Owen B (2005) Support vector machines for automated gait classification. IEEE Trans Biomed Eng 52:828–838. https://doi.org/10.1109/TBME.2005.845241

Gou H, Yan L, Xiao J (2015) A gait recognition system based on SVM and accelerations. MATEC Web Conf 30:06001. https://doi.org/10.1051/matecconf/20153006001

Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286. https://doi.org/10.1109/5.18626

Suk H-I, Sin B-K (2006) HMM-based gait recognition with human profiles. In: Yeung D-Y, Kwok JT, Fred A et al (eds) Structural, syntactic, and statistical pattern recognition. Springer, Berlin, pp 596–603

Bae J, Tomizuka M (2010) Gait phase analysis based on a hidden markov model. IFAC Proc Vol 43:746–751. https://doi.org/10.3182/20100913-3-US-2015.00014

Dargan S, Kumar M, Ayyagari MR, Kumar G (2019) A survey of deep learning and its applications: a new paradigm to machine learning. Arch Comput Methods Eng 27:1–22

MathSciNet   Google Scholar  

Zafar A, Aamir M, Mohd Nawi N et al (2022) A comparison of pooling methods for convolutional neural networks. Appl Sci 12:8643. https://doi.org/10.3390/app12178643

Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

Article   MathSciNet   Google Scholar  

Welling M, Kingma DP (2019) An introduction to variational autoencoders. Found Trends Mach Learn 12(4):307–392

Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86:2278–2324. https://doi.org/10.1109/5.726791

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press, Cambridge

Medsker LR, Jain L (2001) Recurrent neural networks. Design Appl 5(64–67):2

Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Cho K, Van Merriënboer B, Gulcehre C, et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. preprint arXiv:1406.1078v3 .

Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. (2014). Generative adversarial networks. arXiv:1406.2661 .

Kodali N, Abernethy J, Hays J, Kira Z. (2017). On convergence and stability of gans. arXiv preprint arXiv:1705.07215 .

Tran D, Bourdev L, Fergus R, et al. (2015). Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision.

Nonis F, Dagnes N, Marcolin F, Vezzetti E (2019) 3D approaches and challenges in facial expression recognition algorithms—a literature review. Appl Sci 9(18):3904. https://doi.org/10.3390/app9183904

Shi X, Chen Z, Wang H, et al. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems, 28.

Gross R, Shi J. (2001). The CMU motion of body (MoBo) database. Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-RI-TR-01–18.

Shutler JD, Grant MG, Nixon MS, Carter JN (2004) On a large sequence-based human gait database. In: Lotfi A, Garibaldi JM (eds) Applications and science in soft computing. Springer, Berlin, pp 339–346

Wang L, Tan T, Ning H, Hu W (2003) Silhouette analysis-based gait recognition for human identification. IEEE Trans Pattern Anal Mach Intell 25:1505–1518. https://doi.org/10.1109/TPAMI.2003.1251144

Yu S, Tan D, Tan T. (2006). A Framework for Evaluating the Effect of View Angle, Clothing and Carrying Condition on Gait Recognition. In: 18th International Conference on Pattern Recognition (ICPR’06). IEEE, Hong Kong, China, pp 441–444

Tan D, Huang K, Yu S, Tan T. (2006). Efficient Night Gait Recognition Based on Template Matching. In: 18th International Conference on Pattern Recognition (ICPR’06). IEEE, Hong Kong, China, pp 1000–1003

Makihara Y, Mannami H, Tsuji A et al (2012) The OU-ISIR gait database comprising the treadmill dataset. IPSJ Trans Comput Vis Appl 4:53–62. https://doi.org/10.2197/ipsjtcva.4.53

Hofmann M, Geiger J, Bachmann S et al (2014) The TUM gait from audio, image and depth (GAID) database: multimodal recognition of subjects and traits. J Vis Commun Image Represent 25(1):195–206

Uddin MdZ, Ngo TT, Makihara Y et al (2018) The OU-ISIR large population gait database with real-life carried object and its performance evaluation. IPSJ T Comput Vis Appl 10:5. https://doi.org/10.1186/s41074-018-0041-z

Xu C, Makihara Y, Ogi G et al (2017) The OU-ISIR gait database comprising the large population dataset with age and performance evaluation of age estimation. IPSJ T Comput Vis Appl 9:24. https://doi.org/10.1186/s41074-017-0035-2

Song C, Huang Y, Wang W, Wang L (2022) CASIA-E: a large comprehensive dataset for gait recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2022.3183288

An W, Yu S, Makihara Y et al (2020) Performance evaluation of model-based gait on multi-view very large population database with pose sequences. IEEE Trans Biom Behav Identity Sci 2:421–430. https://doi.org/10.1109/TBIOM.2020.3008862

Dou H, Zhang W, Zhang P, et al. (2021). VersatileGait: A Large-Scale Synthetic Gait Dataset with Fine-GrainedAttributes and Complicated Scenarios. ArXiv, abs/2101.01394.

Mu Z, Castro FM, Marin-Jimenez MJ, et al. (2021). ReSGait: The Real-Scene Gait Dataset. In: 2021 IEEE International Joint Conference on Biometrics (IJCB). IEEE, Shenzhen, China, pp 1–8.

Li X, Makihara Y, Xu C, Yagi Y (2022) Multi-view large population gait database with human meshes and its performance evaluation. IEEE Trans Biom Behav Identity Sci 4:234–248. https://doi.org/10.1109/TBIOM.2022.3174559

Phillips P, Grother R, Michaels D. (2003). FRVT 2002: Facial Recognition Vendor Test. Technical report, DoD.

Ye M, Shen J, Lin G (2021) Deep learning for person re-identification: a survey and outlook. IEEE Trans Pattern Anal Mach Intell 44(6):2872–2893

Zhang Y, Huang Y, Yu S, Wang L (2020) Cross-view gait recognition by discriminative feature learning. IEEE Trans on Image Process 29:1001–1015. https://doi.org/10.1109/TIP.2019.2926208

Fan C, Peng Y, Cao C, et al. (2020). GaitPart: Temporal Part-Based Model for Gait Recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, pp 14213–14221

Hou S, Cao C, Liu X, Huang Y (2020) Gait lateral network: learning discriminative and compact representations for gait recognition. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer vision— ECCV 2020. Springer International Publishing, Cham, pp 382–398

Gul S, Malik MI, Khan GM, Shafait F (2021) Multi-view gait recognition system using spatio-temporal features and deep learning. Expert Syst Appl 179:115057. https://doi.org/10.1016/j.eswa.2021.115057

Huang Z, Xue D, Shen X, et al. (2021). 3D Local Convolutional Neural Networks for Gait Recognition. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp 14900–14909

Huang X, Zhu D, Wang X, et al. (2022). Context-Sensitive Temporal Feature Learning for Gait Recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12909–12918).

Hou S, Liu X, Cao C, Huang Y (2022) Gait quality aware network: toward the interpretability of silhouette-based gait recognition. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2022.3154723

Chao H, Wang K, He Y et al (2021) GaitSet: cross-view gait recognition through utilizing gait as a deep set. IEEE Trans Pattern Anal Mach Intell 44(7):3467–3478

Khan MA, Arshad H, Damaševičius R et al (2022) Human gait analysis: a sequential framework of lightweight deep learning and improved moth-flame optimization algorithm. Comput Intell Neurosci 2022:1–13. https://doi.org/10.1155/2022/8238375

Chen J, Wang Z, Zheng C et al (2023) GaitAMR: cross-view gait recognition via aggregated multi-feature representation. Inf Sci 636:118920. https://doi.org/10.1016/j.ins.2023.03.145

Lee TK, Belkhatir M, Sanei S (2014) A comprehensive review of past and present vision-based techniques for gait recognition. Multimed Tools Appl 72:2833–2869

Verlekar TT, Soares LD, Correia PL (2018) Gait recognition in the wild using shadow silhouettes. Image Vis Comput 76:1–13

Ohri K, Kumar M (2021) Review on self-supervised image recognition using deep neural networks. Knowl-Based Syst 224:107090. https://doi.org/10.1016/j.knosys.2021.107090

Boulgouris NV, Hatzinakos D, Plataniotis KN (2005) Gait recognition: a challenging signal processing technology for biometric identification. IEEE Signal Process Mag 22(6):78–90

Talaei Khoei T, Ould Slimane H, Kaabouch N (2023) Deep learning: systematic review, models, challenges, and research directions. Neural Comput Appl 35:23103–23124. https://doi.org/10.1007/s00521-023-08957-4

Jabbar H, Khan RZ (2015) Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Comput Sci Commun Instrum Devices 70(10.3850):978–981

Montavon G, Samek W, Muller K (2018) Methods for interpreting and understanding deep neural networks. Dig Signal Process 73:1–15

Kirkpatrick J, Pascanu R, Rabinowitz N et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci 114(13):3521–3526

Zhang Z, Luo C, Wu H et al (2022) From individual to whole: reducing intra-class variance by feature aggregation. Int J Comput Vis 130(3):800–819

Al Musalhi N, Çelebi E. (2023). Age estimation in human gait extraction using a combination of multi-energy image with invariant moment. Preprints, 2023060186. https://doi.org/10.20944/preprints202306.0186.v1

Download references

Open access funding provided by the Scientific and Technological Research Council of Türkiye (TÜBİTAK).

Author information

Authors and affiliations.

Department of Computer Engineering, Kocaeli University, Kocaeli, Turkey

Pınar Güner Şahan, Suhap Şahin & Fidan Kaya Gülağız

You can also search for this author in PubMed   Google Scholar

Contributions

Study conception done by PGŞ, SŞ, FKG; Study design done by PGŞ, SŞ, FKG; Supervision done by PGŞ, SŞ, FKG; Figure and Table preparation done by PGŞ; Materials done by PGŞ, SŞ, FKG; Data collection and/or processing done by PGŞ; Literature review done by PGŞ, SŞ, FKG; Manuscript preparation done by PGŞ, SŞ, FKG; and Critical review done by PGŞ, SŞ, FKG.

Corresponding author

Correspondence to Pınar Güner Şahan .

Ethics declarations

Conflict of interests.

The authors declare that they have no competing interests.

Ethical approval

Not applicable as the nature of the article being a survey.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Güner Şahan, P., Şahin, S. & Kaya Gülağız, F. A survey of appearance-based approaches for human gait recognition: techniques, challenges, and future directions. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06172-z

Download citation

Accepted : 27 April 2024

Published : 15 May 2024

DOI : https://doi.org/10.1007/s11227-024-06172-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Gait recognition
  • Gait datasets
  • Artificial intelligence
  • Neural networks
  • Find a journal
  • Publish with us
  • Track your research

MIT Technology Review

  • Newsletters

OpenAI’s new GPT-4o lets people interact using voice or video in the same model

The company’s new free flagship “omnimodel” looks like a supercharged version of assistants like Siri or Alexa.

  • James O'Donnell archive page

screenshot from video of Greg Brockman using two instances of GPT4o on two phones to collaborate with each other

OpenAI just debuted GPT-4o, a new kind of AI model that you can communicate with in real time via live voice conversation, video streams from your phone, and text. The model is rolling out over the next few weeks and will be free for all users through both the GPT app and the web interface, according to the company. Users who subscribe to OpenAI’s paid tiers, which start at $20 per month, will be able to make more requests. 

OpenAI CTO Mira Murati led the live demonstration of the new release one day before Google is expected to unveil its own AI advancements at its flagship I/O conference on Tuesday, May 14. 

GPT-4 offered similar capabilities, giving users multiple ways to interact with OpenAI’s AI offerings. But it siloed them in separate models, leading to longer response times and presumably higher computing costs. GPT-4o has now merged those capabilities into a single model, which Murati called an “omnimodel.” That means faster responses and smoother transitions between tasks, she said.

The result, the company’s demonstration suggests, is a conversational assistant much in the vein of Siri or Alexa but capable of fielding much more complex prompts.

“We’re looking at the future of interaction between ourselves and the machines,” Murati said of the demo. “We think that GPT-4o is really shifting that paradigm into the future of collaboration, where this interaction becomes much more natural.”

Barret Zoph and Mark Chen, both researchers at OpenAI, walked through a number of applications for the new model. Most impressive was its facility with live conversation. You could interrupt the model during its responses, and it would stop, listen, and adjust course. 

OpenAI showed off the ability to change the model’s tone, too. Chen asked the model to read a bedtime story “about robots and love,” quickly jumping in to demand a more dramatic voice. The model got progressively more theatrical until Murati demanded that it pivot quickly to a convincing robot voice (which it excelled at). While there were predictably some short pauses during the conversation while the model reasoned through what to say next, it stood out as a remarkably naturally paced AI conversation. 

The model can reason through visual problems in real time as well. Using his phone, Zoph filmed himself writing an algebra equation (3 x + 1 = 4) on a sheet of paper, having GPT-4o follow along. He instructed it not to provide answers, but instead to guide him much as a teacher would.

“The first step is to get all the terms with x on one side,” the model said in a friendly tone. “So, what do you think we should do with that plus one?”

Like previous generations of GPT, GPT-4o will store records of users’ interactions with it, meaning the model “has a sense of continuity across all your conversations,” according to Murati. Other new highlights include live translation, the ability to search through your conversations with the model, and the power to look up information in real time. 

As is the nature of a live demo, there were hiccups and glitches. GPT-4o’s voice might jump in awkwardly during the conversation. It appeared to comment on one of the presenters’ outfits even though it wasn’t asked to. But it recovered well when the demonstrators told the model it had erred. It seems to be able to respond quickly and helpfully across several mediums that other models have not yet merged as effectively. 

Previously, many of OpenAI’s most powerful features, like reasoning through image and video, were behind a paywall. GPT-4o marks the first time they’ll be opened up to the wider public, though it’s not yet clear how many interactions you’ll be able to have with the model before being charged. OpenAI says paying subscribers will “continue to have up to five times the capacity limits of our free users.” 

Additional reporting by Will Douglas Heaven.

Artificial intelligence

Sam altman says helpful agents are poised to become ai’s killer function.

Open AI’s CEO says we won’t need new hardware or lots more training data to get there.

Is robotics about to have its own ChatGPT moment?

Researchers are using generative AI and other techniques to teach robots new skills—including tasks they could perform in homes.

  • Melissa Heikkilä archive page

What’s next for generative video

OpenAI's Sora has raised the bar for AI moviemaking. Here are four things to bear in mind as we wrap our heads around what's coming.

  • Will Douglas Heaven archive page

An AI startup made a hyperrealistic deepfake of me that’s so good it’s scary

Synthesia's new technology is impressive but raises big questions about a world where we increasingly can’t tell what’s real.

Stay connected

Get the latest updates from mit technology review.

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at [email protected] with a list of newsletters you’d like to receive.

IMAGES

  1. (DOC) Face Recognition Technique Research Paper

    face recognition technology research paper

  2. The Research on the Face Recognition Technology

    face recognition technology research paper

  3. Face Recognition Based Automated Student Attendance System by

    face recognition technology research paper

  4. (PDF) An Introduction to Face Recognition Technology

    face recognition technology research paper

  5. (PDF) Face Recognition Systems: A Survey

    face recognition technology research paper

  6. (PDF) An Improved Face recognition Technique

    face recognition technology research paper

VIDEO

  1. How Does Face Recognition Technology Work

  2. Face Recognition using Tensor Flow, Open CV, FaceNet, Transfer Learning

  3. Facial Recognition and Digital Signage: A New Era

  4. 3D Face Recognition System (Artec Group)

  5. The Big Downside to Facial Recognition

  6. Face detection And Recognition based Attendance System Using Raspberry Pi

COMMENTS

  1. A Review of Face Recognition Technology

    Face recognition technology is a biometric technology, which is based on the identification of facial features of a person. People collect the face images, and the recognition equipment automatically processes the images. The paper introduces the related researches of face recognition from different perspectives. The paper describes the development stages and the related technologies of face ...

  2. Past, Present, and Future of Face Recognition: A Review

    Face recognition is one of the most active research fields of computer vision and pattern recognition, with many practical and commercial applications including identification, access control, forensics, and human-computer interactions. However, identifying a face in a crowd raises serious questions about individual freedoms and poses ethical issues. Significant methods, algorithms, approaches ...

  3. (PDF) A Review of Face Recognition Technology

    Abstract and Figures. Face recognition technology is a biometric technology, which is based on the identification of facial features of a person. People collect the face images, and the ...

  4. Face recognition: Past, present and future (a review)☆

    The history of face recognition goes back to the 1950s and 1960s, but research on automatic face recognition is considered to be initiated in the 1970s [409]. In the early works, features based on distances between important regions of the face were used [164]. Research studies on face recognition flourished since the beginning of the 1990s ...

  5. (PDF) Face Recognition: A Literature Review

    The task of face recognition has been actively researched in recent years. This paper provides an up-to-date review of major human face recognition research. We first present an overview of face ...

  6. A review on face recognition systems: recent approaches and ...

    The Facial Recognition Technology ... This paper also presents vital areas for future research directions, and finally, the paper has been articulated in such a way to benefit new and existing researchers in this field. References. Abate AF, Nappi M, Riccio D, Sabatino G (2007) 2D and 3D face recognition: a survey. ...

  7. Face Recognition by Humans and Machines: Three Fundamental Advances

    1. INTRODUCTION. The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks (Jacquet & Champod 2020, Phillips et al. 2018, Taigman et al. 2014).

  8. Face Recognition: From Traditional to Deep Learning Methods

    applications, including face recognition. The rest of this paper provides a summary of some of the most representative re-search works on each of the aforementioned types of methods. A. Geometry-based Methods Kelly's [1] and Kanade's [2] PhD theses in the early seventies are considered the first research works on automatic face recognition.

  9. [2212.13038] A Survey of Face Recognition

    A Survey of Face Recognition. Xinyi Wang, Jianteng Peng, Sufang Zhang, Bihui Chen, Yi Wang, Yandong Guo. Recent years witnessed the breakthrough of face recognition with deep convolutional neural networks. Dozens of papers in the field of FR are published every year. Some of them were applied in the industrial community and played an important ...

  10. [2201.02991] A Survey on Face Recognition Systems

    Face Recognition has proven to be one of the most successful technology and has impacted heterogeneous domains. Deep learning has proven to be the most successful at computer vision tasks because of its convolution-based architecture. Since the advent of deep learning, face recognition technology has had a substantial increase in its accuracy. In this paper, some of the most impactful face ...

  11. PDF Evaluating Facial Recognition Technology: A Protocol for Performance

    White Paper: Evaluating Facial Recognition Technology: A Protocol for Performance Assessment in New Domains 1. Introduction Facial recognition technology (FRT), namely the set of computer vision techniques to identify individuals from images, has proliferated throughout society. Individuals use FRT to unlock smartphones,1 computer appliances,2

  12. Design and Evaluation of a Real-Time Face Recognition System using

    In this paper, design of a real-time face recognition using CNN is proposed, followed by the evaluation of the system on varying the CNN parameters to enhance the recognition accuracy of the system. An overview of proposed real-time face recognition system using CNN is shown in Fig. 1. The organization of the paper is as follows.

  13. The ethical questions that haunt facial-recognition research

    To get a wider sense of academic views on facial-recognition ethics, Nature this year surveyed 480 researchers who have published papers on facial recognition, AI and computer science. On some ...

  14. (PDF) Facial Recognition Technology

    Facial recognition is developed with a combination of artificial intelligence technology that provides more accurate, flexible, and faster identity recognition and is widely used in attendance ...

  15. Human face recognition based on convolutional neural network and

    To deal with the issue of human face recognition on small original dataset, a new approach combining convolutional neural network (CNN) with augmented dataset is developed in this paper. The original small dataset is augmented to be a large dataset via several transformations of the face images. Based on the augmented face image dataset, the ...

  16. Research on Face Recognition and Privacy in China—Based on Social

    As a new technology, face recognition—a typical application of artificial intelligence—rises with the construction of a smart city According to the statistics presented in the Report on In-depth Market Research and Future Development Trend of China's Face Recognition Industry (2018-2024) released by the Intelligence Research Group, it ...

  17. Face Recognition Smart Attendance System using Deep ...

    This paper presents a facial recognition attendance system based on deep learning convolutional neural networks. ... Higher Colleges of Technology, Abu Dhabi, UAE Abstract Face identification has been considered an interesting research domain in the past few years as it plays a major biometric authentication role in several applications ...

  18. Full article: Facial recognition

    Chapter one, "Facial Recognition: An Introduction", excellently outlines the contours of the debates surrounding FRT and the economic and industrial scale of its development. This chapter provides a clear and condensed history spanning the formative work of the 1960s, to the normalisation and mainstreaming of the technology since 2020 ...

  19. DEEP LEARNING FOR FACE RECOGNITION: A CRITICAL ANALYSIS

    face recognition relate to occlusion, illumination and pose invariance, which causes a notable decline in ... Current research in both face detection and recognition algorithms is focused on Deep ... this paper will review all relevant literature for the period from 2003-2018 focusing on the

  20. (PDF) Face detection and Recognition: A review

    Abstract and Figures. Face Detection is one of the type of bio metric technique which refers to the detection of face automatically by computerized systems by taking a look at face. It is a ...

  21. Dynamic facial expression recognition based on spatial key-points

    Dynamic facial expression recognition (DFER) is of great significance in promoting empathetic machines and metaverse technology. However, dynamic facial expression recognition (DFER) in the wild remains a challenging task, often constrained by complex lighting changes, frequent key-points occlusion, uncertain emotional peaks and severe imbalanced dataset categories.

  22. TFN-FICFM: sEMG-Based Gesture Recognition Using Temporal ...

    Surface electromyography (sEMG)-based gesture recognition is a key technology in the field of human-computer interaction. However, existing gesture recognition methods face challenges in effectively integrating discriminative temporal feature representations from sEMG signals. In this paper, we propose a deep learning framework named TFN-FICFM comprises a Temporal Fusion Network (TFN) and ...

  23. A survey of appearance-based approaches for human gait recognition

    Gait recognition has become an important biometric feature for human identification, in addition to data such as face, iris, and fingerprint. The goal of human gait recognition is to identify people based on walking images. Artificial intelligence technologies have revolutionized the field of gait recognition by enabling computers to automatically learn and extract intricate patterns. These ...

  24. MIT Technology Review

    OpenAI just debuted GPT-4o, a new kind of AI model that you can communicate with in real time via live voice conversation, video streams from your phone, and text. The model is rolling out over ...

  25. Hello GPT-4o

    Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio.

  26. Applied Sciences

    Single-crystal diamond tools occupy an important position in the field of optical processing as the basis and key to advanced optical manufacturing technology, such as grating manufacturing and optical mirror-turning processing. Single-crystal diamond tools have become the cornerstone of the development of related industries. This paper takes a single-crystal diamond arc tool as the research ...