Overview
Face swapping aims to transfer the identity of a source face onto a target while preserving target-specific attributes such as pose, expression, illumination, and background. Existing evaluations are often fragmented across different datasets, protocols, and implementation details, making it difficult to fairly compare different methods.
This project page accompanies our paper and provides additional materials beyond the manuscript, including dataset examples, benchmark protocols, quantitative summaries, and more qualitative comparisons across representative face swapping methods.
Survey Taxonomy
We organize face swapping methods into five major paradigms according to their design principles and representation choices.
Evolutionary path of high-fidelity face swapping methods.
CASIA FaceSwapping Dataset
CASIA FaceSwapping is designed specifically for controlled and fine-grained face swapping evaluation. It contains balanced demographic distributions and explicit attribute variations, enabling systematic analysis of identity preservation, attribute consistency, visual fidelity, and robustness.
| Subjects | 1,291 identities |
|---|---|
| Videos | 2,582 videos |
| Raw resolution | 2160 × 3840 |
| Demographics | Asian, African, and Caucasian |
| Variations | Normal, pose, expression, and illumination |
| Aligned images | Uniformly sampled and aligned face images for benchmark evaluation |
Dataset examples showing ethnicity, pose, illumination, and expression variations.
Evaluation Protocols
We establish three standardized protocols to isolate different factors and provide interpretable evaluation. These protocols are designed to evaluate standard performance, demographic generalization, and robustness to dynamic attribute variations.
Protocol 1: Normal
Same ethnicity · normal recordings
Measures baseline face swapping performance under relatively controlled conditions.
Protocol 2: Cross-ethnicity
Different ethnicities · normal recordings
Tests whether a method can generalize across demographic groups without identity leakage or appearance bias.
Protocol 3: Cross-attribute
Same ethnicity · pose / expression / illumination shifts
Evaluates robustness when the target contains challenging attribute variations.
Protocol Summary
| Protocol | Ethnicity | Attribute Setting | Pair Count | Evaluation Focus |
|---|---|---|---|---|
| Normal | Same | Normal | 4,500 | Identity transfer and target attribute preservation under standard conditions |
| Cross-ethnicity | Different | Normal | 1,200 | Demographic generalization and potential ethnicity-related bias |
| Cross-attribute | Same | Different attributes | 4,300 | Robustness to pose, expression, and illumination variations |
Protocol visualization showing the three evaluation settings.
Evaluation Metrics
We evaluate face swapping methods from complementary perspectives, including identity preservation, target attribute consistency, image realism, and temporal stability.
Identity
ID Retrieval and ID Similarity measure whether the generated face preserves the source identity.
Attributes
Pose Error and Expression Error evaluate whether the generated result preserves the target pose and expression.
Realism & Stability
FID measures visual realism, while temporal consistency metrics evaluate video-level stability.
Radar chart summarizing identity preservation, pose/expression consistency, and FID under the three protocols.
Benchmark Results
Quantitative evaluation of 14 face swapping methods across three protocols. Identity preservation is measured by ID retrieval and ID similarity, while pose error, expression error, and FID reflect attribute preservation and generation quality.
| Method | Protocol | ID Retrieval ↑ | ID Similarity ↑ | Pose Error ↓ | Expr. Error ↓ | FID ↓ |
|---|---|---|---|---|---|---|
| HifiFace | Normal | 93.37% | 0.62 | 3.59 | 3.12 | 20.40 |
| Cross-ethnicity | 93.23% | 0.60 | 3.66 | 3.29 | 21.73 | |
| Cross-attribute | 83.57% | 0.57 | 4.12 | 3.14 | 9.93 | |
| FSGAN | Normal | 65.08% | 0.50 | 3.34 | 2.35 | 56.23 |
| Cross-ethnicity | 57.82% | 0.44 | 3.44 | 2.50 | 58.06 | |
| Cross-attribute | 43.74% | 0.40 | 4.08 | 2.37 | 40.83 | |
| Faceshifter | Normal | 66.41% | 0.44 | 5.26 | 3.64 | 169.11 |
| Cross-ethnicity | 66.36% | 0.43 | 5.32 | 3.78 | 172.31 | |
| Cross-attribute | 56.09% | 0.40 | 6.38 | 3.67 | 151.61 | |
| BlendFace | Normal | 73.35% | 0.48 | 3.28 | 3.08 | 93.20 |
| Cross-ethnicity | 70.60% | 0.45 | 3.37 | 3.19 | 94.51 | |
| Cross-attribute | 64.83% | 0.44 | 3.93 | 3.07 | 78.67 | |
| FaceDancer | Normal | 72.81% | 0.49 | 3.42 | 3.15 | 19.14 |
| Cross-ethnicity | 78.74% | 0.50 | 3.72 | 3.56 | 22.33 | |
| Cross-attribute | 62.49% | 0.46 | 3.95 | 3.14 | 6.32 | |
| SimSwap | Normal | 90.00% | 0.61 | 2.14 | 2.43 | 21.75 |
| Cross-ethnicity | 90.74% | 0.58 | 2.21 | 2.63 | 24.01 | |
| Cross-attribute | 81.50% | 0.55 | 2.43 | 2.42 | 7.86 | |
| CSCS | Normal | 88.75% | 0.63 | 3.81 | 3.41 | 33.28 |
| Cross-ethnicity | 96.92% | 0.65 | 4.11 | 3.72 | 36.17 | |
| Cross-attribute | 87.54% | 0.60 | 4.47 | 3.44 | 21.23 | |
| InsightFace | Normal | 96.92% | 0.73 | 2.84 | 2.64 | 30.50 |
| Cross-ethnicity | 97.19% | 0.71 | 2.97 | 2.87 | 32.32 | |
| Cross-attribute | 95.14% | 0.67 | 3.22 | 2.62 | 15.86 | |
| MegaFS | Normal | 73.70% | 0.50 | 5.09 | 2.96 | 23.69 |
| Cross-ethnicity | 72.72% | 0.49 | 5.09 | 3.15 | 25.93 | |
| Cross-attribute | 55.82% | 0.44 | 5.98 | 3.02 | 18.24 | |
| FSLSD | Normal | 15.52% | 0.25 | 5.63 | 3.44 | 28.64 |
| Cross-ethnicity | 13.95% | 0.23 | 5.62 | 3.58 | 30.47 | |
| Cross-attribute | 11.95% | 0.23 | 7.24 | 3.54 | 23.24 | |
| RAFSwap | Normal | 87.77% | 0.54 | 3.69 | 3.28 | 45.61 |
| Cross-ethnicity | 86.00% | 0.51 | 3.74 | 3.46 | 47.47 | |
| Cross-attribute | 72.40% | 0.48 | 4.80 | 3.31 | 31.37 | |
| RGISwap | Normal | 80.84% | 0.53 | 4.00 | 3.41 | 18.77 |
| Cross-ethnicity | 80.92% | 0.52 | 4.03 | 3.58 | 21.28 | |
| Cross-attribute | 62.96% | 0.46 | 4.90 | 3.54 | 13.94 | |
| DiffSwap | Normal | 15.64% | 0.32 | 3.67 | 2.88 | 96.55 |
| Cross-ethnicity | 13.70% | 0.27 | 3.74 | 2.97 | 97.52 | |
| Cross-attribute | 14.38% | 0.30 | 4.14 | 2.89 | 87.11 | |
| FaceAdapter | Normal | 95.49% | 0.66 | 4.38 | 2.95 | 23.83 |
| Cross-ethnicity | 94.74% | 0.66 | 4.83 | 3.22 | 26.51 | |
| Cross-attribute | 88.81% | 0.61 | 5.05 | 2.96 | 14.71 |
More Qualitative Results
We provide additional qualitative comparisons across the three protocols. Each row contains the source, target, and swapped outputs from representative methods, making it easier to inspect identity preservation, expression consistency, illumination adaptation, boundary artifacts, and failure modes.
Normal Protocol
Standard setting with same-ethnicity pairs under normal conditions. Most methods produce plausible results, but differences remain in identity fidelity and local artifacts.
Cross-ethnicity Protocol
Cross-demographic setting highlighting identity leakage, skin-tone inconsistency, and shading discontinuities.
Cross-attribute Protocol
Attribute-shift setting under challenging pose, expression, or illumination variations, where local warping and boundary artifacts are more likely.
Citation
BibTeX entry will be updated upon publication.
@article{li2026highfidelityfaceswapping,
title = {Towards High-Fidelity Face Swapping: A Comprehensive Survey and New Benchmark},
author = {Li, Qi and Wang, Weining and Du, Shuangjun and Peng, Bo and Dong, Jing and Wang, Kun and Sun, Zhenan and Yang, Ming-Hsuan},
journal = {Pending},
year = {2026}
}