Towards High-Fidelity Face Swapping

Overview

Face swapping aims to transfer the identity of a source face onto a target while preserving target-specific attributes such as pose, expression, illumination, and background. Existing evaluations are often fragmented across different datasets, protocols, and implementation details, making it difficult to fairly compare different methods.

This project page accompanies our paper and provides additional materials beyond the manuscript, including dataset examples, benchmark protocols, quantitative summaries, and more qualitative comparisons across representative face swapping methods.

1,291subjects

2,582videos

2.83Mframes

4Kraw video resolution

Survey Taxonomy

We organize face swapping methods into five major paradigms according to their design principles and representation choices.

Evolutionary path of high-fidelity face swapping technology.

Evolutionary path of high-fidelity face swapping methods.

CASIA FaceSwapping Dataset

CASIA FaceSwapping is designed specifically for controlled and fine-grained face swapping evaluation. It contains balanced demographic distributions and explicit attribute variations, enabling systematic analysis of identity preservation, attribute consistency, visual fidelity, and robustness.

Subjects	1,291 identities
Videos	2,582 videos
Raw resolution	2160 × 3840
Demographics	Asian, African, and Caucasian
Variations	Normal, pose, expression, and illumination
Aligned images	Uniformly sampled and aligned face images for benchmark evaluation

Dataset examples showing ethnicity, pose, illumination, and expression variations.

Evaluation Protocols

We establish three standardized protocols to isolate different factors and provide interpretable evaluation. These protocols are designed to evaluate standard performance, demographic generalization, and robustness to dynamic attribute variations.

Protocol 1: Normal

Same ethnicity · normal recordings

Measures baseline face swapping performance under relatively controlled conditions.

4,500 pairs baseline same ethnicity

Protocol 2: Cross-ethnicity

Different ethnicities · normal recordings

Tests whether a method can generalize across demographic groups without identity leakage or appearance bias.

1,200 pairs fairness generalization

Protocol 3: Cross-attribute

Same ethnicity · pose / expression / illumination shifts

Evaluates robustness when the target contains challenging attribute variations.

4,300 pairs robustness attribute shifts

Protocol Summary

Protocol	Ethnicity	Attribute Setting	Pair Count	Evaluation Focus
Normal	Same	Normal	4,500	Identity transfer and target attribute preservation under standard conditions
Cross-ethnicity	Different	Normal	1,200	Demographic generalization and potential ethnicity-related bias
Cross-attribute	Same	Different attributes	4,300	Robustness to pose, expression, and illumination variations

Protocol visualization showing the three evaluation settings.

Evaluation Metrics

We evaluate face swapping methods from complementary perspectives, including identity preservation, target attribute consistency, image realism, and temporal stability.

Identity

ID Retrieval and ID Similarity measure whether the generated face preserves the source identity.

Attributes

Pose Error and Expression Error evaluate whether the generated result preserves the target pose and expression.

Realism & Stability

FID measures visual realism, while temporal consistency metrics evaluate video-level stability.

Radar chart comparing face swapping methods across metrics.

Radar chart summarizing identity preservation, pose/expression consistency, and FID under the three protocols.

Benchmark Results

Quantitative evaluation of 14 face swapping methods across three protocols. Identity preservation is measured by ID retrieval and ID similarity, while pose error, expression error, and FID reflect attribute preservation and generation quality.

Method	Protocol	ID Retrieval ↑	ID Similarity ↑	Pose Error ↓	Expr. Error ↓	FID ↓
HifiFace	Normal	93.37%	0.62	3.59	3.12	20.40
	Cross-ethnicity	93.23%	0.60	3.66	3.29	21.73
	Cross-attribute	83.57%	0.57	4.12	3.14	9.93
FSGAN	Normal	65.08%	0.50	3.34	2.35	56.23
	Cross-ethnicity	57.82%	0.44	3.44	2.50	58.06
	Cross-attribute	43.74%	0.40	4.08	2.37	40.83
Faceshifter	Normal	66.41%	0.44	5.26	3.64	169.11
	Cross-ethnicity	66.36%	0.43	5.32	3.78	172.31
	Cross-attribute	56.09%	0.40	6.38	3.67	151.61
BlendFace	Normal	73.35%	0.48	3.28	3.08	93.20
	Cross-ethnicity	70.60%	0.45	3.37	3.19	94.51
	Cross-attribute	64.83%	0.44	3.93	3.07	78.67
FaceDancer	Normal	72.81%	0.49	3.42	3.15	19.14
	Cross-ethnicity	78.74%	0.50	3.72	3.56	22.33
	Cross-attribute	62.49%	0.46	3.95	3.14	6.32
SimSwap	Normal	90.00%	0.61	2.14	2.43	21.75
	Cross-ethnicity	90.74%	0.58	2.21	2.63	24.01
	Cross-attribute	81.50%	0.55	2.43	2.42	7.86
CSCS	Normal	88.75%	0.63	3.81	3.41	33.28
	Cross-ethnicity	96.92%	0.65	4.11	3.72	36.17
	Cross-attribute	87.54%	0.60	4.47	3.44	21.23
InsightFace	Normal	96.92%	0.73	2.84	2.64	30.50
	Cross-ethnicity	97.19%	0.71	2.97	2.87	32.32
	Cross-attribute	95.14%	0.67	3.22	2.62	15.86
MegaFS	Normal	73.70%	0.50	5.09	2.96	23.69
	Cross-ethnicity	72.72%	0.49	5.09	3.15	25.93
	Cross-attribute	55.82%	0.44	5.98	3.02	18.24
FSLSD	Normal	15.52%	0.25	5.63	3.44	28.64
	Cross-ethnicity	13.95%	0.23	5.62	3.58	30.47
	Cross-attribute	11.95%	0.23	7.24	3.54	23.24
RAFSwap	Normal	87.77%	0.54	3.69	3.28	45.61
	Cross-ethnicity	86.00%	0.51	3.74	3.46	47.47
	Cross-attribute	72.40%	0.48	4.80	3.31	31.37
RGISwap	Normal	80.84%	0.53	4.00	3.41	18.77
	Cross-ethnicity	80.92%	0.52	4.03	3.58	21.28
	Cross-attribute	62.96%	0.46	4.90	3.54	13.94
DiffSwap	Normal	15.64%	0.32	3.67	2.88	96.55
	Cross-ethnicity	13.70%	0.27	3.74	2.97	97.52
	Cross-attribute	14.38%	0.30	4.14	2.89	87.11
FaceAdapter	Normal	95.49%	0.66	4.38	2.95	23.83
	Cross-ethnicity	94.74%	0.66	4.83	3.22	26.51
	Cross-attribute	88.81%	0.61	5.05	2.96	14.71

More Qualitative Results

We provide additional qualitative comparisons across the three protocols. Each row contains the source, target, and swapped outputs from representative methods, making it easier to inspect identity preservation, expression consistency, illumination adaptation, boundary artifacts, and failure modes.

Normal Protocol

Standard setting with same-ethnicity pairs under normal conditions. Most methods produce plausible results, but differences remain in identity fidelity and local artifacts.

Cross-ethnicity Protocol

Cross-demographic setting highlighting identity leakage, skin-tone inconsistency, and shading discontinuities.

Cross-attribute Protocol

Attribute-shift setting under challenging pose, expression, or illumination variations, where local warping and boundary artifacts are more likely.

Citation

BibTeX entry will be updated upon publication.

@article{li2026highfidelityfaceswapping,
  title   = {Towards High-Fidelity Face Swapping: A Comprehensive Survey and New Benchmark},
  author  = {Li, Qi and Wang, Weining and Du, Shuangjun and Peng, Bo and Dong, Jing and Wang, Kun and Sun, Zhenan and Yang, Ming-Hsuan},
  journal = {Pending},
  year    = {2026}
}