Deep Network Representations as Reliable Indicators of Synthetic Content in Audiovisual and Clinical Contexts

QR Code

Karol JĘDRASIAK and Julia BIJOCH

WSB University, Poland

Abstract

This study introduces an interpretable framework for detecting synthetic audiovisual content using deep neural representations, applied to the DeepFake RealWorld (DFRW) dataset (46 371 clips; 77% with audio). Visual, acoustic, and cross-modal embeddings from ResNet, Vision Transformer, SlowFast, Wav2Vec2, and ECAPA-TDNN were evaluated with frequency-based metrics (Δp ≥ 0.15, PR ≥ 1.5). The strongest indicators were facial embedding variance (Δp = 0.29, PR = 3.4), Mahalanobis distance (Δp = 0.25), and audiovisual coherence (Δp = 0.23), all stable within 15% under compression and re-capture. In teledentistry and telemedicine, such explainable AI markers enhance authenticity verification of digital evidence and strengthen medico-legal reliability.

Keywords: Deepfake detection; deep neural representations; multimodal coherence; audiovisual forensics; explainable AI; telemedicine; teledentistry; data integrity
Shares