
AI-generated voice cloning has moved from proof-of-concept to production. In the last quarter of 2024, roughly one in three US consumers reported encountering some form of synthetic-voice fraud, and a significant share suffered financial losses.
What began as isolated incidents has matured into an industrialized workflow, with breached data, low-cost text-to-speech, and automated bot-dialing consistently defeating legacy checks.
Generative AI tools can also replicate a person’s speech pattern, cadence, and accent from just a few seconds of recorded audio. The barrier to entry is low, the models are widely available, and the contact center remains a channel where voice is the only viable option for real security and minimal friction.
President, Chief Product Officer (CPO), and a member of the Board of Directors for Daon.
Despite predictions that automation would make call centers obsolete, the data show otherwise. Phone-based service remains a preferred channel for many high-value or high-risk transactions, and according to Gartner, only around 10% of agent interactions are expected to be fully automated by 2026.
This persistence makes contact centers attractive to attackers – they combine a high concentration of sensitive interactions with legacy verification processes such as knowledge-based authentication (KBA) and basic voice matching.
Even those that do utilize voice matching technologies often employ less sophisticated versions that are susceptible to modern fraud techniques.
Fraudsters can now compile personal dossiers from breached data and open-source information, feed them into AI voice generators, and launch coordinated campaigns that overwhelm legacy defenses.
For organizations still relying on static KBA or a single voiceprint check with no fraud detection, the attack surface has effectively multiplied overnight.
Weak links in legacy verification
Most contact centers still depend on first-generation verification tools that were never designed to withstand high-frequency, AI-powered attacks.
Knowledge-based authentication remains common because it’s inexpensive and familiar, but the information it relies on, such as dates of birth, addresses, or security questions, is readily available through breached data sets or social media.
Once an attacker has the data, passing a KBA check requires little more than persistence. Generative AI compounds the problem by automating both reconnaissance and execution, enabling large-scale attempts that test every weak link in the chain.
When you combine the mass collection and application of data with a voice bot, it eliminates one of the most basic tools call center agents use for security – “does this sound like a 32-year-old woman from New York?” – a voice bot can sound like anyone it needs to, but a hacker can’t.
Where voice biometrics are deployed as single-factor template matching without liveness or synthetic-speech analysis, approved scanning vendor (ASV) engines can be spoofed by high-quality TTS (Text-to-Speech) or injected audio.
These systems analyze pitch, tone, and rhythm to verify a speaker, but alone they offer limited resistance to synthetic speech.
AI models can now reproduce the acoustic characteristics of a target’s voice closely enough to trigger a match, especially when the system lacks real-time analysis for liveness or replay fingerprints such as abnormal jitter/packet-loss patterns, codec hops that don’t match the endpoint, missing near-field room response, and telltale device graphs (virtual audio drivers).
Some attacks also bypass the microphone entirely through injection, feeding a recorded or generated sample directly into the communication channel (e.g., TTS audio injected at the SIP/RTP layer, softphone virtual-audio devices, or middleware that substitutes the live stream).
Without controls that pair real-time PAD (Presentation Attack Detection: micro-prosody, phase, and aperiodicity checks) with network integrity signals (ANI spoofing checks, SIP header sanity, RTP timing) and endpoint attestation to block virtual-device and softphone-driver paths, even well-trained biometric engines can be deceived.
The result is a widening gap between the sophistication of fraud tools and the static nature of many existing verification processes.
Recent headlines, even voices like Sam Altman’s when he warned of an impending “AI fraud crisis,” have fueled doubts about whether voice biometrics can still be trusted in the age of generative AI. Much of that skepticism, however, reflects outdated assumptions.
Modern voice biometric systems no longer rely solely on static voiceprints; they analyze liveness, acoustic integrity, and contextual signals in parallel to distinguish a human caller from a synthesized one.
When deployed as part of a layered and adaptive framework, voice remains one of the most powerful anchors of digital identity, capable of combining convenience with real-time fraud intelligence that passwords or PINs simply can’t deliver.
Layered and adaptive authentication models
Effective defense in the contact center requires multiple, interdependent layers that verify not just who is speaking but how and from where the interaction occurs. Multi-layered fraud detection applies continuously across every call, correlating signals from voice analysis, device intelligence, network attributes, and behavioral patterns.
For instance, synthetic voice detection can flag anomalies in frequency or modulation that indicate machine generation before any biometric match is attempted. At the same time, device or network analytics can expose inconsistencies in caller origin, routing, or latency, each a potential indicator of tampering or injection.
Modern PAD inspects micro-prosody (phoneme-to-phoneme timing, jitter/shimmer stability, aperiodicity), spectral cues (formant continuity, spectral tilt, harmonics-to-noise ratio), and coarticulation realism across syllables.
It also looks for TTS/replay artifacts; over-smoothed F0 contours, breath/noise misplacement, phase discontinuities, and room/loopback mismatches that betray speaker-through-mic vs. line-level injection.
Cross-checks include codec-hop consistency (PSTN 8 kHz <-> VoIP 16 kHz), ASR-prosody coherence (does the timing stress match the transcript), and anti-replay indicators (near-field vs. far-field response). These independent layers overlap, reducing blind spots that any single control might miss.
Step-up authentication operates alongside these defenses but follows a different principle. It activates when a specific action or signal raises the risk threshold, prompting an escalation to a stronger verification factor.
A low-risk inquiry may clear on voice and device signals alone, while a high-value transfer could trigger an app-based biometric prompt or out-of-band confirmation.
Properly implemented, this ensures that friction is proportional to risk: low-value transactions experience minimal interruption, while suspicious activity or high-value transactions are met with additional scrutiny.
Together, continuous multi-layered monitoring and intelligent step-up workflows create a dynamic model of trust that is capable of adapting to threats without undermining the customer experience.
Preparing for tomorrow’s sustained threats
Synthetic voice fraud will not disappear. It will simply evolve. As voice generation models improve, their acoustic signatures become harder to distinguish from legitimate speech, narrowing the margin for error in detection.
Contact centers should therefore treat voice as a valuable but partial signal – an anchor within a broader identity framework that integrates biometric, behavioral, and contextual intelligence.
The risk cannot be eliminated entirely, but it can be contained through layered defenses that adjust to real-time conditions.
Creating and maintaining this balance requires both technical investment and operational discipline. Security teams need to test detection layers against new attack methods, refine escalation thresholds, and ensure that identity data flows securely between systems without creating new exposure points.
The most resilient environments are those where authentication, fraud detection, and customer experience teams operate cohesively, supported by a shared risk model and unified policy framework.
As the threat landscape continues to change, this adaptive, continuous approach will determine which organizations can protect customer trust while preserving the accessibility and responsiveness that voice-based service still provides.
Check out the best business phone systems.
This article was produced as part of TechRadarPro’s Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro
https://cdn.mos.cms.futurecdn.net/dhuaJQQCDzJk5vEgKP9hpj-1920-80.jpg
Source link




