The Enhanced Voice Services Codec - Improving audio for the Internet of Talk
OK, I’ll be the first to admit it -- I have a somewhat unhealthy fascination with codecs. While I’ve always been one for a thrilling tale of analog to digital quantization, my head pretty much exploded when I first read about the mechanics behind G.729, aka CELP: Code-excited linear prediction. I mean, seriously, how cool is that name? Even now, when I read it, my mind is awash with the prospect of enthusiastic bits of data foretelling their future (albeit conservative) path. I was always so happy for them. And, of course, that wasn’t the end of CELPs story. These bits quickly became math whizzes (Algebraic/ACELP), which then became quicker (Low Delay/LD-CELP) but then formed some sort of strange growth (conjugate-structure/CS-ACELP), which I admit was a little sad. What I learned about CELP, circa 1995, was from technical “books” -- typically heavy objects costing in the region of one million dollars, in today’s money.1
With this knowledge, it should come as no surprise, therefore, that digging into the details of a new codec is a fun time for me. Indeed, it’s not the first time since I’ve been blogging here at Metaswitch, having espoused the merits of SILK (of Microsoft Skype and Metaswitch Accession fame) in previous posts.2 While this one was originated by the stodgy 3GPP and therefore is assigned an appropriately boring name, the specifications and rationale behind the Enhanced Voice Services (EVS) codec are nothing of the sort. For one thing, like its ‘adaptive multi-rate’ (AMR) predecessor, EVS employs ACELP at its core, meaning that those books were money well spent, if only I could remember anything I’d read.
A Short History of EVS
About two and a half years ago, SA4 (the technical specifications working group within the 3GPP focused on codecs) ratified EVS as the next-generation conversational codec for Release 12 and on. “Why on earth would they do that?” I hear you cry. Why not simply employ Opus, the open source successor of SILK and the preferred codec of WebRTC, introduced by Google and ratified by the IETF in RFC 6716? The answer lies in the patent courts, with vendors in the mobile space claiming intellectual property infringements, much like they did with a precursor called VP8. Those cases were eventually tossed (or “dismissed,” as I believe it’s officially referred to) and in the meantime EVS advanced through the 3GPP. That’s not to say EVS shouldn’t have been. Yes, it is, ultimately, another licensed codec, but it was designed specifically for mobile infrastructures and therefore tuned to the needs of mobile operators and emerging mobile use cases. Put another way, it has some redeeming qualities for the Internet of Talk.
A Quick Codec Sidebar
We categorize codecs for their ability to support four grades of voice and audio transmission. Narrowband is the classic POTS quality we know and love. That thin and tinny noise that grates on our nerves after listening to it for 20 minutes. Or is that only when I’m talking? Wideband (or HD voice) is what we are increasingly coming to expect, courtesy of OTT communications applications. Both of these are categories for voice transmission and not appropriate for audio. Super wideband encompasses the sort of frequency range that makes music listenable but has not been particularly exploited outside the fixed line business communications market. Fullband, however, is an audiophile’s dream, which is why we see streaming music codecs like ACC and MP3 in that category.
The grades of voice and audio transmission with examples of supporting codecs
Naturally, there are differences between the codecs listed – even if they are in the same category. G.729, which started my voice compression love affair, operates at between 6.4 and 11.8 kbps with a MOS at 8 kbps of around 3.92 and a 10ms compression delay, but scores high on the complexity scale, which means more processing power. AMR-NB is low complexity and can operate at lower bandwidths 4.75 – 12.2 kbps while achieving the same MOS but with double the compression delay. Of course, most codecs come kitted out with discontinuous transmission (DTX), voice activity detection (VAD) and comfort noise generation (CNG), all of which serve to reduce bandwidth utilization by shutting down audio and sending white noise instead, when no one is talking, with their efficiency and accuracy improving over time.3
The need for EVS
With its dependency on the broad availability of voice over LTE (VoLTE), the roll-out of wideband HD voice is only now beginning in any discernable scale. Why, then, are we already talking about fullband? The fact is that the delays in deploying mobile packet voice4 have put cellular operators on their back foot and they must now accelerate the deployment of enhanced audio features in order to keep their voice services relevant. Sound quality and reliability are key -- and indeed critical if voice is to remain relevant to today’s (and tomorrow’s) generation of mobile users. It’s not only voice quality, though, that is important. Ubiquitous HD through the broad implementation of AMR-WB would take care of that. To the near-future generation of subscribers, raised on the broadcast lifestyle, the fullband experience will be essential. Simply put, the success of the Internet of Talk will hinge not only on the ubiquity of high-quality speech but also the clarity of streaming music, which might be pushed between callers along with the flawless transmission of the ambient background noises that add to the context of a call. Or “Hey! Listen to this band!”
All right, but we have perfectly good fullband codecs right now, in the form of ACC and MP3, etc. Why do we need a new one? The problem is that these codecs are continually running at full-tilt. They are perfect for playing Panic! At The Disco but a little overkill for even the greatest oratory elocution. EVS and Opus are a little more ingenious in their execution in that they actually comprised two encoding and decoding techniques, one oriented toward real-time speech and another targeted at streaming audio. In the case of Opus, talk (up to wideband / HD) is transmitted using SILK, while EVS employs a wideband CELP extension, of G.729.1 and G.718 fame. For music and environmental (background) noise, both leverage modified discrete cosine transform (MDCT), a technique already employed by pretty much every other streaming music codec. According to the EVS datasheet, “content-dependent on-the-fly switching between speech and audio compression” occurs, in the case of EVS in as little as 32ms -- a first, according to the developer’s marketing department.5 As open source, Opus doesn’t have a team of highly experienced and invaluable Marketeers like me to dispute that point, which obviously means it’s true and EVS wins. Yay! While I tweet that alternative fact, take a look at this high-level architecture diagram.
A high-level architecture diagram of the EVS codec
As you will note from the diagram, the switch also determines whether to engage the comfort noise generator. A genuine feather in the EVS cap, however, is the addition of an AMR-WB interoperable (IO) mode, which, as the name suggest, provides native backwards compatibility with any endpoint or network function that does not support the EVS codec. Detailed traffic analysis in the pre-processing stage results in a granular classification of the signal type, allowing the switch to target the correct core EVS codec to employ. A switch between modes and the core EVS codec can be performed at each 20ms frame boundary.
FEC ‘n FEC
Although the Internet has seen fit to remove at least irreplaceable image in that historic post, I went into quite a bit of detail regarding FEC in the aforementioned article. While I’m not one of those people to recycle my words and revisit that topic, the tech community has seen fit to recycle the acronym. Talk packet loss in today’s codec circles, and you’d better be ready to clarify whether you are talking frame erasure concealment FEC or forward error correction FEC. In all fairness, many references make the distinction by referring to the former using the alternative label of packet loss concealments (PLC), which is much nicer to us all. Some flip-flop between both. Thanks, ETSI/3GPP. Personally, I would favor “error loss lessening,” which would enable me to refer to the combination as FEC ’n ELL. Genius.
Forward error correction is applied at the encoder at varying degrees of resiliency, depending on link quality, to provide physical protection against packet loss by appending redundant copies of pertinent packets to other packets. Any lost or late arriving packets can be reliably rebuilt at the decoder, albeit at a lower sampling rate. If FEC fails, packet loss concealment techniques, a function of the decoder, mitigate the erasure by either repeating the previous packet (waveform substitution) or by guessing what audio waveform the packet contained based on what was received prior (model-based method) and having the post-processor play that instead.
The Mighty MOS
In all my decades working with (and on) the latest advancements in voice transmission technologies, nothing pleases me more than that fact that, ultimately, the true test of our progress is measured by a few people sitting in a room and listening. That’s it. Me and my very own wax-laden lobes could personally determine the fate of a fancy new algorithm, regardless of whether I’d just been to an AC/DC performance the previous night. Luckily it’s not me -- it’s probably you. Or some other random person. But at least it’s not me. Like any new codec, EVS went through the mean opinion score (MOS) wringer and the people had some pretty nice things to say. I’ve chosen two of the many charts, out there, which carve up the MOS results every which way, from a source that pitted EVS against Opus.6
Clean speech and mixed audio MOS for various codecs
What we note here is not the fullband results at 32kbps but what was subjectively heard in the wideband and fullband ranges at the sort of encoding rates we might favor or expect to be able to maintain across a broad mobile infrastructure. EVS excels, delivering mean opinion scores either much higher at identical encoding rates or on-par but at much lower rates. Shake it any way you want, EVS would appear to be tuned for the Internet of Talk and that makes it good, in my book -- whatever the price.
Join the movement: www.metaswitch.com/iotalk
1. A clichéd but true story. I used a book on ISDN to shore up a broken desk caster for about 10 years, with the irony being the book cost more than the desk. Which, I guess, might explain the defective caster, now I think about it.
4. Lack of ubiquitous 4G coverage, reliability/complexity of CSFB, handset availability, etc., etc.
6. Yes. The same one (5) I vaguely accused of nepotism, previously. All good. Moving on.