Sonos has just unveiled its first AI-powered sound processing feature, in the form of new AI Speech Enhancement options for the Sonos Arc Ultra – and to learn how it came together I visited Sonos‘ UK audio development center.
First, a recap on the feature. There are now four levels of dialogue boosting to choose from, and it works in a totally different way to Sonos’ previous dialogue enhancement options, by separating the speech from the rest of the soundtrack and carefully adjusting it while better maintaining the dynamic range and immersive Dolby Atmos effects that make the Arc Ultra one of the best soundbars. I’ve heard it in action, and really keeps the punch of the bass and detail in effects while still enhancing speech.
Interestingly, the feature was developed in conjunction with the Royal National Institute for Deaf People (RNID), the UK’s leading charity for people with hearing loss, including a year of refining the feature by working directly with people who have different kinds and levels of hearing loss.
But it’s not an ‘accessibility’ feature hidden away in a menu – this is the new standard Speech Enhancement tool, available from the Now Playing screen in the Sonos app, and now with four options instead of the two that the Arc Ultra has currently. It’s just that the higher-tier options are more suited to those with hearing loss than those who just want a bit of extra dialogue clarity – that’s what the lower options are for.
To dig into the background of developing the AI side of the features, as well as the work with the RNID, I visited Sonos’ UK product development facilities, and spoke to Matt Benatan, Principle Audio Researcher at Sonos and the AI lead on the project; Harry Jones, Sound Experience Engineer at Sonos; Lauren Ward, Lead RNID Researcher; and Alastair Moore, RNID Researcher.
Bringing AI into the mix
Your first question might be why this option is only on the Sonos Arc Ultra. Matt Benatan explained that “with Arc Ultra, we have some more CPU capability that we can make use of” – it being the latest in Sonos’ line-up means that it’s the only one capable of supporting the AI algorithm, apparently.
“The underlying technology here is something called source separation,” explains Benatan. “What you want to do is to extract the signal of interest from some more complex signal. Traditionally, this is applied in telecommunications applications, so this is where a lot of the development work around these sorts of technologies comes from.
“But a lot of the traditional methods are quite limited because what you’re trying to remove there are what we term sort of more static types of noise – things like the sound of an air conditioner or the sound of traffic or a crowd.”
Followers of audio tech will recognize that kind of tech as being similar to what’s in the best noise cancelling headphones – but here we don’t want to fully remove sound, we want to enhance it, so it’s a different proposition.
“When we’re dealing with multimedia content in film and television, what we’ve got are these intentionally crafted sonic experiences that contain a multitude of elements that are designed to engage. You’re not supposed to ignore the explosions and the special effects and the music,” says Benatan.
“But at the same time, you can’t engage with that content unless you can hear the dialogue. And so the idea here was to use some of these new neural network-based methods to go a bit further in the digital signal processing that we’ve been able to do before … to apply these dynamic ‘masks’.
“It can adapt in a way that traditional techniques cannot adapt. And this means that [from] frame to frame of incoming audio, we can understand where that speech content is, and we can pull that out.”
Training montage
Neural networks and so-called ‘AI’ sound processing systems need to be trained on sound files to learn what they should recognize (or not), and Sonos’ AI processing was trained on 20,000 hours of realistic audio files – though not on real movies.
This avoided the lingering question of the copyright implications of training an AI model on real works without permission for every sample (currently the subject of much debate, such as from the ‘Make it fair’ campaign in the UK).
“What’s really important when we’re dealing with these sorts of problems is the variety of data that the models are exposed to,” Benatan says. “In order to do that, we worked with an award-winning sound designer who helped us to design the material that we use to train the models … to make sure that we are we’re exposing the model to the information that it needs to see to provide the great experiences in the sort of in the future as a whole.”
As is common in AI model training, Benatan says Sonos used data augmentation for its training, meaning that the samples provided by the sound designer were used in different formats.
For example, one sound file might be used in training in a stereo format, plus a 5.1 surround format, plus a Dolby Atmos 3D audio format – providing more information for the neural network as to how to pick out speech across different types of movie audio formats.
This raised the question of whether it would have been easier if Sonos could simply have trained on a huge range of copyright-protected movies with impunity?
Benatan says that “purely from a data acquisition standpoint, that would have been easier, but we would have lost so much value along the way doing that.”
Bentan says that working with sound designers meant they gained a greater understanding of how mixing works and the purpose of doing things in certain ways, meaning that they learned more about what the AI model needed to do than if they’d just tried training it outright – they learned what gaps using open data only would leave them with.
“We were talking to talking to sound designers about the nature of different scenes and the kinds of compositions that we can expect to encounter. And they were able to provide things for us such as being able to separate, say, the foley and the sound effects,” says Benatar. “Getting some insight into that underlying mixing process and that creative process to create these sonic scenes was really helpful to understand the challenges that we were seeing with the [testing model trained on open data].”
Expanding the audience
Even with this data and promise of an improving AI model, there was still the question of how to make the most of it. Benatan explained that the decision to partner with the RNID came from the personal lives of people involved in the project.
Bentan says, “My manager, James Nesfield [Director of Emerging Technologies at Sonos], and I were chatting about the difficulties that family members had with dialogue. So he was talking about the fact that his mum was really struggling to understand dialogue in in film and TV. And my father in law had recently [started wearing] hearing aids.
“And if you know anybody who’s who’s got hearing aids, you know that it’s a bit of a bumpy road to get them tuned correctly, to get familiar with them. And to begin with, it’s not particularly fun to watch content with them. A lot of people like to take their hearing aids out when they when they watch content. It’s a more natural presentation,” Bentan continues.
“We were like, this [model] can do more than just enhance speech in the way that we’d approached it previously. Like, this could really be something that can help people in the hearing health community. And it was at that point, that we decided to engage with the RNID,” Benatan says.
Lauren Ward adds, “This is the first time that we’ve been embedded so early in the process – often people come to us with products that are already basically ready to go, or are already out in the world. And that’s great, but there’s a limited amount that you can do when something in is basically ready to ship.”
Benatar and Ward explained that bringing the RNID in early provided a new avenue for feedback for Sonos that proved crucial, including from people who knew a ton about both audio and hearing health but don’t necessarily work in the industry.
Ward says that “people who also have knowledge of audio and have hearing loss tend to be massive nerds about their hearing loss. They tend to really want to understand what’s going on. So they’re actually an awesome resource in situations like this, because they can articulate their experiences really well, and then in later stages of the project we went after a broader group [of people with hearing loss].”
The audio “nerds” apparently proved very helpful in describing their experiences, but bringing a wider group of people to test the mode meant that not everyone was able to communicate what they were hearing so well – but Ward, Benatar and Alastair Moore said that these conversations with could be clarifying in their own regard… both for the RNID as well as Sonos.
Ward says: “One of the things that we were exploring in that first test was what engineering people would call perceptibility of artefacts. Can you hear what’s going wrong? Does this voice sound unnatural? Things like that. We’re thinking about ways that we can phrase that, that’s closer to people’s everyday language.
“But even as someone who works with people with hearing loss all the time… in one of the sessions I was running, I had a gentleman and I was trying to find the right language to use with him, and I’d gone, ‘Oh, does the voice still sound natural, even at the higher levels?’ And he’s like, ‘I’ve been deaf since birth. What is natural?'”
Loudness recruitment
One of the other elements that the project aimed to incorporate was that how you change the sound matters – you can’t just add sharpness and clarity and call that a win, because that can end up causing problems for other people.
“There’s a phenomenon called loudness recruitment,” Benatar explains. “This is a rather vicious phenomenon whereby not only do quiet sounds become harder or impossible to hear, but louder sounds become less comfortable, they become painful.
“So you’re compressing the range in which somebody is actually able to comfortably listen. And this was really important in understanding how we would design [the new feature] – you know, what role compression plays in the delivery of dialogue when incorporating the feature.
“Speech enhancement isn’t just about those with hearing loss. It’s about making sure that everybody can engage with the content they’re watching on their terms, right? … It wasn’t just about dialogue. It was that they want to be able to enjoy the content like everybody else does. We latched onto that,” Benatar adds.
Personalization vs ease of use
One thing I noted is that we’re entering the era of personalized audio tuned to our hearing, from the likes of the customization of the Denon PERL Pro to your hearing, up to AirPods Pro 2 now functioning as FDA-approved hearing aids with adjustments made specifically to your level of hearing loss.
Sonos’ approach isn’t exactly one-size fits all, but you could say it’s four sizes fits all: Low, Medium, High and Max. I asked whether there was any concern over it perhaps not covering a wide enough set of needs through those options.
Ward acknowledged that “It’s always a balance between overwhelming options and the ability to personalize,” but noted that having four options actually came out of the work with the RNID.
“When Matt first presented this structure, the setting only had three levels, and it’s grown out to four because what we found from our first listening test in particular was that it really did need to push further than that anyone thought initially. And I think what’s important to differentiate with something like speech enhancement, that sits in entertainment products – versus anything that’s trying to mimic hearing aids – is this it’s not trying to compensate for every difference in hearing. It’s about options, and someone who wants it on High or Max for one piece of content or on one particular day might not want it on another day,” she explains.
“Sometimes they’d say, ‘I just want the immersive experience, I don’t want any speech enhancement or anything to change. Yes, I might lose some of the speech but it’s more about immersion.’ Whereas for other pieces of content on other days, it’s ‘No, I really want to hear the dialogue.’ And that can be the same person’s hearing loss. And then you add multiple different people [watching in the same room] and you’ve just got so many different possible permutations.”
A key element is also the simplicity of actually using the feature.
Ward says, “There are scores of parameters involved in the speech enhancement process, and we’ve seen visualizations where they’re all changing all at once. But if you present all of that to a user, you can get really lost … You’re not going to have that on the Now Playing screen, whereas we’re able to include this feature right there, which is just one extra step. I think means that it’s it’s going to get used loads.”
She adds that, “Something that we really passionately believe is that a crucial part of accessibility is usability.”
Speaking of accessibility, I asked if the study with the RNID found that people actually changed their viewing habits as a result of having improved sound, and Ward said they found that it didn’t just mean that people changed how they watch, but also what they felt able to watch.
“The one that jumps out immediately was during one of the pieces of content in our second listening session. It was something that had a sci-fi war scene, so there are bombs going off, there’s dialogue, it’s quite chaotic. And [one member of the study] expressed that generally, that is something where he would just look at the type of content and avoid right on its face, because it’s going to be too loud, it’s going to be too overwhelming,” Ward explains.
“Then coming listen to it with the speech enhancement on, it was like, ‘Actually, I could go back and watch that because I can get that balance where I’m still in the content, but it’s not too overwhelming.’ And I think that gives us a glimpse into how some people are avoiding some things – or just choosing not to watch some things – not because they don’t feel they’d enjoy the content itself, but because it doesn’t feel accessible. It might feel too hard or too unpleasant.”
Alastair Moore punctuates this conversation with a point about how many people may be subtly affected by this kind of thing. “I think that around 50% of people over age 50 have some level of hearing loss so it’s not a small number.”
Dynamic in more ways than one
The end result is a system tuned with all this mind, but the final interesting touch is that the system knows that it doesn’t actually need to work all the time even when it’s turned on.
Harry Jones explains “We wanted to understand: when do we need to act? Because it’s such a huge thing to affect the sound experiences as well. We don’t want to touch the stuff that doesn’t need to be lifted, we don’t want to pull out a random crowd voice and favor that when it’s a really exciting sequence with cars. Also, on the other end of the scale, if it’s clean, we also don’t need to go crazy with the processing.”
“Something that we learned in the RNID discussions was that it’s not that they just want to bring dialogue up. It’s it’s they want to enjoy the process as well, the whole soundtrack as an entire thing.”
The solution was to analyze what’s happening in the scene before deciding whether it needs to be changed. Each frame of sound analyzed is around 5.8 milliseconds, and if the trend in the sound changes (ie, has the dialogue started to become mixed with other loud noises) then the system reacts.
Sonos identified 15 reasons why speech in a movie of TV show might be unclear, ranging from what’s happening in the mix (mistakes when mastering, artistic intention), to issues with a room (echoes), to outside sound (street noise) and everything in between. It can’t help with all of them, but it apparently proved instructive.
Then they broke down some different types of sound mixes in scenes, ranging from those with people talking and no other sound, to something with music and effects only.
“The real question was: when does it need enhancing, and when does it just need, you know, slightly cleaning up? The dialogue spectrum is anywhere between no dialogue at all, and as we move up [the other end of] that scale, we’ve got no sound and clean dialogue. Muffled dialogue over car noises needs the most help. Talking over music, less but some help,” Jones explains.
“The luxury of having speech extracted was that we knew when it was happening,” he adds.
If you have a Sonos Arc Ultra, you should be able to try the new mode out for yourself pretty much right away. For a lot of people, it won’t be needed, especially because the Sonos Arc Ultra is pretty dialogue-forward on its own (a value I appreciate in it).
But equally, I think there’s a chance that a lot of people might like to use the ‘Low’ setting who wouldn’t have wanted to use speech enhancement tools in the past – and I’ll be very interested to see if the High and Max settings help people as much as Sonos and the RNID hope they will.
You might also like…
https://cdn.mos.cms.futurecdn.net/fvXicUKznT629gqQQT9ShZ.jpg
Source link