In April 2020, during the earliest days of the Covid-19 pandemic, Microsoft Teams announced that the capability to use artificial intelligence (AI) and machine learning (ML) to filter out typing, barking and other noises from its video calls was “coming soon.”
Back then, the platform had already grown from 44 million users in March 2020 to 75 million a month later, as pandemic-related lockdowns left millions of Americans suddenly adapting to remote work and the use of video conferencing tools exploded. Just as workers struggling with background noise on video calls had become part of the cultural zeitgeist, Microsoft Teams debuted AI-powered noise suppression and video quality tools in late 2020 and early 2021.
Now, Microsoft Teams continues to improve AI and ML capabilities to help its now more than 270 million monthly users deal with some of the biggest video conferencing headaches — from annoying echos to difficulties talking at the same time.
New AI and ML-powered capabilities
Today, the company announced a new set of AI and ML-powered capabilities built into Teams’ underlying architecture. These include echo cancellation, adjusting audio in poor acoustic areas, and allowing users to speak and hear at the same time without interruptions. These build on AI-powered features recently released, including expanding background noise suppression. In addition, for the first time Microsoft Teams announced recent video quality improvements, including adjustments for low light and optimizations based on the type of content being shared.
“We are trying to make sure you can have your call or meeting wherever you are, even if you’re in ‘messy’ environments,” Robert Aichner, principal PM manager, Intelligent Conversation and Communications Cloud (IC3) at Microsoft, told VentureBeat.
Aichner, who has a Ph.D. in audio signal processing, has worked at Microsoft for the past decade and spent the past three years leading the AI team at Microsoft Teams, which works to evolve research and academia and ship it into a product.
Microsoft Teams uses AI to tackle tough challenges
Microsoft Teams has always offered noise suppression, Aichner said. But traditional methods have only been able to tackle stationary noises – noises that don’t change over time – such as computer fans or air conditioners. Other noises, such as dogs barking, or echoes from webcams, microphones or desktop speakers, are tougher noisy nuts to crack. So, too, is dealing with large or uncarpeted rooms that make users sound like they are in a cave.
“We have always worked to remove noise – it’s always been a very tough problem in traditional signal processing,” he said. But with machine learning, it is now easier for AI models to learn and improve.
For example, during calls and meetings, when a participant has their microphone too close to their speaker, it’s common for sound to loop between input and output devices, causing an unwanted echo effect. Now, Microsoft Teams uses AI to recognize the difference between sound from a speaker and the user’s voice. This eliminates the echo without suppressing speech or inhibiting the ability for multiple parties to speak at the same time. To accomplish this, Microsoft had 30,000 hours of recorded speech from male and female talkers in 74 different languages, as well as simulated sound for room acoustics, said Aichner.
In addition, in certain environments, room acoustics can cause sound to bounce, or reverberate, causing the user’s voice to sound shallow, as if they’re in a cave. For the first time, Microsoft Teams uses a machine-learning model to convert captured audio signals to sound as if users are speaking into a close-range microphone.
Microsoft Teams’ AI uses supervised learning
“We basically took a lot of clean speech, which is recorded as if I have a close talking microphone, and then we let the model learn to adapt to that and remove everything else,” he said, pointing out that this is supervised learning – where there is a target signal and the model tries to optimize for that.
Dealing with video quality – such as issues of poor lighting – is dealt with in a similar way, he explained: “You have supervised learning about what good lighting looks like, as well as the poor lighting, and then you need some kind of rating of the quality of the good lighting versus the one you are trying to improve.”
In situations where not enough bandwidth is available for the highest quality video, the encoder must make a trade-off between better picture quality versus smoother frame rate. To make it easier for the end user, Teams uses ML to understand the characteristics of the content the user is sharing to ensure participants experience the highest video quality in constrained bandwidth scenarios.
Microsoft Teams engages researchers, joint product efforts
Much of what Microsoft Teams has accomplished as far as using AI and ML to improve sound and video quality is a result of its efforts beginning in early 2020 to engage with the research community.
Aichner’s team began an international competition as part of the Interspeech 2020 and ICASSP 2021 conferences, offering a “deep learning noise suppression challenge designed to “foster innovation in the field of noise suppression to achieve superior perceptual speech quality.” Microsoft Teams open -sourced training and test datasets for researchers to train their noise suppression models.
These days, Microsoft Teams researchers also work jointly with the product team to work together and influence future offerings.
“We have joint teams where we take these models and are actually integrating them,” he said. “I think that’s really key, to connect those two teams so that they get their vision from the product team and know what they should focus on – the product teams also are more aware of where the holes are, where it doesn’t work.”