Like the phoneticist Henry Higgins in George Bernard Shaw’s play Pygmalion, Marius Kotescu and Georgi Tinchev recently demonstrated how students are trying to overcome their pronunciation problems.
Two data scientists working for Amazon in Europe taught Alexa, the company’s digital assistant. Their mission is to use artificial intelligence and recordings of native speakers to help Alexa master English with an Irish accent.
During the demonstration, Alexa talked about a memorable night out. “I had a blast at the party last night,” Alexa says lightheartedly, using the Irish word for fun. “I bought ice cream on the way home and was very happy.”
Mr. Tinchev shook his head. Alexa omitted the “r” in “party” to make it sound flat like a party. Too British, he concluded.
The technologists are part of a team at Amazon working on a difficult area of data science known as detangling speech. It’s a tough problem that’s gaining new relevance in the wave of AI development, with researchers suggesting that voice and technology puzzles can make AI-powered devices, bots, and speech synthesizers more conversational. We believe that it will help us to do more, which means we can have a lot of effect in the region. accent.
Knowing vocabulary and syntax is not enough to tackle speech detangling. The pitch, timbre, and accent of the speaker often give subtle meaning and emotional weight to words. Linguists call this feature of language “prosody,” which machines have a hard time mastering.
Thanks to advances in AI, computer chips and other hardware, researchers have made progress in solving the problem of speech entanglement and transforming computer-generated speech into something more pleasing to the ear. .
Such efforts may eventually converge in the explosion of “generative AI,” a technology that allows chatbots to generate their own responses, researchers say. Chatbots like ChatGPT and Bard may one day operate entirely on the user’s voice commands and respond verbally. At the same time, voice assistants such as Alexa and Apple’s Siri will become more conversational, potentially rekindling consumer interest in the technology sector, which appeared to have stagnated, analysts say. there is
Enabling voice assistants like Alexa, Siri, and Google Assistant to speak multiple languages has been an expensive and time-consuming process. Tech companies have hired voice actors to record hundreds of hours of voice to help create synthetic voices for digital assistants.Advanced AI systems known as “text-to-speech models” for converting text into natural-sounding synthesized speech Streamlining is just beginning this process.
Deutsche Bank Research senior strategist Marion Laboulle said the technology “has made it possible to create human and synthesized voices based on text input in a variety of languages, accents and dialects.”
Amazon is under pressure to catch up to rivals such as Microsoft and Google in the AI race. In April, Amazon CEO Andy Jassy said: told a Wall Street analyst The company plans to make Alexa “even more proactive and conversational” with the help of sophisticated generative AI, said Rohit Prasad, chief scientist for Alexa at Amazon. told CNBC In May, he said he envisioned voice assistants as voice-enabled “personal AI out of the box.”
Irish Alexa made its commercial debut in November after nine months of training to understand and speak an Irish accent.
“Accent is different from language,” Prasad said in an interview. AI technology must learn how to extract accents from other parts of speech, such as tone and frequency, before it can reproduce the characteristics of local dialects. For example, an ‘a’ might be flatter and a ‘t’ might be pronounced more forcefully.
These systems need to understand these patterns “so that they can synthesize entirely new accents,” he says. “That’s difficult.”
Harder also wanted to develop a technique that learned new accents almost exclusively from different-sounding voice models. That’s what Kotescu’s team tried when building Alexa for Ireland. They relied heavily on pre-existing speech models, primarily for British English accents (with a much narrower range of American, Canadian, and Australian accents) to train them to speak Irish English.
The team worked on various linguistic challenges in Irish English. For example, Irish people tend to omit the ‘h’ in ‘th’, pronounce letters as hard ‘t’ or ‘d’, and make ‘bath’ sound like ‘bat’ or even ‘bad’ You may. Irish English is also Rotic English, meaning that the ‘r’ is overpronounced. So the ‘r’ in ‘party’ sounds more distinct than it sounds from a Londoner’s mouth. Alexa had to learn and master these voice features.
Kotescu, a Romanian and lead researcher on the Alexa team in Ireland, said Irish English was “difficult.”
The voice models that power Alexa’s language skills have become more sophisticated in recent years. In 2020, Amazon researchers taught Alexa speak fluent spanish From an English-speaking model.
Kotescu and team saw accents as the next frontier for voice capabilities in Alexa. They designed Alexa in Ireland to rely more on AI than on actors when building voice models. As a result, the Irish Alexa was trained on a relatively small corpus: about 24 hours of recordings by voice actors who recited 2,000 utterances of her in Irish-accented English.
Initially, when Amazon researchers fed Irish recordings to a learning Irish Alexa, some strange things happened.
Letters and syllables were sometimes dropped from responses. “S” may stick. A word or two, sometimes important words, were muttered gibberish and incomprehensible. In at least one case, Alexa’s female voice was lowered an octave and sounded more masculine. To make matters worse, the masculine voice was distinctly British and goofy enough to raise eyebrows in some Irish homes.
“They are big black boxes,” said the Bulgarian national Tinchev, Amazon’s lead scientist on the project, of the voice model. “You have to do a lot of experimentation to adjust them.”
That’s what the techs did to correct Alexa’s “party” gaffe. They deconstructed the audio word-by-word, phoneme-by-phoneme (the smallest fragment of a word you hear) to pinpoint and tweak exactly where Alexa was slipping. We then fed more recorded speech data to Alexa’s voice model in Irish to correct any mispronunciations.
Result: ‘r’ for ‘party’ returned. But then the “p” disappeared.
So the data scientist repeated the same process again. They ultimately focused on phonemes containing the missing ‘p’. I then tweaked the model further so that the ‘p’ sound returned and the ‘r’ didn’t go away. Alexa can finally speak like a Dubliner.
Two Irish linguists, Elaine Vaughan, who teaches at the University of Limerick, and PhD student Kate Talon, who works in the Phonetics and Speech Lab at Trinity College, Dublin, have since played the Irish Alexa high accent. giving a rating. The way Irish Alexa emphasizes the “r” and softens the “t” stands out, they said, and Amazon got the accent right across the board.
“It sounds real to me,” Taron said.
Amazon researchers said they were pleased with the generally positive feedback. Their voice model unraveled the Irish accent so quickly that it gave them hope that the accent could be recreated elsewhere.
“We plan to extend the methodology to accents in languages other than English,” they write in their paper. January research paper About Ireland’s Alexa project.