🏠 Team IT Security News

TSecurity.de ist eine Online-Plattform, die sich auf die Bereitstellung von Informationen,alle 15 Minuten neuste Nachrichten, Bildungsressourcen und Dienstleistungen rund um das Thema IT-Sicherheit spezialisiert hat.
Ob es sich um aktuelle Nachrichten, Fachartikel, Blogbeiträge, Webinare, Tutorials, oder Tipps & Tricks handelt, TSecurity.de bietet seinen Nutzern einen umfassenden Überblick über die wichtigsten Aspekte der IT-Sicherheit in einer sich ständig verändernden digitalen Welt.

16.12.2023 - TIP: Wer den Cookie Consent Banner akzeptiert, kann z.B. von Englisch nach Deutsch übersetzen, erst Englisch auswählen dann wieder Deutsch!

Google Android Playstore Download Button für Team IT Security

800+ IT News als RSS Feed abonnieren

Thema auswählen:

📚 Overcoming Automatic Speech Recognition Challenges: The Next Frontier

🕛 Zeit seit Veröffentlichung: 379 Tage, 9 Stunden 40 Minuten
📆 Veröffentlicht am: 30.03.2023 um 09:18 Uhr
💡 Newskategorie: AI Nachrichten
🔗 Quelle: towardsdatascience.com

Advancements, Opportunities, and Impacts of Automatic Speech Recognition Technology in Various Domains

TL;DR:

This post focuses on the advancements in Automatic Speech Recognition (ASR) technology and its impact on various domains. ASR has become prevalent in multiple industries, with improved accuracy driven by scaling model size and constructing larger labeled and unlabelled training datasets.

Looking ahead, ASR technology is expected to continue improving with the scaling of the acoustic model size and the enhancement of the internal language model. Additionally, self-supervised and multi-task training techniques will enable low-resource languages to benefit from ASR technology, while multilingual training will boost performance even further, allowing for basic usage such as voice commands in many low-resource languages.

ASR will also play a significant role in Generative AI, as interaction with avatars will be via an audio/text interface. With the emergence of textless NLP, some end-tasks, such as speech-2-speech translation, may be solved without using any explicit ASR model. Multimodal models that can be prompted using text, audio, or both will be released and generate text or synthesize audio as an output.

Furthermore, open-ended dialogue systems with voice-based human-machine interfaces will improve robustness to transcription errors and differences between written and spoken forms. This will provide robustness to challenging accents and children’s speech, enabling ASR technology to become an essential tool for many applications.

An end-to-end speech enhancement-ASR-diarization system is set to be released, enabling the personalization of ASR models and improving performance on overlapped speech and challenging acoustic scenarios. This is a significant step towards solving ASR technology’s challenges in real-world scenarios.

Lastly, A wave of speech APIs is expected. And still, there are opportunities for small startups to outperform big tech companies in domains with more legal or regulatory restrictions on the use of technology/data acquisition and in populations with low technology adoption rates.

2022 In A Review

Automatic Speech Recognition (ASR) technology is gaining momentum across various industries such as education, podcasts, social media, telemedicine, call centers, and more. A great example is the growing prevalence of voice-based human-machine interface (HMI) in consumer products, such as smart cars, smart homes, smart assistive technology [1], smartphones, and even artificial intelligence (AI) assistants in hotels [2]. In order to meet the increasing demand for fast and accurate responses, low-latency ASR models have been deployed for tasks like keyword spotting [3], endpointing [4], and transcription [5]. Speaker-attributed ASR models [6–7] are also gaining attention as they enable product personalization, providing greater value to end-users.

Prevalence of Data. Streaming audio and video platforms such as social media and YouTube have led to the easy acquisition of unlabeled audio data [8]. New self-supervised techniques have been introduced to utilize this audio without needing ground truth [9–10]. These techniques improve the performance of ASR systems in the target domain, even without fine-tuning on labeled data for that domain [11]. Another approach gaining attention due to its ability to utilize this unlabeled data is self-training using pseudo-labeling [12–13]. The main concept is to automatically transcribe unlabeled audio data using an automatic speech recognition (ASR) system and then use the generated transcription as ground truth for training a different ASR system in a supervised fashion. OpenAI took a different approach, assuming they can find human-generated transcripts at scale online. They generated a high-quality and large-scale (640K hours) training dataset by crawling publicly available audio data with human-generated subtitles. Using this dataset, they trained an ASR model (a.k.a Whisper) in a fully supervised manner, achieving state-of-the-art (SoTA) results on several benchmarks in zero-shot settings [14].

Losses. Despite End-2-end (E2E) losses dominating SoTA ASR models [15–17], new losses are still being published. A new technique called hybrid autoregressive transducer (HAT) [18] has been introduced, enabling to measure the quality of the internal language model (ILM) by separating the blank and label posteriors. Later work [19] used this factorization to effectively adapt the ILM using only textual data, which improved the overall performance of ASR systems, particularly the transcription of named entities, slang terms, and nouns, which are major pain points for ASR systems. New metrics have also been developed to better align with human perception and overcome word error rate (WER) semantic issues [20].

Architecture Choice. Regarding the acoustic model’s architectural choices, Conformer [21] remained preferred for streaming models, while Transformers [22] is the default architecture for non-streaming models. As for the latter, encoder-only (wav2vec2 based [23–24]) and encoder-decoder (Whisper [14]) multi-lingual models were introduced and improved over the SoTA results across several benchmarks in zero-shot settings. These models outperform their streaming counterparts due to model size, training data size, and their larger context.

Multilingual AI Developments from Tech Giants. Google has announced its “1,000 Languages Initiative” to build an AI model that supports the 1,000 most spoken languages [25], while Meta AI has announced its long-term effort to build language and machine translation (MT) tools that include most of the world’s languages [26].

Spoken Language Breakthrough. Multi-modal (speech/text) and multi-task pre-trained seq-2-seq (encoder-decoder) models such as SpeechT5 [27] were released, showing great success on a wide variety of spoken language processing tasks, including ASR, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

These advancements in ASR technology are expected to drive further innovation and impact a wide range of industries in the years to come.

A Look Ahead

Despite its challenges, the field of Automatic Speech Recognition (ASR) is expected to make significant advancements in various domains, ranging from acoustic and semantic modeling to conversational and generative AI, and even speaker-attributed ASR. This section provides detailed insights into these areas and shares my predictions for the future of ASR technology.

General Improvements:

The improvement of ASR systems is expected on both the acoustic and semantic parts.

On the acoustic model side, larger model and training data sizes are anticipated to enhance the overall performance of ASR systems, similar to the progress observed in LLMs. Although scaling Transformer encoders, such as Wav2Vec or Conformer, poses a challenge, a breakthrough is expected to enable their scaling or see a shift towards encoder-decoder architectures as in Whisper. However, encoder-decoder architectures have drawbacks that need to be addressed, such as hallucinations. Optimizations such as faster-whisper [28] and NVIDIA-wav2vec2 [29] will reduce training and inference time, lowering the barrier to deploying large ASR models.

On the semantic side, researchers will focus on improving ASR models by incorporating larger acoustic or textual contexts. Injecting large-scale unpaired text into the ILM during E2E training, as in JEIT [30], will also be explored. These efforts will help to overcome key challenges such as accurately transcribing named entities, slang terms, and nouns.

Although Whisper and Google’s universal speech model (USM) [31] have improved ASR system performances over several benchmarks, some benchmarks still need to be solved as the word error rate (WER) remains around 20% [32]. Using speech foundation models, adding more diverse training data, and applying multi-task learning will significantly improve performance in such scenarios, opening up new business opportunities. Moreover, new metrics and benchmarks are expected to emerge to better align new end-tasks and domains, such as non-lexical conversational sounds [33] in the medical domain and filler word detection and classification [34] in media editing and educational domains. Task-specific fine-tuned models may be developed for this purpose. Finally, with the growth of multi-modality, more models, training datasets, and new benchmarks for several tasks are also expected to be released [35–36].

As progress continues, a wave of speech APIs is expected, similar to natural language processing (NLP). Google’s USM, OpenAI’s Whisper, and Assembly’s Conformer-1 [37] are some of the early examples.

Although it sounds silly, force alignment is still challenging for many companies. An open-source code for that may help many achieve accurate alignment between audio segments and their corresponding transcript.

Low Resources Languages:

Advancements in self-supervised learning, multi-task learning, and multi-lingual models are expected to improve performance on low-resource and unwritten languages significantly. These methods will achieve acceptable performances by utilizing pre-trained models and fine-tuning on a relatively small number of labeled samples [24]. Another promising approach is dual learning [38], a paradigm for semi-supervised machine learning that seeks to leverage unsupervised data by solving two opposite tasks (text-to-speech (TTS) and ASR in our case) at once. In this method, each model produces pseudo-labels for unlabeled examples, which are used to train the other model.

Additionally, improving ILM using unpaired text can enhance model robustness, which will be especially advantageous for closed-set challenges such as voice commands. The performance will be acceptable but not flawless in some applications, such as captioning YouTube videos, while in others, such as generating verbatim transcripts in court, it may take more time for models to meet the threshold. We anticipate that companies will gather data based on these models while manually correcting transcripts in 2023, and we will see significant improvements in low-resource languages after fine-tuned on proprietary data in 2024.

Generative AI:

The use of avatars is expected to revolutionize human interaction with digital assets. In the short term, ASR will serve as one of the foundations in Generative AI as these avatars will communicate through textual/auditory interface.

But in the future, changes could occur as attention shifts towards new research directions. For example, an emerging technology that is likely to be adopted is Textless NLP, which represents a new language modeling approach to audio generation [39]. This approach uses learnable discrete audio units [40], and auto-regressively generates the next discrete audio unit one unit at a time, similar to text generation. These discrete units can be later decoded back to the audio domain. Thus far, this technology has been able to generate syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers, as can be seen in GSLM/AudioLM [39, 41]. The potential of this technology is enormous, as one can skip the ASR component (and its errors) in many tasks. For example, traditional speech-2-speech (S2S) translation methods work as follows: They transcribe the utterance in the source language, then translate the text to the target language using a machine translation model, and finally generate the audio in the target languages using a TTS engine. Using textless-NLP technology, S2S translation can be done using a single encoder-decoder architecture that works directly on discrete audio units without using any explicit ASR model [42]. We predict that future Textless NLP models will solve many other tasks without going through explicit transcription, such as question-answering systems. However, the main drawback of this method is backtracking errors and debugging, as things will get less intuitive when working on the discrete units space rather than working on the transcription.

T5 [43] and T0 [44] showed great success in NLP by utilizing their multi-task training and showing zero-shot task generalization. In 2021 SpeechT5 [27] was published, showing great success in various spoken language processing tasks. Earlier this year, VALL-E [45] and VALL-EX [46] were released. They showed impressive in-context learning capabilities for TTS models by using textless NLP technology, enabling cloning speaker’s voice by using only a few seconds of their audio, and without requiring any fine-tuning, doing it even in cross-lingual settings.

By joining the concepts taken from SpeechT5 and VALL-E, we can expect the release of T0-like models that can be prompted using either text, audio, or both, and generate text or synthesize audio as an output, depending on the task. A new era of models will begin, as in-context learning will enable generalization in zero-shot settings to new tasks. This will allow semantic search over audio, transcribing a target speaker using speaker-attributed ASR or describing it in free text, e.g., ‘what did the young kid that coughed say?”. Furthermore, it will enable us to classify or synthesize audio using audio or textual description and solve NLP tasks directly from audio using explicit/implicit ASR.

Conversational AI:

Conversational AI has been adopted mainly through task-oriented dialogue systems, namely AI personal assistants (PA) such as Amazon’s Alexa and Apple’s Siri. These PAs have become popular due to their ability to provide quick access to features and information through voice commands. As big tech companies dominate this technology, new regulations on AI assistants will force them to offer third-party options for voice assistants, opening up competition [47]. As this happens, we can expect interoperability between personal assistants, meaning they will start communicating. This will be great as one can use any device to connect to any conversational agent anywhere in the world [48]. From the ASR perspective, this will pose new challenges as the contextualization will be much broader, and assistants will must have the robustness to different accents and possibly support multilingualism.

Over the past few years, a great technological leap has happened in text-based open-ended dialogue systems, e.g., Blender-Bot and LaMDA [49–50]. Initially, these dialogue systems were text-based, meaning they were fed by text and trained to output text, all in the written-form domain. As ASR performances improved, open-ended dialogue systems were augmented with voice-based HMI, which resulted in misalignment between modalities due to differences between the spoken and written forms. One of the main challenges is to bridge this gap by overcoming new types of errors introduced due to the audio-related processing, e.g., differences between spoken and written forms such as disfluencies and entity resolution, and transcription errors such pronunciation errors [51–52].

Possible solutions can be derived from improved transcription quality and robust NLP models that can effectively handle transcription and pronunciation errors. A reliable acoustic model’s confidence score [53] will serve as a key player in these systems, enabling it to point out speaker errors or serve as another input to the NLP model or decoding logic. Furthermore, we expect that ASR models will predict non-verbal cues such as sarcasm, enabling agents to understand the conversation more deeply and provide better responses.

These improvements will enable to push further dialogue systems with an auditory HMI to support challenging accents and children’s speech, such as in Loora [54] and Speaks [55].

Pushing the limits even further, we expect the release of an E2E multi-task learning framework for spoken language tasks using joint modeling of the speech and NLP problems as in MTL-SLT [56]. These models will train in an E2E fashion that will reduce the cumulative error between sequential modules and will address tasks such as spoken language understanding, spoken summarization, and spoken question answering, by taking speech as input and emitting various outputs such as transcription, intent, named entities, summaries, and answers to text queries.

Personalization will play a huge factor for AI assistants and open-ended dialogue systems, leading us to the next point: speaker-attributed ASR.

Speaker Attributed ASR:

There is still a challenge in transcribing distant conversations involving multiple microphones and parties in home environments. Even state-of-the-art (SoTA) systems can only achieve around 35% WER [57].

Early birds of joint ASR and diarization were released in 2019 [58]. This year, we can expect a release of an end-to-end speech enhancement-ASR-diarization system which will improve performance on overlapped speech and enable better performance in challenging acoustic scenarios such as reverberant rooms, far-field settings, and low Signal-to-Noise (SNR) ratios. The improvement will be achieved through joint task optimization, improved pre-training methods (such as WavLM [10]), applying architectural changes [59], data augmentation, and training on in-domain data during pre-training and fine-tuning [11]. Moreover, we can expect the deployment of speaker-attributed ASR systems for personalized speech recognition. This will further improve the transcription accuracy of the target speaker’s voice and bias the transcript towards user-defined words, such as contact names, proper nouns, and other named entities, which are crucial for smart assistants [60]. Additionally, low latency models will continue to be a significant area of focus to enhance edge devices’ overall experience and response time [61–62].

The Role of Startups Compared to Big Tech Companies in The ASR Landscape

Although big tech companies are expected to continue dominating the market with their APIs, small startups can still outperform them in specific domains. These include areas that are underrepresented in the big tech’s training data due to regulations, such as the medical domain and children’s speech, and populations that have not yet adopted technology, such as immigrants with challenging accents or individuals learning English worldwide. In markets where there isn’t enough demand for big tech companies to invest in, such as languages that are not widely spokem small startups may find opportunities to succeed and generate profit.

To create a win-win situation, big tech companies can provide APIs that offer full access to the output of their acoustic models while allowing others to write the decoding logic (WFST/beam-search) instead of merely adding customizable vocabulary or using current model adaptation features [63–64]. This approach will enable small startups to excel in their domains by incorporating priming or multiple language models during inference on top of the given acoustic model, rather than having to train the acoustic models themselves, which can be costly in terms of human capital and domain knowledge. In turn, big tech companies will benefit from broader adoption of their paid models.

How Does ASR Fit Into The Broader Machine Learning Landscape?

On one hand, ASR is on par with the importance of computer vision (CV) and NLP when considering it as the end task. This is the current situation in low-resource languages and domains where the transcript is the main business, e.g., court, medical records, movie subtitles, etc.

On the other hand, ASR is no longer the bottleneck in other domains where it has passed a certain usability threshold. In these cases, the NLP is the bottleneck, which means that improving ASR performances toward perfectionism is not essential for extracting insights for the end task. For example, meeting summarization or action item extraction can be achieved in many cases using current ASR quality.

Closing Remarks

The advancements in ASR technology have brought us closer to achieving seamless communication between humans and machines, for example in Conversational AI and Generative AI. With the continued development of speech enhancement-ASR-diarization systems and the emergence of textless NLP, we are poised to witness exciting breakthroughs in this field. As we look forward to the future, we can’t help but anticipate the endless possibilities that ASR technology will unlock.

Thank you for taking the time to read this post! Your thoughts and feedback on these projections are highly valued and appreciated. Please feel free to share your comments and ideas.

References:

[1] https://www.orcam.com/en/home/

[2] https://voicebot.ai/2022/12/01/hey-disney-custom-alexa-assistant-rolls-out-at-disney-world/

[3] Jose, Christin, et al. “Latency Control for Keyword Spotting.” ArXiv, 2022, https://doi.org/10.21437/Interspeech.2022-10608.

[4] Bijwadia, Shaan, et al. “Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems.” ArXiv, 2022, https://doi.org/10.1109/SLT54892.2023.10022338.

[5] Yoon, Ji, et al. “HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2204.06328.

[6] Kanda, Naoyuki, et al. “Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers Using End-to-End Speaker-Attributed ASR.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03151.

[7] Kanda, Naoyuki, et al. “Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.16685.

[8] https://www.fiercevideo.com/video/video-will-account-for-82-all-internet-traffic-by-2022-cisco-says

[9] Chiu, Chung, et al. “Self-Supervised Learning with Random-Projection Quantizer for Speech Recognition.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2202.01855.

[10] Chen, Sanyuan, et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.” ArXiv, 2021, https://doi.org/10.1109/JSTSP.2022.3188113.

[11] Hsu, Wei, et al. “Robust Wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2104.01027.

[12] Lugosch, Loren, et al. “Pseudo-Labeling for Massively Multilingual Speech Recognition.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.00161.

[13] Berrebbi, Dan, et al. “Continuous Pseudo-Labeling from the Start.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.08711.

[14] Radford, Alec, et al. “Robust Speech Recognition via Large-Scale Weak Supervision.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2212.04356.

[15] Graves, Alex, et al. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” ICML, 2016, https://www.cs.toronto.edu/~graves/icml_2006.pdf

[16] Graves, Alex. “Sequence Transduction with Recurrent Neural Networks.” ArXiv, 2012, https://doi.org/10.48550/arXiv.1211.3711.

[17] Chan, William, et al. “Listen, Attend and Spell.” ArXiv, 2015, https://doi.org/10.48550/arXiv.1508.01211.

[18] Variani, Ehsan, et al. “Hybrid Autoregressive Transducer (Hat).” ArXiv, 2020, https://doi.org/10.48550/arXiv.2003.07705.

[19] Meng, Zhong, et al. “Modular Hybrid Autoregressive Transducer.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.17049.

[20] Kim, Suyoun, et al. “Evaluating User Perception of Speech Recognition System Quality with Semantic Distance Metric.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.05376.

[21] Gulati, Anmol, et al. “Conformer: Convolution-Augmented Transformer for Speech Recognition.” ArXiv, 2020, https://doi.org/10.48550/arXiv.2005.08100.

[22] Vaswani, Ashish, et al. “Attention Is All You Need.” ArXiv, 2017, https://doi.org/10.48550/arXiv.1706.03762.

[23] Baevski, Alexei, et al. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” ArXiv, 2020, https://doi.org/10.48550/arXiv.2006.11477.

[24] Babu, Arun, et al. “XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2111.09296.

[25] https://blog.google/technology/ai/ways-ai-is-scaling-helpful/

[26] https://ai.facebook.com/blog/teaching-ai-to-translate-100s-of-spoken-and-written-languages-in-real-time/

[27] Ao, Junyi, et al. “SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.07205.

[28] https://github.com/guillaumekln/faster-whisper

[29] https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/wav2vec2

[30] Meng, Zhong, et al. “JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2302.08583.

[31] Zhang, Yu, et al. “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.01037.

[32] Kendall, T. and Farrington, C. “The corpus of regional african american language”. Version 2021.07. Eugene, OR: The Online Resources for African American Language Project. http://oraal.uoregon.edu/coraal, 2021

[33] Brian, D Tran, et al. ‘“Mm-hm,” “Uh-uh”: are non-lexical conversational sounds deal breakers for the ambient clinical documentation technology?,’ Journal of the American Medical Informatics Association, 2023, https://doi.org/10.1093/jamia/ocad001

[34] Zhu, Ge, et al. “Filler Word Detection and Classification: A Dataset and Benchmark.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2203.15135.

[35] Anwar, Mohamed, et al. “MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.00628.

[36] Jaegle, Andrew, et al. “Perceiver IO: A General Architecture for Structured Inputs & Outputs.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.14795.

[37] https://www.assemblyai.com/blog/conformer-1/

[38] Peyser, Cal, et al. “Dual Learning for Large Vocabulary On-Device ASR.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.04327.

[39] Lakhotia, Kushal, et al. “Generative Spoken Language Modeling from Raw Audio.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2102.01192.

[40] Zeghidour, Neil, et al. “SoundStream: An End-to-End Neural Audio Codec.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2107.03312.

[41] Borsos, Zalán, et al. “AudioLM: a Language Modeling Approach to Audio Generation.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2209.03143.

[42] https://about.fb.com/news/2022/10/hokkien-ai-speech-translation/

[43] Raffel, Colin, et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” ArXiv, 2019, /abs/1910.10683.

[44] Sanh, Victor, et al. “Multitask Prompted Training Enables Zero-Shot Task Generalization.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.08207.

[45] Wang, Chengyi, et al. “Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2301.02111.

[46] Zhang, Ziqiang, et al. “Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling.” ArXiv, 2023, https://doi.org/10.48550/arXiv.2303.03926.

[47] https://voicebot.ai/2022/07/05/eu-passes-new-regulations-for-voice-ai-and-digital-technology/

[48] https://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=154094

[49] Thoppilan, Romal, et al. “LaMDA: Language Models for Dialog Applications.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2201.08239.

[50] Shuster, Kurt, et al. “BlenderBot 3: a Deployed Conversational Agent that Continually Learns to Responsibly Engage.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2208.03188.

[51] Xiaozhou, Zhou, et al. “Phonetic Embedding for ASR Robustness in Entity Resolution.” Proc. Interspeech 2022, 3268–3272, doi: 10.21437/Interspeech.2022–10956

[52] Chen, Angelica, et al. “Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.00620.

[53] Li, Qiujia, et al. “Improving Confidence Estimation on Out-of-Domain Data for End-to-End Speech Recognition.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2110.03327.

[54] https://loora.ai/

[55] https://techcrunch.com/2022/11/17/speak-lands-investment-from-openai-to-expand-its-language-learning-platform/

[56] Zhiqi, Huang, et al. “MTL-SLT: Multi-Task Learning for Spoken Language Tasks.” NLP4ConvAI, 2022, https://aclanthology.org/2022.nlp4convai-1.11

[57] Watanabe, Shinji, et al. “CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings.” ArXiv, 2020, https://doi.org/10.48550/arXiv.2004.09249.

[58] Shafey, Laurent, et al. “Joint Speech Recognition and Speaker Diarization via Sequence Transduction.” ArXiv, 2019, https://doi.org/10.48550/arXiv.1907.05337.

[59] Kim, Juntae, and Lee, Jeehye. “Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers.” ArXiv, 2021, https://doi.org/10.48550/arXiv.2108.10752.

[60] Sathyendra, Kanthashree, et al. “Contextual Adapters for Personalized Speech Recognition in Neural Transducers.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2205.13660.

[61] Tian, Jinchuan, et al. “Bayes Risk CTC: Controllable CTC Alignment in Sequence-to-Sequence Tasks.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2210.07499.

[62] Tian, Zhengkun, et al. “Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization.” ArXiv, 2022, https://doi.org/10.48550/arXiv.2211.03284.

[63] https://docs.rev.ai/api/custom-vocabulary/

[64] https://cloud.google.com/speech-to-text/docs/adaptation-model

Overcoming Automatic Speech Recognition Challenges: The Next Frontier was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...

Sharing is caring on Social Media

Join the Team IT Security Community

📌 Overcoming Automatic Speech Recognition Challenges: The Next Frontier

🕛 379 Tage, 9 Stunden 23 Minuten
📆 30.03.2023 um 09:18 Uhr
📈 84.87 Punkte

📌 Google AI Introduces Universal Speech Model (USM): A Family of State-of-the-Art Speech Models with 2B Parameters Trained on 12 Million Hours of Speech and 28 Billion Sentences of Text

🕛 401 Tage, 1 Stunden 24 Minuten
📆 08.03.2023 um 16:59 Uhr
📈 35.68 Punkte

📌 Whisper models for automatic speech recognition now available in Amazon SageMaker JumpStart

🕛 174 Tage, 23 Stunden 31 Minuten
📆 10.10.2023 um 19:09 Uhr
📈 34.97 Punkte

📌 This AI Paper from the University of Washington Proposes Cross-lingual Expert Language Models (X-ELM): A New Frontier in Overcoming Multilingual Model Limitations

🕛 75 Tage, 14 Stunden 37 Minuten
📆 28.01.2024 um 04:19 Uhr
📈 33.18 Punkte