● LINE among tp contributors in Japan with largest number of accepted papers—14 including those written or co-written by NAVER
● New non-autoregressive automatic speech recognition method and advancement of phrase break prediction for voice synthesis win praise
TOKYO – August 30, 2021 – LINE Corporation ("LINE") announces that six of its research papers have been accepted by INTERSPEECH 2021, the world's largest conference on speech processing.
Hosted by the International Speech Communication Association (ISCA), INTERSPEECH is the world's largest conference devoted to speech processing. This marks the 22nd conference, for which approximately 1,000 research papers were accepted out of a total of around 2,000 submissions. The proud authors will present their work at INTERSPEECH 2021 held virtually from today through September 3.
LINE’s proposed new method for non-autoregressive automatic speech recognition and research on phrase break prediction for voice synthesis among accepted papers
LINE’s research papers have been accepted in a number of fields. On speech recognition, LINE proposed a new way to improve the accuracy of non-autoregressive automatic speech recognition (ASR)*1, a methodology that is gaining attention for enabling faster speech recognition [1]. In the paper, LINE focuses on the methodology employing a CTC*2 and a Transformer*3, and proposes a solution to one of the methodology’s current limitations which is conditional independence assumption between output tokens (i.e. individual characters or words).
Unlike the conventional method of recognizing speech using output from the last layer, LINE suggests initially recognizing speech in the intermediate layer and summing the features to the subsequent layers to condition the final prediction. In short, the model makes use of intermediate predictions to relax the conditional independence assumption of CTC-based ASR models by conditioning the final prediction on the intermediate predictions. To put it simply, the intermediate predictions allow ‘token domain information’ to be passed on to subsequent Transformer layers so that the resulting features take token relationships into account. While LINE's proposal is a simple enhancement of an existing method, the resulting recognition accuracy is superior to that of other modern methods.
In the field of speech synthesis, LINE's research paper on phrase break prediction was accepted for improving the quality of synthesized speech by inserting pauses at appropriate positions [3]. The proposed method focuses on the expressiveness of BERT (Bidirectional Encoder Representations from Transformers), a technique that is being increasingly used for a wide variety of NLP tasks in recent years, and combines it with the conventional LSTM (Long Short-term Memory) network to improve both phrase break prediction accuracy and the quality of synthesized speech.
Another paper is about a joint research project between LINE and NAVER on the multi-band harmonic-plus-noise Parallel WaveGAN (PWG)*4 model, aiming to improve the quality of PWG [4]. Inspired by a classical signal processing method, LINE and NAVER add a mechanism to PWG that weighs and mixes the periodic and non-periodic components of speech for each frequency band to enable higher quality speech synthesis.
*1 Non-autoregressive automatic speech recognition: A method of recognizing speech at each point in time without depending on previously generated text.
*2 CTC: A learning method that is used when the input sequence length of a neural network is longer than the output sequence length, such as voice data and text.
*3 Transformer: A type of neural network model comprised of only attention mechanism-based layers. It does not contain recurrence or convolution typically considered important in conventional models.
*4 Parallel WaveGAN (PWG): A non-autoregressive, fast waveform generation model based on generative adversarial networks (GAN). GAN is comprised of two modules—a generator and a discriminator—that train by deceiving each other in a zero-sum game.
Accepted papers
1. Jumon Nozaki, Tatsuya Komatsu, "Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions"
2. Tatsuya Komatsu, Shinji Watanabe, Koichi Miyazaki, Tomoki Hayashi, "Acoustic Event Detection with Classifier Chains"
3. Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana, "Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis"
4. Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim, "High-fidelity Parallel WaveGAN with Multi-band Harmonic-plus-Noise Model"
5. Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi, "Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation"
6. Masahito Togami, Robin Scheibler, "Sound Source Localization with Majorization Minimization"
Fundamental AI research
At LINE, AI is positioned as one of the company's strategic businesses. While collaborating with NAVER to create new AI services/features and conduct basic research into the underlying technologies, LINE endeavors to accelerate both R&D into AI tech and growth of its AI-driven businesses. Aiming to shorten the time between research, development, and production, teams in charge of data platform development, data analysis, machine learning, AI technology development, and basic research have also gone beyond their own businesses and domains to work together.
When it comes to basic research, LINE has placed machine learning at the center while focusing on research areas such as speech, language, and image processing. Recognition of LINE's research work include the following accomplishments:
- February 2021: ICASSP, the world's largest international conference in the field of speech, acoustics and signal processing, accepts seven of LINE's research papers, putting LINE among the top contributors in Japan with the greatest number of accepted papers*5
- July 2021: ICCV 2021, one of the largest international conferences on computer vision, accepts two of LINE's papers*6
*5 See press release: https://linecorp.com/en/pr/news/en/2021/3640
*6 See press release: https://linecorp.com/ja/pr/news/ja/2021/3843 (Japanese only)
About the LINE CLOVA brand showcasing LINE's AI technology
LINE's AI tech brand, LINE CLOVA, aims to help create a more convenient and enriching world by resolving the hidden complications in daily life and business, and elevating the quality of social functions and living by utilizing diverse AI technologies and services. Currently, LINE CLOVA offers CLOVA Speech (speech recognition), CLOVA Voice (speech synthesis), and solutions that combine these speech technologies.
LINE AiCall is one example—incorporating CLOVA Speech and CLOVA Voice with a dialogue control system, governments, restaurants, and call centers are increasingly adopting the solution and employing AI to give natural responses to user requests and guide them to their goal. Another is CLOVA Note, an application announced last year. It can detect conversations in meetings with high accuracy and record and manage this information as minutes. This high accuracy is due to the application's speech recognition model, which specializes in analyzing many hours of recorded sound data.
LINE CLOVA will continue striving to both enhance the quality of its existing offerings and create new features/services by proactively advancing basic research on AI tech.
Going forward, LINE will continue to actively work on developing businesses and boosting service value to further expand its growth and vast potential as a communication infrastructure.
■ About LINE
Based in Japan, LINE is dedicated to the mission of “Closing the Distance,” bringing together information, services and people. The LINE messaging app launched in June 2011 and since then has grown into a diverse, global ecosystem that includes AI technology, fintech and more. LINE joined the Z Holdings Group, one of the largest internet service groups in Japan, following the completion of a business integration in March 2021.