Multimodal human-computer interaction makes virtual human "live"

2022-01-24

"Hello, little cloth! What's delicious nearby?" As soon as the user's voice fell, a small window appeared on the mobile phone, and the ranking of nearby hotels was clear at a glance. The "Xiaobu" in the dialogue is the smart assistant of oppo mobile phone. Some time ago, it became the first mobile phone smart assistant based on "virtual human" multimodal interaction in the industry. At the end of last year, the "virtual human" market heated up rapidly. In addition to oppo, Jingdong, Baidu, Alibaba and other science and technology enterprises have launched their own super realistic digital people. Station B also specially sets up partitions for virtual anchors. "Virtual people" have entered people's lives. One of the important reasons why "virtual human" is popular is people's deeper need for human-computer interaction. From simple text to speech, and then to the integration of computer vision and other technologies, human nature tends to integrate the interactive process of vision, hearing and other senses. The multimodal human-computer interaction technology behind the "virtual human" can just meet people's demand for gradually increasing the dimension of external information, make the "virtual human" look and sound like a person, and have more human temperature. Technical support behind "virtual human" Human computer interaction has gone through several stages, such as keyboard interaction, touch interaction, voice interaction and so on. Nowadays, because users put forward higher requirements for the convenience, naturalness and accuracy of human-computer interaction, multimodal human-computer interaction, which is more intelligent and can understand users' intention, has begun to become an important trend in the development of human-computer interaction. In an interview, Wan Yulong, chief architect of oppo Xiaobu assistant, told China Electronics News that when the deep learning algorithm gradually tends to be industrialized in various technical directions, intelligent interaction becomes more and more important. After that, sensors, vision technology, voice technology and natural language processing technology have been upgraded iteratively, and the integration of various technologies has formed a multimodal human-computer interaction mode. Through the understanding and generation of words, speech and vision, combined with action recognition and driving, environmental perception and other ways, multimodal human-computer interaction can fully simulate the interaction between people. Wan Yulong, for example, service robots in complex environments such as subways, banks and shopping malls combine technologies such as sensors, face recognition and voice interaction to help people complete tasks such as information query, ticket purchase and business navigation. At this stage, the most popular representative in the field of multimodal human-computer interaction is "virtual human". Wan Yulong told reporters that thanks to the fire of the concept of the meta universe, the "small incision" of the "virtual human" in the meta universe world has also attracted extensive attention in the industry. In the third quarter of 2021, oppo launched the first "virtual human" version of smart assistant Xiaobu, adding another fire to the "virtual human" market. Relevant data show that Xiaobu "virtual human" covers multimodal fusion algorithms such as vision, voice and natural language processing, and adopts a variety of basic innovative technologies to realize content services, real-time interaction and emotional interaction with users in multiple scene ecosystems. As one of the important achievements in the field of multimodal human-computer interaction, "virtual human" relies on the technical support of front-end acoustic processing, speech wake-up, speech recognition, dialogue understanding and management, speech synthesis, computer vision and graphics. Wan Yulong told reporters that voice interaction is based on dialogue understanding, generates corresponding reply words and content services through dialogue management, and generates broadcast audio combined with speech synthesis technology (TTS); On this basis, virtual human multimodal interaction needs to further understand the expression information contained in the broadcast text, and generate the corresponding expression, mouth shape and action through text and speech analysis. "In addition to the mouth shape, in order to present the expressions of eyes and faces, as well as the actions we make when we speak or are very happy, we need 3D character design and modeling, predict the driving parameters of various parts of the character's body in real time according to the expression content, and then drive the character model in combination with the rendering engine." Wan Yulong, for example, when someone says "big", his mouth will open very large, and then when he says the letter "O", his mouth will appear a circle. In order to make the intelligent assistant more intelligent, the human-computer interaction process will also involve a wide range of technical fields such as knowledge map and content recommendation. AI learning also requires a lot of data accumulation At present, virtual human has key technical difficulties in three links. Wan Yulong pointed out to the reporter of China Electronics News that first, from the perspective of image generation, users will increasingly hope that the "virtual human" they build is very realistic, such as the fine-grained features such as hair and clothing texture can be perfectly presented. Only when the "virtual person" really stands in front of the user like a living person, can the user feel that the distance between himself and the virtual person has been narrowed. "But to achieve this, there are many technologies involved, which will be very difficult to deal with, and the production cost remains high." Wan Yulong told reporters frankly. Second, in terms of image driving, the action of "virtual human" needs to be more smooth and natural, rather than rigid like a robot. When people communicate and express, whether hands, eyes or expressions, all body movements change according to the expressed content and emotion. But "virtual human" needs more powerful AI machine learning and deep learning ability to achieve this. AI will gradually approach real people only after accumulating a large number of data of real person expression and body expression, but this is a very long process. Third, image interaction is particularly important for virtual people, because the biggest selling point of "virtual people" is interactivity. If the "virtual human" can not provide users with a natural and comfortable interactive experience, users will soon lose interest. But this interactive promotion is not simple. For example, when answering questions, people usually use their own background knowledge to give appropriate answers quickly in combination with the sentence context. Intelligent virtual human assistant needs to build and enrich the knowledge base by learning a large number of conversation data between people. The acquisition of these data is not easy, because the amount of data required for AI learning is very large and needs to be constantly updated, and the difficulty is self-evident. Moreover, after obtaining the data, AI also needs to control and screen the quality of the obtained data, which is difficult to check one by one. If AI has no discrimination ability, it is difficult to modify the learned content after learning the data, so some untimely statements are likely to have an adverse impact on users. In addition, if people ask AI a knowledge point, it may choose an answer from Zhihu or other websites for feedback, but this involves intellectual property rights. At the same time, the knowledge learned by AI cannot be guaranteed to be absolutely professional. For example, when people are sick, they can't ask the intelligent virtual human assistant what medicine they should take, because they can't guarantee the professionalism of the answers they get. If the "virtual human" assistant gives a wrong answer, people may have health problems if they do. To sum up, "virtual human" needs more technical accumulation and precipitation in order to have barrier free and natural communication with users. Expand to areas with more application value Although there are still technical difficulties in "virtual human", in recent years, the underlying technology is also making continuous progress. Wan Yulong told China Electronics News that the modeling process and scheme have become simpler, whether it is voice interaction technologies such as speech recognition, dialogue understanding and speech synthesis, or multi-modal driving parameter prediction technologies such as lip driving and expression driving. "From the model level of machine learning, the iteration of the algorithm has brought the model training and tuning to a lower and lower threshold stage." Wan Yulong said. The improvement of computing power will also make the image of "virtual human" closer to real people. Wan Yulong told reporters that the computing power of mobile phones and other devices is becoming stronger and stronger, and the computing power of cloud servers is also increasing, prompting AI engineers to generate more complex and real characters. In 2021, a video of NVIDIA CEO Huang Renxun's "virtual human" speech became popular all over the world, and the Omniverse platform launched by NVIDIA further entered the public's vision. It is understood that Omniverse platform is a real-time 3D design collaboration and virtual world simulation platform launched by NVIDIA. It aims to become the basis for connecting the virtual world by integrating graphics, AI, simulation technology and scalable computing into one platform. Wan Yulong said that with its powerful GPU computing power, NVIDIA has built a more realistic character image. This further shows that the current computing power has indeed improved to a higher level, and the improvement of computing power also makes the rendering of super realistic characters more feasible. On the one hand, the continuous upgrading of dialogue AI technology, on the other hand, the image construction ability of virtual characters is becoming stronger and stronger, and the whole dialogue experience has become more intelligent. The construction of cognitive abilities such as dialogue understanding and knowledge map has reached a higher level, contributing to the increasing ability of "virtual human" to be commercialized. Some people say that the car is the next generation of mobile terminal, which is expected to become a mobile carrier for human-computer interaction and emotional interaction. So, is it possible for "virtual human" to appear in the field of intelligent cockpit? In Wan Yulong's view, both mobile phones and cars can actually be regarded as intelligent interactive carriers. The current focus of the "virtual human" launched by oppo is mainly to improve the interactive experience of smart devices such as mobile phones, TVs and wearable devices. Wan Yulong said that after the intelligent cockpit and other devices have formed a certain scale, the intelligent assistant will have the opportunity to interact frequently with users in these devices, so it is bound to produce the application value of some scenarios. As long as it is a field of application value, the tentacles of "virtual human" are actually expected and available. (Xinhua News Agency)

Edit:Li Ling    Responsible editor:Chen Jie

Source:Science and Technology Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email:lwxsd@liaowanghn.com

Return to list

Recommended Reading Change it

Links

Submission mailbox:lwxsd@liaowanghn.com Tel:020-817896455

粤ICP备19140089号 Copyright © 2019 by www.lwxsd.com.all rights reserved

>