The latest achievement of Jingdong, University of science and technology of China: let AI speak like a real person and gesture vividly

2022-03-28

When human beings speak, they will naturally produce body movements to enhance the effect of speech. Now, researchers from China University of science and technology and jd.com have equipped AI with such functions—— Throw it any type of speech audio, and it can draw the corresponding gesture. For the same audio, it can also generate many different poses: Adopt "dual flow" Architecture Because everyone's habits are different, there is no fixed corresponding relationship between speech and body movements, which also makes it a little difficult to complete the task of speech generation posture. △ representative Italian speech gestures Most of the existing methods are based on certain styles and map speech into corresponding limb movements in a deterministic way. As a result, it is not particularly ideal. Inspired by linguistic research, the author of this paper decomposes the speech action into two complementary parts: pose modes and rhythmic dynamics, and puts forward a novel "speech structure" model - freemo. Freemo adopts a "dual flow" architecture. One branch is used for the generation of main posture, and the other branch is used for "playing rhythm", that is to apply a small range of rhythmic motion to the main posture, so as to make the final posture more rich and natural. As mentioned earlier, the speaker's posture is mainly habitual and has no conventional semantics. Therefore, the author does not have special constraints on the form of posture generation, but introduces conditional sampling to learn various postures in the latent space. In order to facilitate processing, the input audio will be divided into very short segments, and the speech feature parameters MFCC and speech text will be extracted. The main pose is generated by keyword matching of speech text. The speech feature parameter MFCC is used for the generation of rhythmic actions. The rhythm action generator is composed of convolution network, and the specific process is shown in the figure: As Xu Jing, I come from China University of science and technology. The red box indicates the offset of the average pose of the action sequence. By exchanging the offsets of the two sequences, the model can control the "rhythm" without affecting the main posture. More diverse, more natural and more synchronous Freemo's training and testing videos include a special speech2gestrure data set, which contains many TV host programs. However, these videos are seriously disturbed by the environment (such as the cheers of the audience) and the host may have limited action, so the author also introduced some Ted speech videos and Youtube Videos for training and testing. The SOTA models compared include: Audio to body dynamics (audio2body) with RNN Speech2gestrure (S2G) using convolution network Speech drives template (tmpt, equipped with a set of pose templates) Mix style (a set of styles can be generated for each speaker) Trimodal context (TRICON, also RNN, input includes audio, text and speaker) There are three indicators: (1) Synchronization between voice and action; (2) Diversity of movements; (3) The quality level compared with the real action of the speaker. As a result, freemo surpassed five SOTA models in these three indicators and achieved the best results. △ the lower the synchronization score, the better △ score of diversity and quality level PS. because the five SOTA models are deterministic mappings of learning in essence, they do not have diversity. Some more intuitive quality comparisons: The top left corner is the action of the real speaker. You can see that freemo performs best (audio2body is also good). Author introduction As Xu Jing, I come from China University of science and technology. The corresponding author is Mei Tao, vice president of AI research institute of JD AI platform and research department, vice president of technology of JD group and IEEE fellow. The remaining three authors are Zhang Wei, a researcher from JD AI, Bai Yalong and Professor Sun Qibin from China University of science and technology. (Xinhua News Agency)

Edit:Li Ling Responsible editor:Chen Jie

Source:QbitAI

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to list