From 'nothing to existence' to 'existence to excellence', the large-scale model of domestic video generation is gradually entering a favorable stage

2024-08-08

Half a year after the birth of Sora, a large model in Wensheng Video, its Chinese "challengers" lined up to compete for the "ticket" to the next AI killer application. In the past month, four domestic video generation models have been launched and made available to the public. Unlike Sora, which only releases samples and is not open for use, China's video generation model is released and launched immediately, making it easy for users to "use it as it is". At present, on the application side, a group of content creators have tasted the joy of "being able to type and make videos" and "making videos without asking people" for the first time; On the technical side, there is still a batch of incubating video generation models on the way. Although domestic tools do not yet have the one-time "minute level" generation capability and cannot achieve the "seamless" and "silky" comparable to real shooting, video generation has solved the problem of "existence" and gradually evolved towards "excellence". Recently, the Wensheng video track has been crowded with strong Chinese players, and imagination has been stirred up. In late July, China's AI Unicorn Spectrum AI launched the video generation model "Qingying", Aishi Technology released the video generation product PixVerse V2, and Shengshu Technology launched the video generation model Vidu. At the same time, the "Kering AI" released by Kwai in June has accumulated millions of users. Sora is still in the prototype stage of the laboratory, and domestic video generation tools have been intensively launched and opened for use on the C-end, which is exciting Yuan Li, assistant professor and doctoral supervisor of the School of Information Engineering at Peking University Shenzhen Graduate School, said. What skills does Sora's Chinese challenger have? In the early morning, giant pandas sit by the lake playing guitar, rabbits read newspapers in the restaurant, kangaroos and golden monkeys have breakfast on the side, and then gather at the sports field in Zootopia to watch the annual cycling competition... This animated microfilm generated by Keling AI, although only 62 seconds long, demonstrates the ability to understand and present the physical laws of the real world (such as the law of reflection and the law of gravity), as well as a certain level of imagination and storytelling ability. Entering the 'Olympic time', many short films that have gone viral on social media and can connect different scenes and camera movements are also created by domestic video generation models. Video generation, in short, is the use of generative AI technology to convert multimodal inputs such as text and images into video signals Wan Pengfei, the head of the Kwai Vision Generation and Interaction Center, said, "Unlike the way we used to get video by camera shooting and graphic rendering, the essence of video generation is to sample and calculate pixels from the target distribution. This way can achieve higher content freedom at a lower cost." Entering Vidu's video generation page, journalists experienced the freedom of "one click generation". Upload a photo as the "starting frame" or as a "reference character", enter the text description of the scene you want to generate in the dialog box, click the "generate" button, and a lifelike short video will be automatically generated. From entering the page to completing the download, it takes less than 1 minute. A technical leader told reporters a "generation secret": "Try the prompt word formula of 'camera language+scene creation+detail description', and you can get the desired video content in less than 5 times." For example, enter the text "realistic style, close range, tiger lying on the ground, body slightly undulating" in the dialog box. One minute later, a video appeared on the screen: on the grass brushed by the gentle breeze, the tiger's body undulated with its breath, its fur and beard moved with the wind, and it could even "deceive the real". The rapid iteration of video generation technology is based on accurate evaluation of the generated content's effectiveness. How to distinguish the performance of video generation models? Firstly, consider controllability, that is, the degree of correspondence between the generated content and the input text; secondly, consider stability and consistency; thirdly, consider rationality, that is, whether the generated content conforms to physical laws; fourthly, consider style, aesthetics, and creativity; and finally, consider real-time generation, "summarized Xu Dong, a professor of computer science at the University of Hong Kong and a foreign academician of the European Academy of Sciences. The speed of cost reduction has increased, and foreign netizens have expressed their praise for China's self-developed video generation model through actions. Many Twitter accounts have already added videos generated by Keling AI and Qingying AI to their posts. To be honest, the technology has not yet reached a mature stage, and the technical ceiling for video generation models is high, with a lot of room for improvement. However, we have seen pain points in the film, animation, advertising, and gaming industries: long production cycles and high production costs, which technology can strive to solve Tang Jiayu, co-founder and CEO of Shengshu Technology, told reporters. To become a 'must-have' technology, it is necessary to reduce costs while enhancing availability and controllability. As a technology that directly lowers the threshold for creation and production, the emergence of video generation models has brought the spring of "small team animation production" and "low-cost content creation" to film and animation practitioners. Creating AIGC (Generative Artificial Intelligence) animated short films is an interesting experience. We first came up with an idea, drew it into a storyboard, then used AI to generate the image, and then used Vidu to turn the image into a video Chen Liufang, the winner of the Best Film award in the AIGC Short Film category at the Beijing Film Festival and the AI director of Ainimate Lab, told reporters. Video generation will make science fiction, fantasy, and animation no longer "money burning games" that only big companies dare to play. Chen Liufang said that after using Vidu, the reduction in production cycle and production cost can be considered significant. Taking the animated short film 'All the Way South' as an example, the creative team consists of only three people: a director, a storyboard artist, and an AIGC technology application expert. The traditional process requires 20 people, including directors, storyboards, art, modeling, materials, lighting, rendering and other different 'professions', with a cycle of about one month. By doing so, the cost has been reduced by over 90% Chen Liufang said, of course, the current level of refinement in video generation technology is not enough, about one-third of the performance of traditional animation. However, lower costs and higher efficiency have made traditional professionals in film, animation, and gaming feel the chill of the eve of technological disruption. The era of 'everyone becoming a designer' and 'everyone becoming a director' will come, just like the era of 'everyone having a microphone', "said Zhang Peng, CEO of Zhipu AI. This is both a challenge and an opportunity for the animation industry. For example, a martial arts master can be powerful even with the simplest weapons and most common moves, and the core lies in his strong internal skills. For the animation industry, 'moves' are like new technology, while' internal strength 'is creativity, audio-visual expression, and quality control judgment of aesthetics Professor and head of the Animation Department at the School of Animation and Digital Arts, Communication University of China, said Ai Shengying. Technology has undoubtedly brought higher cost-effectiveness tools, but it also highlights the crucial role of creativity. When the proportion of investment in the production of film, animation, and games significantly decreases, the competition becomes even more about creativity, "said Chen Liufang. After "refining" a killer application language model knocked on the door of generative AI, video as an extension of image modality pushed AIGC's technology to a climax and brought its application closer to the public. Currently, there are two main technological routes for global video generation: one is diffusion models, which are further divided into two categories. The other is diffusion models based on convolutional neural networks, such as Meta's Emu Video and Tencent's VideoCrafter; The other is the diffusion model based on the Transformer architecture, such as Vidu of Shengshu Technology, Sora of OpenAI, and Keling AI of Kwai. The second type is autoregressive routing, such as Google's VideoPoet, Phenaki, etc. The mainstream choice for video generation in China is the diffusion model based on the Transformer architecture, which enables the model to exhibit scalability in fields such as language processing, computer vision, and image generation, following the 'law of scale' Xu Dong said. This choice also means greater computing power, higher quality, larger scale data, and complex algorithms. The first priority is the algorithm. The video adds a temporal dimension to the image, and the complexity of the algorithm increases exponentially Xu Dong said that under certain conditions of data and computing power, the key to model performance lies in the algorithm's ability, which depends on the level of algorithm talent. Secondly, the most lacking is data. Video generation heavily relies on data. Compared to text data, the accumulation of video data is more difficult. The improvement of data quality not only includes video resolution, style, storyboarding, combination, continuity, but also includes data cleaning, filtering, and processing, "said Zhang Peng. The video generation model is even more like a 'card swallowing behemoth'. From Sora's practice, continuously improving the data volume and parameter scale of the model remains the core of AIGC evolution to this day. CITIC Securities estimates that a 60 frame video (approximately 6 to 8 seconds) requires approximately 60000 patches. If the denoising steps are 20, it is equivalent to generating 1.2 million tokens. Considering that the diffusion model needs to be generated multiple times in practical use, the actual computational workload will far exceed 1.2 million Tokens (tokens). The parameters of the large model are growing at a rate of ten times per year. For both technology companies and research institutions, how to continuously train high-performance models remains a huge challenge. But at the same time, the 'killer applications' on the C-end are full of expectations. From the generation of creativity to the production of images, music, and videos, AI will have great potential. In the future, video production may be as simple and convenient as making PPTs today Wang Zhongyuan, President of Beijing Zhiyuan Artificial Intelligence Research Institute, said. (New Society)

Edit:Xiong Dafei Responsible editor:Li Xiang

Source:XinHuaNet

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to list