Multimodal large models drive AI towards the era of "synesthesia"

2023-07-20

Just as the "five senses" of humans are interconnected and inseparable, the boundaries between the visual, linguistic, audio, and other modalities of artificial intelligence (AI) are also gradually merging. With the rapid development of artificial intelligence's perception, interaction, and generation capabilities, multimodal large models are driving artificial intelligence into the era of "synesthesia". The reporter learned from the Shanghai Artificial Intelligence Laboratory yesterday that the scholar multimodal large model released by the laboratory leads the global performance in more than 80 multimodal and visual evaluation tasks, surpassing similar models developed by Google, Microsoft, OpenAI, etc. The scholar multimodal large model contains 20 billion parameters, trained from 8 billion massive multimodal samples, supports the recognition and understanding of 3.5 million semantic tags, covers common categories and concepts in the open world, and has three core capabilities: open world understanding, cross modal generation, and multimodal interaction. When ChatGPT emerged, experts predicted that it would change the "interface" of human-computer interaction. At present, multimodal understanding, generation, and interaction capabilities are becoming an important direction for the new round of evolution of large models, and a low threshold era where everyone can use voice to "command" AI may be within reach. From predefined tasks to open tasks, unlocking real-world understanding. In the rapidly growing demands of various application scenarios, traditional computer vision is no longer able to handle the countless specific tasks and scene requirements in the real world. There is an urgent need for an advanced visual system with universal scene perception and complex problem processing capabilities. The scholar multimodal large model integrates the three major modeling capabilities of vision, language, and multitasking, namely the universal visual large model, the ultra large language pre training model (LLM) for text understanding, and the compatible decoding modeling large model for multitasking, making it closer to human perception and cognitive abilities. In artificial intelligence research, "open world" refers to the real world defined by non preset, non academic, or closed sets. In traditional research, AI can only complete predefined tasks defined by academic or closed sets, and the scope of these tasks differs greatly from the real open world. For example, the ImageNet-1K academic collection contains 1000 types of objects, including approximately two types of flowers, 48 types of birds, and 21 types of fish; In the real world, the types of flowers, birds, and fish are approximately 450000, 10000, and 20000, respectively. In the open world, the multimodal models of scholars are constantly learning to acquire perceptual and cognitive abilities that are closer to humans. In terms of semantic openness, it can recognize and understand over 3.5 million semantics in the open world, covering common object categories, object actions, and optical characters in daily life, completing the transformation from solving predefined tasks to executing open tasks, providing strong support for future research on multimodal general artificial intelligence (AGI) models. At present, the development of AI technology is facing a large number of challenges in cross modal tasks. In autonomous driving scenarios, it is necessary to accurately assist vehicles in judging traffic light status, road signs, and other information, providing effective information input for vehicle decision-making and planning. Drawing writing is a classic form of modal transformation ability. After appreciating Zhang Da

Edit:XiaoWanNing Responsible editor:YingLing

Source:Wenhui Daily

Special statement: if the pictures and texts reproduced or quoted on this site infringe your legitimate rights and interests, please contact this site, and this site will correct and delete them in time. For copyright issues and website cooperation, please contact through outlook new era email：lwxsd@liaowanghn.com

Return to list