卢志武
1, 金琴
2, 宋睿华
1, 文继荣
1,2
(1.中国人民大学高瓴人工智能学院,中国 北京100872;2.中国人民大学信息学院,中国 北京100872)
摘要:提出了悟道•文澜的BriVL双塔模型。该模型利用6.5亿的互联网图文数据,通过自监督的任务来训练,是目前最大的中文通用图文预训练模型。通过实验发现,该模型在多个国际公开数据集上取得了最佳性能。同时,还提出了悟道•文澜的多语言多模态预训练单塔模型—MLMM。实验结果证明,该模型在多个国际公开数据集上取得了最佳性能,并可以学习到跨语言跨模态的通用常识。设计了实验并讨论超大规模多模态预训练模型对文本编码、图像生成和图文互检带来的影响,以及文澜模型的落地应用与学科交叉成果。
关键词:多模态预训练;多语言预训练;双塔模型;单塔模型
WuDao-WenLan: What Do Very-Large Multimodal Pre-Training Models Bring?
LU Zhiwu1, JIN Qin2, SONG Ruihua1, WEN Jirong1,2
(1.Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China;2.School of Information Renmin University of China, Beijing 100872, China)
Abstract: A multimodal pre-training two-tower model called WuDao-WenLan BriVL is proposed, which is trained through self-supervised learning over 650 M image-text pairs crawled from the Web. This is the largest open-sourced Chinese image-text pre-training model. Extensive experiments show that our BriVL achieves the new state-of-the-art on multiple benchmark datasets. Moreover, a multi-lingual pre-training single-tower model called WuDao-WenLan MLMM is also proposed. Extensive experiments show that our MLMM achieves superior performance on multiple multi-lingual benchmark datasets. In addition, experiments are conducted to discuss what very-large multimodal pre-training models bring to text encoding, text-to-image generation, and image-text retrieval, as well as in what applications WenLan can be applied in multiple fields.
Keywords: multimodal pre-training; multi-lingual pre-training; two-tower model; single-tower model