![]() ![]() With the aim of including more and more tasks and modalities in a unified network, it is simply not feasible to re-train a fixed, large-scale model from scratch, every time to include a new task.īuilding on the concept of "model ensembling", I hope that Prismer would become one of the first works to explore effective multi-modal learning strategies to tackle the aforementioned challenges. And second, the multi-modal training data are extremely difficult to collect and most of the time do not exist. First, the multi-task learning needs to avoid negative transfer which is a well-known problem that has been the focus of my research for some time. While I have to admit that "scaling up" is the major and essential factor of the great success in NLP, accompanied with the surprising emergent abilities only being shown in very large models, it's extremely difficult and probably also not affordable to simply train a multi-task multi-modal model as in a singular and unified architecture design. This was first shown in the GPT model series for general-purpose language understanding, and has recently been applied to other domains, such as in vision and robotics. Scaling up the model has become the easiest and the standard practice to improve the state-of-the-arts. Prismer was designed to follow this similar direction and explore how to better utilise pre-trained models when building a new model, particularly with multi-task and multi-modal capabilities. This similar concept has also been shown to be effective in the widely popular ControlNet (but in a slightly different perspective), which provided conditional multi-modal control for the Stable Diffusion model, only requiring some lightweight fine-tuning. The concept of "model ensembling" is very appealing, for which I was personally particularly inspired by the success of Socratic Models, which demonstrated that a wide range of multi-modal tasks could be achieved in a zero-shot manner by simply connecting them together using "language as the universal control interface". This idea eventually led to the development of the Prismer project as it is. This still allowed us to utilise a diverse pool of pre-trained models to provide useful domain knowledge, but without the need to heavily modify the network architecture of each model. However, we quickly realised that the complexity of this research was too much for an internship project, because we have to accommodate the design of each model to make them compatible with each other, and also need to design a training scheme that can fine-tune all these models simultaneously.Īs a result, we shifted our focus to a more specific task: multi-modal reasoning. As with many of my past research projects, we initially set out with an overly ambitious goal, of designing a multi-modal generative framework >.< This is to create a system that could perform "any-directional" multi-modal generation tasks, such as image-to-text, text-to-image, depth-to-image, image-to-depth, and even (depth+segmentation)-to-image, where the model to solve each task would be initialised with a separately pre-trained domain expert. Prismer is my internship project with NVIDIA Machine Learning Research Team.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |