This Article is written as a summay by Marktechpost Staff based on the paper 'VALHALLA: Visual Hallucination for Machine Translation'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper, github, project and post. Please Don't Forget To Join Our ML Subreddit
Machine translation is a branch of computational linguistics that uses software to convert text or speech between languages.
Typically, machine translation replaces words from one language with words from another. However, this method rarely results in a correct translation because recognition of entire sentences and their closest counterparts in the target language is required. Many words have multiple meanings, and not all terms in one language have comparable words in another.
Many researchers have worked to solve this challenge using statistical and neural corpus techniques, which have led to better translations, management of linguistic typology, translation of idioms, and isolation of anomalies.
But these systems rely heavily on textual data and have no explicit connection to the real world. That being said, researchers are now investigating ways in which multimodal AT systems can incorporate a wealth of external data into the modeling process.
Typically, these methods require source sentences to be linked to corresponding images during training and testing. This specifically limits their usefulness in situations where images are not available during inference. This inspired researchers at MIT-IBM Watson AI Lab, MIT CSAIL, and UC San Diego to work on multimodal machine translation, which uses visual modality to improve machine translation systems.
In their recent work, the researchers first explore whether a system that only has access to images during training time can generalize to these situations in their latest work. “Visual hallucination, or the ability to design visual scenes, can be used to improve machine translation systems,” they claim. Further, they state that if a translation system had access to the images during training, it could be taught to abstract an image or visual representation of the text sentence to ground the translation process. This abstract visual representation could be used instead of a real image to perform multimodal translation during the test period.
Researchers present a basic yet effective VisuAL HALLucinAtion (VALHALLA) framework, which is based on machine learning for machine translation that integrates visuals during training to build a more successful textual model. In machine translation, models are trained to augment the recovered textual representation of the source sentence with a latent visual representation similar to that extracted by an MMT system from a real image.
In this study, the augmented textual representation is retrieved from the source sentence with a latent visual representation similar to that extracted by an MMT system from a real image. They use a discrete codebook (trained using VQGAN-VAE) to train an autoregressive hallucination transformer to predict visual tokens from input source words for multimodal translation.
A visual hallucination transformer maps the source sentence into a discrete image representation. Then an MMT transformer maps the combined source sentence with its discrete image representation into the target sentence. Hallucinations, translation, and coherency losses are used to train the transformer patterns from start to finish.
According to the researchers, this is the first time that an autoregressive image transformer has been used in conjunction with a translation transformer to successfully hallucinate discrete visual representations.
Their results show that discrete visual representations perform better than continuous visual integrations currently used in MMT approaches. They demonstrated the superiority of VALHALLA on strong translation references on three typical MT datasets with a wide variety of language pairs and different training data sizes.
The results reveal that VALHALLA outperforms the most relevant state-of-the-art MMT techniques that use continuous image representations by an average of 23% BLUE over the text-only translation baseline. In underfunded translation contexts, the advantages over the text-only baseline are as large as +3.1 BLUE, supporting the idea that visual hallucinations may have significant practical relevance in these contexts. Additional research supports this, indicating that in limited textual contexts, VALHALLA models do indeed use visual hallucination to enhance translations.