Jan 12, 2022

Raffaella Bernardi, University of Trento

Two Challenges Behind Visual Dialogues: Grounding Negation and Asking an Informative Question

Visual Dialogues are an intriguing challenge for the Computer Vision and Computational Linguistics communities. They involve both understanding multimodal inputs as well as generating visually grounded questions. We take the GuessWhat?! game as test-bed since it has a simple dialogue structure — Yes-No question answer asymmetric exchanges. We wonder to which extent State-Of-The-Art models take the answers into account and in particular whether they handle positively/negatively answered questions equally well. Moreover, the task is goal oriented: the questioner has to guess the target object in an image. As such it is well suited to study dialogue strategies. SOTA systems are shown to generate questions that, although grammatically correct, often lack an effective strategy and sound unnatural to humans. Inspired by the cognitive literature on information search and cross-situational word learning, we propose Confirm-it, a model based on a beam search re-ranking algorithm that guides an effective goal-oriented strategy by asking questions that confirm the model’s conjecture about the referent. We show that dialogues generated by Confirm-it are more natural and effective than beam search decoding without re-ranking. The work is based on the following publications: Alberto Testoni, Claudio Greco and Raffaella Bernardi Artificial Intelligence models do not ground negation, humans do. GuessWhat?! dialogues as a case study Front.ers in Big Data doi: 10.3389/fdata.2021.736709 Alberto Testoni, Raffaella Bernardi “Looking for Confirmations: An Effective and Human-Like Visual Dialogue Strategy”. In Proceedings of EMNLP 2021 (Short paper).

Jan 5, 2022

Ayşegül Dündar, Bilkent University

Controllable Image Synthesis

With GAN based models achieving realistic image synthesis on various objects, there has been an increased interest to deploy them for gaming, robotics, architectural designs, and AR/VR applications. However, such applications also require full controllability on the synthesis. To enable controllability, image synthesis has been conditioned on various inputs such as semantic maps, keypoints, and edges to name a few. With these methods, control and manipulation over generated images are still limited. In a new line of research, methods are proposed to learn 3D attributes from images for precise control on the rendering. In this talk, I will cover a range of image synthesis works, starting with conditional image synthesis and continue with 3D attributes learning from single view images for the aim of image synthesis.