Grand Challenges

The ACM MM 2023 Grand Challenge Chairs: Tiago Falk (Tiago.Falk@inrs.ca); Ivan Tashev (ivantash@microsoft.com); Anderson Avila (Anderson.Avila@inrs.ca) invite you to participate and solve any of the following 10 challenges :

  • MultiMediate: Multi-modal Behaviour Analysis for Artificial Mediation
  • Deep Video Understanding (DVU) 2023
  • MER 2023: Multi-label Learning, Semi-Supervised Learning, and Modality Robustness
  • Conversational Head Generation Challenge
  • REACT 2023 Challenge: Multiple Appropriate Facial Reaction Generation in Dyadic Interaction
  • Visual Text Question Answer (VTQA) Challenge
  • Facial Micro-Expression Grand Challenge
  • SMP Challenge (Social Media Prediction Challenge) 
  • Invisible Video Watermark 
  • ACM Multimedia Computational Paralinguistics Challenge (ComParE)

——————–

MultiMediate: Multi-modal Behaviour Analysis for Artificial Mediation

Organizers List: 

Philipp Müller, Tobias Baur, Dominik Schiller, Michael Dietz, Alexander Heimerl, Elisabeth André, Dominike Thomas, Andreas Bulling, Michal Balazia, François Brémond

Challenge Summary:

Artificial mediators are a promising approach to support conversations, but at present their abilities are limited by insufficient progress in behaviour sensing and analysis. The MultiMediate challenge is designed to work towards the vision of effective artificial mediators by facilitating and measuring progress on key social behaviour sensing and analysis tasks. This year, the challenge focuses on the recognition of bodily behaviours as well as on engagement estimation. In addition, we continue to accept submissions to previous years’ tasks, including backchannel detection, agreement estimation from backchannels, eye contact detection, and next speaker prediction.

Website:

https://multimediate-challenge.org/

//————————

Deep Video Understanding (DVU) 2023

Organizers List: 

Keith Curtis, George Awad, Afzal Godil, Ian Soboroff

Challenge Summary:

The DVU Challenge seeks new techniques and approaches to address how an automatic system understands and comprehend a full movie in terms of entities (characters, locations, concepts, relationships, interactions, sentiments, etc). The challenge provides a set of development data as whole annotated movies on the movie-level and scene-level. While testing systems on new unseen movies licensed from a professional movie distribution platform (KinoLorberEdu). Queries on the movie-level and scene-level includes sentiment classification, text summary matching with correct scenes, finding next/previous interactions between specific characters, finding unique scenes, fill-in-the-graph space about how entities are related, and general open domain multiple choice questions. Systems will be given the chance to choose to participate in either movie-level, scene-level queries, or both. Finally, we intend to measure robustness in multimodal systems by allowing participants to submit results against a variant of the testing dataset which has been introduced with visual/audio noise and real world perturbations and corruptions.

Website:

https://sites.google.com/view/dvuchallenge2023/home

Call for participation:

https://sites.google.com/view/dvuchallenge2023/home/call-for-participation

//————————

MER 2023: Multi-label Learning, Semi-Supervised Learning, and Modality Robustness

Organizers :

Zheng Lian, Jianhua Tao, Erik Cambria, Björn W. Schuller, and Guoying Zhao

Challenge summary:

Multimodal emotion recognition has become an important research topic due to its wide applications in human-computer interaction. Over the last few decades, the technology has made remarkable progress with the development of deep learning. However, existing technologies are hard to meet the demand for practical applications. The main reasons include: 1) Due to the high annotation cost, collecting large amounts of labeled samples is challenging; 2) Many factors may lead to modality perturbation in real-world environments, such as background noise and blurry videos due to network limitations. To improve the robustness, we plan to launch a multimodal emotion recognition challenge (MER 2023) to motivate global researchers to build innovative technologies that can further accelerate and foster research. We aim to provide a common platform and a benchmark test set to systematically evaluate the robustness of emotion recognition systems, thus promoting applications of this technology in practice.

Website:

MER 2023 (merchallenge.cn) (http://merchallenge.cn/ )

//————————

Conversational Head Generation Challenge

Organizers :

Yalong Bai, Mohan Zhou, Wei Zhang, Ting Yao, Abdulmotaleb El Saddik, Xiaodong He, Tao Mei

Challenge summary:

Conversational head generation is to synthesize the head dynamics during a conversation (including both talking and listening roles). This task is critical for applications such as telepresence, digital human, virtual agents, and social robots. Current talking-head generation only covers one-way information flow, still miles away from the full sense of “communication”. In contrast, this year’s challenge is based on the **extended** ViCo dataset, which includes more YouTube video clips featuring real human conversations. Two tracks are included in our challenge, including talking head video generation (speaker), and responsive listening head video generation (listener).

Website:

https://vico.solutions/challenge/2023

//————————

REACT 2023 Challenge: Multiple Appropriate Facial Reaction Generation in Dyadic Interaction

Organizers:

Dr Micol Spitale, Dr Siyang Song, Cristina Palmero, Prof Sergio Escalera, Prof Michel Valstar, Dr Tobias Baur, Dr Fabien Ringeval, Prof Elisabeth Andrè, Prof Hatice Gunes,

Challenge summary:

Human behavioural responses are stimulated by their environment (or context), and people will inductively process the stimulus and modify their interactions to produce an appropriate response. When facing the same stimulus, different facial reactions could be triggered across not only different subjects but also the same subjects under different contexts. The Multimodal Multiple Appropriate Facial Reaction Generation Challenge (REACT 2023) is a satellite event of ACM MM 2023, (Ottawa, Canada, October 2023), which aims at comparison of multimedia processing and machine learning methods for automatic human facial reaction generation under different dyadic interaction scenarios. The goal of the Challenge is to provide the first benchmark test set for multimodal information processing and to bring together the audio, visual and audio-visual affective computing communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under well-defined conditions.

Website:

https://sites.google.com/cam.ac.uk/react2023/home

//————————

Visual Text Question Answer (VTQA) Challenge

Organizers:

Xiangqian Wu, Kang Chen, Tianli Zhao

Challenge summary:

The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are limited to just picking the answer from a pre-defined set of options and lack attention to text. We present a new challenge with a dataset that contains 27,317 questions based on 10,238 image-text pairs. Specifically, the task requires the model to align multimedia representations of the same entity to implement multi- hop reasoning between image and text and finally use natural language to answer the question. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.

Website:

https://visual-text-QA.github.io/

//————————

Facial Micro-Expression Grand Challenge

Organizers:

Adrian K. Davison, Jingting Li, Moi Hoon Yap, John See, Xiaobai Li, Wen-Huang Cheng, Xiaopeng Hong, Su-Jing Wang

Challenge summary:

Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. Unfortunately, the small sample problem severely limits the automation of ME analysis. Furthermore, Due to the weak and transient nature of MEs, it is difficult for models to distinguish it from other types of facial actions. Therefore, ME in long videos is a challenging task, and the current performance cannot meet the practical application requirements. Addressing these issues, this challenge focuses on ME and the macro-expression spotting task. This year, in order to evaluate algorithms’ performance more fairly, based on CAS(ME)2, SAMM Long Videos, SMIC-E-long, CAS(ME)3 and 4DME, we will build an unseen cross-cultural long-video test set. All participating algorithms are required to run on this test set and submit their results on a leaderboard with a baseline result.

Website:

https://megc2023.github.io/

//————————

SMP Challenge (Social Media Prediction Challenge) 

Organizers:

Bo Wu, Wen-Huang Cheng, Bei Liu, Peiye Liu, Jia Wang, Zhaoyang Zeng, Jiebo Luo

Challenge summary:

SMP Challenge is an annual research challenge that aims to seek novel methods for forecasting problems and improving people’s social lives and business scenarios with numerous social multimedia data. Social media popularity prediction helps us to understand online word-of-mouth and efficiently discover interesting news, emerging topics, or amazing products, and recommend online content from information oceans. SMP Challenge formulates the Social Media Popularity Prediction task and multiple aspects of evaluation and provides a benchmark SMPD with about half a million posts and rich annotations.

Website:

https://smp-challenge.com

//————————

Invisible Video Watermark 

Organizers:

Jin Chen, Yi Yu, Shien Song, XinYing Wang, Jie Yang, Yifei Xue, Yizhen Lao 

Challenge summary:

The invisible video watermark is a type of digital signature that is embedded into the video in a way that is not visible to the naked eye. Invisible watermarks are widely used to protect the ownership and authenticity of video content. While deep learning has significantly enhanced video processing capacity, how to construct capable computer vision approaches for watermarking has received little attention. Furthermore, it is necessary to discuss whether the deep learning approach is better than the traditional approach for watermarking. The IVW Challenge provides various types of video data, and participants need to develop a robust framework based on this dataset so that the invisible watermark can be extracted with different kinds of attacks. 

Website:

https://challenge.ai.mgtv.com/contest/detail/18?locale=en

//————————

ACM Multimedia Computational Paralinguistics Challenge (ComParE)

Organizers:

Björn Schuller, Anton Batliner, Shahin Amiriparian, Alexander Barnhill, Alan S. Cowen, Claude Montacié

Challenge summary:

The Computational Paralinguistics ChallengE (ComParE) series is an open Challenge in the field of Computational Paralinguistics dealing with states and traits of individuals as manifested in their speech and further signals’ properties. The Challenge takes annually place since 2009. Every year, we introduce new tasks as there still exists a multiplicity of not yet covered, but highly relevant paralinguistic phenomena. At the same time, new baseline methods and challenge types are introduced. This year features two new tasks by the Emotion Intensity Sub-Challenge, and the Requests Sub-Challenge.

The Webpage is:

http://www.compare.openaudio.eu/