MMAct Challenge in conjunction with ActivityNet @ CVPR2021

Dataset Download

Important Dates


  • -2021.04.05. Challenge open! Train and Val set released here.

  • -2021.05.01. Evaluation script, submission format and test data (password is the same as trainval request) released!

  • -2021.05.07. Evaluation server open on CodaLab task1 link and task2 link.

  • -2021.06.12. Evaluation server close. (extended)

  • -2021.06.14. Deadline for submitting the report.

  • -2021.06.19. Workshop day at CVPR 2021. Check program and Results

Challenge Overview


The MMAct Challenge 2021 will be hosted in the CVPR'21 International Challenge on Activity Recognition (ActivityNet) Workshop. This challenge asks participants to propose cross-modal video action recognition/localization approaches for addressing shortcomings in visual only approaches using MMAct Dataset. The goal of this task is to leverage the sensor-based, e.g. body-worn sensors data as the privilege information, and vision-based modalities in ways that can overcome the limitations imposed by modality discrepancy between train (sensor + video) and test (only video, including keypoints) phases. The modalities used for this challenge are: Acceleration, Orientation, Gyroscope, RGB video, and Keypoints. This challenge promotes an alternate point of view on how to address vision challenges through using cross-modal methods in the hopes of expanding research on video action understanding to further leverage the sensors commonly embedded in daily-use smart devices (e.g., smartphones).

Winners will be announced at ActivityNet Workshop at CVPR 2021, along with Prizes sponsored by Hitachi, Ltd. and subject to changes.

MMAct Challenge Dataset Features


A large-scale multi modal dataset for action understanding

5 modalities

RGB, Keypoints, Acceleration, Gyroscope, Orientation.

1600+ Videos

Untrimmed videos with 1920x1080@30FPS

32k Clips

Average clip length ranges from 3-8 seconds.

35 Classes

Daily, Abnormal, Desk work actions

4 Scenes

Free space, Occlusion, Station Entrance, Desk work.

4 Views

4 surveillance camera views

20 Subjects

10 female, 10 male

Randomness

Collected using a semi-naturalistic collection protocol.

TASK1: Cross-Modal Action Recognition


In this task, participants will use trimmed videos from MMAct along with paired sensor data. This task allows the participants to train with trimmed sensor data and trimmed video, but test on only trimmed video for action recognition. It requries the user to submit two results according to the follows two different splits. External datasets for pre-training are allowed as long as clearly stated in the report. Submissions that use only the vision-based modality in the train phase are also welcome! To encourage the submissions contributed on cross-modal methods, there will be a special award selected from the participants who use cross-modal manner.

MMAct trimmed cross-view dataset:
  • ・clips from multiple views for training and validation, clips from unseen views for testing.

  • ・35 action classes from 20 subjects with four scenes.

  • MMAct untrimmed dataset and MMAct trimmed cross-scene dataset are NOT permitted in this task.

MMAct trimmed cross-scene dataset:
  • ・clips from 3 scenes (Free space, Desk, Entrance) for training and validation, clips from 3 scenes (Occlusion, Desk, Entrance) for testing.

  • ・35 action classes from 20 subjects with four camera views.

  • MMAct untrimmed dataset and MMAct trimmed cross-view dataset are NOT permitted in this task.


TASK2: Cross-Modal Action Temporal Localization


In this task, participants will use untrimmed paired sensor data and video for training, then test on only untrimmed videos for temporal action localization with the output being the recognized action class and its start and end time in the untrimmed video. External datasets for pre-training are allowed as long as clearly stated in the report. Submissions that use only the vision-based modality in the train phase are also welcome! To encourage the submissions contributed on cross-modal methods, there will be a special award selected from the participants who use cross-modal manner. Notice that Keypoints are NOT provided in this task.

MMAct untrimmed cross-session dataset:
  • ・~9k untrimmed videos for training and validation, and the other untrimmed videos for testing.

  • ・35 action classes from 20 subjects with four camera views and four scenes.

  • MMAct trimmed cross-view dataset and cross-scene dataset are NOT permitted in this task.

Results


TASK1: Cross-Modal Trimmed Action Recognition
Organization x-scene AP x-view AP mAP Report
1 DeepBlue Technology 0.9716 0.9449 0.9583 PDF
2 OPPO Research Institute 0.9468 0.9108 0.9288 PDF
3 University of Koblenz 0.8064 0.6748 0.7406 PDF
TASK2: Cross-Modal Untrimmed Action Detection
Organization AP Report
1 DeepBlue Technology 0.4457 PDF
2 OPPO Research Institute 0.4068 PDF

Download


Dataset

Please follow this link to download the challenge dataset.

Preparation

You can use the dev-kit from this page when preparing your entry.

Evaluation Metrics


Evaluation script can be found here:evaluation script

Task 1. Cross-Modal Trimmed Action Recognition: The evaluation will be done across MMAct trimmed cross-view dataset and MMAct trimmed cross-scene dataset. We will use the mean Average Precision (mAP) as our metric, and the winner of this challenge will be selected based on the average of this metric across the above two datasets.

Task 2. Cross-Modal Untrimmed Action Temporal Localization: We will use the Interpolated Average Precision (AP) as our evaluation metric, which is also used by ActivityNet. The winner will be selected based on this metric evaluated on MMAct untrimmed cross-session dataset.


Submission


For Task1 submission format (cross-view and cross-scene are the same) and Task2 submission format, please refer it here: Test set Submission format for Leaderboard

Teams


Organizer

Quan
Kong

Hitachi, Ltd.

Katsuyuki Nakamura

Hitachi, Ltd.

Hirokatsu Kataoka

AIST

Shin'ichi Satoh

NII

Takuya Maekawa

Osaka Univ.



Committee Member

Joseph Korpela

Hitachi, Ltd.

Kensho Hara

AIST

Yoshiki
Ito

Hitachi, Ltd.

Saptarshi Sinha

Hitachi, Ltd.

Reference


Please cite the following paper if you use the dataset.

Publications

Bibtex

        @InProceedings{Kong_2019_ICCV,
          author = {Kong, Quan and Wu, Ziming and Deng, Ziwei and Klinkigt, Martin and Tong, Bin and Murakami, Tomokazu},
          title = {MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding},
          booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
          month = {October},
          year = {2019}
        }