5 modalities
RGB, Keypoints, Acceleration, Gyroscope, Orientation.
-2021.04.05. Challenge open! Train and Val set released here.
-2021.05.01. Evaluation script, submission format and test data (password is the same as trainval request) released!
-2021.05.07. Evaluation server open on CodaLab task1 link and task2 link.
-2021.06.12. Evaluation server close. (extended)
-2021.06.14. Deadline for submitting the report.
-2021.06.19. Workshop day at CVPR 2021. Check program and Results
The MMAct Challenge 2021 will be hosted in the CVPR'21 International Challenge on Activity Recognition (ActivityNet) Workshop.
This challenge asks participants to propose cross-modal video action recognition/localization approaches for addressing shortcomings in visual only approaches using MMAct Dataset.
The goal of this task is to leverage the sensor-based, e.g. body-worn sensors data as the privilege information, and vision-based modalities in ways that can overcome the limitations imposed by modality discrepancy between
train (sensor + video) and test (only video, including keypoints) phases. The modalities used for this challenge are: Acceleration, Orientation, Gyroscope, RGB video, and Keypoints.
This challenge promotes an alternate point of view on how to address vision challenges
through using cross-modal methods in the hopes of expanding research on video action understanding to further leverage the sensors commonly embedded in
daily-use smart devices (e.g., smartphones).
Winners will be announced at ActivityNet Workshop at CVPR 2021, along with Prizes sponsored by Hitachi, Ltd. and subject to changes.
A large-scale multi modal dataset for action understanding
RGB, Keypoints, Acceleration, Gyroscope, Orientation.
Untrimmed videos with 1920x1080@30FPS
Average clip length ranges from 3-8 seconds.
Daily, Abnormal, Desk work actions
Free space, Occlusion, Station Entrance, Desk work.
4 surveillance camera views
10 female, 10 male
Collected using a semi-naturalistic collection protocol.
In this task, participants will use trimmed videos from MMAct along with paired sensor data. This task allows the participants to train with trimmed sensor data and trimmed video, but test on only trimmed video for action recognition. It requries the user to submit two results according to the follows two different splits. External datasets for pre-training are allowed as long as clearly stated in the report. Submissions that use only the vision-based modality in the train phase are also welcome! To encourage the submissions contributed on cross-modal methods, there will be a special award selected from the participants who use cross-modal manner.
・clips from multiple views for training and validation, clips from unseen views for testing.
・35 action classes from 20 subjects with four scenes.
・MMAct untrimmed dataset and MMAct trimmed cross-scene dataset are NOT permitted in this task.
・clips from 3 scenes (Free space, Desk, Entrance) for training and validation, clips from 3 scenes (Occlusion, Desk, Entrance) for testing.
・35 action classes from 20 subjects with four camera views.
・MMAct untrimmed dataset and MMAct trimmed cross-view dataset are NOT permitted in this task.
In this task, participants will use untrimmed paired sensor data and video for training, then test on only untrimmed videos for temporal action localization with the output being the recognized action class and its start and end time in the untrimmed video. External datasets for pre-training are allowed as long as clearly stated in the report. Submissions that use only the vision-based modality in the train phase are also welcome! To encourage the submissions contributed on cross-modal methods, there will be a special award selected from the participants who use cross-modal manner. Notice that Keypoints are NOT provided in this task.
・~9k untrimmed videos for training and validation, and the other untrimmed videos for testing.
・35 action classes from 20 subjects with four camera views and four scenes.
・MMAct trimmed cross-view dataset and cross-scene dataset are NOT permitted in this task.
Evaluation script can be found here:evaluation script
Task 1. Cross-Modal Trimmed Action Recognition: The evaluation will be done across MMAct trimmed cross-view dataset and MMAct trimmed cross-scene dataset. We will use the mean Average Precision (mAP) as our metric, and the winner of this challenge will be selected based on the average of this metric across the above two datasets.
Task 2. Cross-Modal Untrimmed Action Temporal Localization: We will use the Interpolated Average Precision (AP) as our evaluation metric, which is also used by ActivityNet. The winner will be selected based on this metric evaluated on MMAct untrimmed cross-session dataset.
For Task1 submission format (cross-view and cross-scene are the same) and Task2 submission format, please refer it here: Test set Submission format for Leaderboard
Please cite the following paper if you use the dataset.
@InProceedings{Kong_2019_ICCV, author = {Kong, Quan and Wu, Ziming and Deng, Ziwei and Klinkigt, Martin and Tong, Bin and Murakami, Tomokazu}, title = {MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding}, booktitle = {The IEEE International Conference on Computer Vision (ICCV)}, month = {October}, year = {2019} }