NTU-X, which is an extended version of popular NTU dataset
This repository contains details and pretrained models for the newly introduced dataset NTU-X, which is an extended version of popular NTU dataset. For additional details and results on experiments using NTU-X, take a look at the paper NTU-X: An Enhanced Large-scale Dataset for Improving Pose-based Recognition of Subtle Human Actions
The original NTU dataset contains the human action skeleton which are captured using the Kinect. These skeletons have 25 joints. However, all the current top performing models seem to be bottlenecked at certain classes which involve finer finger level movements such as, reading, writing, eat meal etc.
Hence the new NTU-X dataset, introduces a more detailed 118 joints skeleton for the action sequences of the NTU dataset. This new dataset, along with 25 body joints, contains 42 finger joints and 51 face joints.
Model | NTU60 | NTU60-X | NTU120 | NTU120-X |
---|---|---|---|---|
DSTA-Net | 91.50 | 93.56 | 86.60 | 87.80 | ?>
CTR-GCN | 92.40 | 93.90 | 88.90 | 88.36 |
4s-ShiftGCN | 90.70 | 91.78 | 85.90 | 86.18 |
MsG3d | 91.50 | 91.76 | 86.90 | 87.10 | ?>
PA-ResGCN | 90.90 | 91.64 | 87.30 | 86.42 |
NTU | NTU-X (ours) | |
Write | ||
Read | ||
Eat Meal |
Dataset | Body | Fingers | Face | # Joints | # Sequences | # Classes |
---|---|---|---|---|---|---|
MSR-Action 3d | :heavy_check_mark: | 20 | 567 | 20 | ||
Northwestern-UCLA | :heavy_check_mark: | 24 | 1475 | 10 | ||
NTU-RGB+D | :heavy_check_mark: | 25 | 56880 | 60 | ||
NTU-RGB+D 120 | :heavy_check_mark: | 25 | 114035 | 120 | ||
NTU60-X (Ours) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | 118 | 56148 | 60 |
NTU120-X (Ours) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | 118 | 113821 | 120 |
ROSE Lab, creators of NTU-RGBD dataset have refused to give us permission to release the dataset. NTU-X addresses a fundamental raw-data-level shortcoming of existing Kinect-based dataset, a shortcoming no amount of novel deep models can somehow undo. We had hoped to release the dataset for the benefit of the community once we receive consent. Unfortunately, due to reasons best known to them, keeping the dataset out of community's reach was the decision they arrived at. Apologies to those who were awaiting the release. We hope you understand.
Few experiments are performed to benchmark this new dataset using the top performing models of the original NTU RGB+D dataset. Details about this models can be found at Models
NTU-X contains same classes as NTU RGB+D dataset. The action labels are mentioned below:
A1 drink water. | A2 eat meal/snack. | A3 brushing teeth. |
A4 brushing hair. | A5 drop. | A6 pickup. |
A7 throw. | A8 sitting down. | A9 standing up (from sitting position). |
A10 clapping. | A11 reading. | A12 writing. |
A13 tear up paper. | A14 wear jacket. | A15 take off jacket. |
A16 wear a shoe. | A17 take off a shoe. | A18 wear on glasses. |
A19 take off glasses. | A20 put on a hat/cap. | A21 take off a hat/cap. |
A22 cheer up. | A23 hand waving. | A24 kicking something. |
A25 reach into pocket. | A26 hopping (one foot jumping). | A27 jump up. |
A28 make a phone call/answer phone. | A29 playing with phone/tablet. | A30 typing on a keyboard. |
A31 pointing to something with finger. | A32 taking a selfie. | A33 check time (from watch). |
A34 rub two hands together. | A35 nod head/bow. | A36 shake head. |
A37 wipe face. | A38 salute. | A39 put the palms together. |
A40 cross hands in front (say stop). | A41 sneeze/cough. | A42 staggering. |
A43 falling. | A44 touch head (headache). | A45 touch chest (stomachache/heart pain). |
A46 touch back (backache). | A47 touch neck (neckache). | A48 nausea or vomiting condition. |
A49 use a fan (with hand or paper)/feeling warm. | A50 punching/slapping other person. | A51 kicking other person. |
A52 pushing other person. | A53 pat on back of other person. | A54 point finger at the other person. |
A55 hugging other person. | A56 giving something to other person. | A57 touch other person's pocket. | A58 handshaking. | A59 walking towards each other. | A60 walking apart from each other. |
A61 put on headphone. | A62 take off headphone. | A63 shoot at the basket. |
A64 bounce ball. | A65 tennis bat swing. | A66 juggling table tennis balls. |
A67 hush (quite). | A68 flick hair. | A69 thumb up. |
A70 thumb down. | A71 make ok sign. | A72 make victory sign. |
A73 staple book. | A74 counting money. | A75 cutting nails. |
A76 cutting paper (using scissors). | A77 snapping fingers. | A78 open bottle. |
A79 sniff (smell). | A80 squat down. | A81 toss a coin. |
A82 fold paper. | A83 ball up paper | A84 play magic cube. |
A85 apply cream on face. | A86 apply cream on hand back. | A87 put on bag. |
A88 take off bag. | A89 put something into a bag. | A90 take something out of a bag. |
A91 open a box. | A92 move heavy objects. | A93 shake fist. |
A94 throw up cap/hat. | A95 hands up (both hands). | A96 cross arms. |
A97 arm circles. | A98 arm swings. | A99 running on the spot. |
A100 butt kicks (kick backward). | A101 cross toe touch. | A102 side kick. |
A103 yawn. | A104 stretch oneself. | A105 blow nose. |
A106 hit other person with something. | A107 wield knife towards other person. | A108 knock over other person (hit with body). |
A109 grab other person’s stuff. | A110 shoot at other person with a gun. | A111 step on foot. |
A112 high-five. | A113 cheers and drink. | A114 carry something with other person. |
A115 take a photo of other person. | A116 follow other person. | A117 whisper in other person’s ear. |
A118 exchange things with other person. | A119 support somebody with hand. | A120 finger-guessing game (playing rock-paper-scissors). |
1. How is the NTU-X dataset created?
It is collected by estimating 3D SMPL-X pose outputs from the RGB frames of the NTU-60 RGB videos. We use both SMPL-X and Expose to perform these estimations.
2. How the pose extractor (SMPLx/ExPose) is decided for each class?
We use a semi-automatic approach to estimate the 3D pose for the videos of each class. Keeping the intra-view and intra-subject variance of the NTU dataset in mind, we sample random videos covering each view perclass of NTU and estimate the SMPL-X, ExPose outputs. The estimated skeleton is then backprojected to its corresponding RGB frame and the accuracy of the alignment is used to select between SMPL-X and Expose.
3. Which class IDs have ExPose used as pose extractor and which class IDs have SMPLx used as pose extractor?
Empirically, we observe that ExPose,SMPL-X perform equally well for single-person actions but SMPL-X, though slow, provides better pose estimates for multi-person action class sequences. The classes selected for SMPL-X and Expose are as follows: