Supervised Activity classification: On condition that movies are concatenations of picture frames, the field of activity recognition and classification can also be relevant. Furthermore we added a classification token at the tip of the final utterance, which is ignored by the language modeling loss, but used as input for the classification loss. Given that assessment durations usually range in terms of...