Mixed Martial Arts Automatic Scoring: An Action Recognition Use Case

Introduction

Computer based Artificial Intelligence algorithms are increasing in accuracy and decreasing in the time it takes to produce results Areas where AI is being used is increasing, from Autonomous Driving to Medical Devices. Specific subtypes of AI, such as Deep Learning, mimic the workings of the human brain in how data is processed for object detection, speech recognition, language translation, and decision-making processes. Since the classical approach in 1996 when Deep Blue won a Chess game against world champion Garry Kasparov, a lot of key AI innovations have occurred. RidgeRun is a high-end company that knows the value of keeping updated on the latest AI markets and researching efforts that are occurring globally. The Mixed Martial Arts Automatic Scoring project is a research project that aims to test several state-of-the-art papers and implementations to gather important AI information and approaches to introduce customer requirements and solve their problems in the best and innovative ways in the shortest possible time.

Roadmap

The following image shows the roadmap for the project. Currently, a Labeling Tool is being developed to match our tagging needs and also to generate a very general approach that can match any customer's needs to label videos or images for their AI project needs. Once the tool is finished a proof-of-concept demonstration will be generated to automatically show scoring in a Mixed Martial Arts match.

Action Recognition

The task of action recognition or action detection involves analyzing videos and determining what action or motion is being performed. The primary subject of these videos is predominantly humans performing some action. However, this requirement can be relaxed to generalize over other subjects such as animals or robots. The applications can range anywhere from human-computer interaction to automated video editing proposals. When we consider Spatio-temporal action recognition, we deal with action localization. This task not only involves determining what action is being performed but also when and where it is being performed in said video. Spatio-temporal Action Recognition: A Survey

Train and Test Approach: SlowFast

The SlowFast model involves (i) a Slow pathway, operating at a low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at a high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. The models achieve strong performance for both action classification and detection in video, and large improvements are pinpointed as contributions by our SlowFast concept. SlowFast Networks for Video Recognition

Labeling Approach: AVA

The process of data labeling in the machine learning area is as important as the model and training. The common phrase: "Trash in, trash out" illustrates why it is so important. If you have millions of data for training with great tag information and accurate description of what is being described by the data itself it would be a lot easier for your network model to learn about what you are trying to teach it. The first step when developing a deep learning project is to talk about the quantity and quality of data available. If data is available but not labeled/tagged then a lot of time must be used to ensure this process has a great quality.

Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics are:

The definition of atomic visual actions, rather than composite actions
Precise Spatio-temporal annotations with possibly multiple annotations for each person
Exhaustive annotation of these atomic actions over 15-minute video clips
People temporally linked across consecutive segments
Using movies to gather a varied set of action representations

Label time frame

To label the actions performed by a person, a key choice is the annotation vocabulary, which in turn is determined by the temporal granularity at which actions are classified. In AVA short segments (±1.5 seconds centered on a keyframe) are used to provide temporal context for labeling the actions in the middle frame.

Person bounding box annotation

Localize a person and his or her actions with a bounding box. When multiple subjects are present in a keyframe, each subject is shown to the annotator separately for action annotation, and thus their action labels can be different.

Person link annotation

Link the bounding boxes over short periods of time to obtain ground-truth person references. We calculate the pairwise similarity between bounding boxes in adjacent keyframes using a person embedding.

Action annotation

The action labels are generated by annotators using the interface shown in the next image. It allows entering up to 7 action labels

1 pose action (required)
3 person-object interactions (optional)
3 person-person interactions (optional).

If none of the listed actions is descriptive, annotators can flag a check box called “other action”. In addition, they could flag segments containing blocked or inappropriate content, or incorrect bounding boxes. On average, annotators take 22 seconds to annotate a given video segment at the proposal stage, and 19.7 seconds at the verify stage

The Interface example picture for the AVA project can be found in the paper: AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Label reference example

The label action list is defined as the following example:

label { name: "watch (a person)" action_id: 80 label_type: PERSON_INTERACTION }

AVA Results

The following figure breaks down performance by categories and the number of training examples. While more data generally yields better performance, the outliers reveal that not all categories are of equal complexity. Categories correlated with scenes and objects (such as swimming) or categories with low diversity (such as fall down) obtain high performance despite having fewer training examples. In contrast, categories with lots of data such as touching and smoking, obtain relatively low performance possibly because they have large visual variations

The previous bar graph was generated with results from the paper: AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Label Tool Design Criteria

With RidgeRun's custom labeling tool we aim to cover the following advantages:

Features that can be developed internally that may not be available in other labeling tools.
Add the timeframe feature to allow different ranges. Common labeling tools do not include this feature and usually are frame by frame or to images.
Could general label system that can adopt problem domain/problem specific details. For example, with several tags for a single frame.
Add support for required output format (such as AVA). Open-source tools usually do not include this support.

Current development progress has a simple system to tag a KeyFrame that can be changed as needed such as frames overlap (seconds) and keyframe length (number of frames grouped), with user input for labels with allows really flexible scheme where tags are loaded at startup and exporting system to AVA format.

RidgeRun Resources

Quick Start

Client Engagement Process

RidgeRun Blog

Homepage

Technical and Sales Support

RidgeRun Online Store

RidgeRun Videos

Contact Us

RidgeRun.ai: Artificial Intelligence | Generative AI | Machine Learning

Contact Us

Visit our Main Website for the RidgeRun Products and Online Store. RidgeRun Engineering informations are available in RidgeRun Professional Services, RidgeRun Subscription Model and Client Engagement Process wiki pages. Please email to support@ridgerun.com for technical questions and contactus@ridgerun.com for other queries. Contact details for sponsoring the RidgeRun GStreamer projects are available in Sponsor Projects page.