Jump to content

Assembly Line Activity Recognition: Difference between revisions

m
no edit summary
mNo edit summary
mNo edit summary
Line 1: Line 1:
<seo title="jetson orin | nvidia jetson orin | RidgeRun" titlemode="replace" keywords="GStreamer, NVIDIA, Jetson, TX1, TX2, Jetson AGX Xavier, Xavier, AI, Deep Learning, Machine Learning, Jetson TX1, Jetson TX2, Jetson Xavier, NVIDIA Jetson Xavier, NVIDIA Jetson Orin, Jetson Orin, Orin, NVIDIA Orin, NVIDIA Jetson AGX Orin, Jetson AGX Orin, Assembly Line, Assembly Line Activity, activity recognition, machine learning activity recognition" description="This RidgeRun wiki is about the project involves a complete machine learning activity recognition process."></seo>
<seo title="Assembly Line Activity Recognition | Activity Recognition | RidgeRun" titlemode="replace" metakeywords="GStreamer, NVIDIA, Jetson, TX1, TX2, Jetson AGX Xavier, Xavier, AI, Deep Learning, Machine Learning, Jetson TX1, Jetson TX2, Jetson Xavier, NVIDIA Jetson Xavier, NVIDIA Jetson Orin, Jetson Orin, Orin, NVIDIA Orin, NVIDIA Jetson AGX Orin, Jetson AGX Orin, Assembly Line, Assembly Line Activity, activity recognition, machine learning activity recognition" metadescription="This RidgeRun wiki is about the project involves a complete machine learning activity recognition process."></seo>


<table>
<table>
Line 182: Line 182:
=== Dataset characteristics ===
=== Dataset characteristics ===


Each dataset sample is associated to a group of frames called window, this window is 30 frames long (1 second). Each sample has a label associated with 2 additional fields besides the actual label: video-id and timestamp, in order to know what part of the video it is related to.  
Each dataset sample is associated with a group of frames called a window, this window is 30 frames long (1 second). Each sample has a label associated with 2 additional fields besides the actual label: video-id and timestamp, in order to know what part of the video it is related to.  


The labels are stored in a comma separated file with the following structure:
The labels are stored in a comma-separated file with the following structure:
<pre>
<pre>
video-id, timestamp, class  
video-id, timestamp, class  
Line 201: Line 201:
===Exploratory Data Analysis (EDA)===
===Exploratory Data Analysis (EDA)===


Data is a key aspect to a successful machine learning application, a dataset must meet the correct characteristics in order to be used to train a neural network that produces good results. For the purpose of determining those characteristics in the assembly dataset an Exploratory Data Analysis (EDA) was performed. The EDA allowed us to have a better understanding of the dataset composition, its weaknesses and strengths, as well as foresee possible issues or biases during the training phase. The following plot shows the dataset class distribution.
Data is a key aspect of a successful machine learning application, a dataset must meet the correct characteristics in order to be used to train a neural network that produces good results. For the purpose of determining those characteristics in the assembly dataset, an Exploratory Data Analysis (EDA) was performed. The EDA allowed us to have a better understanding of the dataset composition, its weaknesses, and strengths, as well as foresee possible issues or biases during the training phase. The following plot shows the dataset class distribution.


[[File:Assembly_datset_distribution.png|600px|center|thumb|Assembly dataset distribution]]
[[File:Assembly_datset_distribution.png|600px|center|thumb|Assembly dataset distribution]]


From this EDA it was concluded that the ''Part Removal'' label was too under represented, only a 0.35% of the labels fell under this category which makes it really complicated to train a network for this case. Based on this and other experimental results it was decided that this class was not going to be used for the final network training.
From this EDA it was concluded that the ''Part Removal'' label was too represented, only 0.35% of the labels fell under this category which makes it really complicated to train a network for this case. Based on this and other experimental results it was decided that this class was not going to be used for the final network training.


The EDA also included a visualization process for the class samples, this helped determine any dependency between classes. As a result, it was noted that the ''Install spacer'' and ''Install washer'' labels happened together in most of the video samples, which means that there were no unique samples for training; this resulted in the combination of these classes into a new one called ''Install spacer and washer''.
The EDA also included a visualization process for the class samples, this helped determine any dependency between classes. As a result, it was noted that the ''Install spacer'' and ''Install washer'' labels happened together in most of the video samples, which means that there were no unique samples for training; this resulted in the combination of these classes into a new one called ''Install spacer and washer''.
Line 211: Line 211:
=== Network implementation ===
=== Network implementation ===


The project uses a SlowFast architecture, which is a 3D convolutional neural network with two main data paths, it uses a slow and a fast path, these paths specialize either in spatial data or in temporal data, to make a complete spatio-temporal data processing.  
The project uses a SlowFast architecture, which is a 3D convolutional neural network with two main data paths, it uses a slow and a fast path, these paths specialize either in spatial data or in temporal data, to make a complete spatiotemporal data processing.  


The network used is a PyTorch implementation from the PyTorchVideo library. The following figure shows a basic representation of the layers that constitute this network.
The network used is a PyTorch implementation from the PyTorchVideo library. The following figure shows a basic representation of the layers that constitute this network.
Line 218: Line 218:
[[File:Slowfast_assembly_network.png|500px|center|thumb|SlowFast network basic representation]]
[[File:Slowfast_assembly_network.png|500px|center|thumb|SlowFast network basic representation]]
<br>
<br>
This project was built around this PyTorchVideo SlowFast implementation; the software allows us to experiment with this network for our specific use case, modify parameters and test multiple training and tuning techniques in order to achieve the best results possible. For experiment tracking and reproducibility this project was built using DVC as MLOps tool, with a 5 stage process as seen in the following dependency diagram.
This project was built around this PyTorchVideo SlowFast implementation; the software allows us to experiment with this network for our specific use case, modify parameters and test multiple training and tuning techniques in order to achieve the best results possible. For experiment tracking and reproducibility this project was built using DVC as an MLOps tool, with a 5-stage process as seen in the following dependency diagram.


[[File:Assembly_line_MLOps_pipeline.png|350px|center|thumb|MLOps pipeline stages]]
[[File:Assembly_line_MLOps_pipeline.png|350px|center|thumb|MLOps pipeline stages]]
Line 237: Line 237:
'''Transfer learning'''
'''Transfer learning'''


For the experimentation process, a baseline training was executed in order to have a point of reference and a starting point. After the baseline was executed, the first decision to make was whether to use transfer learning or keep the training process from scratch. The results from using transfer learning yielded the biggest improvement on the network performance; we used a SlowFast model from torch hub that was trained on the [https://www.deepmind.com/open-source/kinetics kinetics] 400 dataset. This achieved twice as better performance in one forth of the time compared to training from scratch. Transfer learning improved the training times as well as the network performance. The following plots show the difference between transfer learning and the baseline confusion matrices.
For the experimentation process, baseline training was executed in order to have a point of reference and a starting point. After the baseline was executed, the first decision to make was whether to use transfer learning or keep the training process from scratch. The results from using transfer learning yielded the biggest improvement in the network performance; we used a SlowFast model from torch hub that was trained on the [https://www.deepmind.com/open-source/kinetics kinetics] 400 datasets. This achieved twice as better performance in one-fourth of the time compared to training from scratch. Transfer learning improved the training times as well as the network performance. The following plots show the difference between transfer learning and the baseline confusion matrices.


<gallery widths=350px heights=350px mode=packed>
<gallery widths=350px heights=350px mode=packed>
Line 246: Line 246:
'''Classes subsampling and replacement'''
'''Classes subsampling and replacement'''


The next problem that needed solving was the data imbalance present in the dataset. As seen in the original data distribution plot, the dataset was not balanced and some classes were under represented. To tackle this problem, the first technique tested was the dataset subsampling, where not all samples available were used; only selected samples from each class in order to keep a balanced distribution. This was not optimal since a lot of useful data was being left out; after that, data replication was introduced where samples were selected with replacement. This was also not ideal since samples for the under represented classes were repeated a lot in the dataset. Finally, different loss functions were tested, particularly weighted cross entropy and focal loss, both of which account for the data distribution to calculate the loss. This is what yielded the best results and led to the use of focal loss for all experiments going forward. The following plots shows the original baseline dataset distribution and the final distribution used for most of the experiments; this also includes the removal of under represented labels.
The next problem that needed solving was the data imbalance present in the dataset. As seen in the original data distribution plot, the dataset was not balanced and some classes were underrepresented. To tackle this problem, the first technique tested was the dataset subsampling, where not all samples available were used; only selected samples from each class were in order to keep a balanced distribution. This was not optimal since a lot of useful data was being left out; after that, data replication was introduced where samples were selected with replacement. This was also not ideal since samples for the underrepresented classes were repeated a lot in the dataset. Finally, different loss functions were tested, particularly weighted cross entropy and focal loss, both of which account for the data distribution to calculate the loss. This is what yielded the best results and led to the use of focal loss for all experiments going forward. The following plots show the original baseline dataset distribution and the final distribution used for most of the experiments; this also includes the removal of underrepresented labels.


<gallery widths=350px heights=250px mode=packed>
<gallery widths=350px heights=250px mode=packed>
Line 257: Line 257:
Once the data imbalance problem was tackled, the next problem was overfitting, this can be seen in the image below extracted from the balanced training using transfer learning, where the validation loss plot crosses the training loss plot indicating that the validation loss does not decrease at the same rate as the training loss and therefore the network is overfitting to the training dataset.
Once the data imbalance problem was tackled, the next problem was overfitting, this can be seen in the image below extracted from the balanced training using transfer learning, where the validation loss plot crosses the training loss plot indicating that the validation loss does not decrease at the same rate as the training loss and therefore the network is overfitting to the training dataset.


To solve this the first experiment was training with more data, specifically with the complete dataset; this however, did not reduce the overfit, so the next experiment was to remove the underrepresented classes such as Part Removal, this had an improvement over the network but the overfit remained; finally the training approach was changed to use cross validation which solved the overfitting issue and it was kept for the final training. The following image shows the result of cross-validation training and how both plots do not cross anymore.
To solve this the first experiment was training with more data, specifically with the complete dataset; this however, did not reduce the overfit, so the next experiment was to remove the underrepresented classes such as Part Removal, this had an improvement over the network but the overfit remained; finally the training approach was changed to use cross-validation which solved the overfitting issue and it was kept for the final training. The following image shows the result of cross-validation training and how both plots do not cross anymore.


<gallery widths=350px heights=350px mode=packed>
<gallery widths=350px heights=350px mode=packed>
Line 272: Line 272:
== Results ==
== Results ==


The network was tested on a separate part of the dataset, that had not been presented to the network. The best performing model achieved an accuracy of 0.91832, a precision of 0.91333, a recall value of 0.91832, and finally an f1-score of 0.91521. These metrics indicate a very high performance of the network for this particular task;  the following table presents the complete training, validation, and testing results.  
The network was tested on a separate part of the dataset, that had not been presented to the network. The best-performing model achieved an accuracy of 0.91832, a precision of 0.91333, a recall value of 0.91832, and finally an f1-score of 0.91521. These metrics indicate a very high performance of the network for this particular task;  the following table presents the complete training, validation, and testing results.  


{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"
{| class="wikitable" style="margin-left: auto; margin-right: auto; border: none;"
Line 322: Line 322:
|}
|}


In addition, the following plots show the training and validation loss for the best performing network, as well as the test confusion matrix; this matrix shows predominance along its diagonal; indicating a match between the networks predictions and the sample's ground truth.
In addition, the following plots show the training and validation loss for the best-performing network, as well as the test confusion matrix; this matrix shows predominance along its diagonal; indicating a match between the networks predictions and the sample's ground truth.
<br>
<br>
<br>
<br>
Line 368: Line 368:
== What is next? ==
== What is next? ==


Having a trained model for the specific assembly process allows getting a prediction in real-time of every action that happens during production. These predictions can be used for online analysis of the production process into an embedded system, and automatically flag the recorded videos when a specific set of events happen such as a part is removed, the assembly sequence is incorrect or a part has been completed. All this data can be automatically logged into the system and tied to the recorded video for posterior analysis. Remote configuration can be enabled on the edge device to allow the user to control the settings of the detection such as what events to log, change the recording settings, and change the behavior on specific events, among others. The following figure shows a simplified diagram of the entire solution for a production line:
Having a trained model for the specific assembly process allows getting a prediction in real time of every action that happens during production. These predictions can be used for online analysis of the production process into an embedded system, and automatically flag the recorded videos when a specific set of events happen such as a part is removed, the assembly sequence is incorrect or a part has been completed. All this data can be automatically logged into the system and tied to the recorded video for posterior analysis. Remote configuration can be enabled on the edge device to allow the user to control the settings of the detection such as what events to log, change the recording settings, and change the behavior of specific events, among others. The following figure shows a simplified diagram of the entire solution for a production line:


[[File:action_recognition_use_case.png|600px|thumb|center|Typical solution for a action recognition system on a production line]]
[[File:action_recognition_use_case.png|600px|thumb|center|Typical solution for a action recognition system on a production line]]
Cookies help us deliver our services. By using our services, you agree to our use of cookies.