This article is a translation of Japanese ver.
日本語版はこちら*1
Hi, this is chang.
Previously, I tried to make 6 Channel U-Net learn defects with multiple kinds. Then I wrote "1 channel network for each kind seemed to be better"*2. Today, I tested this in detail.
Conclusion is 6 channel is better contrary to my expectation.
0. What's channel?
The channel of neural network is the layer that is vertical to the image. For simplicity, I am explaining in way of image processing. The channel is often used for representing RGB of color images. I used in this way for GAN previously*3*4.
For making U-Net learn multiple kinds of defect, the channel is used for outputting the inference of each kind. If there are 6 kinds of defect, the network is required to have 6 channels and output 6 kinds of inference.
Strong processes like CNN or max pooling do not work in the channel direction. But the network can connect in the direction. Despite the network has to detect the different kinds of defect, they depends on each other. That means the connection among the channels are weakened during learning. In the result, the network works independently for each kind of defect. So I predicted that multi-kind defection using channels requires long-long time for training.
1. Developing apps.
Imagine that you construct deep learning and implement the network into apps. In many cases, training of deep learning are executed on a linux computer with large GPU. On the other hand, the apps for users run on iOS, Android or Windows. So you have to transfer the trained network to the environment for use and write code for inference in the environment. Today, we assume the Window application with C#.
For convenience of application, we think the following 4 points.
(1) Training time
Training time has nothing to the convenience of application. But if the time is too long, you cannot release applications. So you have to design the network architecture with the realistic training time.
(2) Size of trained network
Trained network are binary files outputted from the libraries (tensorflow and keras in may case). The application loads the network. If the size of network is too large, too much memories or GPU are consumed. In my view, using GPU in application is not realistic, so today we use CPU for inference and argue for the memory consumption.
(3) Inference time
You will feel frustration if the application takes long time to show the result. So we consider to decrease inference time as less as possible.
(4) Accuracy of defection
The most important point: we consider the accuracy of defection.
2. Dataset
We use DAGM with defect*5.
DAGM
From left-up, the defected image of Class_1, Class_2, Class_3, Class_4, Class_5, and Class_6.
There are 150 defected images for each class.
3. Program
About deep learning, there were no big changes from the previous development. New development was the viewer using C#.
Viewer for inference result
Drag & drop image to PictureBox, push button, then inference result is shown.
In the ListBox, processing time is shown.
4. Result
(1) Training time
I showed the comparison of training time with the same condition: 8 batch-size and 200 epochs.
Class | Training time[s] |
---|---|
1 | 794 |
2 | 798 |
3 | 800 |
4 | 801 |
5 | 789 |
6 | 797 |
all | 5488 |
Training of 1 channel took about 800 sec. per a class. Training of 6 channel took 5500 sec(91.6 min.). The time of 6 channel is greater than the six times of 1 channel, but the difference was small(12 min.).
The progress of training also did not have a big difference although I predicted the accuracy of 1 channel increased in a shorter time. As is shown bellow, the accuracy for the test data reached stable at about 50 epochs both in 1 and 6 channel.
Transition of accuracy for test data
(2) Size of trained network
I showed the size of the trained networks. There was no big difference between 1 channel and 6 channel. The size of 6 channel is a little greater(7KB),,, but I think you can ignore it.
Class | Size of trained network[KB] |
---|---|
1 | 404.469 |
2 | 404.469 |
3 | 404.469 |
4 | 404.469 |
5 | 404.469 |
6 | 404.469 |
all | 404.506 |
Although I could not measure the precise memory consumption, windows task manager showed that both the network of 1 and 6 channel consumed about 600 MByte. It is very interesting that increasing channel does not increase memory consumption. 6 channel has much advantage because loading 6 networks of 1 channel consumes 3.6 GByte.
(3) Inference time
Although there was no statistic comparison, the inference took about 6 sec. in every network. 6 channel took longer a little(10 mses.), but the difference was very very small.
(4) The accuracy of defection
I showed the inference result for a defect image of class_1, inferred using networks trained with 1 channel. As you see, the network of Class_5 and Class_6 misjudged.
Inference result of 1 channel U-Net
From left-up, the result of the trained network of Class_1, Class_2, Class_3, Class_4, Class_5, and Class_6
On the other hand, the network of 6 channel judged correctly that only class_1 had defect.
Inference result of 6 channel U-Net
Considering the accuracy for all the classes, 6 channels was superior.
5. Consideration
Channel | Training time | Size | Inference time | Accuracy |
---|---|---|---|---|
1 | △ | ☓ | △ | ☓ |
6 | △ | ○ | △ | ○ |
The results were far from my expectations. I predicted that 1 channel is superior for the accuracy. I had overlooked that 1 channel were easy to misjudge in other classes. There is a option for 1 channel that training the images of other classes as non-defect... I think it does not work well because it requires the complex processing like weighting*6.
Only one superior point of 1 channel is I think flexibility. If you want to add other kind of defect after training and the number of channel was the same to current variation, you have to construct network from the beginning. If you have margin of channel, I think it is possible to perform additional training for additional defect data. It requires investigation. In my view with the current experiment, multi-channel is better if the system spec. and the number of defect are fixed. Single channel is good for base investigation before the actual development.
6. Afterword
It was fun with new discovery. I wonder why the inference time has no difference between single and multiple channel. Sorry for my poor knowledge. I will investigate in the future.
Today's source code is here*7.
*1:https://changlikesdesktop.hatenablog.com/entry/2020/07/19/132644
*2:https://changlikesdesktop.hatenablog.com/entry/2019/08/26/064135
*3:https://changlikesdesktop.hatenablog.com/entry/2020/06/26/150550
*4:https://changlikesdesktop.hatenablog.com/entry/2020/06/22/201412
*5:https://conferences.mpi-inf.mpg.de/dagm/2007/prizes.html
*6:https://changlikesdesktop.hatenablog.com/entry/2019/05/15/052922