Rough Sets Theory with Deep Learning for Tracking in Natural Interaction with Deaf
الموضوعات :Mohammad Ebrahimi 1 , Hossein Ebrahimpour-Komeleh 2
1 - Electrical and Computer Engineering Kashan University
2 - Faculty of Electrical and Computer Engineering Kashan University
الکلمات المفتاحية: Natural Interaction with Deaf, Machine Vision, Persian Deaf News Hand Tracking, Sign Language, Rough Sets Theory, Deep Learning.,
ملخص المقالة :
Sign languages commonly serve as an alternative or complementary mode of human communication Tracking is one of the most fundamental problems in computer vision, and use in a long list of applications such as sign languages recognition. Despite great advances in recent years, tracking remains challenging due to many factors including occlusion, scale variation, etc. The mistake detecting of head or left hand instead of right hand in overlapping are, modes like this, and due to the uncertainty of the hand area over the deaf news video frames; we proposed two methods: first, tracking using particle filter and second tracking using the idea of the rough set theory in granular information with deep neural network. We proposed the method for Combination the Rough Set with Deep Neural Network and used for in Hand/Head Tracking in Video Signal DeafNews. We develop a tracking system for Deaf News. We used rough set theory to increase the accuracy of skin segmentation in video signal. Using deep neural network, we extracted inherent relationships available in the frame pixels and generalized the achieved features to tracking. The system proposed is tested on the 33 of Deaf News with 100 different words and 1927 video files for words then recall, MOTA and MOTP values are obtained.
[1] H. Tang, H. Liu, W. Xiao and N. Sebe, "Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion", Neurocomputing, Vol. 331, 2019, pp. 424-433.
[2] L. Kraljevi´c, M. Russo, M. Paukovi´c and M. Šari´c, “A Dynamic Gesture Recognition Interface for Smart Home Control based on Croatian Sign Language”, Appl. Sci. 2020, 10, 2300.
[3] A. Wadhawan and P. Kumar, "Sign Language Recognition Systems: A Decade Systematic Literature Review", Arch Computat Methods Eng 2019, https://doi.org/10.1007/s11831-019-09384-2.
[4] P. Kim, "MATLAB Deep Learning: With Machine Learning, Neural Networks and Artificial Intelligence", Apress, 2017.
[5] A.E. Hassanien, A. Abraham, J.F. Peters, G. Schaefer, Henry C., "Rough sets and near sets in medical imaging: a review", IEEE Transactions on Information Technology in Biomedicine, Vol. 13(6), 2009, pp. 955-968.
[6] V. Sattari-Naeini, and A. Moaref, "Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection", International Journal of Engineering, Vol. 30(9), 2017, pp. 1326-1333.
[7] D. Li, C. R. Opazo, X. Yu and H. Li, "Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 1459-1469.
[8] A. H. Mazinan, J. Hassanian, “A Hybrid Object Tracking for Hand Gesture Approach based on MS-MD and its Application,” Journal of Information Systems and Telecommunication (JIST), 2015, Vol. 3, No. 4.
[9] S. Ildarabadi, M. Ebrahimi, H. R. Pourreza, "Improvement Tracking Dynamic Programming using Replication Function for Continuous Sign Language Recognition", International Journal of Engineering Trends and Technology (IJETT), Vol 7(3), 2014, pp. 97-101.
[10] P. Chiranjeevi, S. Sengupta, “Rough-Set-Theoretic Fuzzy Cues-Based Object Tracking Under Improved Particle Filter Framework”, IEEE transactions on fuzzy systems, Vol. 24, 2016, No. 3.
[11] J. B. Zhang, T. R. Li, & H. M. Chen. "Composite rough sets for dynamic data mining", Information Science, Vol. 257, 2014, pp. 81–100.
[12] Y. Xu, A. Sep, Y. Ban, R. Horaud, L. Leal-Taixe and X. Alameda-Pineda, “How to train your deep multi-object tracker", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 6787-6796, Jun. 2020.
http://jist.acecr.org ISSN 2322-1437 / EISSN:2345-2773 |
Journal of Information Systems and Telecommunication
|
Rough Sets Theory with Deep Learning for Tracking in Natural Interaction with Deaf |
Mohammad Ebrahimi1, Hossein Ebrahimpour-Komeleh2*
|
1. Electrical and Computer Engineering Kashan University 2. Faculty of Electrical and Computer Engineering Kashan University |
Received: 03 May 2021/ Revised: 01 Feb 2022/ Accepted: 07 Feb 2022 |
|
Abstract
Sign languages commonly serve as an alternative or complementary mode of human communication Tracking is one of the most fundamental problems in computer vision, and use in a long list of applications such as sign languages recognition. Despite great advances in recent years, tracking remains challenging due to many factors including occlusion, scale variation, etc. The mistake detecting of head or left hand instead of right hand in overlapping are, modes like this, and due to the uncertainty of the hand area over the deaf news video frames; we proposed two methods: first, tracking using particle filter and second tracking using the idea of the rough set theory in granular information with deep neural network. We proposed the method for Combination the Rough Set with Deep Neural Network and used for in Hand/Head Tracking in Video Signal DeafNews. We develop a tracking system for Deaf News. We used rough set theory to increase the accuracy of skin segmentation in video signal. Using deep neural network, we extracted inherent relationships available in the frame pixels and generalized the achieved features to tracking. The system proposed is tested on the 33 of Deaf News with 100 different words and 1927 video files for words then recall, MOTA and MOTP values are obtained.
Keywords: Natural Interaction with Deaf; Machine Vision; Persian Deaf News Hand Tracking; Sign Language; Rough Sets Theory; Deep Learning.
1- Introduction
Recognition of states and hand gestures are very important in a natural interaction with a computer. Its importance is due to its widespread applications in virtual reality, sign language recognition and computer games. Fast and robust hand gesture recognition remains an open problem [1].
By tracking the hand in the video, it is simpler to partition it from the image frames. The purpose of tracking methods is to discover and track one or more objects in the sequence of images. Tracking can be thought of as a kind of object discovery in a set of similar images. Many tracking methods are used to discover and track objects in video films, in which a large number of images have to be processed. Various kinds of probabilistic inference models haves been applied to multi-object tracking, such as Kalman filter, Extended Kalman filter and Particle filter. In the case of linear system and Gaussian-distribution object states, Kalman filter is proved to be the optimal estimator. It has been applied. Extended Kalman filter, for the nonlinear case, extended Kalman filter is a solution. It approximates the nonlinear system by Taylor Expansion. Particle filter, Monte Carlo sampling-based models becomes popular in tracking, especially after the introduce of Particle filter Typically, the strategy of Maximum A Posteriori (MAP) is adopted to derive a state with the maximum probability [2,3].
About 466 million deaf people live in the world, this is approximately 5.3% of the world population1, their natural language is the sign language. They are restricted in reading and writing the official language. Education, work, use of computers and the Internet are affected for them. Diagnosing the sign language, if used in interaction with the computer and in the translation of texts to hand gestures, can support them well [2].
Deep learning is a kind of hierarchical learning. In layered hierarchical learning, nonlinear features are extracted, then the output layer is usually formulated depending on how many groups that are needed [4]. The output layer is a classifier. It combines all features to make predictions. The layers' hierarchy is deeper, the more nonlinear features are extracted. That is why the number of layers in deep learning is used. Sometimes these complex features cannot be obtained directly from the input image.
A Convolutional neural network, CNN is a popular deep learning architecture that automatically learns useful feature representations directly from image data. CNNs, or ConvNets, are essential tools for deep learning, and are especially useful for image classification, object detection, and recognition tasks. CNNs are implemented as a series of interconnected layers.
A semantic segmentation network classifies every pixel in an image, resulting in an image that is segmented by class. Semantic segmentation networks like DeepLab make extensive use of dilated convolutions, also known as Atreus convolutions, because they can increase the receptive field of the layer without increasing the number of parameters or computations.
Although years have passed since the design of target tracking, this topic is still an active research field with many applications in the world's universities and scientific circles. This issue is of particular importance in tracking the targets that move with quick maneuvers, because the dynamic of target motions is complex and its nature is nonlinear. Given that the targets we are interested in track down have high-level maneuvers, various intelligent methods have all been in line with tracking the best.
The "rough sets" approach to estimate sets has led to beneficial aspects of the grain calculations, and is part of computational intelligence. The basic idea of the rough sets for aggregated information implies that how much the subsets can be used to find the objects of interest for estimating [5]. Also rough sets theory is convenient for picking up irrelevant and redundant features from a dataset [6]. Here the computational intelligence of rough sets is used. The causes of the lack of information in a particular application are identified in order to overcome the problem of the lack of information in a particular application. Then, necessary relationships are used to compensate for the lack of information. In fact, subsets of classes are characterized by rough sets, then the boundary and negative members obtained from the definition of the following sets are guided to their proper position with the definition of functions.
Tracking is very important. Machine learning is used for tracking. Dongxu Li et al. used deep learning for sign language recognition [7]. Literature findings of Wadhawan et al. indicated that the major research on sign language recognition has been performed on static, isolated and single-handed signs using camera [3].
In the case of deaf communication, it is necessary to recognize the signs expressed by the deaf. Facial gesture, trajectory and hand gesture are the three basic features for recognizing the language sign expressed by a deaf person. Hand and head tracking is used to find the trajectory and segment them from the background of the video in the frames. So, the problem is accurately tracking the hands and head in videos of signs expressed by the deaf.
The sign language of countries is different. In this work, a Persian dataset of sign language videos has been collected, which is available at Kashan University. The system proposed are tested on 33 videos of Deaf News with 100 different words and 1927 video files for words, and recall, MOTA and MOTP values are obtained. We used rough set theory with deep Neural network for sign language tracking. The novelty of this paper is the use of rough set theory with deep neural network for tracking. This is the first work on this topic. In this paper, at first, tracking using particle filter is explained. At second, tracking using rough sets and deep learning is explained. In the first proposed method, we used a particle filter, which has high accuracy but is very time consuming. The second proposed method responds much faster but is less accurate. To increase the accuracy of the second proposed method, we used the rough set theory.
2- Proposed Algorithm
Sign language recognition is one of the issues that have been used in many applications. Some of them are the transcription, video rebuilding, and deaf of sign language. In this regard, we have tried to create a system for sign language recognition for Persian, so that ordinary people and the deaf can easily interact with each other. The sign language recognition uses a variety of sub-systems, each of which has its own characteristics and procedures, and the relationship between the various components of the system is an important issue that cannot easily be ignored. The purpose of this research is to design and train the "deep learning network" to sign language recognition for Persian.
The first part that the system focuses on is the multi-tracking. The development of a new multi-tracking method used the theory of rough sets in such a way that it automatically tracks objects in a video signal. The objects in this system are two hands and face. The geometric feature of the object's presence at different times, in other words, the trajectory, can be effective in selecting the area appropriately, improving fragmentation, and identifying the results in this application [1,2,8].
The rough sets approach in estimating collections has led to beneficial benefits from granular calculations and is part of computational intelligence. The basic idea of rough sets for granular information implies that how much of the subcategories can be used to estimate in the discovery and fragmentation of favorite objects. In this system, the computational intelligence of the rough sets will be used. To overcome the problem of the lack of information in any particular application, it explains the causes of the lack of information. Then, relationships are used to compensate for the lack of information. In fact, with rough sets, the subsets of the categories are determined, and then the boundary and negative members obtained from the definition of rough sets, with the definition of the function, are directed to their proper place [9].
(1)
There, the folding function, equation (1) is used. In the equation (1), and are location coordinates of pixels in image I. when two hands overlapped or the hands and face overlapped, Weak boundaries are created. At this time the tracker fails, means tracker going from right hand to left hand or to face. The g function shows up a very weak boundary overlapping regions. The g function converts the intermediate values of the gray area of the boundary to completely white values. In this case, the tracker does not cross the boundary and continues to track in its area truly.
2-1- Proposed Method 1: Multi-Tracking using Particle Filter
In simple terms, the filtering method refers to the process of obtaining and accessed targets during the movie screenings. This issue, filter for target, is very important in tracking because the targets move with quick maneuvers by means that dynamic of target motions is complex and its nature is nonlinear. lately, particle filtering has appeared as a tracking approach as compared with meanshif. It is a stochastic approach that models nonlinear motion with non-Gaussian noise.
General approaches in the tracking with filters have two stages: prediction and update. In prediction stage the model must predict the location of the hand in the next frame using motion model, after arriving to next time, the exact location is achieved and update the motion model using observation model. In the particle filter method, this is done pixel-to-pixel, and it raises the computational complexity.
For each position in frame at each time, local score is calculated. The global score is the total score for the best path until now, which ends to each position. For each position in image, the best predecessor is searched for among a set of possible previous scores. This best predecessor is then stored in table of back pointers which is used for the trace back.
Principal Component Analysis (PCA) performed by the Karhunen-Lokve transform produces features that are mutually uncorrelated. The obtained by the KL transform solution is optimal when dimensionality reduction is the goal and one wishes to minimize the approximation mean square error.
Mean face difference images (MFDI) are difference images between the mean face and the tracked face patch computed over a sentence or word segment.
The motion energy feature is used for silence detection in the presented system. Additionally, the use of motion energy as feature for sign language recognition is investigated.
hand position normalized with respect to shoulder and vertical body axis. Gabor wavelet transform is one of the most effective texture feature extraction techniques and has resulted in many successful practical applications.
PCA, MFDI, motion energy, hand position, hand texture speed and RGB to YcBcR and GRAY are features for sign language recognition.
Vision based communicating with compare of speech-based communicating is more complex and meaningful. Direct communication between deaf and other people is very difficult, so there are attempts for making a sign language interpreting system You can see a diagram of it in Fig. 1. In the first proposed method, we used a particle filter, which has high accuracy but is very time consuming. The second proposed method responds much faster but is less accurate. To increase the accuracy of the second method, we used the rough set theory.
2-2- Proposed Method 2: Multi-Tracking using Rough Sets
2-2-1- Rough Set
By using fuzzy and in particular the theory of rough sets with uncertainties in the trace problem, the best trace is attempted. All video frame points are included in the database table as examples in the first column. The properties of each point are stored as a separate column in the table. The value of the attributes for each point is recorded. Due to the fact that a camera mounted in a single place arranges the data, each frame is calculated for each frame as the changes in the positive, negative and boundary sets are added. Each time, the matrix of the hand region in the matrix is multiplied by the general relationship and the matrices of the intermediate and the primary are obtained. By using the definition and use of proper conversion functions, the tracing method improves. If it works online, it is necessary to process the same as the film. This is called active learning [10,11].
Fig. 1: The outline of Proposed method 2: multi-tracking using Rough sets
Fig. 1 shows the outline of proposed method 2, multi-tracking using Rough sets.
The most important features of rough set theory are:
· Finding relationships that are not discovered by statistical methods.
· Ability to use quantitative and qualitative information.
· Finding a minimum set of data that is useful for categorization (such as minimizing dimensions and number of data).
· Assessing the importance of data.
· Generate decision rules on data [3].
Using the definition and use of convenient conversion functions, the method of tracking is improved. The composite decision is listed in table 1.
Table 1: The composite decision table
U | a1 | a2 | a3 | a4 | a5 | a6 | a7 | D | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 0 | 0 | 1 | {0,1,..,255} | {0,1,..,255} | {0,1,..,255} | No | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 0 | 0 | 1 | {0,1,..,255} | {0,1,..,255} | {0,1,..,255} | No | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
… | … | … | … | … | … | … | … | … | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 0 | 0 | 1 | {0,1,..,255} | {0,1,..,255} | {0,1,..,255} | No |
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
Net1 and Rough |
|
|
|
|
| TP:18823 | FP: 447 |
| |
| FN:35 | TN: 3819 |
| |
|
|
|
|
In both methods “Particle Filter” and “Net1 and Rough” for ‘Face’ region: Recall=1, Accuracy=1 and Precision=1.
The both methods obtain good results but the method “Net1 and Rough” answer is in a shorter time.
Table 3: left hand: Rough and Net1: Recall=0.952, Accuracy=0.977, Precision=0.938
Net1 and Rough |
|
|
|
| |
| TP:946 | FP: 62 |
| ||
| FN:473 | TN: 21653 |
| ||
|
|
|
|
In Table 4 Deep learning with rough sets tracking system is compared with other methods.
3-1- Result of Proposed Method 1: Multi-Tracking using Particle Filter
The results of the particle filter show in the Fig. 7.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fig. 7: Some results of Multi-tracking using particle filter on video signal DeafNews dataset
3-2- Result of Proposed Method 2: Semantic Segmentation Network using Deep Learning
Number of layers is 56, number of connections is 66, input is image and output are semantic segmentation. This 56-layer deep learning network, segment the Hands and Face areas from the background. The second-deep learning network is for tracking the right hand, face, and left hand, which in the preceding stage have their areas. Percent accuracy 97.35 on dataset. The dataset that works on the system will be related to the deaf news. The results of deep-learning tracking show in Fig. 9 and Fig. 10. and table 4. In this paper used two separate deep learning for tracking.
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| |||||||||
|
|
|
|
|
|
|
|
|
|
|
Dataset | Recall | MOTP | MOTA | Method |
|
Persian Deaf News | 0.948 | 0.935 | 0.971 | Particle filter | proposed method1 |
Persian Deaf News | 0.974 | 0.971 | 0.980 | deep learning with rough | proposed method2 |
Table 5: Method, MOTA, MOTP MOT16 Dataset
MOTP | MOTA | Method |
|
0.772 | 0.548 | DeepMOT-Tracker | DeepMOT-Tracker [12] |
0.765 | 0.476 | Deep learning with rough | Proposed method |
To test the proposed algorithm, another network is trained using MOT16 dataset. The results are shown in Table 5. It is compared with best result of [12].
3- Conclusion
Tracking is one of the most fundamental problems in computer vision, and use in a long list of applications such as sign languages recognition. We used rough set theory with deep Neural network for sign language tracking. The novelty of this paper is the use of rough set theory with deep neural network for tracking. This is the first work on this topic. This is the first on this topic. In this paper, at first, tracking using particle filter is explained. At second, tracking using rough sets and deep learning is explained. In the first proposed method, we used a particle filter, which has high accuracy but is very time consuming. The second proposed method responds much faster but is less accurate. To increase the accuracy of the second proposed method, we used the rough set theory. The system proposed are tested on 33 of Deaf News with 100 different words and 1927 video files for words, and recall, MOTA and MOTP values are obtained. Also, it with new mask is used for MOT16 dataset for comparing.
We focused our efforts on optimizing tracking with semantic deep network and rough set theory, but we want to use our proposed methods for sign language recognition.
References
[1] H. Tang, H. Liu, W. Xiao and N. Sebe, "Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion", Neurocomputing, Vol. 331, 2019, pp. 424-433.
[2] L. Kraljevi´c, M. Russo, M. Paukovi´c and M. Šari´c, “A Dynamic Gesture Recognition Interface for Smart Home Control based on Croatian Sign Language”, Appl. Sci. 2020, 10, 2300.
[3] A. Wadhawan and P. Kumar, "Sign Language Recognition Systems: A Decade Systematic Literature Review", Arch Computat Methods Eng 2019, https://doi.org/10.1007/s11831-019-09384-2.
[4] P. Kim, "MATLAB Deep Learning: With Machine Learning, Neural Networks and Artificial Intelligence", Apress, 2017.
[5] A.E. Hassanien, A. Abraham, J.F. Peters, G. Schaefer, Henry C., "Rough sets and near sets in medical imaging: a review", IEEE Transactions on Information Technology in Biomedicine, Vol. 13(6), 2009, pp. 955-968.
[6] V. Sattari-Naeini, and A. Moaref, "Fuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection", International Journal of Engineering, Vol. 30(9), 2017, pp. 1326-1333.
[7] D. Li, C. R. Opazo, X. Yu and H. Li, "Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 1459-1469.
[8] A. H. Mazinan, J. Hassanian, “A Hybrid Object Tracking for Hand Gesture Approach based on MS-MD and its Application,” Journal of Information Systems and Telecommunication (JIST), 2015, Vol. 3, No. 4.
[9] S. Ildarabadi, M. Ebrahimi, H. R. Pourreza, "Improvement Tracking Dynamic Programming using Replication Function for Continuous Sign Language Recognition", International Journal of Engineering Trends and Technology (IJETT), Vol 7(3), 2014, pp. 97-101.
[10] P. Chiranjeevi, S. Sengupta, “Rough-Set-Theoretic Fuzzy Cues-Based Object Tracking Under Improved Particle Filter Framework”, IEEE transactions on fuzzy systems, Vol. 24, 2016, No. 3.
[11] J. B. Zhang, T. R. Li, & H. M. Chen. "Composite rough sets for dynamic data mining", Information Science, Vol. 257, 2014, pp. 81–100.
[12] Y. Xu, A. Sep, Y. Ban, R. Horaud, L. Leal-Taixe and X. Alameda-Pineda, “How to train your deep multi-object tracker", Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 6787-6796, Jun. 2020.
* Hossein Ebrahimpour-Komeleh
ebrahimpour@kashanu.ac.ir