PhD (Electrical and Computer Engineering) thesis by Qing Chen, 2008
Introduction
This research aims to detect free air hand gestures using a video input. A normal web camera is used for the video capturing process and the implementation has two parts,
- Recognizing the hand from the video input
- Identify and classify selected hand gestures
One of the main importance of this research is that this approach suggests a real time gesture recognition system since they are using divide and conquer strategy. The have used a combination statistical and syntactic analysis for gesture recognition. The 3D position of the hand is recovered according to the camera's perspective projection. For the high level hand gestures recognition, a stochastic context-free grammar (SCFG ) is used to analyze the syntactic structure of the hand gestures with the terminal strings converted from the postures detected by the low-level of the architecture.
Contribution
To achieve natural and immersive human-computer interaction,the human hand could be used as an interface device. Hand gestures are a powerful human to-human communication channel, which forms a major part of information transfer in our every day life.
Early research on vision-based hand tracking and gesture recognition usually needs the help o f markers or colored gloves. In current state-of-the-art vision-based hand tracking and gesture recognition techniques, research is more focused on tracking the bare hand and identify hand gestures without the help of any markers and gloves. So this research is focused on that purpose.
Since many current approaches are still limited by the lack of speed, accuracy ,robustness and real-time support their contribution is to build a real time 3D hand tracking and gesture recognition system for the purpose of human computer interaction (HCI).
In the first chapter, the researcher has mentioned following as the contribution of this research.
- A two-level system architecture is implemented, which combines the advantages of statistical and syntactic pattern recognition approaches effectively, and achieves real-time, accurate and robust hand tracking and gesture recognition with one camera as input device
- A parallel cascade structure for the architecture's low-level is implemented using the Ada Boost learning algorithm and a set of Haar-like features. This structure can correctly extract a set of hand postures track the hand motion in 3D in real-time.
- The hand gestures are analyzed base on a SCFG, which defines the composite properties based on the constituent hand postures. The assignment of the probability to each production rule of the SCFG can be used to control the "wanted" gestures and the "unwanted" gestures. Smaller probability could be assigned to the"unwanted" gestures while greater value could be assigned to "wanted" gestures so that the resulting SCFG would generate the "wanted" gestures with higher probabilities.
- For hand motion analysis, with the uncertainty of hand trajectories, the ambiguous versions can be identified by looking for the SCFG that has the higher probability to generate the input string. The motion patterns can be controlled by adjusting the probabilities associated with the production rules so that the resulting SCFG would generate the standard motion patterns with higher probabilities.
How does it relates in to my work?
The researchers have suggested some performance requirements that need to be achieved by a 3D hand recognition system. They are,
- Real-time performance
- Accuracy
- Robustness
- Scalability
- User-independance
I also need to consider about above requirements since my research needs to recognize the hand also (Recent trend in 3D hand recognition in free hand environment is to use LeapMotion sensor, so I have to search weather that sensor covers those requirements).
The second chapter, the literature review starts with describing a skeleton structure and the joints of the human hand.
As shown in in the figure above, due to the high DOF of the human hand, hand gesture recognition becomes a very challenging problem. When talk about hand gesture recognition, there are two concepts that we should know.
- Hand Posture: a hand posture is a static hand pose and its current location without any movements involved.
- Hand Gesture: a hand gesture is a sequence of hand postures connected by continuous hand or finger movements over a short period of time.
Lenman et al. suggested that the design space for gestural commands can be characterized along three dimensions. The researcher used that design space for his research as follows,
- The intuition aspect : means the selected gestures should be intuitive and comfortable for the user to learn and to remember. The gestures should be straightforward so that least effort will be required for the user to learn the gestures. The user should be able to use their natural hand configurations and not be required to learn any specific or complex hand configurations, which are very easy to cause fatigue and make the user uncomfortable.
- The articulatory aspect : means the selected gestures should be easy for recognition and do not cause confusions for the user. Gestures involving complicated hand poses and finger movements should be avoided due to the difficulty to articulate and repeat
- The technology aspect : refers to the fact that in order to be viable, the selected gestures must take in to account the properties of employed algorithms and techniques. The required data and information can be extracted and analyzed from the selected gesture commands without causing excessive computation cost for the employed approach.
Above approach is important when I select suitable hand gestures for my research. After that the researcher have reviewed summarized some vision based hand tracking and gesture recognition systems proposed by researchers (Table 2.1 of the paper). Those approaches can be categorized as Appearance based and 3D hand model based approaches. Some appearance based algorithms use statistical methods while some use syntactic methods. The one the researcher selected is to use appearance based approach with a hybrid method (both statistical and syntactic). He has listed a set of popular features and algorithms used to detect human hands nd recognize gestures in appearance based approach,
- Colors and Shapes
- Hand Features
- Optical Flow
- Mean Shift
- SIFT Features
- Stereo Image
- Viola-Jones Algorithm
I think that the gesture recognition part is less important since there are real-time hand recognition sensors like Leap-motion is available now. In the summery of the second chapter, the researcher has compared the Appearance vs. 3D Hand Model and Statistical vs. Syntactic approach. He says that it is easier for appearance-based approaches to achieve
real-time performance due to the comparatively simpler 2D image features. Some
of the drawbacks and limitations can be listed as follows,
- 3D hand model is a complex articulated deformable object with many degrees of freedom, a very large image database is required to cover all the characteristic hand images under different views.
- Lack of the capability to deal with singularities that arise from ambiguous views.
- Most current 3D hand model based approaches focus on real-time tracking for global hand motions and local finger motions with restricted lighting and background conditions.
- Scalability problem, where a 3D hand model with specific kinematic parameters cannot deal with a wide variety of hand sizes from different people.
- The number of primitive types should be small.
- The primitives selected must be able to form an appropriate object representation.
- Primitives should be easily segmentable from the image.
- Primitives should be easily recognizable using some statistical pattern recognition method.
- Primitives should correspond with significant natural elements of the object structure being described.
G= [Vt,
Vn, P, S]
In this model,- Vt is the set of terminals,
- Vn is the set of non-terminals
- P is a finite set of production rules
- Picture Description Language (PDL ) proposed by Shaw
- The grammar defined by Hand et al.
- Tree like approach suggested by Jones et al.
- Etc…
The third chapter of the paper describe the two-level architecture of the design suggested. In the first part of the chapter, he explains how the selection of postures and gestures happen, He has used a taxonomy proposed by Quek to understand hand gestures of different classes. Following figure explains it.
- Unintentional movements : Hand motions that do not have any intentions to communicate information
- Manipulative gestures are the ones used to act on objects in an environment(such as picking up a box).
- Communicative gestures intend to communicate information
Since the architecture doesn't interpret all hand gestures, some of the hand postures need to be selected. Using those postures, The selected hand gestures introduced. Following table shows selected hand postures and implemented hand gestures using the postures selected.
Since this approach consists o two levels, the lower level recognizes and tracks the hand postures from the user, Then the High level is taking the responsible for gesture recognition and motion analysis. I think in my research, the more I need to be considered about how to recognize gestures from a large set of hand postures that continuously given by the Leap-Motion sensor.
Chapter four of the thesis is about the extraction and tracking of the 3D hand posture from the video input at real-time, The researcher has used Haar Like features for this and some other algorithms. It's not much important since leap motion can detect it.
Chapter five describes the high level hand gesture recognition using a context free grammar. The approach also using a probability measurement for better accuracy. Here they have used SCFG and SCFGs extend CFGs in the same way that Hidden Markov models (HMMs) extend regular grammars. But SCFGs have more flexibility than HMMs
Chapter Six describes how the suggested thesis has been evaluated after it was implemented inside a virtual 3d gamin environment. The game is for user to drive a car using free hand command to the destination using some traffic signs as well.
Advantages and Disadvantages
Advantages
- Real-Time recognition
- Better accuracy
- Low hardware requirements (Only a web cam needed except the PC)
- Very good robustness against different lighting conditions and a certain degree of robustness against image rotations .
Disadvantages
- Limited set of hand postures and gestures.
- To achieve the robustness against cluttered backgrounds, background subtraction and noise removal are need to be applied.
Suggested future work
- More diversified hand samples from different people can be used in the training process so that the classifiers will be more user independent .
- Context-awareness for the gesture recognition system : The same gesture performed within different contexts and environments can have different semantic meanings. For example, with the background extracted from the video, if there is a computer detected, we can say that a pointing gesture means turning on the computer in an office. However,if there is a stove detected from the background, we can be pretty much sure that the user is in a kitchen and the pointing gesture probably means turning on the stove.
- Track and recognize multiple objects such as human faces, eye gaze and hand gestures at the same time .





