Site menu:

Latest News :

July 16, 2010:
The AHRS PC board has now gone through is specification and design phase. We hope to be testing this new unit soon. Tthe Extended Kalman Filter (EKF) has been written and just need to be tested on hardware.

Sept 18, 2009:
Hardware PCB Prototype is awaiting fabrication. Hoping to have board assembled and tested by mid next month. BatchPCB has a long turn around time.

Aug 13, 2009
Development of Base Station Software is coming together quickly. Working to create the most intuitive and therefore easy to use client software is our main priority but has also enabled up to create some innovative new control methods.

Aerial Image Recognition and Tracking

Patrick Glass - Johnson Chou - Aaron Sequeira

Electrical Engineering - The University of British Columbia

Download Full Paper


With all the recent developments in object detection technologies, the next step would be to implement these technologies for specific applications. As object detection becomes much more accurate and quicker, in terms of processing time, it allows one to automate several systems that are currently being run by humans. An important area where object detection can be applied is in the use of unmanned aerial vehicles (UAVs). These vehicles could then be used to track objects from an aerial perspective, and this could prove to be very useful for very specific applications such as overhead news feeds. This paper conducts an investigation into how such an algorithm could be implemented and its implications, such as processing time and accuracy.

 

Introduction

This report presents an investigation of object tracking and recognition for applications in the mobile robotics field. The objectives of this project are to design a system capable of following a ground based vehicle from an aerial perspective for an unmanned aerial vehicle (UAV). Many methods exist for tracking objects when using a stationary camera or scene, however, many of those existing methods fail when the camera viewpoint, object of interest, or background change constantly. With the increased demand for military reconnaissance as well as sensor data collection in hazardous environments, unmanned aerial vehicle systems are being used in an increasing number of military and research operations [1]. Having a reliable and human-independent method for tracking a vehicle can help reduce or eliminate the need for a remote pilot operator to manually follow a vehicle which is both a laborious and an unnecessary task. This would allow the remote operator to contribute greater results with the analysis of the video feeds. Our project uses an advanced scale and rotation invariant feature matching algorithm in order to locate and track a vehicle by recognizing the object in future frames of the video feed. In this project we implemented our design using MATLAB and a prerecorded video feed of an aerial tracked vehicle in order to reduce our development and implementation time. This project report will cover the general project design and algorithms used to develop this advanced object tracking system. This report will also cover the methodology and procedures taken to develop this vehicle tracking system, as well as the results and conclusions of our experiments.

 

Methodology

This section of the report is divided into four main subsections: the overall design architecture, feature descriptor extraction, feature matching, and the object recognition stage.

 

Design Architecture

In order to meet the project requirements of being able to track a moving object with a changing camera viewpoint, it was determined, by process of elimination, that many of the existing motion tracking systems would not be acceptable. Seeing as how it was necessary to create a versatile algorithm capable of tracking a variety of objects on its own, no specific tuning of the algorithm would be allowed. The architecture and algorithm components were chosen such that it allowed tracking and recognizing various objects in various scenes with varying light and shadows.


In the first stage of our algorithm the desired object to track is selected. This image serves as the base upon which the next frame received will be matched with. This base image then is simplified to its feature vectors by using a scale invariant feature transformation (SIFT) as can be seen from Fig. 1 [2]. When the next image frame is received it is also broken down to its feature vector. A comparison is then performed on the two images in which a feature matching algorithm is used to determine which of the feature vectors occur in both of the images. This yields the coordinates of the matching points with great accuracy. These points are then fed to the next stage for position estimation. In order to track an object, the relative frame movement is calculated by the Position Estimation block. This block takes in the matched coordinates and calculates the new estimated position of the vehicle or object in the new frame. This data then allows the video feed to be overlaid with a crosshair targeting the tracked object. It is then this center coordinate that could be fed to the UAV guidance system for autonomous tracking. The UAV integration, however, is not covered in this report.

Object Tracking Block Diagram
Fig. 1. Overall Architecture Block Diagram


There are some variations to the above algorithm which were used primarily to increase performance to be able to track an object at 10 frames per second. The main modification was the introduction of a region of interest (ROI) filter which allowed the SIFT operation to only be performed on a subset of the comparison image. This new cropped image allows the SIFT operation to be completed faster, as can be seen in Table 1.

Table 1 . Region of Interest SIFT Speedup of 14 Frames

SIFT Before ROI

SIFT After 1/3 ROI Crop

1.397 seconds

0.454 seconds

307.7% Speedup

 

Feature Descriptor Extraction

The most important part of our design is our choice in feature extraction. Feature extraction is the process of creating "feature vectors" which represent the original input stream in a smaller less redundant format. The needs of our application determined that our feature should be scale and rotation invariant to eliminate the need to have advanced post processing on returned features. The scale invariant feature transform (SIFT) was selected due to its high match percentage given large degrees of scale and rotation as can be seen in Fig. 2 [3]. The base image is an image of a standard stop sign. The reference image is then compared to a scene of different signs all with the same basic shape and style of a stop sign. SIFT is able to correctly match reference points in both images correctly.

Stop Sign Feature Descritor Mathcing Example

Fig. 2. Stop Sign Feature Descriptor Matching Example


SIFT is superior to Harris Corner Detection because of its ability to allow for extreme rotation and scale differences between the reference and the comparison image as is shown in Fig. 3 below [3]. Therefore the slower SIFT computation time is well worth it when orientation and scale cannot be controlled as in our situation with aerial video footage.

Fig. 3. Stop Sign Partial Occlusion Descriptor Matching Example

 

Feature Matching

In order to recognize the object of interest, a feature matching algorithm is performed on the descriptors returned from SIFT. Fig. 4 displays two sample descriptors in histogram form, with 128 elements each, to be compared.


Fig. 4. (a) SIFT Descriptor A. (b) SIFT Descriptor B


There are many types of comparison algorithms, one of which involves error calculations. Error calculation uses the difference of the two descriptors to determine the accuracy of the matches. Among the many error calculations tested, it was found that the least square norm of the difference of the two descriptors, also known as the Euclidean norm, provided the most accurate comparison. The error is calculated as

The lower the error value, the more accurate the matches are. To decrease the computation time, the square root of the error calculation was deemed unnecessary and was omitted. The accuracy of the matches was in no way affected. The new error calculation is then simply the sum of the errors squared:

Features between two frames can now be matched by comparing the errors between a descriptor from one frame to all the descriptors in the other frame. Only the two lowest error values are used to determine the uniqueness of the match. If the lowest error value is smaller than the second lowest error value by a preset threshold, then a unique match has been found. A threshold of 2.2 was found to work best for our application. This method of comparison is repeated for all the descriptors. With all the comparisons made, an array of the matches and their corresponding score will be available for further processing. 
Depending on the quality of the images, error values were ranging from 0 to 50000, with 0 being a perfect match. To further increase the accuracy of the matches, a maximum accepted error was implemented. Any matches with a score above the maximum are rejected. Fig. 5 displays the matching of the same car from two different scenes.

Fig. 5. Feature Matching on a Lamborghini Reventon

 

Object Recognition

Generally at least three correct matches are required to confidently recognize an object [5]. To ensure that three matches were always present, an intelligent threshold algorithm was implemented. The algorithm will adjust the threshold according to the amount of matches returned. In the event that feature match was actually not able to pick up at least three matches even with a low threshold set, as in some video feeds, the frame was skipped until the minimum three matches were made.
With at least three unique matches found, an object of interest has been identified and the transformation of the object can then be calculated [5]. The transformation of the object can be found by solving the general solution of the following equation for all the matches

where  is a point on a frame and  is the matched point on another frame that is transformed by  and translated by [5]. A box can now be drawn around the object of interest by applying the same transformation and translation to the initial object boundary if they were known. Fig. 6 displays object recognition with the same transformation performed on the boundary box.

Fig. 6. (a) Stop sign with initial boundary known. (b) Matched stop sign

 

Experimental Results and Discussion

Throughout the development of this object detection algorithm, we tried to continually improve on our results. Initially, we were working on generating matches between two still images, as opposed to matches within a video feed, where the key features in each image were scaled and rotated. We discovered that by adjusting the threshold, a value that determines how many matches to accept, we could obtain a more accurate set of matches between the two images. Once we were familiar with the SIFT algorithm and how to use it on images, we moved on to applying the algorithm to video feeds.
When first applying the completed algorithm to video feeds, we noticed that it worked in some cases, and did not work in others. Our algorithm aims to successfully lock onto the point of interest and if there are outliers, the situation becomes much more complicated. If all of the matches are correct matches, the tracking works perfectly and the correct image is tracked until an outlier is chosen as a match. An example of a good match can be seen below.

Fig. 7. An example of a good match


When an outlier does become a match, our algorithm has a chance of becoming unstable because it generally shifts the point of focus farther and farther from where it should be with each consecutive match, beginning with the first match containing the outlier. Fig.8 shows what happens when a bad match, because of an outlier, is created. It is apparent that the crosshairs do not behave normally and this is because it cannot decipher the proper corners for the box as the outlier is causing the algorithms to function improperly.

Fig. 8. An example of a bad match

Another reason our algorithm fails occasionally is because there are not enough proper matches. This may be caused by the desired tracking point changing significantly between frames. For example, if the image has been scaled significantly, or its orientation has changed greatly, the algorithm cannot be guaranteed to work. Such circumstances occur when, for example, the desired image has been rotated in the video feed, or a shadow is cast on top of an image

Table 2. Comparison of SIFT vs. Harrison error [3]

 

SIFT

Harris

Detected

38%

16%

Correct/
detected

28%

23%

Correct

10.6%

3.4%

The above table shows how the SIFT algorithm compares against Harris corner detection. In the table, the “detected” row is the percentage of points for which a characteristic scale is detected, the “Correct/detected” row is the percentage of points for which a correct scale is detected with respect to detected points, and the “correct” row is the ratio of the correct to total matches.


In terms of frame rate, our algorithm has significantly improved from where we were at the start of the development. Initially, we were running the algorithm with a frame rate of nearly 4 seconds per frame. Obviously, this rate is much too low if this algorithm is to be used in any sort of real time application. As we continued to work on this particular performance issue, we were able to significantly increase the frame rate to 10 frames per second or 0.1 seconds per frame. This is a significant improvement by leaps and bounds over the original case and would also make this algorithm useable in real time processing.

 

Conclusions and Future Work

We were successful in getting the basic functionality for a feature matching and tracking algorithm but we would have liked to develop a more accurate and consistent algorithm. Had more time been available, a better, more complex algorithm to filter out the outliers could have been generated and this would significantly increase the accuracy of this algorithm and make it reliable on a very consistent basis. In order to further enhance the effectiveness of this algorithm, a Kalman filter could be used to better estimate the future position of the object in the video feed. This would significantly enhance the speed of the SIFT algorithm because only possible locations of where the object could be in the next frame, need to be processed.
Lastly, the work done on this algorithm could be extended so that multiple objects could be simultaneously tracked. This would be ideal if there were several points of interest in a given video feed.


In conclusion, this project was a success but there is definitely still room for development. The usefulness of such an algorithm is immense and would be greatly beneficial in fields such as robotics and automated vehicle control.

 

References

[1]   “Unmanned Aerial Vehicles,” Wikipedia.com. [Online].Available: http://en.wikipedia.org/wiki/Unmanned_aerial_vehicle [Accessed: Jul. 12, 2009]
[2]    “Feature Extraction,” Wikipedia. [Online].
Available: http://en.wikipedia.org/wiki/Feature_extraction. [Accessed: Jul. 15, 2009].
[3]   K. Mikolajczy and C. Schmid, ”Indexing based on scale invariant interest points” The Proceedings on 8th IEEE International Conference on Computer Vision, vol.1, pp. 525-531, 2001.
[4]    “Difference of Gaussians,” Wikipedia. [Online]. Available:
http://en.wikipedia.org/wiki/Difference_of_Gaussians.
[Accessed: Jun. 6, 2009].
[5]   D. Lowe, “Object Recognition from local scale invariant features” The Proceedings on 7th IEEE International Conference on Computer Vision, vol.2, pp. 1150-1157, 1999.