Posted by Hemprasad Y. Badgujar on January 19, 2015
Object tracking using a Kalman filter (MATLAB)
In a previous article, I showed how face detection can be performed in MATLABusing OpenCV. In this article, I will combine this face detector with a Kalman filter to build a simple face tracker that can track a face in a video.
If you are unfamiliar with Kalman filters, I suggest you read up first on how alpha beta filters work. They are a simplified version of the Kalman filter that are much easier to understand, but still apply many of the core ideas of the Kalman filter.
Face tracking without a Kalman filter
The OpenCVbased face detector can be applied to every frame to detect the location of the face. Because it may detect multiple faces, we need a method to find the relationship between a detected face in one frame to another face in the next frame — this is a combinatorial problem known as data association. The simplest method is the nearest neighbour approach, and some other methods can be found in this survey paper on object tracking. However, to greatly simplify the problem, the tracker I have implemented is a single face tracker and it assumes there is always a face in the frame. This means that every face that is detected can be assumed to be the same person’s face. If more than one face is detected, only the first face is used. If no faces are detected, a detection error is assumed. The MATLAB code below will detect the face location in a sequence of images and output the bounding box coordinates to a CSV file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

function detect_faces(imgDir, opencvPath, includePath, outputFilename) % Load the required libraries if libisloaded( 'highgui2410' ), unloadlibrary highgui2410, end if libisloaded( 'cv2410' ), unloadlibrary cv2410, end if libisloaded( 'cxcore2410' ), unloadlibrary cxcore2410, end loadlibrary(... fullfile(opencvPath, 'bin\cxcore2410.dll' ), @proto_cxcore); loadlibrary(... fullfile(opencvPath, 'bin\cv2410.dll' ), ... fullfile(opencvPath, 'cv\include\cv.h' ), ... 'alias' , 'cv2410' , 'includepath' , includePath); loadlibrary(... fullfile(opencvPath, 'bin\highgui2410.dll' ), ... fullfile(opencvPath, 'otherlibs\highgui\highgui.h' ), ... 'alias' , 'highgui2410' , 'includepath' , includePath); % Load the cascade classifierFilename = 'C:/Program Files/OpenCV/data/haarcascades/haarcascade_frontalface_alt.xml' ; cvCascade = calllib( 'cv2410' , 'cvLoadHaarClassifierCascade' , classifierFilename, libstruct( 'CvSize' ,struct( 'width' ,int16(2410), 'height' ,int16(2410)))); % Create memory storage cvStorage = calllib( 'cxcore2410' , 'cvCreateMemStorage' , 0); % Get the list of images imageFiles = dir(imgDir); detections = struct; h = waitbar(0, 'Performing face detection...' ); % progress bar % Open the output CSV file fid = fopen(outputFilename, 'w' ); fprintf(fid, 'filename,x1,y1,x2,y2' ); for i = 1:numel(imageFiles) if imageFiles(i).isdir; continue ; end imageFile = fullfile(imgDir, imageFiles(i).name); % Load the input image cvImage = calllib( 'highgui2410' , ... 'cvLoadImage' , imageFile, int16(1)); if ~cvImage.Value.nSize error( 'Image could not be loaded' ); end % Perform face detection cvSeq = calllib( 'cv2410' , 'cvHaarDetectObjects' , cvImage, cvCascade, cvStorage, 1.1, 2, 0, libstruct( 'CvSize' ,struct( 'width' ,int16(40), 'height' ,int16(40)))); % Save the detected bounding box, if any (and if there's multiple % detections, just use the first one) detections(i).filename = imageFile; if cvSeq.Value.total == 1 cvRect = calllib( 'cxcore2410' , ... 'cvGetSeqElem' , cvSeq, int16(1)); fprintf(fid, '%s,%d,%d,%d,%d' , imageFile, ... cvRect.Value.x, cvRect.Value.y, ... cvRect.Value.x + cvRect.Value.width, ... cvRect.Value.y + cvRect.Value.height); else fprintf(fid, '%s,%d,%d,%d,%d' , imageFile, 0, 0, 0, 0); end % Release image calllib( 'cxcore2410' , 'cvReleaseImage' , cvImage); waitbar(i / numel(imageFiles), h); end % Release resources fclose(fid); close(h); calllib( 'cxcore2410' , 'cvReleaseMemStorage' , cvStorage); calllib( 'cv2410' , 'cvReleaseHaarClassifierCascade' , cvCascade); end 
We can then run our face detector and generate an output file, faces.csv
, like this:
1
2
3
4

imgDir = 'images' ; opencvPath = 'C:\Program Files\OpenCV' ; includePath = fullfile(opencvPath, 'cxcore\include' ); detect_faces(imgDir, opencvPath, includePath, 'faces.csv' ); 
In the video below, I have run this script on the FGnet Talking Face database (which is free to download) and displayed the bounding boxes overlayed on the image sequence. You can download a copy of the faces.csv
file that was used to generate the video from here.
The bounding box roughly follows the face, but its trajectory is quite noisy and the video contains numerous frames where the bounding box disappears because the face was not detected. The Kalman filter can be used to smooth this trajectory and estimate the location of the bounding box when the face detector fails.
Kalman filtering: The gritty details
The Kalman filter is a recursive twostage filter. At each iteration, it performs a predictstep and an update step.
The predict step predicts the current location of the moving object based on previous observations. For instance, if an object is moving with constant acceleration, we can predict its current location, , based on its previous location, , using theequations of motion.
The update step takes the measurement of the object’s current location (if available), , and combines this with the predicted current location, , to obtain an a posterioriestimated current location of the object, .
The equations that govern the Kalman filter are given below (taken from theWikipedia article):
 Predict stage:
 Predicted (a priori) state:
 Predicted (a priori) estimate covariance:
 Update stage:
 Innovation or measurement residual:
 Innovation (or residual) covariance:
 Optimal Kalman gain:
 Updated (a posteriori) state estimate:
 Updated (a posteriori) estimate covariance:
They can be difficult to understand at first, so let’s first take a look at what each of these variables are used for:
 is the current state vector, as estimated by the Kalman filter, at time .
 is the measurement vector taken at time .
 measures the estimated accuracy of at time .
 describes how the system moves (ideally) from one state to the next, i.e. how one state vector is projected to the next, assuming no noise (e.g. no acceleration)
 defines the mapping from the state vector, , to the measurement vector, .
 and define the Gaussian process and measurement noise, respectively, and characterise the variance of the system.
 and are controlinput parameters are only used in systems that have an input; these can be ignored in the case of an object tracker.
Note that in a simple system, the current state and the measurement will contain the same set of state variables (only will be a filtered version of ) and will be an identity matrix, but many realworld systems will include latent variablesthat are not directly measured. For example, if we are tracking the location of a car, we may be able to directly measure its location from a GPS device and its velocity from the speedometer, but not its acceleration.
In the predict stage, the state of the system and its error covariance are transitioned using the defined transition matrix , and can be implemented in MATLAB as:
1
2
3
4

function [x,P] = kalman_predict(x,P,F,Q) x = F*x; %predicted state P = F*P*F' + Q; %predicted estimate covariance end 
In the update stage, we first calculate the difference between our predicted and measured states. We then calculate the Kalman gain matrix, , which is used to weight between our predicted and measured states and is adjusted based on a ratio of prediction error to measurement noise .
Finally, the state vector and its error covariance are then updated with the measured state. It can be implemented in MATLAB as:
1
2
3
4
5
6
7

function [x,P] = kalman_update(x,P,z,H,R) y = z  H*x; %measurement error/innovation S = H*P*H' + R; %measurement error/innovation covariance K = P*H'*inv(S); %optimal Kalman gain x = x + K*y; %updated state estimate P = (eye(size(x,1))  K*H)*P; %updated estimate covariance end 
Both the stages only update two variables: , the state variable, and , the prediction error covariance variable.
The two stages of the filter correspond to the statespace model typically used to model linear dynamical systems. The first stage solves the process equation:
The process noise is additive Gaussian white noise (AWGN)with zero mean and covariance defined by:
The second one is the measurement equation:
The measurement noise is also AGWN with zero mean and covariance defined by:
Defining the system
In order to implement a Kalman filter, we have to define several variables that model the system. We have to choose the variables contained by and , and also choose suitable values for , , and , as well as an initial value for .
We will define our measurement vector as:
where and are the upperleft and lowerright corners of the bounding box around the detected face, respectively. This is simply the output from the Viola and Jones face detector.
A logical choice for our state vector is:
where and are the firstorder derivatives. Other vectors are also possible; for example, some papers introduce a “scale” variable, which assumes that the bounding box maintains a fixed aspect ratio.
The transition matrix defines the equations used to transition from one state vector to the next vector (without taking into account any measurements, ). It is plugged in to the process equation:
Let’s look at some basic equations describing motion:
We could express this system using the following recurrence:
These same equations can also be used to model the variables and their derivatives. Referring back to the process equation, we can thus model this system as:
The process noise matrix measures the variability of the input signal away from the “ideal” transitions defined in the transition matrix. Larger values in this matrix mean that the input signal has greater variance and the filter needs to be more adaptable. Smaller values result in a smoother output, but the filter is not as adaptable to large changes. This can be a little difficult to define, and may require some fine tuning. Based on our definition of the measurement noise above, our process noise matrix is defined as:
where and .
The measurement matrix maps between our measurement vector and state vector . It is plugged in to the measurement equation:
The variables and are mapped directly from to , whereas the derivative variables are latent (hidden) variables and so are not directly measured and are not included in the mapping. This gives us the measurement matrix:
The matrix defines the error of the measuring device. For a physical instrument such as a speedometer or voltmeter, the measurement accuracy may be defined by the manufacturer. In the case of a face detector, we can determine the accuracy empirically. For instance, we may find that our Viola and Jones face detector detects faces to within 10 pixels of the actual face location 95% of the time. If we assume this error is Gaussiandistributed (which is a requirement of the Kalman filter), this gives us a variance of 6.5 pixels for each of the coordinates, so the measurement noise vector is then given by:
The errors are independent, so our covariance matrix is given by:
Decreasing the values in means we are optimistically assuming our measurements are more accurate, so the filter performs less smoothing and the predicted signal will follow the observed signal more closely. Conversely, increasing means we have less confidence in the accuracy of the measurements, so more smoothing is performed.
The estimate covariance matrix is a measure of the estimated accuracy of at time . It is adjusted over time by the filter, so we only need to supply a reasonable initial value. If we know for certain the exact state variable at startup, then we can initialise to a matrix of all zeros. Otherwise, it should be initialised as a diagonal matrix with a large value along the diagonal:
where . The filter will then prefer the information from the first few measurements over the information already in the model.
Implementing the face tracker
The following script implements the system we have defined above. It loads the face detection results from CSV file, performs the Kalman filtering, and displays the detected bounding boxes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

% read in the detected face locations fid = fopen( 'detect_faces.csv' ); fgetl(fid); %ignore the header detections = textscan(fid, '%[^,] %d %d %d %d' , 'delimiter' , ',' ); fclose(fid); % define the filter x = [ 0; 0; 0; 0; 0; 0 ]; F = [ 1 0 0 0 1 0 ; ... 0 1 0 0 0 1 ; ... 0 0 1 0 1 0 ; ... 0 0 0 1 0 1 ; ... 0 0 0 0 1 0 ; ... 0 0 0 0 0 1 ]; Q = [ 1/4 0 0 0 1/2 0 ; ... 0 1/4 0 0 0 1/2 ; ... 0 0 1/4 0 1/2 0 ; ... 0 0 0 1/4 0 1/2 ; ... 1/2 0 1/2 0 1 0 ; ... 0 1/2 0 1/2 0 1 ] * 1e2; H = [ 1 0 0 0 0 0 ; ... 0 1 0 0 0 0 ; ... 0 0 1 0 0 0 ; ... 0 0 0 1 0 0 ]; R = eye(4) * 42.25; P = eye(6) * 1e4; nsamps = numel(detections{1}); for n = 1:nsamps % read the next detected face location meas_x1 = detections{2}(n); meas_x2 = detections{4}(n); meas_y1 = detections{3}(n); meas_y2 = detections{5}(n); z = double([meas_x1; meas_x2; meas_y1; meas_y2]); % step 1: predict [x,P] = kalman_predict(x,P,F,Q); % step 2: update (if measurement exists) if all(z > 0) [x,P] = kalman_update(x,P,z,H,R); end % draw a bounding box around the detected face img = imread(detections{1}{n}); imshow(img); est_z = H*x; est_x1 = est_z(1); est_x2 = est_z(2); est_y1 = est_z(3); est_y2 = est_z(4); if all(est_z > 0) && est_x2 > est_x1 && est_y2 > est_y1 rectangle( 'Position' , [est_x1 est_y1 est_x2est_x1 est_y2est_y1], 'EdgeColor' , 'g' , 'LineWidth' , 3); end drawnow; end 
The results of running this script are shown in the following video:
Clearly we can see that this video has a much smoother and more accurate bounding box around the face than the unfiltered version shown previously, and the video no longer has frames with missing detections.
Closing remarks
In the future, I aim to write an article on the extended Kalman filter (EKF) and unscented Kalman filter (UKF) (and the similar particle filter). These are both nonlinear versions of the Kalman filter. Although face trackers are usually implemented using the linear Kalman filter, the nonlinear versions have some other interesting applications in image and signal processing.
Leave a Reply