Something More for Research

Explorer of Research #HEMBAD

Posted by Hemprasad Y. Badgujar on January 19, 2015


Object tracking using a Kalman filter (MATLAB)

The Kalman filter is useful for tracking different types of moving objects. It was originally invented by Rudolf Kalman at NASA to track the trajectory of spacecraft. At its heart, the Kalman filter is a method of combining noisy (and possibly missing) measurements and predictions of the state of an object to achieve an estimate of its true current state. Kalman filters can be applied to many different types of linear dynamical systems and the “state” here can refer to any measurable quantity, such as an object’s location, velocity, temperature, voltage, or a combination of these.

In a previous article, I showed how face detection can be performed in MATLABusing OpenCV. In this article, I will combine this face detector with a Kalman filter to build a simple face tracker that can track a face in a video.

If you are unfamiliar with Kalman filters, I suggest you read up first on how alpha beta filters work. They are a simplified version of the Kalman filter that are much easier to understand, but still apply many of the core ideas of the Kalman filter.

Face tracking without a Kalman filter

The OpenCV-based face detector can be applied to every frame to detect the location of the face. Because it may detect multiple faces, we need a method to find the relationship between a detected face in one frame to another face in the next frame — this is a combinatorial problem known as data association. The simplest method is the nearest neighbour approach, and some other methods can be found in this survey paper on object tracking. However, to greatly simplify the problem, the tracker I have implemented is a single face tracker and it assumes there is always a face in the frame. This means that every face that is detected can be assumed to be the same person’s face. If more than one face is detected, only the first face is used. If no faces are detected, a detection error is assumed. The MATLAB code below will detect the face location in a sequence of images and output the bounding box coordinates to a CSV file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
function detect_faces(imgDir, opencvPath, includePath, outputFilename)
    % Load the required libraries
    if libisloaded('highgui2410'), unloadlibrary highgui2410, end
    if libisloaded('cv2410'), unloadlibrary cv2410, end
    if libisloaded('cxcore2410'), unloadlibrary cxcore2410, end
    loadlibrary(...
        fullfile(opencvPath, 'bin\cxcore2410.dll'), @proto_cxcore);
    loadlibrary(...
        fullfile(opencvPath, 'bin\cv2410.dll'), ...
        fullfile(opencvPath, 'cv\include\cv.h'), ...
            'alias', 'cv2410', 'includepath', includePath);
    loadlibrary(...
        fullfile(opencvPath, 'bin\highgui2410.dll'), ...
        fullfile(opencvPath, 'otherlibs\highgui\highgui.h'), ...
            'alias', 'highgui2410', 'includepath', includePath);
    % Load the cascade
    classifierFilename = 'C:/Program Files/OpenCV/data/haarcascades/haarcascade_frontalface_alt.xml';
    cvCascade = calllib('cv2410', 'cvLoadHaarClassifierCascade', classifierFilename, libstruct('CvSize',struct('width',int16(2410),'height',int16(2410))));
    % Create memory storage
    cvStorage = calllib('cxcore2410', 'cvCreateMemStorage', 0);
    % Get the list of images
    imageFiles = dir(imgDir);
    detections = struct;
    h = waitbar(0, 'Performing face detection...'); % progress bar
    % Open the output CSV file
    fid = fopen(outputFilename, 'w');
    fprintf(fid, 'filename,x1,y1,x2,y2');
    for i = 1:numel(imageFiles)
        if imageFiles(i).isdir; continue; end
        imageFile = fullfile(imgDir, imageFiles(i).name);
        % Load the input image
        cvImage = calllib('highgui2410', ...
            'cvLoadImage', imageFile, int16(1));
        if ~cvImage.Value.nSize
            error('Image could not be loaded');
        end
        % Perform face detection
        cvSeq = calllib('cv2410', 'cvHaarDetectObjects', cvImage, cvCascade, cvStorage, 1.1, 2, 0, libstruct('CvSize',struct('width',int16(40),'height',int16(40))));
        % Save the detected bounding box, if any (and if there's multiple
        % detections, just use the first one)
        detections(i).filename = imageFile;
        if cvSeq.Value.total == 1
            cvRect = calllib('cxcore2410', ...
                'cvGetSeqElem', cvSeq, int16(1));
            fprintf(fid, '%s,%d,%d,%d,%d', imageFile, ...
                cvRect.Value.x, cvRect.Value.y, ...
                cvRect.Value.x + cvRect.Value.width, ...
                cvRect.Value.y + cvRect.Value.height);
        else
            fprintf(fid, '%s,%d,%d,%d,%d', imageFile, 0, 0, 0, 0);
        end
        % Release image
        calllib('cxcore2410', 'cvReleaseImage', cvImage);
        waitbar(i / numel(imageFiles), h);
    end
    % Release resources
    fclose(fid);
    close(h);
    calllib('cxcore2410', 'cvReleaseMemStorage', cvStorage);
    calllib('cv2410', 'cvReleaseHaarClassifierCascade', cvCascade);
end

We can then run our face detector and generate an output file, faces.csv, like this:

1
2
3
4
imgDir = 'images';
opencvPath = 'C:\Program Files\OpenCV';
includePath = fullfile(opencvPath, 'cxcore\include');
detect_faces(imgDir, opencvPath, includePath, 'faces.csv');

In the video below, I have run this script on the FGnet Talking Face database (which is free to download) and displayed the bounding boxes overlayed on the image sequence. You can download a copy of the faces.csv file that was used to generate the video from here.

The bounding box roughly follows the face, but its trajectory is quite noisy and the video contains numerous frames where the bounding box disappears because the face was not detected. The Kalman filter can be used to smooth this trajectory and estimate the location of the bounding box when the face detector fails.

Kalman filtering: The gritty details

The Kalman filter is a recursive two-stage filter. At each iteration, it performs a predictstep and an update step.

The predict step predicts the current location of the moving object based on previous observations. For instance, if an object is moving with constant acceleration, we can predict its current location, \hat{\textbf{x}}_{t}, based on its previous location, \hat{\textbf{x}}_{t-1}, using theequations of motion.

The update step takes the measurement of the object’s current location (if available), \textbf{z}_{t}, and combines this with the predicted current location, \hat{\textbf{x}}_{t}, to obtain an a posterioriestimated current location of the object, \textbf{x}_{t}.

The equations that govern the Kalman filter are given below (taken from theWikipedia article):

  1. Predict stage:
    1. Predicted (a priori) state: \hat{\textbf{x}}_{t|t-1} = \textbf{F}_{t}\hat{\textbf{x}}_{t-1|t-1} + \textbf{B}_{t} \textbf{u}_{t}
    2. Predicted (a priori) estimate covariance: \textbf{P}_{t|t-1} = \textbf{F}_{t} \textbf{P}_{t-1|t-1} \textbf{F}_{t}^{T}+ \textbf{Q}_{t}
  2. Update stage:
    1. Innovation or measurement residual: \tilde{\textbf{y}}_t = \textbf{z}_t - \textbf{H}_t\hat{\textbf{x}}_{t|t-1}
    2. Innovation (or residual) covariance: \textbf{S}_t = \textbf{H}_t \textbf{P}_{t|t-1} \textbf{H}_t^T + \textbf{R}_t
    3. Optimal Kalman gain: \textbf{K}_t = \textbf{P}_{t|t-1} \textbf{H}_t^T \textbf{S}_t^{-1}
    4. Updated (a posteriori) state estimate: \hat{\textbf{x}}_{t|t} = \hat{\textbf{x}}_{t|t-1} + \textbf{K}_t\tilde{\textbf{y}}_t
    5. Updated (a posteriori) estimate covariance: \textbf{P}_{t|t} = (I - \textbf{K}_t \textbf{H}_t) \textbf{P}_{t|t-1}

They can be difficult to understand at first, so let’s first take a look at what each of these variables are used for:

  • {\mathbf{x}_{t}} is the current state vector, as estimated by the Kalman filter, at time {t}.
  • {\mathbf{z}_{t}} is the measurement vector taken at time {t}.
  • {\mathbf{P}_{t}} measures the estimated accuracy of {\mathbf{x}_{t}} at time {t}.
  • {\mathbf{F}} describes how the system moves (ideally) from one state to the next, i.e. how one state vector is projected to the next, assuming no noise (e.g. no acceleration)
  • {\mathbf{H}} defines the mapping from the state vector, {\mathbf{x}_{t}}, to the measurement vector, {\mathbf{z}_{t}}.
  • {\mathbf{Q}} and {\mathbf{R}} define the Gaussian process and measurement noise, respectively, and characterise the variance of the system.
  • {\mathbf{B}} and {\mathbf{u}} are control-input parameters are only used in systems that have an input; these can be ignored in the case of an object tracker.

Note that in a simple system, the current state {\mathbf{x}_{t}} and the measurement {\mathbf{z}_{t}} will contain the same set of state variables (only {\mathbf{x}_{t}} will be a filtered version of {\mathbf{z}_{t}}) and {\mathbf{H}}will be an identity matrix, but many real-world systems will include latent variablesthat are not directly measured. For example, if we are tracking the location of a car, we may be able to directly measure its location from a GPS device and its velocity from the speedometer, but not its acceleration.

In the predict stage, the state of the system and its error covariance are transitioned using the defined transition matrix {\mathbf{F}}, and can be implemented in MATLAB as:

1
2
3
4
function [x,P] = kalman_predict(x,P,F,Q)
    x = F*x; %predicted state
    P = F*P*F' + Q; %predicted estimate covariance
end

In the update stage, we first calculate the difference between our predicted and measured states. We then calculate the Kalman gain matrix, {\mathbf{K}}, which is used to weight between our predicted and measured states and is adjusted based on a ratio of prediction error {\mathbf{P}_{t}} to measurement noise {\mathbf{S}_{t}}.

Finally, the state vector and its error covariance are then updated with the measured state. It can be implemented in MATLAB as:

1
2
3
4
5
6
7
function [x,P] = kalman_update(x,P,z,H,R)
    y = z - H*x; %measurement error/innovation
    S = H*P*H' + R; %measurement error/innovation covariance
    K = P*H'*inv(S); %optimal Kalman gain
    x = x + K*y; %updated state estimate
    P = (eye(size(x,1)) - K*H)*P; %updated estimate covariance
end

Both the stages only update two variables: {\mathbf{x}_{t}}, the state variable, and {\mathbf{P}_{t}}, the prediction error covariance variable.

The two stages of the filter correspond to the state-space model typically used to model linear dynamical systems. The first stage solves the process equation:

\displaystyle \mathbf{x}_{t+1}=\mathbf{F}\mathbf{x}_{k}+\mathbf{w}_{k}

The process noise {\mathbf{w}_{k}} is additive Gaussian white noise (AWGN)with zero mean and covariance defined by:

\displaystyle E\left[\mathbf{w}_{t}\mathbf{w}_{t}^{T}\right]=\mathbf{Q}

The second one is the measurement equation:

\displaystyle \mathbf{z}_{t}=\mathbf{H}\mathbf{x}_{t}+\mathbf{v}_{t}

The measurement noise {\mathbf{v}_{t}} is also AGWN with zero mean and covariance defined by:

\displaystyle E\left[\mathbf{v}_{t}\mathbf{v}_{t}^{T}\right]=\mathbf{R}

Defining the system

In order to implement a Kalman filter, we have to define several variables that model the system. We have to choose the variables contained by {\mathbf{x}_{t}} and {\mathbf{z}_{t}}, and also choose suitable values for {\mathbf{F}}, {\mathbf{H}}, {\mathbf{Q}} and {\mathbf{R}}, as well as an initial value for {\mathbf{P}_{t}}.

We will define our measurement vector as:

\displaystyle \mathbf{z}_{t}=\left[\begin{array}{cccc} x_{1,t} & y_{1,t} & x_{2,t} & y_{2,t}\end{array}\right]^{T}

where \left(x_{1,t},\, y_{1,t}\right) and \left(x_{2,t},\, y_{2,t}\right) are the upper-left and lower-right corners of the bounding box around the detected face, respectively. This is simply the output from the Viola and Jones face detector.

A logical choice for our state vector is:

\displaystyle \mathbf{x}_{t}=\left[\begin{array}{cccccc} x_{1,t} & y_{1,t} & x_{2,t} & y_{2,t} & dx_{t} & dy_{t}\end{array}\right]^{T}

where {dx_{t}} and {dy_{t}} are the first-order derivatives. Other vectors are also possible; for example, some papers introduce a “scale” variable, which assumes that the bounding box maintains a fixed aspect ratio.

The transition matrix {\mathbf{F}} defines the equations used to transition from one state vector{\mathbf{x}_{t}} to the next vector {\mathbf{x}_{t+1}} (without taking into account any measurements, {\mathbf{z}_{t}}). It is plugged in to the process equation:

\displaystyle \mathbf{x}_{t+1}=\mathbf{F}\mathbf{x}_{k}+\mathbf{w}_{k}

Let’s look at some basic equations describing motion:

\displaystyle \begin{aligned}x & =dx_{0}t+\frac{1}{2}d^{2}x\cdot\Delta T^{2}\\ dx & =dx_{0}+d^{2}x\cdot\Delta T\end{aligned}

We could express this system using the following recurrence:

\displaystyle \begin{aligned}x_{t+1} & =x_{t}+dx_{t}\cdot\Delta T+\frac{1}{2}d^{2}x_{t}\cdot\Delta T^{2}\\ dx_{t+1} & =dx_{t}+d^{2}x_{t}\cdot\Delta T\end{aligned}

These same equations can also be used to model the {y_{t}} variables and their derivatives. Referring back to the process equation, we can thus model this system as:

\displaystyle \left[\begin{array}{c} x_{1,t+1}\\ y_{1,t+1}\\ x_{2,t+1}\\ y_{2,t+1}\\ dx_{t+1}\\ dy_{t+1}\end{array}\right]=\left[\begin{array}{cccccc} 1 & 0 & 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 & 0 & 1\\ 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 1\end{array}\right]\left[\begin{array}{c} x_{1,t}\\ y_{1,t}\\ x_{2,t}\\ y_{2,t}\\ dx_{t}\\ dy_{t}\end{array}\right]+\left[\begin{array}{c} d^{2}x_{t}/2\\ d^{2}y_{t}/2\\ d^{2}x_{t}/2\\ d^{2}y_{t}/2\\ d^{2}x_{t}\\ d^{2}y_{t}\end{array}\right]\times\Delta T

The process noise matrix {\mathbf{Q}} measures the variability of the input signal away from the “ideal” transitions defined in the transition matrix. Larger values in this matrix mean that the input signal has greater variance and the filter needs to be more adaptable. Smaller values result in a smoother output, but the filter is not as adaptable to large changes. This can be a little difficult to define, and may require some fine tuning. Based on our definition of the measurement noise {\mathbf{v}_{t}} above, our process noise matrix is defined as:

\displaystyle \begin{aligned}\mathbf{Q} & =\left[\begin{array}{cccccc} \Delta T^{4}/4 & 0 & 0 & 0 & \Delta T^{3}/2 & 0\\ 0 & \Delta T^{4}/4 & 0 & 0 & 0 & \Delta T^{3}/2\\ 0 & 0 & \Delta T^{4}/4 & 0 & \Delta T^{3}/2 & 0\\ 0 & 0 & 0 & \Delta T^{4}/4 & 0 & \Delta T^{3}/2\\ \Delta T^{3}/2 & 0 & \Delta T^{3}/2 & 0 & \Delta T^{2} & 0\\ 0 & \Delta T^{3}/2 & 0 & \Delta T^{3}/2 & 0 & \Delta T^{2}\end{array}\right]\times a^{2}\\ & =\left[\begin{array}{cccccc} 1/4 & 0 & 0 & 0 & 1/2 & 0\\ 0 & 1/4 & 0 & 0 & 0 & 1/2\\ 0 & 0 & 1/4 & 0 & 1/2 & 0\\ 0 & 0 & 0 & 1/4 & 0 & 1/2\\ 1/2 & 0 & 1/2 & 0 & 1 & 0\\ 0 & 1/2 & 0 & 1/2 & 0 & 1\end{array}\right]\times10^{-2}\end{aligned}

where {\Delta T=1} and {a=d^{2}x_{t}=d^{2}y_{t}=0.1}.

The measurement matrix {\mathbf{H}} maps between our measurement vector {\mathbf{z}_{t}} and state vector {\mathbf{x}_{t}}. It is plugged in to the measurement equation:

\displaystyle \mathbf{z}_{t}=\mathbf{H}\mathbf{x}_{t}+\mathbf{v}_{t}

The variables {x_{t}} and {y_{t}} are mapped directly from {\mathbf{z}_{t}} to {\mathbf{x}_{t}}, whereas the derivative variables are latent (hidden) variables and so are not directly measured and are not included in the mapping. This gives us the measurement matrix:

\displaystyle \mathbf{H}=\left[\begin{array}{cccccc} 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\end{array}\right]

The matrix {\mathbf{R}} defines the error of the measuring device. For a physical instrument such as a speedometer or voltmeter, the measurement accuracy may be defined by the manufacturer. In the case of a face detector, we can determine the accuracy empirically. For instance, we may find that our Viola and Jones face detector detects faces to within 10 pixels of the actual face location 95% of the time. If we assume this error is Gaussian-distributed (which is a requirement of the Kalman filter), this gives us a variance of 6.5 pixels for each of the coordinates, so the measurement noise vector is then given by:

\displaystyle \mathbf{v}=\left[\begin{array}{cccc} 6.5 & 6.5 & 6.5 & 6.5\end{array}\right]^{T}

The errors are independent, so our covariance matrix is given by:

\displaystyle \mathbf{R}=\left[\begin{array}{cccc} 6.5^{2} & 0 & 0 & 0\\ 0 & 6.5^{2} & 0 & 0\\ 0 & 0 & 6.5^{2} & 0\\ 0 & 0 & 0 & 6.5^{2}\end{array}\right]=\left[\begin{array}{cccc} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1\end{array}\right] \times 42.25

Decreasing the values in {\mathbf{R}} means we are optimistically assuming our measurements are more accurate, so the filter performs less smoothing and the predicted signal will follow the observed signal more closely. Conversely, increasing {\mathbf{R}} means we have less confidence in the accuracy of the measurements, so more smoothing is performed.

The estimate covariance matrix {\mathbf{P}} is a measure of the estimated accuracy of {\mathbf{x}_{t}} at time {t}. It is adjusted over time by the filter, so we only need to supply a reasonable initial value. If we know for certain the exact state variable at start-up, then we can initialise {\mathbf{P}} to a matrix of all zeros. Otherwise, it should be initialised as a diagonal matrix with a large value along the diagonal:

\displaystyle \mathbf{P}=\left[\begin{array}{cccccc} 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 0 & 0 & 1\end{array}\right]\times\epsilon

where {\epsilon\gg0}. The filter will then prefer the information from the first few measurements over the information already in the model.

Implementing the face tracker

The following script implements the system we have defined above. It loads the face detection results from CSV file, performs the Kalman filtering, and displays the detected bounding boxes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
% read in the detected face locations
fid = fopen('detect_faces.csv');
fgetl(fid); %ignore the header
detections = textscan(fid, '%[^,] %d %d %d %d', 'delimiter', ',');
fclose(fid);
% define the filter
x = [ 0; 0; 0; 0; 0; 0 ];
F = [ 1 0 0 0 1 0 ; ...
      0 1 0 0 0 1 ; ...
      0 0 1 0 1 0 ; ...
      0 0 0 1 0 1 ; ...
      0 0 0 0 1 0 ; ...
      0 0 0 0 0 1 ];
Q = [ 1/4  0   0   0  1/2  0  ; ...
       0  1/4  0   0   0  1/2 ; ...
       0   0  1/4  0  1/2  0  ; ...
       0   0   0  1/4  0  1/2 ; ...
      1/2  0  1/2  0   1   0  ; ...
       0  1/2  0  1/2  0   1  ] * 1e-2;
H = [ 1 0 0 0 0 0 ; ...
      0 1 0 0 0 0 ; ...
      0 0 1 0 0 0 ; ...
      0 0 0 1 0 0 ];
R = eye(4) * 42.25;
P = eye(6) * 1e4;
nsamps = numel(detections{1});
for n = 1:nsamps
    % read the next detected face location
    meas_x1 = detections{2}(n);
    meas_x2 = detections{4}(n);
    meas_y1 = detections{3}(n);
    meas_y2 = detections{5}(n);
    z = double([meas_x1; meas_x2; meas_y1; meas_y2]);
    % step 1: predict
    [x,P] = kalman_predict(x,P,F,Q);
    % step 2: update (if measurement exists)
    if all(z > 0)
        [x,P] = kalman_update(x,P,z,H,R);
    end
    % draw a bounding box around the detected face
    img = imread(detections{1}{n});
    imshow(img);
    est_z = H*x;
    est_x1 = est_z(1);
    est_x2 = est_z(2);
    est_y1 = est_z(3);
    est_y2 = est_z(4);
    if all(est_z > 0) && est_x2 > est_x1 && est_y2 > est_y1
        rectangle('Position', [est_x1 est_y1 est_x2-est_x1 est_y2-est_y1], 'EdgeColor', 'g', 'LineWidth', 3);
    end
    drawnow;
end

The results of running this script are shown in the following video:
Clearly we can see that this video has a much smoother and more accurate bounding box around the face than the unfiltered version shown previously, and the video no longer has frames with missing detections.

Closing remarks

In the future, I aim to write an article on the extended Kalman filter (EKF) and unscented Kalman filter (UKF) (and the similar particle filter). These are both non-linear versions of the Kalman filter. Although face trackers are usually implemented using the linear Kalman filter, the non-linear versions have some other interesting applications in image and signal processing.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Extracts from a Personal Diary

dedicated to the life of a silent girl who eventually learnt to open up

Num3ri v 2.0

I miei numeri - seconda versione

ThuyDX

Just another WordPress.com site

Algunos Intereses de Abraham Zamudio Chauca

Matematica, Linux , Programacion Serial , Programacion Paralela (CPU - GPU) , Cluster de Computadores , Software Cientifico

josephdung

thoughts...

Tech_Raj

A great WordPress.com site

Travel tips

Travel tips

Experience the real life.....!!!

Shurwaat achi honi chahiye ...

Ronzii's Blog

Just your average geek's blog

Karan Jitendra Thakkar

Everything I think. Everything I do. Right here.

VentureBeat

News About Tech, Money and Innovation

Chetan Solanki

Helpful to u, if u need it.....

ScreenCrush

Explorer of Research #HEMBAD

managedCUDA

Explorer of Research #HEMBAD

siddheshsathe

A great WordPress.com site

Ari's

This is My Space so Dont Mess With IT !!

%d bloggers like this: