본문 바로가기

Enginius/Machine Learning

Practical usage of Representer theorem

Let's assume that you are given a set of input and output data and your goal is to find an underlying function '$g(\cdot)$' which maps the input space to the output space. Formally writing, you are given,

$$ D = \{ (x_1, y_1) \ \cdots \ (x_n, y_n) \}  \\ x_i \in X \ , \ y_i \in Y$$

and you try to find 

$$ g: X \to Y.$$

For those who are familiar with machine learning literatures would probably figure out that this is what most of machine learning problems deal with. In particular, if the output space '$Y$' is a real line '$R$', we call such problems as a regression problem and this is what this article will mainly deal with. 


For better understanding, let's think about a simple regression problem where the training data are given as follows: 

The seven black circles shown in the above figure are our training data where X axis and Y axis indicate input and output spaces, respectively. Simply speaking, our goal is to find a curve interpolating the seven points. Perhaps, the first method that you can think of would be a linear model. It is a very simple, very efficient method. As there is no free lunch, however, the result of linear regression is likely to be unsatisfactory. The blue line in the below figure is the result of a linear regression, and  clearly, we might want something better that this. 

One natural extension would be using higher order polynomial models such as second, third, or forth polynomials which are shown with different colors (check the legend). 


However, these result aren't also that satisfactory. In this case, the representer theorem can give you a reasonable solution. 


Let me first formalize the problem we wish to solve. The function '$g(\cdot)$' we wish to find can be represented as follows:

$$ \hat{g} = \arg \min_{g} \sum_{i=1}^{n}( y_i - g(x_i) )^2 + \gamma R(g) $$

where '$R(g)$' is a regularizer of the functional '$g(\cdot)$' such as a Hilbert norm '$\| g \|_H$'. 

Now, from the representer theorem, we know that '$g(x)$' can be expressed as follows

$$ g(x) = \sum_{j=1}^{n} \alpha_j k(x, x_j) $$

where '$x_i$' is '$i$'th training input data, '$n$' is the number of training data, '$k(x, x')$' is a kernel function, and '$\alpha_i$' is the parameters we wish to optimize. In other words, using the representer theorem, finding an arbitrary function '$g(\cdot)$' converts to finding parameters '$\alpha$' whose size is equivalent to the number of training data. 


Now, we can replace '$g(x)$' to '$\sum_{j=1}^n \alpha_j k(x, x_j)$' and rewrite the optimization with respect to the parameter '$\alpha$'

$$ \hat{\alpha} = \arg \min_{\alpha} \sum_{i=1}^n ( y_i - \sum_{j=1}^n \alpha_j k(x_i, x_j ) )^2 + \gamma R(g). $$

If we use the Hilbert norm as a regularizer, '$R(g) = \| g \|_H^2 $', this optimization can be converted to following convex quadratic programming

$$ \hat{\alpha}  = \arg \min_{\alpha} \| K\alpha - Y \|_2^2 + \gamma \alpha^T K \alpha$$

In fact, this formulation can be solved analytically in a closed form solution; 

$$ \hat{\alpha} = (K + \gamma I )^{-1}Y. $$

I would like to emphasize that this analytic solution is equivalent to that of kernel ridge regression or Gaussian process regression. In other words, kernel ridge regression and GPR can be explained within the framework of the representer theorem. The red curve in the below figure is the regression result using the representer theorem. 


More impressive part is that it can be derived with 5 lines of code. (excluding the kernel function)

K = kernel_se(X, X, hyp);

gamma = 1E-4;

alphahat = (K + gamma*eye(n)) \ Y;

ktest = kernel_se(Xtest, X, hyp);

Ytest_rep = ktest*alphahat;


MATLAB Codes 

%%

clc

clear

close all

 

%% 1. Raw data

% Training Data 

X = [1 4 6 9 11 15 18]';

Y = [3 4 2.5 2 3 1.5 3.5]'; 

n = size(X, 1);

Xtest = linspace(0, 20, 100)';

 

% Plot training data

fig = figure(1); set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15); 

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig1.png');

 

%% 2. Linear regression

p = polyfit(X, Y, 1);

Ytest = polyval(p, Xtest);

 

% Plot

fig = figure(2); hold on; set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

plot(Xtest, Ytest, 'b-', 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data', 'Linear regression');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15); 

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig2.png');

 

%% 3. Polynomial regression (2nd, 3rd, 4th)

p2 = polyfit(X, Y, 2);

Ytest_p2 = polyval(p2, Xtest);

p3 = polyfit(X, Y, 3);

Ytest_p3 = polyval(p3, Xtest);

p4 = polyfit(X, Y, 4);

Ytest_p4 = polyval(p4, Xtest);

 

% Plot

fig = figure(3); hold on; set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

plot(Xtest, Ytest, 'b-', 'LineWidth', 2);

plot(Xtest, Ytest_p2, 'c-', 'LineWidth', 2);

plot(Xtest, Ytest_p3, 'g-', 'LineWidth', 2);

plot(Xtest, Ytest_p4, 'm-', 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data', 'Linear regression' ...

    , '2nd-order fit', '3rd-order fit', '4th-order fit');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15); 

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig3.png');

 

%% 4. Representer theorem

hyp = struct('g2', 1, 'A', 1/(5)^2, 'w2', 1E-8, 'mzflag', 0);

K = kernel_se(X, X, hyp);

gamma = 1E-4;

alphahat = (K + gamma*eye(n)) \ Y;

ktest = kernel_se(Xtest, X, hyp);

Ytest_rep = ktest*alphahat;

 

% Plot

fig = figure(4); clf; hold on; set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

plot(Xtest, Ytest, 'b-', 'LineWidth', 2);

plot(Xtest, Ytest_p2, 'c-', 'LineWidth', 2);

plot(Xtest, Ytest_p3, 'g-', 'LineWidth', 2);

plot(Xtest, Ytest_p4, 'm-', 'LineWidth', 2);

plot(Xtest, Ytest_rep, 'r-', 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data', 'Linear regression' ...

    , '2nd-order fit', '3rd-order fit', '4th-order fit' ...

    , 'Representer theorem');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15); 

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig4.png');


function K = kernel_se(X1, X2, hyp)

%

% Squared exponential kernel function

%

% sungjoon.choi@cpslab.snu.ac.kr

%

nr_X1 = size(X1, 1);

nr_X2 = size(X2, 1);

 

K = hyp.g2*exp(-sqdist(X1', X2', hyp.A));

if nr_X1 == nr_X2 && nr_X1 > 1

    K = K + hyp.w2*eye(nr_X1, nr_X1);

end

 

function Y = col_sum(X)

Y = sum(X, 1);

 

function m = sqdist(p, q, A)

% SQDIST      Squared Euclidean or Mahalanobis distance.

% SQDIST(p,q)   returns m(i,j) = (p(:,i) - q(:,j))'*(p(:,i) - q(:,j)).

% SQDIST(p,q,A) returns m(i,j) = (p(:,i) - q(:,j))'*A*(p(:,i) - q(:,j)).

 

% Written by Tom Minka

 

[~, pn] = size(p);

[~, qn] = size(q);

 

if pn == 0 || qn == 0

    m = zeros(pn,qn);

    return

end

 

if nargin == 2

    

    pmag = col_sum(p .* p);

    qmag = col_sum(q .* q);

    m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*q;

    %m = ones(pn,1)*qmag + pmag'*ones(1,qn) - 2*p'*q;

    

else

    

    Ap = A*p;

    Aq = A*q;

    pmag = col_sum(p .* Ap);

    qmag = col_sum(q .* Aq);

    m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*Aq;

    

end




'Enginius > Machine Learning' 카테고리의 다른 글

Deep Neural Network (DNN)  (1) 2015.12.02
Boosting Methods  (0) 2015.12.02
Representer Theorem  (2) 2015.11.06
A Reduction of the Elastic Net to Support Vector Machines  (0) 2015.10.27
An Introduction to Variable and Feature Selection  (0) 2015.10.24