Enginius/Machine Learning

# Practical usage of Representer theorem

Let's assume that you are given a set of input and output data and your goal is to find an underlying function '$g(\cdot)$' which maps the input space to the output space. Formally writing, you are given,

$$D = \{ (x_1, y_1) \ \cdots \ (x_n, y_n) \} \\ x_i \in X \ , \ y_i \in Y$$

and you try to find

$$g: X \to Y.$$

For those who are familiar with machine learning literatures would probably figure out that this is what most of machine learning problems deal with. In particular, if the output space '$Y$' is a real line '$R$', we call such problems as a regression problem and this is what this article will mainly deal with.

For better understanding, let's think about a simple regression problem where the training data are given as follows:

The seven black circles shown in the above figure are our training data where X axis and Y axis indicate input and output spaces, respectively. Simply speaking, our goal is to find a curve interpolating the seven points. Perhaps, the first method that you can think of would be a linear model. It is a very simple, very efficient method. As there is no free lunch, however, the result of linear regression is likely to be unsatisfactory. The blue line in the below figure is the result of a linear regression, and  clearly, we might want something better that this.

One natural extension would be using higher order polynomial models such as second, third, or forth polynomials which are shown with different colors (check the legend).

However, these result aren't also that satisfactory. In this case, the representer theorem can give you a reasonable solution.

Let me first formalize the problem we wish to solve. The function '$g(\cdot)$' we wish to find can be represented as follows:

$$\hat{g} = \arg \min_{g} \sum_{i=1}^{n}( y_i - g(x_i) )^2 + \gamma R(g)$$

where '$R(g)$' is a regularizer of the functional '$g(\cdot)$' such as a Hilbert norm '$\| g \|_H$'.

Now, from the representer theorem, we know that '$g(x)$' can be expressed as follows

$$g(x) = \sum_{j=1}^{n} \alpha_j k(x, x_j)$$

where '$x_i$' is '$i$'th training input data, '$n$' is the number of training data, '$k(x, x')$' is a kernel function, and '$\alpha_i$' is the parameters we wish to optimize. In other words, using the representer theorem, finding an arbitrary function '$g(\cdot)$' converts to finding parameters '$\alpha$' whose size is equivalent to the number of training data.

Now, we can replace '$g(x)$' to '$\sum_{j=1}^n \alpha_j k(x, x_j)$' and rewrite the optimization with respect to the parameter '$\alpha$'

$$\hat{\alpha} = \arg \min_{\alpha} \sum_{i=1}^n ( y_i - \sum_{j=1}^n \alpha_j k(x_i, x_j ) )^2 + \gamma R(g).$$

If we use the Hilbert norm as a regularizer, '$R(g) = \| g \|_H^2$', this optimization can be converted to following convex quadratic programming

$$\hat{\alpha} = \arg \min_{\alpha} \| K\alpha - Y \|_2^2 + \gamma \alpha^T K \alpha$$

In fact, this formulation can be solved analytically in a closed form solution;

$$\hat{\alpha} = (K + \gamma I )^{-1}Y.$$

I would like to emphasize that this analytic solution is equivalent to that of kernel ridge regression or Gaussian process regression. In other words, kernel ridge regression and GPR can be explained within the framework of the representer theorem. The red curve in the below figure is the regression result using the representer theorem.

More impressive part is that it can be derived with 5 lines of code. (excluding the kernel function)

K = kernel_se(X, X, hyp);

gamma = 1E-4;

alphahat = (K + gamma*eye(n)) \ Y;

ktest = kernel_se(Xtest, X, hyp);

Ytest_rep = ktest*alphahat;

MATLAB Codes

%%

clc

clear

close all

%% 1. Raw data

% Training Data

X = [1 4 6 9 11 15 18]';

Y = [3 4 2.5 2 3 1.5 3.5]';

n = size(X, 1);

Xtest = linspace(0, 20, 100)';

% Plot training data

fig = figure(1); set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15);

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig1.png');

%% 2. Linear regression

p = polyfit(X, Y, 1);

Ytest = polyval(p, Xtest);

% Plot

fig = figure(2); hold on; set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

plot(Xtest, Ytest, 'b-', 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data', 'Linear regression');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15);

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig2.png');

%% 3. Polynomial regression (2nd, 3rd, 4th)

p2 = polyfit(X, Y, 2);

Ytest_p2 = polyval(p2, Xtest);

p3 = polyfit(X, Y, 3);

Ytest_p3 = polyval(p3, Xtest);

p4 = polyfit(X, Y, 4);

Ytest_p4 = polyval(p4, Xtest);

% Plot

fig = figure(3); hold on; set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

plot(Xtest, Ytest, 'b-', 'LineWidth', 2);

plot(Xtest, Ytest_p2, 'c-', 'LineWidth', 2);

plot(Xtest, Ytest_p3, 'g-', 'LineWidth', 2);

plot(Xtest, Ytest_p4, 'm-', 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data', 'Linear regression' ...

, '2nd-order fit', '3rd-order fit', '4th-order fit');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15);

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig3.png');

%% 4. Representer theorem

hyp = struct('g2', 1, 'A', 1/(5)^2, 'w2', 1E-8, 'mzflag', 0);

K = kernel_se(X, X, hyp);

gamma = 1E-4;

alphahat = (K + gamma*eye(n)) \ Y;

ktest = kernel_se(Xtest, X, hyp);

Ytest_rep = ktest*alphahat;

% Plot

fig = figure(4); clf; hold on; set(fig, 'Position', [200 200 700 400]);

plot(X, Y, 'ko', 'MarkerSize', 14, 'LineWidth', 2);

plot(Xtest, Ytest, 'b-', 'LineWidth', 2);

plot(Xtest, Ytest_p2, 'c-', 'LineWidth', 2);

plot(Xtest, Ytest_p3, 'g-', 'LineWidth', 2);

plot(Xtest, Ytest_p4, 'm-', 'LineWidth', 2);

plot(Xtest, Ytest_rep, 'r-', 'LineWidth', 2);

grid on; axis([0 20 0 5]);

hleg = legend('Training data', 'Linear regression' ...

, '2nd-order fit', '3rd-order fit', '4th-order fit' ...

, 'Representer theorem');

set(hleg, 'FontSize', 15);

xlabel('Input space X', 'FontSize', 15);

ylabel('Output space Y', 'FontSize', 15);

set(fig,'PaperPositionMode','auto');

print (fig , '-dpng', 'fig4.png');

function K = kernel_se(X1, X2, hyp)

%

% Squared exponential kernel function

%

% sungjoon.choi@cpslab.snu.ac.kr

%

nr_X1 = size(X1, 1);

nr_X2 = size(X2, 1);

K = hyp.g2*exp(-sqdist(X1', X2', hyp.A));

if nr_X1 == nr_X2 && nr_X1 > 1

K = K + hyp.w2*eye(nr_X1, nr_X1);

end

function Y = col_sum(X)

Y = sum(X, 1);

function m = sqdist(p, q, A)

% SQDIST      Squared Euclidean or Mahalanobis distance.

% SQDIST(p,q)   returns m(i,j) = (p(:,i) - q(:,j))'*(p(:,i) - q(:,j)).

% SQDIST(p,q,A) returns m(i,j) = (p(:,i) - q(:,j))'*A*(p(:,i) - q(:,j)).

% Written by Tom Minka

[~, pn] = size(p);

[~, qn] = size(q);

if pn == 0 || qn == 0

m = zeros(pn,qn);

return

end

if nargin == 2

pmag = col_sum(p .* p);

qmag = col_sum(q .* q);

m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*q;

%m = ones(pn,1)*qmag + pmag'*ones(1,qn) - 2*p'*q;

else

Ap = A*p;

Aq = A*q;

pmag = col_sum(p .* Ap);

qmag = col_sum(q .* Aq);

m = repmat(qmag, pn, 1) + repmat(pmag', 1, qn) - 2*p'*Aq;

end

#### 'Enginius > Machine Learning' 카테고리의 다른 글

 Deep Neural Network (DNN)  (1) 2015.12.02 2015.12.02 2015.11.07 2015.11.06 2015.10.27 2015.10.24
• 2015.11.08 02:07 신고

어디서 본듯한 식이라고 느꼈는데 GPR과 같아지는군요ㅎㅎ

• 2015.11.08 03:13 신고

네 ㅎㅎ 전 GPR 공부하다가 이걸 알게됐어요. 혹시 PIL에 계신 분인가요?

• 2015.11.08 09:52 신고

네 안녕하세요. 저는 박사과정 윤상두입니다 ㅋㅋ
어제 밤에 잠이안와서 페이스북보다가 식을따라가보니 재밌더군용

• 2015.11.08 09:58 신고

반갑습니다 ㅋㅋ 블로그하시는군요. :)

• 2015.11.08 10:40 신고

네 뭔가 모범적인 블로그의 예를 보여주시는것 같아 저도 열심히 벤치마킹(?)해보겠습니다 ㅋㅋ