본문 바로가기

Enginius/Machine Learning

Bayesian Optimization with Expected Improvement

Implementation of following paper: link 

Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. "Practical Bayesian optimization of machine learning algorithms." Advances in neural information processing systems. 2012.


"Bayesian optimization typically works by assuming the unknown function was sampled from a Gaussian process and maintains a posterior distribution for this function as observations are made or, in our case, as the results of running learning algorithm experiments with different hyperparameters are observed. To pick the hyperparameters of the next experiment, one can optimize the expected improvement (EI) over the current best result or the Gaussian process upper confidence bound (UCB)." 



In this post, I used the expected improvement (EI) criterion with ARD kernel function which is identical to the setting of this paper.ARD Matern 5/2 kernel looks like: 

where '$r^2(x, x')$' is a scaled squared distance measure. 

In Matlab, this kernel function can be easily implemented as 

function K = kernel_ardmatern(X1, X2, bo)


nr_X1 = size(X1, 1);

nr_X2 = size(X2, 1);

% First, normalize inputs 

norm_factors = zeros(1, bo.d);

for i = 1:bo.d

    norm_factors(i) = (bo.data{i}.range(end) - bo.data{i}.range(1))*.5;

end

norm_X1 = X1 ./ repmat(norm_factors, nr_X1, 1);

norm_X2 = X2 ./ repmat(norm_factors, nr_X2, 1);

% Compute Squared distances

rsq = sqdist(norm_X1, norm_X2);

% Compute Kernels

K = (1 + sqrt(5*rsq) + 5/3*rsq).*exp(-sqrt(5*rsq));


Expected improvement (EI) function can be implemented as

function val = acquisition_ei(x, bo)


n = bo.n;

X = bo.input(1:bo.n, :);

Y = bo.output(1:bo.n, :);

xstar   = x;

KX      = kernel_ardmatern(X, X, bo);

Kxstar  = kernel_ardmatern(xstar, xstar, bo);

KXxstar = kernel_ardmatern(X, xstar, bo);

gpvar   = Kxstar - KXxstar'/(KX + 1E-4*eye(n, n))*KXxstar;

gpvar   = abs(gpvar);

gpmean  = KXxstar'/(KX + 1E-4*eye(n, n))*Y;

fmin    = min(Y);

gamma   = (fmin - gpmean)/sqrt(gpvar);

Phi     = cdf('Normal', gamma, 0, 1);

val     = sqrt(gpvar) * gamma * Phi + pdf('Normal', gamma, 0, 1);


 For the underlying function, I will use following function which has been used in 'double stochastic gradient' paper. 

$$f(x) = \cos(\frac{1}{2}\pi \| x \|_2) \exp(-0.1\pi \| x \|_2) $$



One great thing about Bayesian optimization is that, once we can evaluate the cost function, we don't have to care about computing gradients nor types of the input space, whether it is continuous or discrete.


Codes in Matlab 

: Comments are in Korean. but I guess it won't bother much. 

: To be more precise, we have to marginalize the hyperparameters of the kernel function by MC integral. However, I just fix the hyperparameters to be half of its range (maximum - minimum).

: Didn't do any tuning, and, more importantly, this implementation can be wrong. Please let me know if there exists anything wrong in this code. 


Videos in youtube (works quiet well)