Implementation of following paper: link
Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. "Practical Bayesian optimization of machine learning algorithms." Advances in neural information processing systems. 2012.
"Bayesian optimization typically works by assuming the unknown function was sampled from a Gaussian process and maintains a posterior distribution for this function as observations are made or, in our case, as the results of running learning algorithm experiments with different hyperparameters are observed. To pick the hyperparameters of the next experiment, one can optimize the expected improvement (EI) over the current best result or the Gaussian process upper confidence bound (UCB)."
In this post, I used the expected improvement (EI) criterion with ARD kernel function which is identical to the setting of this paper.ARD Matern 5/2 kernel looks like:
where '$r^2(x, x')$' is a scaled squared distance measure.
In Matlab, this kernel function can be easily implemented as
function K = kernel_ardmatern(X1, X2, bo)
nr_X1 = size(X1, 1);
nr_X2 = size(X2, 1);
% First, normalize inputs
norm_factors = zeros(1, bo.d);
for i = 1:bo.d
norm_factors(i) = (bo.data{i}.range(end) - bo.data{i}.range(1))*.5;
end
norm_X1 = X1 ./ repmat(norm_factors, nr_X1, 1);
norm_X2 = X2 ./ repmat(norm_factors, nr_X2, 1);
% Compute Squared distances
rsq = sqdist(norm_X1, norm_X2);
% Compute Kernels
K = (1 + sqrt(5*rsq) + 5/3*rsq).*exp(-sqrt(5*rsq));
Expected improvement (EI) function can be implemented as
function val = acquisition_ei(x, bo)
n = bo.n;
X = bo.input(1:bo.n, :);
Y = bo.output(1:bo.n, :);
xstar = x;
KX = kernel_ardmatern(X, X, bo);
Kxstar = kernel_ardmatern(xstar, xstar, bo);
KXxstar = kernel_ardmatern(X, xstar, bo);
gpvar = Kxstar - KXxstar'/(KX + 1E-4*eye(n, n))*KXxstar;
gpvar = abs(gpvar);
gpmean = KXxstar'/(KX + 1E-4*eye(n, n))*Y;
fmin = min(Y);
gamma = (fmin - gpmean)/sqrt(gpvar);
Phi = cdf('Normal', gamma, 0, 1);
val = sqrt(gpvar) * gamma * Phi + pdf('Normal', gamma, 0, 1);
For the underlying function, I will use following function which has been used in 'double stochastic gradient' paper.
$$f(x) = \cos(\frac{1}{2}\pi \| x \|_2) \exp(-0.1\pi \| x \|_2) $$
One great thing about Bayesian optimization is that, once we can evaluate the cost function, we don't have to care about computing gradients nor types of the input space, whether it is continuous or discrete.
Codes in Matlab
: Comments are in Korean. but I guess it won't bother much.
: To be more precise, we have to marginalize the hyperparameters of the kernel function by MC integral. However, I just fix the hyperparameters to be half of its range (maximum - minimum).
: Didn't do any tuning, and, more importantly, this implementation can be wrong. Please let me know if there exists anything wrong in this code.
Videos in youtube (works quiet well)
'Enginius > Machine Learning' 카테고리의 다른 글
Total variational distance between two empirical distributions (KDE) (0) | 2016.06.06 |
---|---|
Hierarchical Occupancy Flow (0) | 2015.12.21 |
Convolutional Neural Network (CNN) (5) | 2015.12.03 |
Deep Neural Network (DNN) (1) | 2015.12.02 |
Boosting Methods (0) | 2015.12.02 |