Dirichlet Process (DP)

Dirichlet Process (DP)는 매우 복잡스러운 수학의 결과가 우리의 직관과 맞을 수 있다는 것을 보여주는 좋은 예라 할 수 있겠다.

DP는 그 이름에서 알 수 있듯이 Dirichlet distribution을 따르는 random process이다. Random process라 함은 시간에 따라 특정 분포를 따르는 Random variable들의 sequence라 생각할 수 있다. 나 개인적으로는 특정한 Probability allocation function을 갖는 Sample space에서 임의로 random variable들을 뽑는다는 개념을 좋아한다. 물론 RP에 시간에 따른 함수를 추가할 수 있기 때문에 정확한 정의는 아니다.

각설하고 DP는 unsupervised clustering에 많이 사용된다. 특히나 우리가 cluster하고 싶은 cluster의 수를 모를 때! 사용될 수 있는 장점이 있다. 유도 과정은 매우 복잡하니 생략을 하겠다. (사실 나도 따라갈 수만 있고 유도는 못하겠다..)

DP를 이해하기 위해서는 Dirichlet distribution에 대해서 알아야 한다. Dirichlet distribution은 simplex에 대한 분포를 갖는다. n차원의 simplex의 정의는 모두 양수이고 sum이 1인 n차원의 벡터를 의미한다. 이러한 벡터는 multinomial distribution의 parameter로 쓰일 수 있다. (각 n개의 공이 뽑힐 확률로 이해될 수 있다.) 그래서 Dirichlet distribution을 따라다니는 수식어가 바로 distribution over distribution이다. 거창하게 들리지만 사실 multinomial distribution의 파라미터로 쓰일 수 있다는 말이다.

이러한 DP가 중요해지는 이유는 multinomial distribution이 대부분의 mixture model의 기본이기 때문이다. 다시 말해서 우리가 임의로 설정한 basis distribution의 weight를 Dirichlet random variable로 놓을 수 있다는 말이다. 하지만 이 경우에는 Dirichlet distribution의 dimension이 고정되어 있어서 cluster의 수를 모르는 경우에 사용되기가 힘들다.

그래서 아래 논문에서는 Dirichlet process란 것을 제안했고, 이는 내부적으로 Dirichlet distribution의 dimension을 바꿀 수 있게 되어있다. 이는 Chinese restaurant process라고도 불린다. 원탁 테이블을 갖는 중국집에 사람이 계속 들어올 때 사람이 cluster하고자 하는 개별 data이고, 테이블을 cluster하고자 하는 묶음이 된다.

각설하고 식을 써보면 다음과 같다.

이게 정말 잘 동작하는지 시뮬레이션을 통해서 결과를 확인해보자. Likelihood는 각 cluster의 center를 mean으로 하는 multivariate Gaussian 분포를 따른다고 가정하였다. 간단하게 하기 위해서 분산은 모두 상수로 하였다.

실험 결과는 다음과 같다. (왼쪽이 임의로 만든 데이터, 오른쪽이 DP-GMM으로 구한 cluster)

위의 동영상이 좀 깨지긴 하지만 처음에 cluster의 수를 제대로 몰라도 결국 비슷하게 찾아가는 것을 알 수 있다.

DP prior 부분에 $\alpha$ 가 들어가는데 이 변수가 중요하다. 결론부터 말하자면 이 값이 클 수록 새로운 cluster가 많이 생긴다. 작으면 작을 수록 새로운 cluster가 잘 생기지 않는다. (오히려 잘 준다.)

실험에 사용된 매트랩 코드

1. demo_dp_gmm

%% DP를 이용해서 GMM 문제를 풀자.

clc;

clear all;

close all;

%% 실험의 파라미터

make_new_data = 1; % 새로운 GMM으로 실험을 할지 여부

save_new_image = 1; % 현재 GMM을 저장할지 여부

save_new_video = 1; % 결과를 video로 저장할지 여부

new_video_index = 2; % video 인덱스

%% 이미지를 새로 저장할 땐 기존 이미지를 삭제한다.

if save_new_image

delete('pics/*.png');

end

%% 1. 실험을 위한 데이터를 만든다.

if make_new_data

nr_gaussian_mixture = 10; % Gaussian mixture의 수

nr_data_each_mixture = 30; % 하나의 mixture의 포인트 수

sigma2_gmm_x = .4;

sigma2_gmm_y = .4;

xmin = 0; xmax = 10; ymin = 0; ymax = 10;

stage_boundary = [0 10 0 10];

gmm_data.nr_mix = nr_gaussian_mixture;

gmm_data.gmm_all = [];

for gmi = 1:gmm_data.nr_mix

% 중심을 설정하고,

gmm_data.gmm{gmi}.center ...

= [xmin+(xmax-xmin)*rand(), ymin+(ymax-ymin)*rand()];

% 몇 개의 point를 모을지 정하고

gmm_data.gmm{gmi}.nr_pnt = nr_data_each_mixture;

% 중심을 주변으로 포인트를 더 모은다.

curr_center = gmm_data.gmm{gmi}.center;

gmm_data.gmm{gmi}.point = repmat(curr_center, gmm_data.gmm{gmi}.nr_pnt, 1) ...

+ randn(gmm_data.gmm{gmi}.nr_pnt, 2)*sqrt(diag([sigma2_gmm_x sigma2_gmm_y]));

% 데이터를 다 모은다.

gmm_data.gmm_all = [gmm_data.gmm_all ; gmm_data.gmm{gmi}.point];

% (추가) plot을 위한 색을 정한다.

gmm_data.gmm{gmi}.color = rand(1, 3);

end

save('mats/gmm_data.mat', 'gmm_data');

else

load 'mats/gmm_data.mat';

end

%% 2. 실험 데이터를 확인해본다.

fig_ref = figure(); hold on;

h_list = []; % legend를 위한 plot의 argument

legend_list = [];

for gmi = 1:gmm_data.nr_mix

% 1. 현 mixture에 해당하는 점들을 그리고

cur_points = gmm_data.gmm{gmi}.point;

h(gmi) = plot(cur_points(:, 1), cur_points(:, 2), 'o' ...

, 'Color', gmm_data.gmm{gmi}.color );

% 2. 중심을 그린다. (중앙을 채워서)

center_point = gmm_data.gmm{gmi}.center;

plot(center_point(1), center_point(2), 'o' ...

, 'MarkerFaceColor', gmm_data.gmm{gmi}.color );

% 3. Legend를 위한 변수 update

h_list = [h_list h(gmi)];

legend_list{gmi} = [num2str(gmi) 'th mixture'];

end

legend(h_list, legend_list, -1); axis equal;

title('Reference GMM data');hold off;

set(fig_ref,'PaperPositionMode','auto')

print (fig_ref , '-dpng', ['pics/fig_ref.png']) ;

%% 3. Dirichlet Process를 해보자.

% DP의 Likelihood 계산에 사용될 var

dp.sigma2_x = .5;

% DP의 concentration parameter

dp.alpha = 10;

% DP의 데이터의 수

dp.data = gmm_data.gmm_all;

% DP의 class의 수

dp.nr_class = 1;

% DP로 GMM을 해보자.

% 1. 각 포인트에 대해서 임의로 class 설정을 한다.

dp.nr_data = size(dp.data, 1);

dp.class = randi([1 dp.nr_class] ...

, size(gmm_data.gmm_all, 1), 1); % 각 포인트에 랜덤으로 class를 할당

for ci = 1:dp.nr_class

% 1. 현재 class에 해당하는 index를 찾고

class_idx = find(dp.class == ci);

% 2. 찾은 index에 해당하는 point들의 평균으로 center를 설정한다.

center_pos = mean(dp.data(class_idx, :));

dp.center{ci} = center_pos;

end

% 2. Iteration을 돌면서 변경을 시킨다.

color_list = rand(1E3, 3); % plot에 사용될 색 list. (넉넉히 잡아놓자.)

dp.maxiter = 2E2;

for iter = 1:dp.maxiter

%% 현재 결과를 plot해본다.

fig1 = figure(2); clf; hold on;

h_list = []; legend_list = []; % legend를 위한 plot의 argument

legend_index = 1;

for ci = 1:dp.nr_class

class_idx = find(dp.class == ci);

if sum(class_idx) ~= 0

% 1. 현재 class에 해당하는 점들을 그리자.

h(ci) = plot(dp.data(class_idx, 1), dp.data(class_idx, 2), 'o' ...

, 'Color', color_list(ci, :) );

% 2. 중심을 그리자.

center_point = dp.center{ci};

plot(center_point(1), center_point(2), 'o' ...

, 'MarkerFaceColor', color_list(ci, :) );

text(center_point(1), center_point(2), [' [' num2str(ci) ']' ]);

% 3. Legend를 위한 변수 update

h_list = [h_list h(ci)];

legend_list{legend_index} = [num2str(legend_index) 'th mixture'];

legend_index = legend_index + 1;

end

legend(h_list, legend_list, -1); axis equal;

title(['[' num2str(iter) '] iteration DP result ' num2str(legend_index-1) ' classes']);

hold off; drawnow;

% 5 번에 한번씩 사진을 저장을 한다.

% if rem(iter, 5) == 1 && save_new_image

if save_new_image

set(fig1,'PaperPositionMode','auto')

print (fig1 , '-dpng', ['pics/fig_', num2str(iter), '.png']) ;

end

pause(.001);

%% 각 포인트에 대해서 어느 class에 할당될지 정한다. (혹은 새로운 class를 만들자.)

for i = 1:dp.nr_data

% 1. 현재 포인트가 새로운 class에 들어갈지, 기존 cluster에 들어갈지 정하자

cur_point = dp.data(i, :);

p = zeros(dp.nr_class+1, 1);

% 1. 먼저 기존 class에 들어갈 확률을 계산한다.

lik_list = zeros(dp.nr_class, 1);

for j = 1:dp.nr_class

% i-번째 point가 속한 class를 구한다.

i_class = dp.class(i);

% j-번째 class가 총 몇개가 할당되었는지 구한다.

n_j = sum(find(dp.class==j));

% i-번째 point가 속한 class가 j이면 하나 뺀다.

if i_class == j, n_j = n_j - 1; end;

% p(j)를 계산한다. - j번째 class의 center를 평균으로 한다.

lik = mvnpdf(cur_point, dp.center{j}, diag([dp.sigma2_x dp.sigma2_x]));

p(j) = lik * n_j / (dp.nr_data-1+dp.alpha);

lik_list(j) = lik;

end

% 2. 새로운 class가 생길 확률을 계산한다.

lik_new = max(lik_list);

p(dp.nr_class+1) = lik_new * dp.alpha / (dp.nr_data-1+dp.alpha);

% 3. p를 normalize 한 후에 하나를 뽑는다.

p = p / sum(p);

z = find(mnrnd(1, p)==1); % 뽑힌 놈의 index를 구했다.

if z <= dp.nr_class

% 기존의 class에 할당

dp.class(i) = z;

else

% 새로운 class를 만든다.

% fprintf('making new class: %d \n', z);

dp.class(i) = z;

dp.nr_class = dp.nr_class + 1;

end

% 바뀐 포인트에 대한 처리를 한다.

for ci = 1:dp.nr_class

% 1. 현재 class에 해당하는 index를 찾고

class_idx = find(dp.class == ci);

% 2. 찾은 index에 해당하는 point들의 평균으로 center를 설정한다. (비어있지 않으면)

if sum(class_idx) ~= 0

center_pos = [mean(dp.data(class_idx, 1)) mean(dp.data(class_idx, 2))];

dp.center{ci} = center_pos;

end

end % for i = 1:dp.nr_data

end % for iter = 1:dp.maxiter

%% 동영상을 만들자.

if save_new_video && save_new_image

demo_make_video_dp_gmm;

end

2. demo_make_video_dp_gmm

+ Hyper-parameter alpha를 구하는 부분 추가

위의 논문에 보면 alpha에 대한 conditional distribution이 나온다. 이식을 Gibbs sampling을 통해서 풀 수 있다.

Conditional distribution은 다음과 같다.

저작자표시 비영리 동일조건

'Enginius > Machine Learning' 카테고리의 다른 글

What is the future of Big data (0)	2013.07.22
Principal component analysis (PCA) on data (1)	2013.07.11
Latent Dirichlet Allocation (0)	2013.06.22
Convex Optimization of Graph Laplacian (0)	2013.05.31
Decentralized Orthogonal Iteration (DOI) + Spectral Clustering (0)	2013.05.31

Mad for Simplicity

Dirichlet Process (DP)

'Enginius > Machine Learning' 카테고리의 다른 글

티스토리툴바

Dirichlet Process (DP)

'Enginius > Machine Learning' 카테고리의 다른 글

'Enginius/Machine Learning' Related Articles

티스토리툴바