pillyshi: 1月 2019

トピックモデル

仕事で使うことになりそうだったので，トピックモデルについて復習してみた．結構戸惑うところがあったのでメモ．基本的には，「文書 $d$ は確率変数ではない」と解釈しておくと，後々楽な気がする．

$\bm{w}_d = \left(w_{d1}, \ldots, w_{dN_d}\right)$ を文書 $d$ を表す長さ $N_d$ の単語列とする． $\bm{w}_d$ の従う確率分布の密度関数を $p_d$ とする．いま， $\left\{\bm{w}_d\right\}_{d=1}^D$ という文書集合が与えられたとする．ここで，注意しなければならないのは， $\bm{w}_d$ と $\bm{w}_{d'}$ は独立だと仮定するが，同一の分布には従っているとは仮定しない．つまり，i.i.d.は仮定しない．そのことに注意すると，

$\begin{aligned} p(\left\{\bm{w}_d\right\}_{d=1}^D) = \prod_{d=1}^D p_d(\bm{w}_d) \end{aligned}$

となる．
一方で， $w_{dn}$ と $w_{dn'}$ に対してはi.i.d.を仮定する．よって，任意の $n \in \{1, \ldots, N_d\}$ に対して， $w_{dn}$ の従う確率分布の密度関数を， $q_d$ とする．このことに注意すると，

$\begin{aligned} p_d(\bm{w}_d) = \prod_{n=1}^{N_d} q_d(w_{dn}) \end{aligned}$

と表すことができる．まとめると，文書集合の尤度は以下のようにかける．

$\begin{aligned} p(\left\{\bm{w}_d\right\}_{d=1}^D) = \prod_{d=1}^D \prod_{n=1}^{N_d} q_d(w_{dn}) \end{aligned}$

ここで，単語はトピック $k \in \{1, \ldots, K\}$ に基づいて生成されると仮定すると，上の尤度は以下のように表される．

$\begin{aligned} p(\left\{\bm{w}_d\right\}_{d=1}^D) = \prod_{d=1}^D \prod_{n=1}^{N_d} \sum_{k=1}^K q_d(w_{dn}|k) q_d(k) \end{aligned}$

トピックに関して周辺化を行っただけ．ここまでくれば $q_{d}(\cdot|\cdot)$ や $q_d(\cdot)$ をどのようにモデル化するかは自由である．
僕の読んでいる本（機械学習プロフェッショナルシリーズのトピックモデル）では， $q_d(v|k) = \phi_{kv}, q_d(k) = \theta_{dk}$ としていた．注意しなければならないのは， $q_d(v|k) = \phi_{kv}$ の部分で，単語はトピックのみに依存し，文書には依存しないモデルを使っていること．
本とかでは，いろんな仮定が省略されて書かれているので，今回はそれを明確に記述するために，かなりめんどくさい定式化をしてみた．

Written with StackEdit.

Converting Large CSV into Sparse Matrix using Dask

In this article, I explain how to convert a large csv file into sparse matrix (or matrices) using dask for training Machine Learning Models. The following code create a coo matrix (or matrices) which can be used for training Machine Learning Model.

import dask.dataframe as dd
from scipy import sparse

id_column = 'column name for ID' # e.g. PassengerId in case of titanic dataset
target_column = 'column name for target value' # e.g. Survived in case of titanic dataset

ddf = dd.read_csv('/path/to/csv')
features = ddf.drop([id_column, target_column], axis=1)

categorical_columns = features.columns[features.dtypes == object].tolist() # or you can explicitly specify categorical columns
dtypes = features.categorize(columns=categorical_columns).dtypes # this dtypes will be used for transforming test data. 

X = dd.get_dummies(features.astype(dtypes), sparse=True) # this sparse option doesn't work...

# X can be huge dataframe. Therefore we split X into parts using delayed
X = X.repartitions(npartitions=n_partitions) # if necessary
coos = []
for x in X.to_delayed():
	x = x.compute() # this x should be small enough
	coo = x.to_sparse(fill_value=0).to_coo()
	coos.append(coo)
X = sparse.vstack(coos)
y = ddf[target_column].compute().values # Target values should be small enough since it is just one column

You can train Machine Learning Model using this X and y . In the following code, we train XGBClassifier .

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X, y)

After fitting the model, we need features for test data which can be created as follows.

ddf = dd.read_csv('/path/to/test_csv')
features = ddf.drop([id_column], axis=1) # test data doesn't have target_column

X = dd.get_dummies(features.astype(dtypes), sparse=True) # Again, this sparse option doesn't work. Why this option exists?

X = X.repartitions(npartitions=n_partitions) # if necessary
coos = []
for x in X.to_delayed():
	x = x.compute() # this x should be small enough
	coo = x.to_sparse(fill_value=0).to_coo()
	coos.append(coo)

# if your test data is small enough
X = sparse.vstack(coos)
proba = model.predict_proba(X)

Although This approach is memory-friendly, cannot handle data which exceed your machine’s RAM even using sparse format. Currently I’m trying to find a way to deal with such situation. Probably distributed computation using Dask would be helpful. However I haven’t found good ways so far. Please let me know if you know good way to handle such data.

Written with StackEdit.

トピックモデル

Converting Large CSV into Sparse Matrix using Dask

機械学習の問題設定