Converting Large CSV into Sparse Matrix using Dask

Converting Large CSV into Sparse Matrix using Dask

In this article, I explain how to convert a large csv file into sparse matrix (or matrices) using dask for training Machine Learning Models. The following code create a coo matrix (or matrices) which can be used for training Machine Learning Model.

import dask.dataframe as dd
from scipy import sparse

id_column = 'column name for ID' # e.g. PassengerId in case of titanic dataset
target_column = 'column name for target value' # e.g. Survived in case of titanic dataset

ddf = dd.read_csv('/path/to/csv')
features = ddf.drop([id_column, target_column], axis=1)

categorical_columns = features.columns[features.dtypes == object].tolist() # or you can explicitly specify categorical columns
dtypes = features.categorize(columns=categorical_columns).dtypes # this dtypes will be used for transforming test data. 

X = dd.get_dummies(features.astype(dtypes), sparse=True) # this sparse option doesn't work...

# X can be huge dataframe. Therefore we split X into parts using delayed
X = X.repartitions(npartitions=n_partitions) # if necessary
coos = []
for x in X.to_delayed():
	x = x.compute() # this x should be small enough
	coo = x.to_sparse(fill_value=0).to_coo()
	coos.append(coo)
X = sparse.vstack(coos)
y = ddf[target_column].compute().values # Target values should be small enough since it is just one column

You can train Machine Learning Model using this X and y . In the following code, we train XGBClassifier .

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X, y)

After fitting the model, we need features for test data which can be created as follows.

ddf = dd.read_csv('/path/to/test_csv')
features = ddf.drop([id_column], axis=1) # test data doesn't have target_column

X = dd.get_dummies(features.astype(dtypes), sparse=True) # Again, this sparse option doesn't work. Why this option exists?

X = X.repartitions(npartitions=n_partitions) # if necessary
coos = []
for x in X.to_delayed():
	x = x.compute() # this x should be small enough
	coo = x.to_sparse(fill_value=0).to_coo()
	coos.append(coo)

# if your test data is small enough
X = sparse.vstack(coos)
proba = model.predict_proba(X)

Although This approach is memory-friendly, cannot handle data which exceed your machine’s RAM even using sparse format. Currently I’m trying to find a way to deal with such situation. Probably distributed computation using Dask would be helpful. However I haven’t found good ways so far. Please let me know if you know good way to handle such data.

Written with StackEdit.

0 件のコメント:

コメントを投稿

機械学習の問題設定

機械学習の問題設定 機械学習の問題設定を見直したのでメモ. ( Ω , F , P ) (\Omega, \mathcal{F}, P) ( Ω , F , P ) : ベースとなる確率空間 ( X , F X ) (\mathcal...