Converting Large CSV into Sparse Matrix using Dask
In this article, I explain how to convert a large csv file into sparse matrix (or matrices) using dask for training Machine Learning Models. The following code create a coo matrix (or matrices) which can be used for training Machine Learning Model.
import dask.dataframe as dd
from scipy import sparse
id_column ='column name for ID'# e.g. PassengerId in case of titanic dataset
target_column ='column name for target value'# e.g. Survived in case of titanic dataset
ddf = dd.read_csv('/path/to/csv')
features = ddf.drop([id_column, target_column], axis=1)
categorical_columns = features.columns[features.dtypes ==object].tolist()# or you can explicitly specify categorical columns
dtypes = features.categorize(columns=categorical_columns).dtypes # this dtypes will be used for transforming test data.
X = dd.get_dummies(features.astype(dtypes), sparse=True)# this sparse option doesn't work...# X can be huge dataframe. Therefore we split X into parts using delayed
X = X.repartitions(npartitions=n_partitions)# if necessary
coos =[]for x in X.to_delayed():
x = x.compute()# this x should be small enough
coo = x.to_sparse(fill_value=0).to_coo()
coos.append(coo)
X = sparse.vstack(coos)
y = ddf[target_column].compute().values # Target values should be small enough since it is just one column
You can train Machine Learning Model using this X and y . In the following code, we train XGBClassifier .
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X, y)
After fitting the model, we need features for test data which can be created as follows.
ddf = dd.read_csv('/path/to/test_csv')
features = ddf.drop([id_column], axis=1)# test data doesn't have target_column
X = dd.get_dummies(features.astype(dtypes), sparse=True)# Again, this sparse option doesn't work. Why this option exists?
X = X.repartitions(npartitions=n_partitions)# if necessary
coos =[]for x in X.to_delayed():
x = x.compute()# this x should be small enough
coo = x.to_sparse(fill_value=0).to_coo()
coos.append(coo)# if your test data is small enough
X = sparse.vstack(coos)
proba = model.predict_proba(X)
Although This approach is memory-friendly, cannot handle data which exceed your machine’s RAM even using sparse format. Currently I’m trying to find a way to deal with such situation. Probably distributed computation using Dask would be helpful. However I haven’t found good ways so far. Please let me know if you know good way to handle such data.