©2018 by macnabbs. Proudly created with Wix.com

 
  • Alibek Jakupov

SMOTE Memory Error: Tips and tricks

Updated: May 13, 2019



If you are reading this article then you have probably already faced this issue while running your oversampling experiment. Here I share my personal experience of solving this issue. Nevertheless, even if these techniques worked out in my experiment, it does not necessarily imply that it will work out for all the users. However, I find it useful to share this experience and hopefully someone may find it useful too.


Let's start with a brief introduction.


SMOTE is an abbreviation for Synthetic Minority Oversampling Technique that allows implementing oversampling over your imbalanced data. In other words, if you have a huge difference in distribution of instances in your classes (for instance 10k elements for class A against 1k elements for class B) you may want to generate some synthetic instances that will not simply duplicate instances but permute features in the instances to generate new samples.


In python there is an excellent package called imblearn that besides implementing SMOTE implements some other oversampling algorithms.

As a result, the majority class does not take over the other classes during the training process. Consequently, all classes are represented by the decision function.


The code is quite simple, you need to initialize the SMOTE class and then run the oversample.

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.over_sampling import SMOTE # doctest: +NORMALIZE_WHITESPACE
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> sm = SMOTE(random_state=42)
>>> X_res, y_res = sm.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 900, 1: 900})

However when I ran my experiment, I have fallen on the following error:

Memory Error

Certainly, I started to look for this error and found out this.


Even if this is not exactly the same error I have found out that the error was caused by the dimensionality (5 000 features) and the huge number of classes (157 all in all).


So here are the steps I followed:


First, initialize the SMOTE


# initilize oversamplers

smote = SMOTE(random_state=42)

Then create a function for binary oversampling:



And here's the tip. I splitted the dataset into a majority class and one class from the list. I then created a binary sampler, filtered out the majority class and appended the newly generated dataframe to the base frame.



Hope you find this useful.