Accelerating Random Forests Up to 8x Using cuML on Kaggle with Example

Atilay Cem Samiloglu
4 min readJun 12, 2022


Random Forest on GPUs

Random forests are a popular machine learning technique for classification and regression problems. By building multiple independent decision trees, they reduce the problems of overfitting seen with individual trees.

In this post, I review the basic random forest algorithms, show how their training can be parallelized on NVIDIA GPUs, and finally present benchmark numbers demonstrating the performance. For more information about the random forests algorithm, see An Implementation and Explanation of the Random Forest in Python (Toward Data Science) or Lesson 1: Introduction to Random Forests (

Machine Learning Operations preferred on CPUs

  • Recommendation systems for training and inference that involve huge memory for embedding layers.
  • Machine learning algorithms that do not require parallel computing, i.e. support vector machine algorithms, time-series data.
  • The algorithm that uses sequential data, for example, recurrent neural networks.
  • Algorithms involved intensive branching.
  • Most data science algorithms deployed on cloud or Backend-as-a-service (BAAS) architectures

cuML (GPU) Random Forest vs CPU Random Forest on Kaggle


Kaggle provides free access to NVidia K80 GPUs in kernels. This benchmark shows that enabling a GPU to your Kernel results in a 8 speedup during training of a model.

How to switch ON the GPU in Kaggle Kernel?

  1. Open any notebook. On the right-hand side click on the drop down beside “Accelerator”
  2. Select “GPU”
  3. Wait for the kernel to start up

You’re good to go now

Kaggle implementing a limit on each user’s GPU use of 33 hours/week.

RAPIDS to Implement ML Model on GPU

RAPIDS is an open-source Python framework that executes data science code on GPUs instead of CPUs. This results in huge performance gains for data science work, similar to those seen for training deep learning models. RAPIDS has interfaces for DataFrames, ML, graph analysis, and more. RAPIDS uses Dask to handle parallelizing to machines with multiple GPUs, as well as a cluster of machines each with one or more GPUs.


In this application, a random forest classifier was applied on a data set with approximately 1.2 million rows and 38 variables.

It was first run on the CPU in the model. Then, in order to compare the working speed , the model and data with the same features were run on the GPU.

The data used in this study is a data containing customer information on credit card fraud.

You can access Kaggle kernels for this notebook from this links.



Data Set:


Data Shape:



(1295071, 39)

Features for Random Forest Classifier:

# n_estimators = 100, random_state = 42,max_depth = 12,
max_features=7, min_samples_split = 10

# CV = 10

Train- Test Split:

y = new_df['is_fraud']
X = new_df.drop(['is_fraud'], axis = 1)
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.30, random_state=17)

Train- Test Shape:

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape


((906549, 38), (388522, 38), (906549,), (388522,))

1. Running the model on CPU


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV
rf_model = RandomForestClassifier(random_state=42)
param_grid = {'n_estimators': [100],'max_depth': [12],'max_features' : [7], 'min_samples_split' :[10] }rf_cv_model = GridSearchCV(rf_model, param_grid, cv=10).fit(X_train, Y_train)rf_tuned = RandomForestClassifier(**rf_cv_model.best_params_).fit(X_train, Y_train)y_pred = rf_tuned.predict(X_test)


CPU times: user 8min, sys: 12.6 s, total: 8min 12s
Wall time: 8min 12s

2. Running the Model on GPU

  • If we were going to run the model on CPU, we would already be good to go.
  • However, to run the cuML model on GPU, the data type must be set to float32.

Converting to Float:

for col in X_train.select_dtypes('float64').columns:
X_train[col] = X_train[col].astype( np.float32 )
X_test[col] = X_test[col].astype( np.float32 )

Y_train = Y_train.astype( np.float32 )
for col in X_train.select_dtypes('int64').columns:
X_train[col] = X_train[col].astype( np.float32 )
X_test[col] = X_test[col].astype( np.float32 )


rf_model = cuRF(n_estimators = 100,  n_streams=1, random_state = 42,max_depth = 12, max_features=7, min_samples_split = 10 )def prediction (X_train, Y_train, model, X_test, seed = 42):

print('seed:', seed)

kfold = KFold(n_splits = 10, shuffle=True, random_state = seed)

y_pred = np.zeros(len(X_test))
train_oof = np.zeros(len(X_train))

for idx in kfold.split(X=X_train, y=Y_train):
train_idx, val_idx = idx[0], idx[1]
xtrain = X_train.iloc[train_idx]
ytrain = Y_train.iloc[train_idx]
xval = X_train.iloc[val_idx]
yval = Y_train.iloc[val_idx]

# fit model for current fold, ytrain)
#create predictions
y_pred += model.predict(X_test)/kfold.n_splits

val_pred = model.predict(xval)
# getting out-of-fold predictions on training set
train_oof[val_idx] = val_pred
# calculate and append rmsle
print(classification_report(yval, val_pred))

return y_pred, train_oof
pred, train_oof = prediction (X_train, Y_train, rf_model, X_test)


seed: 42
CPU times: user 1min 29s, sys: 2.73 s, total: 1min 32s
Wall time: 1min 32s


We trained a random forest model on 1.2 million instances of credit card fraud (CPU) and RAPIDS (GPU) clusters.

That’s 8min 12s with CPU vs. 1 min for RAPIDS! (GPU)

GPUs for the win!