How do I determine how big a dataset my computer can handle using XGBoost in Python?

Question

How do I determine how big a dataset my computer can handle using XGBoost in Python?

I am trying a Regression challenge from Kaggle - the great energy predictor - https://www.kaggle.com/c/ashrae-energy-prediction

I have cleaned and preprocessed the data and now attempting to apply the XGBoost algorithm using the following data source (the variable I am predicting is meter_reading).

        0   1   2   3   4
site_id 0   0   0   0   0
building_id 7   31  55  96  103
primary_use Education   Education   Office  Lodging/residential Education
square_feet 121074  61904   16726   200933  21657
meter   chilledwater    chilledwater    chilledwater    chilledwater    chilledwater
timestamp   2016-02-29 09:00:00 2016-02-29 09:00:00 2016-02-29 09:00:00 2016-02-29 09:00:00 2016-02-29 09:00:00
meter_reading   1857.26 1097.47 337.683 1266.31 337.683
meter_reading_roll_avg  2219.77 1719.04 510.663 2245.43 349.27
outlier_ratio   0.836691    0.638421    0.661264    0.563951    0.966825
air_temperature 12.8    12.8    12.8    12.8    12.8
dew_temperature 8.9 8.9 8.9 8.9 8.9
sea_level_pressure  1021.9  1021.9  1021.9  1021.9  1021.9
wind_speed  0   0   0   0   0
hour    9   9   9   9   9
weekday 0   0   0   0   0
month   2   2   2   2   2
wind_compass    North   North   North   North   North
HDD 5.2 5.2 5.2 5.2 5.2
CDD 0   0   0   0   0

When I run it with around 10,000 samples, the algorithm works and I get a result.

When I run it with 400k+ samples I get an error

    ---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-11-ff1b424a4002> in <module>
      6     print("Test Index: ", test_index)
      7     X_train, X_test, y_train, y_test = X.values[train_index], X.values[test_index], y.values[train_index], y.values[test_index]
----> 8     model.fit(X_train,y_train)
      9     y_pred=model.predict(X_test)
     10     predictions = [round(value) for value in y_pred]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\sklearn.py in fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, callbacks)
    822                               evals_result=evals_result, obj=obj, feval=feval,
    823                               verbose_eval=verbose, xgb_model=xgb_model,
--> 824                               callbacks=callbacks)
    825 
    826         self.objective = xgb_options["objective"]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, xgb_model, callbacks)
    210                            evals=evals,
    211                            obj=obj, feval=feval,
--> 212                            xgb_model=xgb_model, callbacks=callbacks)
    213 
    214 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
     73         # Skip the first update if it is a recovery step.
     74         if version % 2 == 0:
---> 75             bst.update(dtrain, i, obj)
     76             bst.save_rabit_checkpoint()
     77             version += 1

~\AppData\Local\Continuum\anaconda3\lib\site-packages\xgboost\core.py in update(self, dtrain, iteration, fobj)
   1367             _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
   1368                                                     ctypes.c_int(iteration),
-> 1369                                                     dtrain.handle))
   1370         else:
   1371             pred = self.predict(dtrain, output_margin=True, training=True)

OSError: [WinError -529697949] Windows Error 0xe06d7363

I think this is because I dont have enough computing power

Here are my specs.

Is there a quick and convenient way to determine when I don't have enough computing power for a dataset/algorithm?

python

machine-learning

xgboost

asked on Stack Overflow Jul 29, 2020 by

Doptima • edited Jul 29, 2020 by

desertnaut

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0