I was working on multi-variate regression analysis. There were over 80 explanatory variables so I used the AIC (Akaike information criterion) with step function in order to reduce these. BTW, the AIC step function does not exist in Python so you have to write yourself.

qiita.com This guy wrote his own step function. I copied and pasted it in my ipynb.

def step_aic(model, exog, endog, **kwargs): """ This select the best exogenous variables with AIC Both exog and endog values can be either str or list. (Endog list is for the Binomial family.) Note: This adopt only "forward" selection Args: model: model from statsmodels.formula.api exog (str or list): exogenous variables endog (str or list): endogenous variables kwargs: extra keyword argments for model (e.g., data, family) Returns: model: a model that seems to have the smallest AIC """ # convert exog, endog to list format exog = np.r_[[exog]].flatten() endog = np.r_[[endog]].flatten() remaining = set(exog) selected = [] # contains adopted candidates # calculate AIC only for constants formula_head = ' + '.join(endog) + ' ~ ' formula = formula_head + '1' aic = model(formula=formula, **kwargs).fit().aic print('AIC: {}, formula: {}'.format(round(aic, 3), formula)) current_score, best_new_score = np.ones(2) * aic # adopt all elements, or ends the loop if the AIC will not be improved although adding any elements while remaining and current_score == best_new_score: scores_with_candidates = [] for candidate in remaining: # calculate the AIC when adding the remained elements one by one formula_tail = ' + '.join(selected + [candidate]) formula = formula_head + formula_tail aic = model(formula=formula, **kwargs).fit().aic print('AIC: {}, formula: {}'.format(round(aic, 3), formula)) scores_with_candidates.append((aic, candidate)) # adopt the elements that improved the AIC most as the best candidate scores_with_candidates.sort() scores_with_candidates.reverse() best_new_score, best_candidate = scores_with_candidates.pop() # if adding a candinate reduces the AIC, add it as the determined candidates if best_new_score < current_score: remaining.remove(best_candidate) selected.append(best_candidate) current_score = best_new_score formula = formula_head + ' + '.join(selected) print('The best formula: {}'.format(formula)) return model(formula, **kwargs).fit()

Here is the problem. "print('AIC: {}, formula: {}'.format(round(aic, 3), formula))" yeilds huge amount of text information on my notebook, which makes my file as big as 80 MB. Have you ever heard of 80 MB sized ipynb? Jupyter notebook cannot handle it and freezed. To solve this problem, you have to trim your ipynb. But how? Your local jupyter kernel cannot open it. I tried once to delete unnecessary part manually by opening ipynb in my editor (as a JSON file) but that forced me huge efforts.

My idea was to use Google Colab notebook. Colab can handle and open a big size file.

You can use Google Colab Notebooks for trimming outputs. I opened 83 MB ipynb file at Colab and Colab could handle it. From Colab GUI, you can choose the output cell you want to delete then get the file back to your local directory and reopen it. Eventually I trimmed the original file to the size of 2 MB in this way.