堀内亮佑のブログ

事務員@新潟県の病院組織A→医療情報技師@新潟県の病院組織B→機械学習エンジニア@外資系CRO→データサイエンティスト@日系製薬企業→ソリューションズアーキテクト@外資系ソフトウェアベンダー

How can you trim the file size of ipynb which is so big that your kernel cannot open it?

I was working on multi-variate regression analysis. There were over 80 explanatory variables so I used the AIC (Akaike information criterion) with step function in order to reduce these. BTW, the AIC step function does not exist in Python so you have to write yourself.

qiita.com This guy wrote his own step function. I copied and pasted it in my ipynb.

def step_aic(model, exog, endog, **kwargs):
    """
    This select the best exogenous variables with AIC
    Both exog and endog values can be either str or list.
    (Endog list is for the Binomial family.)

    Note: This adopt only "forward" selection

    Args:
        model: model from statsmodels.formula.api
        exog (str or list): exogenous variables
        endog (str or list): endogenous variables
        kwargs: extra keyword argments for model (e.g., data, family)

    Returns:
        model: a model that seems to have the smallest AIC
    """

    # convert exog, endog to list format
    exog = np.r_[[exog]].flatten()
    endog = np.r_[[endog]].flatten()
    remaining = set(exog)
    selected = []  # contains adopted candidates

    # calculate AIC only for constants
    formula_head = ' + '.join(endog) + ' ~ '
    formula = formula_head + '1'
    aic = model(formula=formula, **kwargs).fit().aic
    print('AIC: {}, formula: {}'.format(round(aic, 3), formula))

    current_score, best_new_score = np.ones(2) * aic

    # adopt all elements, or ends the loop if the AIC will not be improved although adding any elements
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:

            # calculate the AIC when adding the remained elements one by one
            formula_tail = ' + '.join(selected + [candidate])
            formula = formula_head + formula_tail
            aic = model(formula=formula, **kwargs).fit().aic
            print('AIC: {}, formula: {}'.format(round(aic, 3), formula))

            scores_with_candidates.append((aic, candidate))

        # adopt the elements that improved the AIC most as the best candidate 
        scores_with_candidates.sort()
        scores_with_candidates.reverse()
        best_new_score, best_candidate = scores_with_candidates.pop()

        # if adding a candinate reduces the AIC, add it as the determined candidates 
        if best_new_score < current_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score

    formula = formula_head + ' + '.join(selected)
    print('The best formula: {}'.format(formula))
    return model(formula, **kwargs).fit()

Here is the problem. "print('AIC: {}, formula: {}'.format(round(aic, 3), formula))" yeilds huge amount of text information on my notebook, which makes my file as big as 80 MB. Have you ever heard of 80 MB sized ipynb? Jupyter notebook cannot handle it and freezed. To solve this problem, you have to trim your ipynb. But how? Your local jupyter kernel cannot open it. I tried once to delete unnecessary part manually by opening ipynb in my editor (as a JSON file) but that forced me huge efforts.

github.com

My idea was to use Google Colab notebook. Colab can handle and open a big size file.

You can use Google Colab Notebooks for trimming outputs. I opened 83 MB ipynb file at Colab and Colab could handle it. From Colab GUI, you can choose the output cell you want to delete then get the file back to your local directory and reopen it. Eventually I trimmed the original file to the size of 2 MB in this way.