In [1]: from statsmodels.datasets.longley import load
In [2]: import statsmodels.api as sm
In [3]: import numpy as np
In [4]: data = load()
In [5]: data.exog = sm.tools.add_constant(data.exog)
In [6]: ols_model = sm.OLS(data.endog, data.exog)
In [7]: ols_results = ols_model.fit()
the Longley dataset is well known to have high multicollinearity one way to find the condition number is as follows
normalize the independent variables to have unit length, Greene 4.9
In [8]: norm_x = np.ones_like(data.exog)
In [9]: for i in range(int(ols_model.df_model)):
...: norm_x[:,i] = data.exog[:,i]/np.linalg.norm(data.exog[:,i])
...:
In [10]: norm_xtx = np.dot(norm_x.T,norm_x)
In [11]: eigs = np.linalg.eigvals(norm_xtx)
In [12]: collin = np.sqrt(eigs.max()/eigs.min())
In [13]: print collin
56240.8691004
clearly there is a big problem with multicollinearity the rule of thumb is any number of 20 requires attention
for instance, consider the longley dataset with the last observation dropped
In [14]: ols_results2 = sm.OLS(data.endog[:-1], data.exog[:-1,:]).fit()
all of our coefficients change considerably in percentages of the original coefficients
In [15]: print "Percentage change %4.2f%%\n"*7 % tuple([i for i in ols_results.params/ols_results2.params*100 - 100])
Percentage change -173.43%
Percentage change 31.04%
Percentage change 3.48%
Percentage change 7.83%
Percentage change -199.54%
Percentage change 15.39%
Percentage change 15.40%