Getting started¶
This very simple case-study is designed to get you up-and-running quickly with
statsmodels
. Starting from raw data, we will show the steps needed to
estimate a statistical model and to draw a diagnostic plot. We will only use
functions provided by statsmodels
or its pandas
and patsy
dependencies.
Loading modules and functions¶
After installing statsmodels and its dependencies, we load a few modules and functions:
In [1]: from __future__ import print_function
In [2]: import statsmodels.api as sm
ImportErrorTraceback (most recent call last)
<ipython-input-2-085740203b77> in <module>()
----> 1 import statsmodels.api as sm
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/api.py in <module>()
5 from . import regression
6 from .regression.linear_model import OLS, GLS, WLS, GLSAR
----> 7 from .regression.recursive_ls import RecursiveLS
8 from .regression.quantile_regression import QuantReg
9 from .regression.mixed_linear_model import MixedLM
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/regression/recursive_ls.py in <module>()
14 from statsmodels.regression.linear_model import OLS
15 from statsmodels.tools.data import _is_using_pandas
---> 16 from statsmodels.tsa.statespace.mlemodel import (
17 MLEModel, MLEResults, MLEResultsWrapper)
18 from statsmodels.tools.tools import Bunch
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/mlemodel.py in <module>()
16 from scipy.stats import norm
17
---> 18 from .simulation_smoother import SimulationSmoother
19 from .kalman_smoother import SmootherResults
20 from .kalman_filter import (INVERT_UNIVARIATE, SOLVE_LU)
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/simulation_smoother.py in <module>()
8
9 import numpy as np
---> 10 from .kalman_smoother import KalmanSmoother
11 from . import tools
12
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/kalman_smoother.py in <module>()
9 import numpy as np
10
---> 11 from statsmodels.tsa.statespace.representation import OptionWrapper
12 from statsmodels.tsa.statespace.kalman_filter import (KalmanFilter,
13 FilterResults)
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/representation.py in <module>()
8
9 import numpy as np
---> 10 from .tools import (
11 find_best_blas_type, validate_matrix_shape, validate_vector_shape
12 )
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/tools.py in <module>()
205 'z': _statespace.zcopy_index_vector
206 })
--> 207 set_mode(compatibility=None)
208
209
/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/tools.py in set_mode(compatibility)
57 if not compatibility:
58 from scipy.linalg import cython_blas
---> 59 from . import (_representation, _kalman_filter, _kalman_smoother,
60 _simulation_smoother, _tools)
61 compatibility_mode = False
ImportError: cannot import name _representation
In [3]: import pandas
In [4]: from patsy import dmatrices
pandas builds on numpy
arrays to provide
rich data structures and data analysis tools. The pandas.DataFrame
function
provides labelled arrays of (potentially heterogenous) data, similar to the
R
“data.frame”. The pandas.read_csv
function can be used to convert a
comma-separated values file to a DataFrame
object.
patsy is a Python library for describing
statistical models and building Design Matrices using R
-like formulas.
Data¶
We download the Guerry dataset, a
collection of historical data used in support of Andre-Michel Guerry’s 1833
Essay on the Moral Statistics of France. The data set is hosted online in
comma-separated values format (CSV) by the Rdatasets repository.
We could download the file locally and then load it using read_csv
, but
pandas
takes care of all of this automatically for us:
In [5]: df = sm.datasets.get_rdataset("Guerry", "HistData").data
NameErrorTraceback (most recent call last)
<ipython-input-5-63f901ae7e17> in <module>()
----> 1 df = sm.datasets.get_rdataset("Guerry", "HistData").data
NameError: name 'sm' is not defined
The Input/Output doc page shows how to import from various other formats.
We select the variables of interest and look at the bottom 5 rows:
In [6]: vars = ['Department', 'Lottery', 'Literacy', 'Wealth', 'Region']
In [7]: df = df[vars]
NameErrorTraceback (most recent call last)
<ipython-input-7-817b52d314c7> in <module>()
----> 1 df = df[vars]
NameError: name 'df' is not defined
In [8]: df[-5:]
NameErrorTraceback (most recent call last)
<ipython-input-8-6b38bebfca7f> in <module>()
----> 1 df[-5:]
NameError: name 'df' is not defined
Notice that there is one missing observation in the Region column. We
eliminate it using a DataFrame
method provided by pandas
:
In [9]: df = df.dropna()
NameErrorTraceback (most recent call last)
<ipython-input-9-1ac174be4d67> in <module>()
----> 1 df = df.dropna()
NameError: name 'df' is not defined
In [10]: df[-5:]
NameErrorTraceback (most recent call last)
<ipython-input-10-6b38bebfca7f> in <module>()
----> 1 df[-5:]
NameError: name 'df' is not defined
Substantive motivation and model¶
We want to know whether literacy rates in the 86 French departments are associated with per capita wagers on the Royal Lottery in the 1820s. We need to control for the level of wealth in each department, and we also want to include a series of dummy variables on the right-hand side of our regression equation to control for unobserved heterogeneity due to regional effects. The model is estimated using ordinary least squares regression (OLS).
Design matrices (endog & exog)¶
To fit most of the models covered by statsmodels
, you will need to create
two design matrices. The first is a matrix of endogenous variable(s) (i.e.
dependent, response, regressand, etc.). The second is a matrix of exogenous
variable(s) (i.e. independent, predictor, regressor, etc.). The OLS coefficient
estimates are calculated as usual:
where \(y\) is an \(N \times 1\) column of data on lottery wagers per capita (Lottery). \(X\) is \(N \times 7\) with an intercept, the Literacy and Wealth variables, and 4 region binary variables.
The patsy
module provides a convenient function to prepare design matrices
using R
-like formulas. You can find more information here.
We use patsy
’s dmatrices
function to create design matrices:
In [11]: y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
NameErrorTraceback (most recent call last)
<ipython-input-11-e357b4316ca3> in <module>()
----> 1 y, X = dmatrices('Lottery ~ Literacy + Wealth + Region', data=df, return_type='dataframe')
NameError: name 'df' is not defined
The resulting matrices/data frames look like this:
In [12]: y[:3]
NameErrorTraceback (most recent call last)
<ipython-input-12-941693b55cb6> in <module>()
----> 1 y[:3]
NameError: name 'y' is not defined
In [13]: X[:3]
NameErrorTraceback (most recent call last)
<ipython-input-13-153642677bb3> in <module>()
----> 1 X[:3]
NameError: name 'X' is not defined
Notice that dmatrices
has
- split the categorical Region variable into a set of indicator variables.
- added a constant to the exogenous regressors matrix.
- returned
pandas
DataFrames instead of simple numpy arrays. This is useful because DataFrames allowstatsmodels
to carry-over meta-data (e.g. variable names) when reporting results.
The above behavior can of course be altered. See the patsy doc pages.
Model fit and summary¶
Fitting a model in statsmodels
typically involves 3 easy steps:
- Use the model class to describe the model
- Fit the model using a class method
- Inspect the results using a summary method
For OLS, this is achieved by:
In [14]: mod = sm.OLS(y, X) # Describe model
NameErrorTraceback (most recent call last)
<ipython-input-14-5408baa0a82d> in <module>()
----> 1 mod = sm.OLS(y, X) # Describe model
NameError: name 'sm' is not defined
In [15]: res = mod.fit() # Fit model
NameErrorTraceback (most recent call last)
<ipython-input-15-7e8bcaf10af6> in <module>()
----> 1 res = mod.fit() # Fit model
NameError: name 'mod' is not defined
In [16]: print(res.summary()) # Summarize model
NameErrorTraceback (most recent call last)
<ipython-input-16-963a1fa5b90f> in <module>()
----> 1 print(res.summary()) # Summarize model
NameError: name 'res' is not defined
The res
object has many useful attributes. For example, we can extract
parameter estimates and r-squared by typing:
In [17]: res.params
NameErrorTraceback (most recent call last)
<ipython-input-17-a6ed515e2925> in <module>()
----> 1 res.params
NameError: name 'res' is not defined
In [18]: res.rsquared
NameErrorTraceback (most recent call last)
<ipython-input-18-6c0195efc793> in <module>()
----> 1 res.rsquared
NameError: name 'res' is not defined
Type dir(res)
for a full list of attributes.
For more information and examples, see the Regression doc page
Diagnostics and specification tests¶
statsmodels
allows you to conduct a range of useful regression diagnostics
and specification tests. For instance,
apply the Rainbow test for linearity (the null hypothesis is that the
relationship is properly modelled as linear):
In [19]: sm.stats.linear_rainbow(res)
NameErrorTraceback (most recent call last)
<ipython-input-19-052802ccfec2> in <module>()
----> 1 sm.stats.linear_rainbow(res)
NameError: name 'sm' is not defined
Admittedly, the output produced above is not very verbose, but we know from
reading the docstring
(also, print(sm.stats.linear_rainbow.__doc__)
) that the
first number is an F-statistic and that the second is the p-value.
statsmodels
also provides graphics functions. For example, we can draw a
plot of partial regression for a set of regressors by:
In [20]: sm.graphics.plot_partregress('Lottery', 'Wealth', ['Region', 'Literacy'],
....: data=df, obs_labels=False)
....:
NameErrorTraceback (most recent call last)
<ipython-input-20-ebbe148b620d> in <module>()
----> 1 sm.graphics.plot_partregress('Lottery', 'Wealth', ['Region', 'Literacy'],
2 data=df, obs_labels=False)
NameError: name 'sm' is not defined

More¶
Congratulations! You’re ready to move on to other topics in the Table of Contents