The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

Using Datasets from Stata

webuse(data[, baseurl, as_df]) Download and return an example dataset from Stata.

Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. The actual data is accessible by the data attribute. For example:

In [1]: import statsmodels.api as sm

ImportErrorTraceback (most recent call last)
<ipython-input-1-085740203b77> in <module>()
----> 1 import statsmodels.api as sm

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/api.py in <module>()
      5 from . import regression
      6 from .regression.linear_model import OLS, GLS, WLS, GLSAR
----> 7 from .regression.recursive_ls import RecursiveLS
      8 from .regression.quantile_regression import QuantReg
      9 from .regression.mixed_linear_model import MixedLM

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/regression/recursive_ls.py in <module>()
     14 from statsmodels.regression.linear_model import OLS
     15 from statsmodels.tools.data import _is_using_pandas
---> 16 from statsmodels.tsa.statespace.mlemodel import (
     17     MLEModel, MLEResults, MLEResultsWrapper)
     18 from statsmodels.tools.tools import Bunch

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/mlemodel.py in <module>()
     16 from scipy.stats import norm
     17 
---> 18 from .simulation_smoother import SimulationSmoother
     19 from .kalman_smoother import SmootherResults
     20 from .kalman_filter import (INVERT_UNIVARIATE, SOLVE_LU)

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/simulation_smoother.py in <module>()
      8 
      9 import numpy as np
---> 10 from .kalman_smoother import KalmanSmoother
     11 from . import tools
     12 

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/kalman_smoother.py in <module>()
      9 import numpy as np
     10 
---> 11 from statsmodels.tsa.statespace.representation import OptionWrapper
     12 from statsmodels.tsa.statespace.kalman_filter import (KalmanFilter,
     13                                                       FilterResults)

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/representation.py in <module>()
      8 
      9 import numpy as np
---> 10 from .tools import (
     11     find_best_blas_type, validate_matrix_shape, validate_vector_shape
     12 )

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/tools.py in <module>()
    205             'z': _statespace.zcopy_index_vector
    206         })
--> 207 set_mode(compatibility=None)
    208 
    209 

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/tools.py in set_mode(compatibility)
     57     if not compatibility:
     58         from scipy.linalg import cython_blas
---> 59         from . import (_representation, _kalman_filter, _kalman_smoother,
     60                        _simulation_smoother, _tools)
     61         compatibility_mode = False

ImportError: cannot import name _representation

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

NameErrorTraceback (most recent call last)
<ipython-input-2-1da56fdd18bc> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

NameError: name 'sm' is not defined

In [3]: print(duncan_prestige.__doc__)

NameErrorTraceback (most recent call last)
<ipython-input-3-4f7c06093561> in <module>()
----> 1 print(duncan_prestige.__doc__)

NameError: name 'duncan_prestige' is not defined

In [4]: duncan_prestige.data.head(5)

NameErrorTraceback (most recent call last)
<ipython-input-4-627c79b1326f> in <module>()
----> 1 duncan_prestige.data.head(5)

NameError: name 'duncan_prestige' is not defined

R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

Available Datasets

Usage

Load a dataset:

In [5]: import statsmodels.api as sm

ImportErrorTraceback (most recent call last)
<ipython-input-5-085740203b77> in <module>()
----> 1 import statsmodels.api as sm

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/api.py in <module>()
      5 from . import regression
      6 from .regression.linear_model import OLS, GLS, WLS, GLSAR
----> 7 from .regression.recursive_ls import RecursiveLS
      8 from .regression.quantile_regression import QuantReg
      9 from .regression.mixed_linear_model import MixedLM

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/regression/recursive_ls.py in <module>()
     14 from statsmodels.regression.linear_model import OLS
     15 from statsmodels.tools.data import _is_using_pandas
---> 16 from statsmodels.tsa.statespace.mlemodel import (
     17     MLEModel, MLEResults, MLEResultsWrapper)
     18 from statsmodels.tools.tools import Bunch

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/mlemodel.py in <module>()
     16 from scipy.stats import norm
     17 
---> 18 from .simulation_smoother import SimulationSmoother
     19 from .kalman_smoother import SmootherResults
     20 from .kalman_filter import (INVERT_UNIVARIATE, SOLVE_LU)

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/simulation_smoother.py in <module>()
      8 
      9 import numpy as np
---> 10 from .kalman_smoother import KalmanSmoother
     11 from . import tools
     12 

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/kalman_smoother.py in <module>()
      9 import numpy as np
     10 
---> 11 from statsmodels.tsa.statespace.representation import OptionWrapper
     12 from statsmodels.tsa.statespace.kalman_filter import (KalmanFilter,
     13                                                       FilterResults)

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/representation.py in <module>()
      8 
      9 import numpy as np
---> 10 from .tools import (
     11     find_best_blas_type, validate_matrix_shape, validate_vector_shape
     12 )

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/tools.py in <module>()
    205             'z': _statespace.zcopy_index_vector
    206         })
--> 207 set_mode(compatibility=None)
    208 
    209 

/builddir/build/BUILD/statsmodels-0.9.0/statsmodels/tsa/statespace/tools.py in set_mode(compatibility)
     57     if not compatibility:
     58         from scipy.linalg import cython_blas
---> 59         from . import (_representation, _kalman_filter, _kalman_smoother,
     60                        _simulation_smoother, _tools)
     61         compatibility_mode = False

ImportError: cannot import name _representation

In [6]: data = sm.datasets.longley.load()

NameErrorTraceback (most recent call last)
<ipython-input-6-f0fe0de8afb1> in <module>()
----> 1 data = sm.datasets.longley.load()

NameError: name 'sm' is not defined

The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.

In [7]: data.data

NameErrorTraceback (most recent call last)
<ipython-input-7-b574fe29b619> in <module>()
----> 1 data.data

NameError: name 'data' is not defined

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [8]: data.endog[:5]

NameErrorTraceback (most recent call last)
<ipython-input-8-efbb19c892e1> in <module>()
----> 1 data.endog[:5]

NameError: name 'data' is not defined

In [9]: data.exog[:5,:]

NameErrorTraceback (most recent call last)
<ipython-input-9-5f67c2911378> in <module>()
----> 1 data.exog[:5,:]

NameError: name 'data' is not defined

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [10]: data.endog_name

NameErrorTraceback (most recent call last)
<ipython-input-10-13f3ac2b4583> in <module>()
----> 1 data.endog_name

NameError: name 'data' is not defined

In [11]: data.exog_name

NameErrorTraceback (most recent call last)
<ipython-input-11-17e310842077> in <module>()
----> 1 data.exog_name

NameError: name 'data' is not defined

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [12]: type(data.data)

NameErrorTraceback (most recent call last)
<ipython-input-12-0a79ff3bac29> in <module>()
----> 1 type(data.data)

NameError: name 'data' is not defined

In [13]: type(data.raw_data)

NameErrorTraceback (most recent call last)
<ipython-input-13-22b431808971> in <module>()
----> 1 type(data.raw_data)

NameError: name 'data' is not defined

In [14]: data.names

NameErrorTraceback (most recent call last)
<ipython-input-14-401581d2f50a> in <module>()
----> 1 data.names

NameError: name 'data' is not defined

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

In [15]: data = sm.datasets.longley.load_pandas()

NameErrorTraceback (most recent call last)
<ipython-input-15-ebb3c90207a7> in <module>()
----> 1 data = sm.datasets.longley.load_pandas()

NameError: name 'sm' is not defined

In [16]: data.exog

NameErrorTraceback (most recent call last)
<ipython-input-16-f6af8f45a1ee> in <module>()
----> 1 data.exog

NameError: name 'data' is not defined

In [17]: data.endog

NameErrorTraceback (most recent call last)
<ipython-input-17-6e532988a5c3> in <module>()
----> 1 data.endog

NameError: name 'data' is not defined

The full DataFrame is available in the data attribute of the Dataset object

In [18]: data.data

NameErrorTraceback (most recent call last)
<ipython-input-18-b574fe29b619> in <module>()
----> 1 data.data

NameError: name 'data' is not defined

With pandas integration in the estimation classes, the metadata will be attached to model results:

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.