IT Learning

実践形式でITのお勉強

Python Statistics

Multiple Regression Analysis by Python statsmodels

投稿日:

Overview

In this article, there is a explanation of Multiple Regression Analysis by using statsmodels in python. We focus not analyzing but understanding how to use this library.

Environments

  • Python 3.8.6
  • statsmodels 0.13.2

Preparetion of datase

This time we will use Red Wine Quality dataset. This dataset is published as OpenDatabase License in kaggle site. We download it and save the csv file in any directory.

Red Wine Quality | Kaggle

Installing statsmodels

By using pip command, we will install the statsmodels library.

pip install statsmodels

Loading dataset

Next, we will load the dataset you save in any directory by pandas dataframe.

import pandas as pd
df = pd.read_csv('winequality-red.csv')
df.head()

In this dataset, we can see “quality” is a responsive variable and others are explanatory variables.

Multiple Regression Analysis

Now, we move on to Muitiple Regression Analysis for this dataset. However statsmodels is so powerful that we don’t need to write lots of code.

import statsmodels.api as sm

#Explanatory columns
X = df[df.columns[:-1]]

#Adding constant term
X = sm.add_constant(X)

#Responsive variable
y = df['quality']

#Generating a model
model = sm.OLS(y,X)
result = model.fit()

#Checking results
result.summary()

The meaning of results

The meaning of the results is important for us so lets check them.

Dep. Variable

This is the responsive variable we selected, quality in here.

Model

The type of modes is showed. OLS means Ordinary Least Sqauares. This method is quite common.

Method

Method is almost same as Model, Least Squares.

Date, Time

The date and time we got the OLS results. It means when we executed result.summary().

No. Observations

The number of rows in dataet. Here it is same as a result of len(df), 1599.

Df Residuals

This is Degree of freedom of Residuals. it can be caluculated as No.Observations minus Df Model minus 1. 1599 – 11 -1 = 1578

Df Model

This is Degree of freedom of Model. It is number of explanatory variables, 11.

Covariance Type

The covariance Type is nonrobust as default.

R-squared

This is squared of Coefficient of Determination.

Adj. R-sqaured

This is squared of Adjusted Coefficient of Determination. Adj. R-sqaured can be used for comparison between two models having different explanatory variables.

F-statistic

F-static can be used to test whether there are meaningul explanatory variables.

Prob (F-statistic)

This is probability of F-statistic. It is enough small value in this example.

Log-Liklihood

Log-Liklihood can evaluate likelihod of data samples.

AIC

This is Akaike’s Information Criterion. Smaller is better.

BIC

This is Bayesian information criterion. Smaller is better.

Summary of this article

statisticmodels can make results of multiple regression quite easily. However when we work on any real tasks, it is also good idea to use any other libralies to show data graphially.

Related

-Python, Statistics

執筆者:


comment

Your email address will not be published. Required fields are marked *