# Multiple Regression Analysis by Python statsmodels

## Overview

In this article, there is a explanation of Multiple Regression Analysis by using *statsmodels* in python. We focus not analyzing but understanding how to use this library.

## Environments

- Python 3.8.6
- statsmodels 0.13.2

## Preparetion of datase

This time we will use Red Wine Quality dataset. This dataset is published as OpenDatabase License in kaggle site. We download it and save the csv file in any directory.

## Installing statsmodels

By using pip command, we will install the *statsmodels* library.

`pip install statsmodels`

## Loading dataset

Next, we will load the dataset you save in any directory by pandas dataframe.

```
import pandas as pd
df = pd.read_csv('winequality-red.csv')
df.head()
```

In this dataset, we can see “*quality*” is a responsive variable and others are explanatory variables.

## Multiple Regression Analysis

Now, we move on to Muitiple Regression Analysis for this dataset. However *statsmodels* is so powerful that we don’t need to write lots of code.

```
import statsmodels.api as sm
#Explanatory columns
X = df[df.columns[:-1]]
#Adding constant term
X = sm.add_constant(X)
#Responsive variable
y = df['quality']
#Generating a model
model = sm.OLS(y,X)
result = model.fit()
#Checking results
result.summary()
```

## The meaning of results

The meaning of the results is important for us so lets check them.

**Dep. Variable**

This is the responsive variable we selected, quality in here.

**Model**

The type of modes is showed. OLS means Ordinary Least Sqauares. This method is quite common.

**Method **

Method is almost same as Model, Least Squares.

**Date, Time**

The date and time we got the OLS results. It means when we executed result.summary().

**No. Observations**

The number of rows in dataet. Here it is same as a result of len(df), 1599.

**Df Residuals**

This is Degree of freedom of Residuals. it can be caluculated as No.Observations minus Df Model minus 1. 1599 – 11 -1 = 1578

**Df Model**

This is Degree of freedom of Model. It is number of explanatory variables, 11.

**Covariance Type**

The covariance Type is nonrobust as default.

**R-squared**

This is squared of Coefficient of Determination.

**Adj. R-sqaured**

This is squared of Adjusted Coefficient of Determination. Adj. R-sqaured can be used for comparison between two models having different explanatory variables.

**F-statistic**

F-static can be used to test whether there are meaningul explanatory variables.

**Prob (F-statistic)**

This is probability of F-statistic. It is enough small value in this example.

**Log-Liklihood**

Log-Liklihood can evaluate likelihod of data samples.

**AIC**

This is *Akaike’s Information Criterion*. Smaller is better.

**BIC**

This is *Bayesian information criterion*. Smaller is better.

## Summary of this article

*statisticmodels* can make results of multiple regression quite easily. However when we work on any real tasks, it is also good idea to use any other libralies to show data graphially.