In this article, there is a explanation of Multiple Regression Analysis by using statsmodels in python. We focus not analyzing but understanding how to use this library.
- Python 3.8.6
- statsmodels 0.13.2
Preparetion of datase
This time we will use Red Wine Quality dataset. This dataset is published as OpenDatabase License in kaggle site. We download it and save the csv file in any directory.
By using pip command, we will install the statsmodels library.
pip install statsmodels
Next, we will load the dataset you save in any directory by pandas dataframe.
import pandas as pd df = pd.read_csv('winequality-red.csv') df.head()
In this dataset, we can see “quality” is a responsive variable and others are explanatory variables.
Multiple Regression Analysis
Now, we move on to Muitiple Regression Analysis for this dataset. However statsmodels is so powerful that we don’t need to write lots of code.
import statsmodels.api as sm #Explanatory columns X = df[df.columns[:-1]] #Adding constant term X = sm.add_constant(X) #Responsive variable y = df['quality'] #Generating a model model = sm.OLS(y,X) result = model.fit() #Checking results result.summary()
The meaning of results
The meaning of the results is important for us so lets check them.
This is the responsive variable we selected, quality in here.
The type of modes is showed. OLS means Ordinary Least Sqauares. This method is quite common.
Method is almost same as Model, Least Squares.
The date and time we got the OLS results. It means when we executed result.summary().
The number of rows in dataet. Here it is same as a result of len(df), 1599.
This is Degree of freedom of Residuals. it can be caluculated as No.Observations minus Df Model minus 1. 1599 – 11 -1 = 1578
This is Degree of freedom of Model. It is number of explanatory variables, 11.
The covariance Type is nonrobust as default.
This is squared of Coefficient of Determination.
This is squared of Adjusted Coefficient of Determination. Adj. R-sqaured can be used for comparison between two models having different explanatory variables.
F-static can be used to test whether there are meaningul explanatory variables.
This is probability of F-statistic. It is enough small value in this example.
Log-Liklihood can evaluate likelihod of data samples.
This is Akaike’s Information Criterion. Smaller is better.
This is Bayesian information criterion. Smaller is better.
Summary of this article
statisticmodels can make results of multiple regression quite easily. However when we work on any real tasks, it is also good idea to use any other libralies to show data graphially.