Data Science, Regression Models

tirthankarghosh5
Jun 28, 2022
4 min read

By Tirthankar Ghosh on 28 June 2022 Introduction to regression models

Introduction

Data science has evolved over thousands of year and today it has evolved into a computer science field which we commonly refer to as data mining. The field has developed on top of statistics. We’ll be using python’s sklearn library with numpy to perform our tasks.

Regression

Regression is the task of developing a function f(x) to fit a given set of data so that another set of values (with a similar behavior) can provide predictions. In this article I’ll be evaluating a few regression models and their accuracy over a set of randomly generated data. We’ll be talking about the following regression methods.

Note: The selection of the method shall be done by analysis of the distribution of data, which will be discussed in a later article. For the experiment we’ll be using a log plot, y = log(x) + noise and a sine plot y = sin(x), and evaluate the results by visualizing.

Linear regression

Polynomial regression

Regression trees

Polynomial regression

This is the task of deriving an equation in the form of y = mx + c for interpreting the data set.

Linear regression (polynomial of degree 1)

The obvious fact is the log(x) is not linear and the linear regression is not quite a fix. It is clear from the figure shown. But we can clearly see that in the range 20–80 the predictions are quite okay. This is because the noise has made the distribution quite linear over the range.

Polynomial regression (higher degree)This demonstrates the curve fitting using polynomials of degree 2, 3 and 4 we can see that most of the points have been quite well covered by the plot of degree 2 polynomial.

Advantages of polynomial regression

Fast Polynomial fitting can be done quite fast for a reasonably large data set (less than 10000 entrees) and is just a matter for few matrix multiplications.
Can extrapolate predictions. This means that the predictions can be performed outside of the trainig set. This is because we have an equation that can support any real value and has a real output. (we’ll see how this holds in regression trees as we move on)

Decision tree regression

Decision tree: reference http://scikit-learn.org/stable/modules/tree.html#tree

Decision tree is a tree made with conditions as nodes and truth and false of each conditions on edges leading to another conditional node. Most of the times these are used for logistic regression purposes (as shown in the diagram) where the data exists in categorical form. Yet regression trees can be made by having probabilities in place of truths and rages in place of conditionals. But the outputs will be in discreet form (we’ll see how).

The diagram demonstrate the accuracy of the decision tree regression plot for log(x) with some noise added. We can see when the tree becomes deeper, more is the accuracy. Also this seems to provide a better estimate than that of the linear regression and yet quite questionable in comparison with the polynomial fit. But this is indeed noise tolerant as you can see it has covered most of the points despite the noise we have added, where polynomial gracefully covered just the points it saw in its way.

Aha! What we have got here?. Recall “Can extrapolate predictions”. This is what it is. if the points happen to be outside of the trained dataset, a tree would utterly fail to predict. In this scenario I have built the model for sine curve in range (0, 2Π) and predicted for the range (0, 4Π). And the predictions happen to keep the value at the last training point further without doing any good.

Decision tree advantages

Noise tolerance For a larger data set noise would be cancelled of by adjusting the conditions and the eventual conditional probabilities going down the tree
More accuracy in logistic regression When the data are in categorical nature this seems to be the better approach, well what can linear regression do anyway?
Fast training for big data It’s all about traversing a tree, which is usually faster than matrix operations at large scale.

Looking for disadvantages? If you note any, you are using the wrong tool for wrong job. Coding examples for above illustrations Polynomial regression

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
import numpy as np
import matplotlib.pyplot as plt
# Regression tools
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

# Data set for model generation
x = np.random.uniform(1, 100, 100)
x = np.sort(x)
y = np.log(x) + np.random.normal(0, .1, 100)

# linear model
lr = linear_model.LinearRegression()
lr.fit(x.reshape(-1,1), y)
linear_predictions = lr.predict(x.reshape(-1,1))

# polynomial model
plt.figure()
for degree in range(2,5):
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    model.fit(x.reshape(-1,1), y)
    polynomial_predictions = model.predict(x.reshape(-1,1))
    plt.plot(x, polynomial_predictions, label="degree %d" % degree)
plt.scatter(x, y, label="data points")
plt.legend(loc='lower right')

plt.figure()
plt.scatter(x, y, label="data points")
plt.plot(x, linear_predictions, 'r', label="linear")
plt.legend(loc='lower right')
plt.show()

Decision tree regression

import numpy as np
import matplotlib.pyplot as plt
# Regression tools
from sklearn.tree import DecisionTreeRegressor

# Data set for model generation
x = np.random.uniform(1, 100, 100)
x = np.sort(x)
y = np.log(x) + np.random.normal(0, .1, 100)

# Consider few models with different depths
model_1 = DecisionTreeRegressor(max_depth=2)
model_2 = DecisionTreeRegressor(max_depth=5)

model_1.fit(x.reshape(-1,1), y)
model_2.fit(x.reshape(-1,1), y)

# Get the predicted Y values
y_1 = model_1.predict(x.reshape(-1,1))
y_2 = model_2.predict(x.reshape(-1,1))

plt.figure()
plt.scatter(x, y, label="data points", s=10)
plt.plot(x, y_1,  label="depth 2 plot", color='r')
plt.plot(x, y_2, label="depth 5 plot", color='g')
plt.legend(loc='lower right')
plt.show()

# Generating inputs for a sine curve
x_sin = np.arange(0, 2 * 22/7, 0.1)
x_sin = np.sort(x_sin)
y_sin = np.sin(x_sin) #+ np.random.normal(0, .1, 100)

# Fitting sine curve with depth 5 tree
model_3 = DecisionTreeRegressor(max_depth=5)
model_3.fit(x_sin.reshape(-1,1), y_sin)
x_sin_2 = np.sort(np.arange(0, 4 * 22/7, 0.1))
y_sin_2 = np.sin(x_sin_2)

y_3 = model_3.predict(x_sin_2.reshape(-1,1))
plt.figure()
plt.plot(x_sin_2, y_sin_2, label="data points")
plt.plot(x_sin_2, y_3, label="depth 5 plot", color='g')
plt.legend(loc='lower right')
plt.show()
tree_regression.py hosted with ❤ by GitHub

Data Scientist
Travel Photography

Video Channel

Morning Rush

Into the Blue

Beach Patrol