learn-python/contrib/machine-learning/polynomial-regression.md

# Polynomial Regression

Polynomial Regression is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modeled as an $nth$ degree polynomial. This guide provides an overview of polynomial regression, including its fundamental concepts, assumptions, and how to implement it using Python.

## Introduction

Polynomial Regression is used when the data shows a non-linear relationship between the independent variable $x$ and the dependent variable $y$ is modeled as an $nth$ degree polynomial. It extends the simple linear regression model by considering polynomial terms of the independent variable, allowing for a more flexible fit to the data.

## Concepts

### Polynomial Equation

The polynomial regression model is based on the following polynomial equation:

$$
\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots + \beta_n x^n + \epsilon \]
$$

Where:
- $y$ is the dependent variable.
- $x$ is the independent variable.
- $\beta_0, \beta_1, \ldots, \beta_n$ are the coefficients of the polynomial.
- $\epsilon$ is the error term.

### Degree of Polynomial

The degree of the polynomial (n) determines the flexibility of the model. A higher degree allows the model to fit more complex, non-linear relationships, but it also increases the risk of overfitting.

### Overfitting and Underfitting

- **Overfitting**: When the model fits the noise in the training data too closely, resulting in poor generalization to new data.
- **Underfitting**: When the model is too simple to capture the underlying pattern in the data.

## Assumptions

1. **Independence**: Observations are independent of each other.
2. **Homoscedasticity**: The variance of the residuals (errors) is constant across all levels of the independent variable.
3. **Normality**: The residuals of the model are normally distributed.
4. **No Multicollinearity**: The predictor variables are not highly correlated with each other.

## Implementation

### Using Scikit-learn

Scikit-learn is a popular machine learning library in Python that provides tools for polynomial regression.

### Code Example

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = pd.read_csv('path/to/your/dataset.csv')

# Define features and target variable
X = data[['feature']]
y = data['target']

# Transform features to polynomial features
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X)

# Initialize and train polynomial regression model
model = LinearRegression()
model.fit(X_poly, y)

# Make predictions
y_pred = model.predict(X_poly)

# Evaluate the model
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

# Visualize the results
plt.scatter(X, y, color='blue')
plt.plot(X, y_pred, color='red')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('Polynomial Regression')
plt.show()
```

## Evaluation Metrics

- **Mean Squared Error (MSE)**: The average of the squared differences between actual and predicted values.
- **R-squared (R²) Score**: A statistical measure that represents the proportion of the variance for the dependent variable that is explained by the independent variables in the model.

## Conclusion

Polynomial Regression is a powerful tool for modeling non-linear relationships between variables. It is important to choose the degree of the polynomial carefully to balance between underfitting and overfitting. Understanding and properly evaluating the model using appropriate metrics ensures its effectiveness.

## References

- [Scikit-learn Documentation](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression)
- [Wikipedia: Polynomial Regression](https://en.wikipedia.org/wiki/Polynomial_reg)