diff --git a/contrib/machine-learning/Principal Component Analysis .md b/contrib/machine-learning/Principal Component Analysis .md new file mode 100644 index 0000000..aed6773 --- /dev/null +++ b/contrib/machine-learning/Principal Component Analysis .md @@ -0,0 +1,90 @@ +### Principal Component Analysis (PCA) + +#### Introduction +Principal Component Analysis (PCA) is a statistical technique used in machine learning and data analysis for dimensionality reduction. It transforms a dataset with possibly correlated variables into a set of linearly uncorrelated variables called principal components. This method helps in simplifying the complexity of high-dimensional data while retaining as much variability as possible. + +#### How PCA Works +PCA involves several steps, each transforming the dataset in a specific way to achieve dimensionality reduction: + +1. **Standardize the Data**: + - Ensure the dataset is standardized so that each feature has a mean of zero and a variance of one. This step is crucial because PCA is affected by the scale of the variables. + + ```python + from sklearn.preprocessing import StandardScaler + + scaler = StandardScaler() + X_scaled = scaler.fit_transform(X) + ``` + +2. **Covariance Matrix Computation**: + - Compute the covariance matrix to understand how the variables in the dataset are varying from the mean with respect to each other. + + ```python + covariance_matrix = np.cov(X_scaled.T) + ``` + +3. **Eigenvalues and Eigenvectors Calculation**: + - Calculate the eigenvalues and eigenvectors of the covariance matrix. Eigenvectors determine the directions of the new feature space, while eigenvalues determine their magnitude (importance). + + ```python + eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix) + ``` + +4. **Sort Eigenvalues and Eigenvectors**: + - Sort the eigenvalues and their corresponding eigenvectors in descending order. The eigenvectors corresponding to the largest eigenvalues are the principal components. + + ```python + idx = np.argsort(eigenvalues)[::-1] + eigenvectors = eigenvectors[:, idx] + eigenvalues = eigenvalues[idx] + ``` + +5. **Principal Components Selection**: + - Select the top \( k \) eigenvectors to form a matrix that will transform the data to a new feature subspace. + + ```python + k = 2 # for example, selecting the top 2 components + principal_components = eigenvectors[:, :k] + ``` + +6. **Transform the Data**: + - Transform the original dataset to the new feature subspace. + + ```python + X_pca = np.dot(X_scaled, principal_components) + ``` + +#### Applications of PCA +PCA is widely used in various fields to simplify data analysis and visualization: + +- **Image Compression**: Reducing the dimensionality of image data to store images with less memory. +- **Noise Reduction**: Filtering out the noise from data by selecting only the most important components. +- **Data Visualization**: Projecting high-dimensional data to 2D or 3D for easier visualization. +- **Feature Extraction**: Identifying the most significant features in a dataset for use in other machine learning models. + +#### Example of PCA in Python +Here’s an example demonstrating PCA using the `scikit-learn` library: + +```python +import numpy as np +from sklearn.decomposition import PCA +from sklearn.preprocessing import StandardScaler + +# Sample data +X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]]) + +# Standardize the data +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) + +# Apply PCA +pca = PCA(n_components=1) +X_pca = pca.fit_transform(X_scaled) + +print("Original Data:\n", X) +print("Transformed Data:\n", X_pca) +``` + +#### Conclusion +Principal Component Analysis (PCA) is a powerful tool for reducing the dimensionality of datasets while preserving as much variance as possible. It is especially useful in exploratory data analysis and preprocessing for other machine learning algorithms. +