kopia lustrzana https://github.com/animator/learn-python
Principal Component Analysis
rodzic
9cb2523ae7
commit
d88184f0bd
|
@ -0,0 +1,90 @@
|
||||||
|
### Principal Component Analysis (PCA)
|
||||||
|
|
||||||
|
#### Introduction
|
||||||
|
Principal Component Analysis (PCA) is a statistical technique used in machine learning and data analysis for dimensionality reduction. It transforms a dataset with possibly correlated variables into a set of linearly uncorrelated variables called principal components. This method helps in simplifying the complexity of high-dimensional data while retaining as much variability as possible.
|
||||||
|
|
||||||
|
#### How PCA Works
|
||||||
|
PCA involves several steps, each transforming the dataset in a specific way to achieve dimensionality reduction:
|
||||||
|
|
||||||
|
1. **Standardize the Data**:
|
||||||
|
- Ensure the dataset is standardized so that each feature has a mean of zero and a variance of one. This step is crucial because PCA is affected by the scale of the variables.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sklearn.preprocessing import StandardScaler
|
||||||
|
|
||||||
|
scaler = StandardScaler()
|
||||||
|
X_scaled = scaler.fit_transform(X)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Covariance Matrix Computation**:
|
||||||
|
- Compute the covariance matrix to understand how the variables in the dataset are varying from the mean with respect to each other.
|
||||||
|
|
||||||
|
```python
|
||||||
|
covariance_matrix = np.cov(X_scaled.T)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Eigenvalues and Eigenvectors Calculation**:
|
||||||
|
- Calculate the eigenvalues and eigenvectors of the covariance matrix. Eigenvectors determine the directions of the new feature space, while eigenvalues determine their magnitude (importance).
|
||||||
|
|
||||||
|
```python
|
||||||
|
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Sort Eigenvalues and Eigenvectors**:
|
||||||
|
- Sort the eigenvalues and their corresponding eigenvectors in descending order. The eigenvectors corresponding to the largest eigenvalues are the principal components.
|
||||||
|
|
||||||
|
```python
|
||||||
|
idx = np.argsort(eigenvalues)[::-1]
|
||||||
|
eigenvectors = eigenvectors[:, idx]
|
||||||
|
eigenvalues = eigenvalues[idx]
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Principal Components Selection**:
|
||||||
|
- Select the top \( k \) eigenvectors to form a matrix that will transform the data to a new feature subspace.
|
||||||
|
|
||||||
|
```python
|
||||||
|
k = 2 # for example, selecting the top 2 components
|
||||||
|
principal_components = eigenvectors[:, :k]
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Transform the Data**:
|
||||||
|
- Transform the original dataset to the new feature subspace.
|
||||||
|
|
||||||
|
```python
|
||||||
|
X_pca = np.dot(X_scaled, principal_components)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Applications of PCA
|
||||||
|
PCA is widely used in various fields to simplify data analysis and visualization:
|
||||||
|
|
||||||
|
- **Image Compression**: Reducing the dimensionality of image data to store images with less memory.
|
||||||
|
- **Noise Reduction**: Filtering out the noise from data by selecting only the most important components.
|
||||||
|
- **Data Visualization**: Projecting high-dimensional data to 2D or 3D for easier visualization.
|
||||||
|
- **Feature Extraction**: Identifying the most significant features in a dataset for use in other machine learning models.
|
||||||
|
|
||||||
|
#### Example of PCA in Python
|
||||||
|
Here’s an example demonstrating PCA using the `scikit-learn` library:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import numpy as np
|
||||||
|
from sklearn.decomposition import PCA
|
||||||
|
from sklearn.preprocessing import StandardScaler
|
||||||
|
|
||||||
|
# Sample data
|
||||||
|
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
|
||||||
|
|
||||||
|
# Standardize the data
|
||||||
|
scaler = StandardScaler()
|
||||||
|
X_scaled = scaler.fit_transform(X)
|
||||||
|
|
||||||
|
# Apply PCA
|
||||||
|
pca = PCA(n_components=1)
|
||||||
|
X_pca = pca.fit_transform(X_scaled)
|
||||||
|
|
||||||
|
print("Original Data:\n", X)
|
||||||
|
print("Transformed Data:\n", X_pca)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Conclusion
|
||||||
|
Principal Component Analysis (PCA) is a powerful tool for reducing the dimensionality of datasets while preserving as much variance as possible. It is especially useful in exploratory data analysis and preprocessing for other machine learning algorithms.
|
||||||
|
|
Ładowanie…
Reference in New Issue