Author

Zeju Li

Principal Component Analysis (PCA)

Adapted from https://mml-book.github.io/

We will implement the PCA algorithm using the projection perspective. We will first implement PCA, then apply it to the MNIST digit dataset.

Install required packages

!pip install numpy
!pip install timeit
!pip install matplotlib
!pip install ipywidgets
!pip install scikit-learn
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.12/site-packages (1.26.4)
ERROR: Could not find a version that satisfies the requirement timeit (from versions: none)
ERROR: No matching distribution found for timeit
Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.12/site-packages (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.23 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (24.1)
Requirement already satisfied: pillow>=8 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Requirement already satisfied: ipywidgets in /opt/anaconda3/lib/python3.12/site-packages (7.8.1)
Requirement already satisfied: comm>=0.1.3 in /opt/anaconda3/lib/python3.12/site-packages (from ipywidgets) (0.2.1)
Requirement already satisfied: ipython-genutils~=0.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipywidgets) (0.2.0)
Requirement already satisfied: traitlets>=4.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from ipywidgets) (5.14.3)
Requirement already satisfied: widgetsnbextension~=3.6.6 in /opt/anaconda3/lib/python3.12/site-packages (from ipywidgets) (3.6.6)
Requirement already satisfied: ipython>=4.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipywidgets) (8.27.0)
Requirement already satisfied: jupyterlab-widgets<3,>=1.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipywidgets) (1.0.0)
Requirement already satisfied: decorator in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (0.19.1)
Requirement already satisfied: matplotlib-inline in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (0.1.6)
Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (3.0.43)
Requirement already satisfied: pygments>=2.4.0 in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (2.15.1)
Requirement already satisfied: stack-data in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (0.2.0)
Requirement already satisfied: pexpect>4.3 in /opt/anaconda3/lib/python3.12/site-packages (from ipython>=4.0.0->ipywidgets) (4.8.0)
Requirement already satisfied: notebook>=4.4.1 in /opt/anaconda3/lib/python3.12/site-packages (from widgetsnbextension~=3.6.6->ipywidgets) (7.2.2)
Requirement already satisfied: parso<0.9.0,>=0.8.3 in /opt/anaconda3/lib/python3.12/site-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets) (0.8.3)
Requirement already satisfied: jupyter-server<3,>=2.4.0 in /opt/anaconda3/lib/python3.12/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.14.1)
Requirement already satisfied: jupyterlab-server<3,>=2.27.1 in /opt/anaconda3/lib/python3.12/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.27.3)
Requirement already satisfied: jupyterlab<4.3,>=4.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (4.2.5)
Requirement already satisfied: notebook-shim<0.3,>=0.2 in /opt/anaconda3/lib/python3.12/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.2.3)
Requirement already satisfied: tornado>=6.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (6.4.1)
Requirement already satisfied: ptyprocess>=0.5 in /opt/anaconda3/lib/python3.12/site-packages (from pexpect>4.3->ipython>=4.0.0->ipywidgets) (0.7.0)
Requirement already satisfied: wcwidth in /opt/anaconda3/lib/python3.12/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython>=4.0.0->ipywidgets) (0.2.5)
Requirement already satisfied: executing in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython>=4.0.0->ipywidgets) (0.8.3)
Requirement already satisfied: asttokens in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython>=4.0.0->ipywidgets) (2.0.5)
Requirement already satisfied: pure-eval in /opt/anaconda3/lib/python3.12/site-packages (from stack-data->ipython>=4.0.0->ipywidgets) (0.2.2)
Requirement already satisfied: anyio>=3.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (4.2.0)
Requirement already satisfied: argon2-cffi>=21.1 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (21.3.0)
Requirement already satisfied: jinja2>=3.0.3 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (3.1.4)
Requirement already satisfied: jupyter-client>=7.4.4 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (8.6.0)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (5.7.2)
Requirement already satisfied: jupyter-events>=0.9.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.10.0)
Requirement already satisfied: jupyter-server-terminals>=0.4.4 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.4.4)
Requirement already satisfied: nbconvert>=6.4.4 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (7.16.4)
Requirement already satisfied: nbformat>=5.3.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (5.10.4)
Requirement already satisfied: overrides>=5.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (7.4.0)
Requirement already satisfied: packaging>=22.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (24.1)
Requirement already satisfied: prometheus-client>=0.9 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.14.1)
Requirement already satisfied: pyzmq>=24 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (25.1.2)
Requirement already satisfied: send2trash>=1.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.8.2)
Requirement already satisfied: terminado>=0.8.3 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.17.1)
Requirement already satisfied: websocket-client>=1.7 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.8.0)
Requirement already satisfied: async-lru>=1.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.0.4)
Requirement already satisfied: httpx>=0.25.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.27.0)
Requirement already satisfied: ipykernel>=6.5.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (6.28.0)
Requirement already satisfied: jupyter-lsp>=2.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.2.0)
Requirement already satisfied: setuptools>=40.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (75.1.0)
Requirement already satisfied: babel>=2.10 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.11.0)
Requirement already satisfied: json5>=0.9.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.9.6)
Requirement already satisfied: jsonschema>=4.18.0 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (4.23.0)
Requirement already satisfied: requests>=2.31 in /opt/anaconda3/lib/python3.12/site-packages (from jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.32.3)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.12/site-packages (from asttokens->stack-data->ipython>=4.0.0->ipywidgets) (1.16.0)
Requirement already satisfied: idna>=2.8 in /opt/anaconda3/lib/python3.12/site-packages (from anyio>=3.1.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (3.7)
Requirement already satisfied: sniffio>=1.1 in /opt/anaconda3/lib/python3.12/site-packages (from anyio>=3.1.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.3.0)
Requirement already satisfied: argon2-cffi-bindings in /opt/anaconda3/lib/python3.12/site-packages (from argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (21.2.0)
Requirement already satisfied: pytz>=2015.7 in /opt/anaconda3/lib/python3.12/site-packages (from babel>=2.10->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2024.1)
Requirement already satisfied: certifi in /opt/anaconda3/lib/python3.12/site-packages (from httpx>=0.25.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2024.8.30)
Requirement already satisfied: httpcore==1.* in /opt/anaconda3/lib/python3.12/site-packages (from httpx>=0.25.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.0.2)
Requirement already satisfied: h11<0.15,>=0.13 in /opt/anaconda3/lib/python3.12/site-packages (from httpcore==1.*->httpx>=0.25.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.14.0)
Requirement already satisfied: appnope in /opt/anaconda3/lib/python3.12/site-packages (from ipykernel>=6.5.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.1.3)
Requirement already satisfied: debugpy>=1.6.5 in /opt/anaconda3/lib/python3.12/site-packages (from ipykernel>=6.5.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.6.7)
Requirement already satisfied: nest-asyncio in /opt/anaconda3/lib/python3.12/site-packages (from ipykernel>=6.5.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.6.0)
Requirement already satisfied: psutil in /opt/anaconda3/lib/python3.12/site-packages (from ipykernel>=6.5.0->jupyterlab<4.3,>=4.2.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (5.9.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/anaconda3/lib/python3.12/site-packages (from jinja2>=3.0.3->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.1.3)
Requirement already satisfied: attrs>=22.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2023.7.1)
Requirement already satisfied: referencing>=0.28.4 in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.30.2)
Requirement already satisfied: rpds-py>=0.7.1 in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema>=4.18.0->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.10.6)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-client>=7.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.9.0.post0)
Requirement already satisfied: platformdirs>=2.5 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-core!=5.0.*,>=4.12->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (3.10.0)
Requirement already satisfied: python-json-logger>=2.0.4 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.0.7)
Requirement already satisfied: pyyaml>=5.3 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (6.0.1)
Requirement already satisfied: rfc3339-validator in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.1.4)
Requirement already satisfied: rfc3986-validator>=0.1.1 in /opt/anaconda3/lib/python3.12/site-packages (from jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.1.1)
Requirement already satisfied: beautifulsoup4 in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (4.12.3)
Requirement already satisfied: bleach!=5.0.0 in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (4.1.0)
Requirement already satisfied: defusedxml in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.7.1)
Requirement already satisfied: jupyterlab-pygments in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.1.2)
Requirement already satisfied: mistune<4,>=2.0.3 in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.0.4)
Requirement already satisfied: nbclient>=0.5.0 in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.8.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.5.0)
Requirement already satisfied: tinycss2 in /opt/anaconda3/lib/python3.12/site-packages (from nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.2.1)
Requirement already satisfied: fastjsonschema>=2.15 in /opt/anaconda3/lib/python3.12/site-packages (from nbformat>=5.3.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.16.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/anaconda3/lib/python3.12/site-packages (from requests>=2.31->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/anaconda3/lib/python3.12/site-packages (from requests>=2.31->jupyterlab-server<3,>=2.27.1->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.2.3)
Requirement already satisfied: webencodings in /opt/anaconda3/lib/python3.12/site-packages (from bleach!=5.0.0->nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (0.5.1)
Requirement already satisfied: fqdn in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.5.1)
Requirement already satisfied: isoduration in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (20.11.0)
Requirement already satisfied: jsonpointer>1.13 in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.1)
Requirement already satisfied: uri-template in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.3.0)
Requirement already satisfied: webcolors>=24.6.0 in /opt/anaconda3/lib/python3.12/site-packages (from jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (24.11.1)
Requirement already satisfied: cffi>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.17.1)
Requirement already satisfied: soupsieve>1.2 in /opt/anaconda3/lib/python3.12/site-packages (from beautifulsoup4->nbconvert>=6.4.4->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.5)
Requirement already satisfied: pycparser in /opt/anaconda3/lib/python3.12/site-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi>=21.1->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (2.21)
Requirement already satisfied: arrow>=0.15.0 in /opt/anaconda3/lib/python3.12/site-packages (from isoduration->jsonschema[format-nongpl]>=4.18.0->jupyter-events>=0.9.0->jupyter-server<3,>=2.4.0->notebook>=4.4.1->widgetsnbextension~=3.6.6->ipywidgets) (1.2.3)
Requirement already satisfied: scikit-learn in /opt/anaconda3/lib/python3.12/site-packages (1.5.1)
Requirement already satisfied: numpy>=1.19.5 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /opt/anaconda3/lib/python3.12/site-packages (from scikit-learn) (3.5.0)

Learning objectives

  1. Write code that implements PCA.
  2. Write code that implements PCA for high-dimensional datasets

Let’s first import the packages we need for this week.

# PACKAGE: DO NOT EDIT THIS CELL
import numpy as np
import timeit
import matplotlib as mpl
# mpl.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ipywidgets import interact
# PACKAGE: DO NOT EDIT THIS CELL
from sklearn.datasets import fetch_openml
images, labels = fetch_openml('mnist_784', version=1, return_X_y=True)
images = images.to_numpy()
# %matplotlib inline

Now, let’s plot a digit from the dataset:

plt.figure(figsize=(4,4))
plt.imshow(images[0].reshape(28,28), cmap='gray');
plt.show()

Before we implement PCA, we will need to do some data preprocessing. In this assessment, some of them will be implemented by you, others we will take care of. However, when you are working on real world problems, you will need to do all these steps by yourself.

The preprocessing steps we will do are 1. Convert unsigned interger 8 (uint8) encoding of pixels to a floating point number between 0 and 1. 2. Subtract from each image the mean \(\boldsymbol \mu\). 3. Scale each dimension of each image by \(\frac{1}{\sigma}\) where \(\sigma\) is the stardard deviation.

The steps above ensure that our images will have zero mean and one variance. These preprocessing steps are also known as Data Normalization or Feature Scaling.

1. PCA

Now we will implement PCA. Before we do that, let’s pause for a moment and think about the steps for performing PCA. Assume that we are performing PCA on some dataset \(\boldsymbol X\) for \(M\) principal components. We then need to perform the following steps, which we break into parts:

  1. Data normalization (normalize).
  2. Find eigenvalues and corresponding eigenvectors for the covariance matrix \(S\). Sort by the largest eigenvalues and the corresponding eigenvectors (eig).

After these steps, we can then compute the projection and reconstruction of the data onto the spaced spanned by the top \(n\) eigenvectors.

def normalize(X):
    """Normalize the given dataset X
    Args:
        X: ndarray, dataset

    Returns:
        (Xbar, mean, std): tuple of ndarray, Xbar is the normalized dataset
        with mean 0 and standard deviation 1; mean and std are the
        mean and standard deviation respectively.

    Note:
        You will encounter dimensions where the standard deviation is
        zero, for those when you do normalization the normalized data
        will be NaN. Handle this by setting using `std = 1` for those
        dimensions when doing normalization.
    """
    mu = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    std_filled = std.copy()
    std_filled[std==0] = 1.
    Xbar = ((X-mu)/std_filled)
    return Xbar, mu, std

def eig(S):
    """Compute the eigenvalues and corresponding eigenvectors
        for the covariance matrix S.
    Args:
        S: ndarray, covariance matrix

    Returns:
        (eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors

    Note:
        the eigenvals and eigenvecs should be sorted in descending
        order of the eigen values
    """
    eigvals, eigvecs = np.linalg.eig(S)
    k = np.argsort(eigvals)[::-1]
    return eigvals[k], eigvecs[:,k]

def projection_matrix(B):
    """Compute the projection matrix onto the space spanned by `B`
    Args:
        B: ndarray of dimension (D, M), the basis for the subspace

    Returns:
        P: the projection matrix
    """
    return (B @ np.linalg.inv(B.T @ B) @ B.T)

def PCA(X, num_components):
    """
    Args:
        X: ndarray of size (N, D), where D is the dimension of the data,
           and N is the number of datapoints
        num_components: the number of principal components to use.
    Returns:
        X_reconstruct: ndarray of the reconstruction
        of X from the first `num_components` principal components.
    """
    # first perform normalization on the digits so that they have zero mean and unit variance
    # Then compute the data covariance matrix S
    S = 1.0/len(X) * np.dot(X.T, X)

    # Next find eigenvalues and corresponding eigenvectors for S
    eig_vals, eig_vecs = eig(S)

    # find indices for the largest eigenvalues, use them to sort the eigenvalues and
    # corresponding eigenvectors. Take a look at the documenation fo `np.argsort`
    # (https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html),
    # which might be useful
    eig_vals, eig_vecs = eig_vals[:num_components], eig_vecs[:, :num_components]

    # dimensionality reduction of the original data
    B = np.real(eig_vecs)
    # Z = X.T.dot(W)
    # reconstruct the images from the lower dimensional representation
    reconst = (projection_matrix(B) @ X.T)
    return reconst.T
## Some preprocessing of the data
NUM_DATAPOINTS = 1000
X = (images.reshape(-1, 28 * 28)[:NUM_DATAPOINTS]) / 255.
Xbar, mu, std = normalize(X)
for num_component in range(1, 20):
    from sklearn.decomposition import PCA as SKPCA
    # We can compute a standard solution given by scikit-learn's implementation of PCA
    pca = SKPCA(n_components=num_component, svd_solver='full')
    sklearn_reconst = pca.inverse_transform(pca.fit_transform(Xbar))
    reconst = PCA(Xbar, num_component)
    np.testing.assert_almost_equal(reconst, sklearn_reconst)
    print(np.square(reconst - sklearn_reconst).sum())
2.9591152600708595e-25
5.463148533300457e-25
1.408735053395555e-24
1.2965345188177177e-24
1.3503663481091028e-24
8.92008612601377e-24
4.389638173732418e-24
2.1540659521801493e-24
1.779295762481672e-23
1.095931897896876e-23
2.945893804975438e-23
4.6077038690533556e-24
1.4489632306835883e-23
6.703160142504867e-23
1.0675673066729429e-23
2.5221095525701676e-23
5.69712442148524e-23
6.134549755904185e-24
6.570322026001618e-23

The greater number of of principal components we use, the smaller will our reconstruction error be. Now, let’s answer the following question:

How many principal components do we need in order to reach a Mean Squared Error (MSE) of less than \(100\) for our dataset?

We have provided a function in the next cell that computes the mean squared error (MSE), which will be useful for answering the question above.

def mse(predict, actual):
    """Helper function for computing the mean squared error (MSE)"""
    return np.square(predict - actual).sum(axis=1).mean()
loss = []
reconstructions = []

# iterate over different numbers of principal components, and compute the MSE
for num_component in range(1, 100):
    reconst = PCA(Xbar, num_component)
    error = mse(reconst, Xbar)
    reconstructions.append(reconst)
    print('n = {:d}, reconstruction_error = {:f}'.format(num_component, error))
    loss.append((num_component, error))

reconstructions = np.asarray(reconstructions)
reconstructions = reconstructions * std + mu # "unnormalize" the reconstructed image
loss = np.asarray(loss)
n = 1, reconstruction_error = 569.447737
n = 2, reconstruction_error = 536.059608
n = 3, reconstruction_error = 508.250286
n = 4, reconstruction_error = 487.018907
n = 5, reconstruction_error = 467.571610
n = 6, reconstruction_error = 451.178087
n = 7, reconstruction_error = 436.538484
n = 8, reconstruction_error = 422.992021
n = 9, reconstruction_error = 411.014047
n = 10, reconstruction_error = 399.584085
n = 11, reconstruction_error = 389.584568
n = 12, reconstruction_error = 379.839784
n = 13, reconstruction_error = 371.029342
n = 14, reconstruction_error = 362.366899
n = 15, reconstruction_error = 353.938557
n = 16, reconstruction_error = 345.925507
n = 17, reconstruction_error = 338.175478
n = 18, reconstruction_error = 330.708730
n = 19, reconstruction_error = 323.733031
n = 20, reconstruction_error = 316.938865
n = 21, reconstruction_error = 310.496983
n = 22, reconstruction_error = 304.201992
n = 23, reconstruction_error = 298.172601
n = 24, reconstruction_error = 292.313512
n = 25, reconstruction_error = 286.605986
n = 26, reconstruction_error = 281.249763
n = 27, reconstruction_error = 276.073512
n = 28, reconstruction_error = 271.090739
n = 29, reconstruction_error = 266.285363
n = 30, reconstruction_error = 261.536181
n = 31, reconstruction_error = 256.968481
n = 32, reconstruction_error = 252.536167
n = 33, reconstruction_error = 248.151479
n = 34, reconstruction_error = 243.860069
n = 35, reconstruction_error = 239.744423
n = 36, reconstruction_error = 235.681797
n = 37, reconstruction_error = 231.745039
n = 38, reconstruction_error = 227.906508
n = 39, reconstruction_error = 224.109339
n = 40, reconstruction_error = 220.497057
n = 41, reconstruction_error = 217.007581
n = 42, reconstruction_error = 213.601517
n = 43, reconstruction_error = 210.313706
n = 44, reconstruction_error = 207.052983
n = 45, reconstruction_error = 203.817955
n = 46, reconstruction_error = 200.617975
n = 47, reconstruction_error = 197.504203
n = 48, reconstruction_error = 194.441104
n = 49, reconstruction_error = 191.411091
n = 50, reconstruction_error = 188.489604
n = 51, reconstruction_error = 185.638727
n = 52, reconstruction_error = 182.800523
n = 53, reconstruction_error = 180.035721
n = 54, reconstruction_error = 177.273539
n = 55, reconstruction_error = 174.597458
n = 56, reconstruction_error = 171.989803
n = 57, reconstruction_error = 169.483188
n = 58, reconstruction_error = 166.996179
n = 59, reconstruction_error = 164.572977
n = 60, reconstruction_error = 162.177082
n = 61, reconstruction_error = 159.814261
n = 62, reconstruction_error = 157.509342
n = 63, reconstruction_error = 155.275526
n = 64, reconstruction_error = 153.045000
n = 65, reconstruction_error = 150.834582
n = 66, reconstruction_error = 148.699171
n = 67, reconstruction_error = 146.587905
n = 68, reconstruction_error = 144.507996
n = 69, reconstruction_error = 142.451446
n = 70, reconstruction_error = 140.434531
n = 71, reconstruction_error = 138.436785
n = 72, reconstruction_error = 136.472570
n = 73, reconstruction_error = 134.533930
n = 74, reconstruction_error = 132.651860
n = 75, reconstruction_error = 130.794802
n = 76, reconstruction_error = 128.982040
n = 77, reconstruction_error = 127.186982
n = 78, reconstruction_error = 125.420004
n = 79, reconstruction_error = 123.673373
n = 80, reconstruction_error = 121.940057
n = 81, reconstruction_error = 120.229814
n = 82, reconstruction_error = 118.547715
n = 83, reconstruction_error = 116.882256
n = 84, reconstruction_error = 115.286841
n = 85, reconstruction_error = 113.693019
n = 86, reconstruction_error = 112.118001
n = 87, reconstruction_error = 110.562349
n = 88, reconstruction_error = 109.025932
n = 89, reconstruction_error = 107.518042
n = 90, reconstruction_error = 106.055453
n = 91, reconstruction_error = 104.614065
n = 92, reconstruction_error = 103.188570
n = 93, reconstruction_error = 101.782066
n = 94, reconstruction_error = 100.389584
n = 95, reconstruction_error = 99.014701
n = 96, reconstruction_error = 97.680056
n = 97, reconstruction_error = 96.367284
n = 98, reconstruction_error = 95.067511
n = 99, reconstruction_error = 93.776217
import pandas as pd
# create a table showing the number of principal components and MSE
pd.DataFrame(loss).head()
0 1
0 1.0 569.447737
1 2.0 536.059608
2 3.0 508.250286
3 4.0 487.018907
4 5.0 467.571610

We can also put these numbers into perspective by plotting them.

fig, ax = plt.subplots()
ax.plot(loss[:,0], loss[:,1]);
ax.axhline(100, linestyle='--', color='r', linewidth=2)
ax.xaxis.set_ticks(np.arange(1, 100, 5));
ax.set(xlabel='num_components', ylabel='MSE', title='MSE vs number of principal components');
plt.show()

But numbers dont’t tell us everything! Just what does it mean qualitatively for the loss to decrease from around \(450.0\) to less than \(100.0\)?

Let’s find out! In the next cell, we draw the the leftmost image is the original dight. Then we show the reconstruction of the image on the right, in descending number of principal components used.

def show_num_components_reconst(image_idx):
    fig, ax = plt.subplots(figsize=(20., 20.))
    actual = X[image_idx]
    # concatenate the actual and reconstructed images as large image before plotting it
    x = np.concatenate([actual[np.newaxis, :], reconstructions[:, image_idx]])
    ax.imshow(np.hstack(x.reshape(-1, 28, 28)[np.arange(10)]),
              cmap='gray');
    ax.axvline(28, color='orange', linewidth=2)
    plt.show()
show_num_components_reconst(0)
show_num_components_reconst(1)
show_num_components_reconst(2)

def show_pca_digits(i=1):
    """Show the i th digit and its reconstruction at the last iteration"""
    plt.figure(figsize=(4,4))
    actual_sample = X[i].reshape(28,28)
    reconst_sample = (reconst[i, :] * std + mu).reshape(28, 28)
    plt.imshow(np.hstack([actual_sample, reconst_sample]), cmap='gray')
    plt.show()
show_pca_digits(0)
show_pca_digits(1)
show_pca_digits(2)

2. PCA for high-dimensional datasets

Sometimes, the dimensionality of our dataset may be larger than the number of samples we have. Then it might be inefficient to perform PCA with our implementation above. Instead, we can implement PCA in a more efficient manner, which we call “PCA for high dimensional data” (PCA_high_dim).

Below are the steps for performing PCA for high dimensional dataset 1. Compute the matrix \(\boldsymbol X\boldsymbol X^T\) (a \(N\) by \(N\) matrix with \(N \ll D\)) 2. Compute eigenvalues \(\lambda\)s and eigenvectors \(V\) for \(\boldsymbol X\boldsymbol X^T\) 3. Compute the eigenvectors for the original covariance matrix as \(\boldsymbol X^T\boldsymbol V\). Choose the eigenvectors associated with the M largest eigenvalues to be the basis of the principal subspace \(U\). 4. Compute the orthogonal projection of the data onto the subspace spanned by columns of \(\boldsymbol U\).

### PCA for high-dimensional datasets

def PCA_high_dim(X, n_components):
    """Compute PCA for small sample size but high-dimensional features.
    Args:
        X: ndarray of size (N, D), where D is the dimension of the sample,
           and N is the number of samples
        num_components: the number of principal components to use.
    Returns:
        X_reconstruct: (N, D) ndarray. the reconstruction
        of X from the first `num_components` pricipal components.
    """
    N, D = X.shape
    M = np.dot(X, X.T) / N
    eig_vals, eig_vecs = eig(M)
    eig_vals, eig_vecs = eig_vals[:n_components], eig_vecs[:, :n_components]
    U = (X.T @ (eig_vecs))
    answer = np.zeros((N, D))
    answer = ((U @ np.linalg.inv(U.T @ U) @ U.T) @ X.T).T
    return answer

Given the same dataset, PCA_high_dim and PCA should give the same output. Assuming we have implemented PCA, correctly, we can then use PCA to test the correctness of PCA_high_dim. Given the same dataset, PCA and PCA_high_dim should give identical results.

We can use this invariant to test our implementation of PCA_high_dim, assuming that we have correctly implemented PCA.

np.testing.assert_almost_equal(PCA(Xbar, 2), PCA_high_dim(Xbar, 2))

Now let’s compare the running time between PCA and PCA_high_dim.

Tips for running benchmarks or computationally expensive code:

When you have some computation that takes up a non-negligible amount of time. Try separating the code that produces output from the code that analyzes the result (e.g. plot the results, compute statistics of the results). In this way, you don’t have to recompute when you want to produce more analysis.

The next cell includes a function that records the time taken for executing a function f by repeating it for repeat number of times. You do not need to modify the function but you can use it to compare the running time for functions which you are interested in knowing the running time.

def time(f, repeat=10):
    times = []
    for _ in range(repeat):
        start = timeit.default_timer()
        f()
        stop = timeit.default_timer()
        times.append(stop-start)
    return np.mean(times), np.std(times)

We first benchmark the time taken to compute \(\boldsymbol X^T\boldsymbol X\) and \(\boldsymbol X\boldsymbol X^T\). Jupyter’s magic command %time is quite handy.

The next cell finds the running time for computing \(\boldsymbol X^T\boldsymbol X\) and \(\boldsymbol X\boldsymbol X^T\) for different dimensions of \(\boldsymbol X\).

times_mm0 = []
times_mm1 = []

# iterate over datasets of different size
for datasetsize in np.arange(4, 784, step=20):
    XX = Xbar[:datasetsize] # select the first `datasetsize` samples in the dataset
    # record the running time for computing X.T @ X
    mu, sigma = time(lambda : XX.T @ XX)
    times_mm0.append((datasetsize, mu, sigma))

    # record the running time for computing X @ X.T
    mu, sigma = time(lambda : XX @ XX.T)
    times_mm1.append((datasetsize, mu, sigma))

times_mm0 = np.asarray(times_mm0)
times_mm1 = np.asarray(times_mm1)

Having recorded the running time for computing X @ X.T and X @ X.T, we can plot them.

fig, ax = plt.subplots()
ax.set(xlabel='size of dataset', ylabel='running time')
bar = ax.errorbar(times_mm0[:, 0], times_mm0[:, 1], times_mm0[:, 2], label="$X^T X$ (PCA)", linewidth=2)
ax.errorbar(times_mm1[:, 0], times_mm1[:, 1], times_mm1[:, 2], label="$X X^T$ (PCA_high_dim)", linewidth=2)
ax.legend();

Alternatively, use the time magic command for benchmarking functions.

%time Xbar.T @ Xbar
%time Xbar @ Xbar.T
pass # Put this here so that our output does not show result of computing `Xbar @ Xbar.T`
CPU times: user 230 ms, sys: 3.78 ms, total: 234 ms
Wall time: 63.9 ms
CPU times: user 34.7 ms, sys: 8.6 ms, total: 43.3 ms
Wall time: 9.97 ms

Next we benchmark PCA, PCA_high_dim.

times0 = []
times1 = []

# iterate over datasets of different size
for datasetsize in np.arange(4, 784, step=100):
    XX = Xbar[:datasetsize]
    npc = 2
    mu, sigma = time(lambda : PCA(XX, npc), repeat=10)
    times0.append((datasetsize, mu, sigma))

    mu, sigma = time(lambda : PCA_high_dim(XX, npc), repeat=10)
    times1.append((datasetsize, mu, sigma))

times0 = np.asarray(times0)
times1 = np.asarray(times1)

Let’s plot the running time. Spend some time and think about what this plot means. We mentioned in lectures that PCA_high_dim are advantageous when we have dataset size \(N\) < data dimension \(M\). Although our plot does not for the two running time does not intersect exactly at \(N = M\), it does show the trend.

fig, ax = plt.subplots()
ax.set(xlabel='number of datapoints', ylabel='run time')
ax.errorbar(times0[:, 0], times0[:, 1], times0[:, 2], label="PCA", linewidth=2)
ax.errorbar(times1[:, 0], times1[:, 1], times1[:, 2], label="PCA_high_dim", linewidth=2)
ax.legend();

Again, with the magic command time.

%time PCA(Xbar, 2)
%time PCA_high_dim(Xbar, 2)
pass
CPU times: user 2.62 s, sys: 109 ms, total: 2.73 s
Wall time: 305 ms
CPU times: user 6.29 s, sys: 500 ms, total: 6.79 s
Wall time: 739 ms
Back to top

Reuse

Citation

For attribution, please cite this work as:
Li, Zeju. n.d. “Principal Component Analysis (PCA).” https://zerojumpline.github.io//teaching/2025-08-08-Pattern Recognition/code_1_pca.html.