{ "cells": [ { "cell_type": "markdown", "id": "307ca7c6-35be-4dcd-b335-58410aee5bd2", "metadata": { "tags": [] }, "source": [ "# Stereo Geometry\n", "\n", "This notebook visualizes the geometry between two views called epipolar geometry.\n", "\n", "**Subjects are covered:**\n", "1. **Definitions of epipolar geometry, the Fundamental Matrix, and the Essential Matrix.**\n", "2. **Visualizing epipolar geometry.**\n", "3. **8 point algorithm for computing the Fundamental matrix.**\n", "4. **Deriving relative camera poses from the essential matrix.**\n", "5. **Conclusion**\n", "6. **Sources**\n", " \n", "Section 1 is meant as a theoretical underpinning of the methods used. It can be skipped by readers only interested in the algorithm's implementation. Many of the descriptions in this section I have taken, sometimes verbatim, from [Multiple View Geometry in Computer Vision](https://www.robots.ox.ac.uk/~vgg/hzbook/) by Richard Hartley and Andrew Zisserman, [epipolar geometry Wikipedia articles](https://en.wikipedia.org/wiki/Essential_matrix), and [Computer Vision Algorithms and Applications](https://szeliski.org/Book/) by Richard Szeliski.\n", " \n", "**Why are we interested in these subjects?** \n", "Structure and depth are inherently ambiguous from single views. \n", "Understanding stereo geometry will allow us to overcome this ambiguity and estimate many useful values, such as: \n", "* **The 3D structure of a scene.** Given calibrated cameras, the 3D reconstruction can be known up to a similarity transform. \n", " Additionally, if a landmark with known dimensions and pose is observed, a Euclidean reconstruction can be computed. \n", " This allows for applications such as: \n", " * measuring objects in the scene (photogrammetry) \n", " * collision avoidance for autonomous robots (depth perception)\n", " * inserting objects into a scene (virtual reality)\n", " * superimposing textures onto the scene (virtual reality)\n", " * creating 3D models of real-world objects (metric reconstruction).\n", "* **The relative pose of a camera.** Knowing the relative pose will allow for a more targeted search of point correspondences. Stereo geometry provides closed-form solutions for calculating relative camera poses from point correspondences. Also, when performing camera pose estimation with iterative non-linear methods (e.g. Levenberg Marquardt in notebook 2), it is crucial to have an initial pose estimate.\n", "\n", "**Further reading** \n", "For a comprehensive treatment of epipolar geometry, see chapters nine to eleven of the textbook [Multiple view geometry in computer vision](https://www.robots.ox.ac.uk/~vgg/hzbook/) by Richard Hartley and Andrew Zisserman. For this subject, I would not recommend video lectures. The notations in many lectures was inconsistent, and I eventually found it was best understood through reading and drawing." ] }, { "cell_type": "markdown", "id": "8f8978a0-913b-453f-877d-aa236e782038", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "# Epipolar Geometry\n", "Epipolar geometry is the intrinsic geometry between two views. It is independent of scene structure and only depends on the cameras' internal parameters and relative pose. \n", "This geometry is essentially the geometry of the intersection of the two camera image planes with a pencil of planes having the baseline as axis. Here, the baseline is the line joining the camera centers. (images taken from [Multiple View Geometry in Computer Vision](https://www.robots.ox.ac.uk/~vgg/hzbook/)).\n", "\n", "
\n", " \n", "
\n", " \n", "\n", "$ \\mathbf {C,C}'$ - The camera projection centers \n", "$ \\mathbf {X}$ - A 3D point \n", "$ \\mathbf {x,x'}$ - The 2D images of $X$ \n", "$ \\mathbf {\\pi}$ - The epipolar plane \n", "$ \\mathbf {e, e'}$ - The epipoles \n", "$ \\mathbf {l, l'}$ - The epipolar lines \n", "$ \\mathbf {P, P'}$ - The camera matrices\n", "\n", "* The epipole is the point of intersection of the line joining the camera centers (the baseline) with the image plane.\n", "* Equivalently, the epipole is the image in one view of the other view's camera center.\n", "* $ \\mathbf {X,x,x',C, C'}$ are coplanar. This plane is denoted as $ \\mathbf{\\pi}$. The rays back-projected from $ \\mathbf {x}$ and $ \\mathbf {x'}$ intersect at $ \\mathbf {X}$, and also lie in $ \\mathbf {\\pi}$. \n", "* The epipolar plane must contain the baseline. Therefore, there is a one-parameter family (a pencil) of epipolar planes. \n", "* The epipolar line is the intersection of an epipolar plane with the image plane. \n", "* All epipolar lines intersect at the epipole. \n", "* An epipolar plane defines the correspondence between the two epipolar lines. \n", "* A segment of the epipolar line can be found by projecting the other camera's back-projected ray of the image of $ \\mathbf {X}$.\n", "* For each point $ \\mathbf {x}$ in one image, there exists a corresponding epipolar line $ \\mathbf {l'}$ in the other image. Any point $ \\mathbf {x'}$ in the second image matching point $ \\mathbf {x}$ must lie on the epipolar line $ \\mathbf {l'}$.\n", "* **Recap.** In homogenous coordinates, the cross product between two points is the line passing through those points. The cross product of two lines is the point where they intersect. The dot product of a line and a point on that line is always $0$.\n" ] }, { "cell_type": "markdown", "id": "ecc29f0a-74cd-4ac0-8df6-79d6cda9bb9a", "metadata": {}, "source": [ "# The Fundamental Matrix\n", "The fundamental matrix is the algebraic representation of epipolar geometry. That is the above variables and their interrelations are captured in this matrix.\n", "\n", "Mostly simply: \n", "\n", "An epipolar line is a projection in the second image of the ray from the point $ \\mathbf {x}$ through the camera center $ \\mathbf {C}$ of the first camera. Thus, there is a map \n", "\n", "$$ \\mathbf {x \\rightarrow l'}$$ \n", "This mapping is a projective mapping from points to lines, which can be represented by the Fundamental matrix $\\mathbf{F}$. \n", "\n", "$$ \\mathbf{l'} = \\mathbf{Fx} $$\n", "\n", "Because each corresponding point $\\mathbf{x'}$ lies on $\\mathbf{l'}$ and has a dot product of 0, $\\mathbf{F}$ defines the constraint\n", "\n", "$$ \\mathbf{x'} ^{\\mathbf{T}} \\mathbf{F} \\mathbf{x} = 0$$\n", "\n", "\n", "### Algebraic Derivation of the Fundamental Matrix \n", "This derivation is taken from the book Multiple View Geometry in Computer Vision. \n", "\n", "The ray back-projected from $\\mathbf{x}$ is obtained by solving $\\mathbf{PX = x}$. The one-parameter family of solutions is of the form \n", "\n", "$$ \\mathbf{X}(\\lambda) = \\mathbf{P^{+} x} + \\lambda \\mathbf{C}$$\n", " \n", "Keep in mind these are homogenous coordinates, meaning the last coordinate is rescaled to 1. Essentially, we are choosing a point on a line segment between camera center $\\mathbf{C}$ and 3D point $\\mathbf{P^{+} x}$. In particular, two 3D points on the ray are $\\mathbf{P^{+} x}$ at $\\lambda = 0$, and $\\mathbf{C}$ at $\\lambda = \\infty$. Here $\\mathbf{P^{+}}$ is the psuedo-inverse of $\\mathbf{P}$, and $\\mathbf{C}$ is its null-vector, namely the camera center, defined by $\\mathbf{PC}=\\mathbf{0}$. These two points are imaged by the second camera $\\mathbf{P'}$ at $\\mathbf{P'}\\mathbf{P^{+} x}$ and $\\mathbf{P'C}$ respectively. The epipolar line is the line joining these two projeted points, namely $\\mathbf{l'} = (\\mathbf{P'C})\\times (\\mathbf{P'}\\mathbf{P^{+} x})$. The point $\\mathbf{P'C}$ is the epipole in the second image, namely the projection of the first camera center, and may be denoted by $\\mathbf{e'}$. Thus, $\\mathbf{l'} = [\\mathbf{e'}]_{\\times} (\\mathbf{P'}\\mathbf{P^{+}) x} = \\mathbf{Fx} $, where $\\mathbf{F}$ is the matrix \n", "\n", "$$ \\mathbf{F} = [\\mathbf{e'}]_{\\times} (\\mathbf{P'}\\mathbf{P^{+})} $$ \n", "\n", "Thus formalizing $ \\mathbf {x \\rightarrow l'}$ to the equation \n", "\n", "$$ \\mathbf{l'} = \\mathbf{Fx}$$\n", "\n", "Given that any point $w$ on $\\mathbf{l'}$ will have the dot product $w \\cdot \\mathbf{l'} = 0$, we can infer the following: \n", "The fundamental matrix satisfies the condition that for any pair of corresponding points $\\mathbf{x} \\leftrightarrow \\mathbf{x'}$ in the two images\n", "\n", "$$ \\mathbf{x'} ^{\\mathbf{T}} \\mathbf{F} \\mathbf{x} = 0$$\n", "\n", "The importance of the relation is that it gives a way of characterizing the fundamental matrix without reference to the camera matrices, i.e. only in terms of corresponding image points. This enables $\\mathbf{F}$ to be computed from image correspondences alone. To compute matrix $\\mathbf{F}$ at least $7$ correspondences are\n", "required.\n", "\n", "### Properties \n", "\n", "* $\\mathbf{F}$ is a $3 \\times 3$ matrix\n", "* $\\mathbf{F}$ is rank 2\n", "* If $\\mathbf{F}$ is the fundamental matrix of the pair of cameras $(\\mathbf{P}, \\mathbf{P'})$, then\n", "$\\mathbf{F^{T}}$ is the fundamental matrix of the pair in the opposite order $(\\mathbf{P'}, \\mathbf{P})$.\n", "* $ \\mathbf{l'} = \\mathbf{Fx}$ \n", "* $ \\mathbf{l} = \\mathbf{F^Tx'}$ \n", "* For any point $ \\mathbf{x}$ (other than $ \\mathbf{e}$) the epipolar line $ \\mathbf{l'} = \\mathbf{F} \\mathbf{x}$ \n", "contains epipole $\\mathbf{e'}$\n", "* $\\mathbf{e'^T} (\\mathbf{Fx}) = (\\mathbf{e'^TF})\\mathbf{x} = 0$ for all $\\mathbf{x}$.\n", "* $\\mathbf{e'}$ is the left null-vector of $\\mathbf{F}$.\n", "* $\\mathbf{e}$ is the right null-vector of $\\mathbf{F}$.\n", "* $\\mathbf{e'} = \\mathbf{P'C}$\n", "* $\\mathbf{e} = \\mathbf{PC'}$\n", "* $\\mathbf{Fe} = 0$\n", "* $\\mathbf{F^{T}e'} = 0$\n", "* $\\mathbf{l'} = \\mathbf{F}[\\mathbf{e}]_{\\times}\\mathbf{l}$ \n", "* $\\mathbf{l} = \\mathbf{F^T}[\\mathbf{e'}]_{\\times}\\mathbf{l'}$ " ] }, { "cell_type": "markdown", "id": "08646dc9-b944-4e74-9efd-a742f107b329", "metadata": {}, "source": [ "# The Essential Matrix\n", "The essential matrix is the specialization of the fundamental matrix to the case of\n", "normalized image coordinates. Historically, the essential matrix was introduced (by Longuet-Higgins) before the fundamental matrix, and the fundamental\n", "matrix may be thought of as the generalization of the essential matrix in which the\n", "(inessential) assumption of calibrated cameras is removed. The essential matrix is defined as \n", "\n", "$$ \\mathbf{E} = \\mathbf{K^{'T}FK}$$\n", "\n", "The relation between the essential matrix and corresponding points is essentially ( ;] ) the same as for the fundamental matrix. \n", "\n", "$$ \\mathbf{x'} ^{\\mathbf{T}} \\mathbf{F} \\mathbf{x} = 0$$ \n", "\n", "$$ \\mathbf{x'} \\mathbf{K^{'-T}K^{'T}FKK^{-1}} \\mathbf{x} = 0$$ \n", "\n", "$$ \\mathbf{y'} \\mathbf{K^{'T}FK} \\mathbf{y} = 0 $$\n", "\n", "$$ \\mathbf{y'} \\mathbf{E} \\mathbf{y} = 0 $$\n", "\n", "Where $\\mathbf{y}$ and $\\mathbf{y'}$ are the **normalized coordinates**. These can be thought of as the image plane coordinates for a camera with the identity matrix as camera intrinsics." ] }, { "cell_type": "markdown", "id": "03a80e73-33d1-4797-a43c-80f3cc303970", "metadata": {}, "source": [ "# Visualizing epipolar geometry" ] }, { "cell_type": "markdown", "id": "1e160425-3519-414b-b9cb-aa9cd9d4177a", "metadata": {}, "source": [ "We will now visualize epipolar geometry by plotting the epipolar lines for stereo point correspondences. \n", "For this demonstration, we use a stereo setup from notebook 1.\n", "\n", "
\n", " \n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0077de5e-4a1b-4543-a7e9-d27ffe4ea772", "metadata": {}, "outputs": [], "source": [ "from notebook_functions import (init_3d_plot, \n", " plot_chessboard, \n", " plot_camera_wireframe, \n", " plot_picture,\n", " project_points_to_picture,\n", " object_points,\n", " images,\n", " get_stereo_setup_with_correspondences, \n", " triangulate)\n", "import ipyvolume as ipv\n", "import numpy as np \n", "import cv2\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "id": "550750ef-afe1-4cad-ac05-5a844491efb4", "metadata": {}, "outputs": [], "source": [ "images, extrinsics, cam_centers, intrinsics, match_coords, object_points = get_stereo_setup_with_correspondences()\n", "\n", "init_3d_plot()\n", "plot_chessboard(object_points)\n", "vis_scale = 5\n", "\n", "for idx, (cam_center, extrinsic, image) in enumerate(zip(cam_centers, extrinsics, images)):\n", " inv_extrinsic = np.linalg.inv(extrinsic)\n", " inv_intrinsics = np.linalg.inv(intrinsics)\n", " plot_camera_wireframe(cam_center, vis_scale, inv_extrinsic)\n", " plot_picture(image, inv_extrinsic, vis_scale)\n", " \n", "ipv.show()" ] }, { "cell_type": "markdown", "id": "55aa96e8-835a-425c-916b-5bb592dcfd0b", "metadata": { "tags": [] }, "source": [ "## Calculate the fundamental matrix from camera matrices\n", "We will calculate the fundamental matrix using $ \\mathbf{F} = [\\mathbf{e'}]_{\\times} (\\mathbf{P'}\\mathbf{P^{+})} $ and $\\mathbf{e'} = \\mathbf{P'C}$. \n", "Where $ [T_{\\times}] = \\begin{pmatrix} 0 & -t_z & t_y\\\\ t_z & 0 & -t_x\\\\ -t_y & t_x & 0 \\end{pmatrix} $" ] }, { "cell_type": "code", "execution_count": null, "id": "cdb422c9-b382-4f02-a6d8-29747251b325", "metadata": {}, "outputs": [], "source": [ "extrinsic_1, extrinsic_2 = extrinsics\n", "extrinsic_1_orig = extrinsic_1\n", "extrinsic_2_orig = extrinsic_2\n", "cam_center_1, cam_center_2 = cam_centers\n", "cam_center_1_orig = cam_center_1\n", "cam_center_2_orig = cam_center_2\n", "proj_1 = intrinsics[:3, :3] @ extrinsic_1[:3, :4]\n", "proj_2 = intrinsics[:3, :3] @ extrinsic_2[:3, :4]\n", "pinv_proj_1 = np.linalg.pinv(proj_1)\n", "pinv_proj_2 = np.linalg.pinv(proj_2)\n", "e_1 = proj_1 @ cam_center_2[:4]\n", "e_2 = proj_2 @ cam_center_1[:4]\n", "e_1 = e_1 / e_1[2]\n", "e_2 = e_2 / e_2[2]\n", "x, y, z = e_2\n", "cross_e_2 = np.array([[ 0, -z, y],\n", " [ z, 0, -x],\n", " [-y, x, 0]])\n", "F = cross_e_2 @ (proj_2 @ pinv_proj_1)\n", "\n", "print(f'Rank of F is very near 2 (ignoring numerical rounding errors): {np.isclose(np.linalg.svd(F)[1][2], 0)}')\n", "print(f'Fe≈0 and F trans e\\'≈ 0: {np.all(F @ e_1 < 0.0001), np.all(F.T @ e_2 < 0.0001)}')" ] }, { "cell_type": "markdown", "id": "b5bc4e93-29f0-410a-8899-68070c9d16b1", "metadata": {}, "source": [ "## Draw the epipolar lines" ] }, { "cell_type": "markdown", "id": "f4251af4-7abd-47f6-a061-975f96d1d131", "metadata": {}, "source": [ "The epipolar lines will now be drawn using the equation $ \\mathbf{l'} = \\mathbf{Fx}$ and $ \\mathbf{l} = \\mathbf{F^Tx'}$." ] }, { "cell_type": "code", "execution_count": null, "id": "18a2820d-bc9b-410d-aaf8-ca7141461ab3", "metadata": {}, "outputs": [], "source": [ "def draw_homogenous_line(image, line): \n", " \"\"\" Draws a line represented by homogenous coordinates onto an image.\n", " \n", " Args: \n", " image (np.ndarray): An image array. \n", " line (np.ndarray): A 2D line represented by homgenous coordinates.\n", " Returns: \n", " image (np.ndarray): The image with the line drawn on it.\n", " \"\"\" \n", " height, width, _ = image.shape\n", " a, b, c = line\n", " y_at_x0 = int(-c/b)\n", " y_at_xwidth = int((-a * width - c)/b)\n", " image = cv2.line(image, (0, y_at_x0), (width, y_at_xwidth), [255, 255, 255], 5)\n", " return image\n", "\n", "def plot_epipolar_lines(image_1, image_2, points_1, points_2, F):\n", " \"\"\" Plots point correspondences along with epipolar lines. \n", " \n", " Args: \n", " image_{1|2} (np.ndarray): The stereo images. \n", " points_{1|2} (np.ndarray): The stereo point correspondences. \n", " F (no.ndarray): The fundamental matrix.\n", " \"\"\"\n", " for point_1, point_2 in zip(points_1, points_2): \n", " line_1 = F.T @ point_2[:3]\n", " line_2 = F @ point_1[:3]\n", " image_1 = cv2.circle(image_1, point_1[:2].astype(int), 15, [255,255,255], -1)\n", " image_2 = cv2.circle(image_2, point_2[:2].astype(int), 15, [255,255,255], -1)\n", " image_1 = draw_homogenous_line(image_1, line_1)\n", " image_2 = draw_homogenous_line(image_2, line_2)\n", "\n", " image_concat = cv2.hconcat([image_1, image_2])\n", " image_concat = cv2.cvtColor(image_concat, cv2.COLOR_BGR2RGB)\n", " plt.figure(figsize=(20,20))\n", " plt.imshow(image_concat)\n", " plt.show() " ] }, { "cell_type": "code", "execution_count": null, "id": "c8f791cc-1fb7-4151-ba04-fc748f61d46b", "metadata": {}, "outputs": [], "source": [ "im_1_points = match_coords[0]\n", "im_2_points = match_coords[1]\n", "image_1 = images[0]\n", "image_2 = images[1]\n", "plot_epipolar_lines(image_1.copy(), image_2.copy(), im_1_points, im_2_points, F)" ] }, { "cell_type": "markdown", "id": "da65714c-e545-4b41-9cd2-684e4c5d4a44", "metadata": { "tags": [] }, "source": [ "\n", "# 8 point algorithm - Fundamental matrix from point correspondences\n", "\n", "The 8 point algorithm is a straightforward approach for calculating the Fundamental matrix $\\mathbf{F}$.\n", "\n", "1. **Points**. Find $8$ or more stereo point correspondences. Here we will use SIFT features.\n", "2. **Linear solution**. Use an SVD to find the least-squares solution to $\\mathbf{F}$ given the $8$ or more correspondence constraints $\\mathbf{x'} ^{\\mathbf{T}} \\mathbf{F} \\mathbf{x} = 0$. That is, for SVD$= \\mathbf {U\\Sigma V^{T}}$, take the last column of $\\mathbf{V}$ as $\\mathbf{F}$.*\n", "3. **Enforce Rank 2**. For noisy data, the linear solution $\\mathbf{F}$ will not be of rank $2$ in general. To enforce this, use the SVD to perform a PCA and keep only the first $2$ principal axes when reconstructing $\\mathbf{F}$. \n", " \n", "*-See the appendix for an intuitive explanation and proof of why:\n", "* The Singular Value Decomposition can be used to find a least-squares solution.\n", "* The rank of the Fundamental matrix must be $2$. " ] }, { "cell_type": "markdown", "id": "ac68eb23-de2a-4eb7-b77a-e9794cf7bd3c", "metadata": {}, "source": [ "### Constraint matrix\n", "\n", "$\n", "\\begin{align}\n", "\\mathbf{x'} ^{\\mathbf{T}} \\mathbf{F} \\mathbf{x} &= 0 \\\\\\\\\n", "\\begin{bmatrix}\n", "x' & y' & w'\n", "\\end{bmatrix} \n", "\\begin{bmatrix}\n", "f_1 & f_2 & f_3 \\\\\n", "f_4 & f_5 & f_6 \\\\\n", "f_7 & f_8 & f_9\n", "\\end{bmatrix} \n", "\\begin{bmatrix}\n", "x \\\\ \n", "y \\\\ \n", "w\n", "\\end{bmatrix} &= 0 \\\\\\\\\n", "x' x f_1 + x' y f_2 + x' w f_3 \\\\\n", "+ y' x f_4 + y' y f_5 + y' w f_6 \\\\\n", "+ w' x f_7 + w' y f_8 + w' w f_9 &= 0 \\\\\\\\\n", "\\begin{bmatrix}\n", "x'x & x'y & x'w & y'x & y'y & y'w & w'x & w'y & w'w \n", "\\end{bmatrix}\n", "\\begin{bmatrix}\n", "f_1 \\\\ f_2 \\\\ f_3 \\\\ f_4 \\\\ f_5 \\\\ f_6 \\\\ f_7 \\\\ f_8 \\\\ f_9\n", "\\end{bmatrix} &= \\begin{bmatrix} 0 \\end{bmatrix}\n", "\\end{align}\n", "$\n", "\n", "The last equation is a linear homogeneous equation of form $\\mathbf{A}b = 0$. Adding $7$ more constraints to $\\mathbf{A}$ allows us to compute $b$. In this case, $b$ is the rolled-out fundamental matrix $\\mathbf{F}$." ] }, { "cell_type": "code", "execution_count": null, "id": "b04a1858-5d01-48fa-819c-a855e031b90a", "metadata": {}, "outputs": [], "source": [ "# Linear solution\n", "constraint_matrix = list()\n", "for point_1, point_2 in zip(im_1_points, im_2_points): \n", " x1, y1, w1, _ = point_1\n", " x2, y2, w2, _ = point_2\n", " constraint = [x2*x1, x2*y1, x2*w1, y2*x1, y2*y1, y2*w1, w2*x1, w2*y1, w2*w1]\n", " constraint_matrix.append(constraint)\n", "constraint_matrix = np.array(constraint_matrix)\n", "u, s, v_t = np.linalg.svd(constraint_matrix)\n", "least_squares = v_t[-1] # last column of V\n", "F = least_squares.reshape((3, 3))\n", "\n", "# Enforce rank 2 \n", "u, s, v_t = np.linalg.svd(F)\n", "s = np.diag(s) \n", "s[2, 2] = 0\n", "F = u @ s @ v_t\n", "plot_epipolar_lines(image_1.copy(), image_2.copy(), im_1_points, im_2_points, F)" ] }, { "cell_type": "markdown", "id": "5a4aa479-9234-44fe-a8f4-f51245cb82be", "metadata": {}, "source": [ "# Deriving relative camera pose from the essential matrix." ] }, { "cell_type": "markdown", "id": "e97e7d48-44be-4614-aaaa-612e913a77ce", "metadata": {}, "source": [ "For a proof of this result refer to 9.6.2 in Multiple View Geometry in Computer Vision and [this Wikipedia article](https://en.wikipedia.org/wiki/Essential_matrix#Extracting_rotation_and_translation). The proof is out-of-scope and would require too much space.\n", "\n", "For a given essential matrix $\\mathbf{E}$ with SVD decomposition $\\mathbf{E} = \\mathbf{U} \\text{diag}(1,1,0)\\mathbf{V^T}$ and first camera matrix $\\mathbf{P} = [\\mathbf{I} | \\mathbf{0}]$, there are our possible choices for the second camera $\\mathbf{P'}$, namely \n", "\n", "$$ \\mathbf{P'} = [\\mathbf{UWV^T} | +\\mathbf{u_3}] \\hspace{0.5em} \\text{ or } \\hspace{0.5em} [\\mathbf{UWV^T} | -\\mathbf{u_3}] \\hspace{0.5em} \\text{ or } \\hspace{0.5em} [\\mathbf{UW^TV^T} | +\\mathbf{u_3}] \\hspace{0.5em} \\text{ or } \\hspace{0.5em} [\\mathbf{UW^TV^T} | -\\mathbf{u_3}]$$ \n", "\n", "Where $\\mathbf{W} = \\begin{bmatrix} 0 & -1 & 0 \\\\\n", " 1 & 0 & 0 \\\\\n", " 0 & 0 & 1 \\end{bmatrix} $\n", " \n", "This 4 solution ambiguity can be reduced to a single solution. This is the solution that places all points in front of the cameras after triangulation. \n", "We compute all 4 solutions and check which solution has points with only positive z coordinates relative to the camera. " ] }, { "cell_type": "code", "execution_count": null, "id": "6bdf247f-9b1e-45eb-8965-11868766fdcc", "metadata": {}, "outputs": [], "source": [ "def decompose_essential(essential):\n", " \"\"\" Decomposes an essential matrix into a relative camera pose. \n", " \n", " Args:\n", " essential (np.ndarray): The essential matrix. \n", " Returns: \n", " pose (np.ndarray): The extrinsic matrix of the second camera.\n", " \"\"\"\n", " w = np.array([[0, -1, 0], \n", " [1, 0, 0], \n", " [0, 0, 1]])\n", " u, s, v_t = np.linalg.svd(essential)\n", "\n", " # Translation hypothesis\n", " u_3 = u[:, 2, np.newaxis]\n", " \n", " # Rotation hypothesis\n", " rot_a = u @ w @ v_t\n", " rot_b = u @ w.T @ v_t\n", " rot_a = -rot_a if np.linalg.det(rot_a) < 0 else rot_a\n", " rot_b = -rot_b if np.linalg.det(rot_b) < 0 else rot_b\n", "\n", " # 4 motion hypothoses as extrinsic matrices\n", " extr_2_a = np.hstack([rot_a, u_3])\n", " extr_2_b = np.hstack([rot_a, -u_3])\n", " extr_2_c = np.hstack([rot_b, u_3])\n", " extr_2_d = np.hstack([rot_b, -u_3])\n", " \n", " # Assume cam 1 = [I | 0]\n", " extr_1 = np.array([[1, 0, 0, 0],\n", " [0, 1, 0, 0], \n", " [0, 0, 1, 0]])\n", " cam_center_1 = np.zeros(3)\n", " inv_intrinsics = np.linalg.inv(intrinsics[:3, :3])\n", " inv_extrinsic_1 = np.linalg.pinv(extr_1)\n", " vec_in_cam_ref_1 = inv_intrinsics @ im_1_points.T[:3]\n", " vec_in_world_1 = inv_extrinsic_1 @ vec_in_cam_ref_1\n", " pose = None\n", " \n", " # Triangulate points for each of the 4 solutions\n", " for extr_2 in [extr_2_a, extr_2_b, extr_2_c, extr_2_d]:\n", " cam_center_2 = -extr_2[:3, :3] @ extr_2[:, 3].T\n", " inv_extrinsic_2 = np.linalg.pinv(extr_2)\n", " vec_in_cam_ref_2 = inv_intrinsics @ im_2_points.T[:3]\n", " vec_in_world_2 = inv_extrinsic_2 @ vec_in_cam_ref_2\n", " rel_zs = np.array([])\n", " for vec_1, vec_2 in zip(vec_in_world_1.T, vec_in_world_2.T):\n", " triangulated, _, _ = triangulate(cam_center_1, vec_1, cam_center_2, vec_2)\n", " triangulated = np.append(triangulated, 1)\n", " x1, y1, z1 = extr_1 @ triangulated # Triangulated point in camera 1 ref\n", " x2, y2, z2 = extr_2 @ triangulated # Triangulated point in camera 2 ref\n", " rel_zs = np.append(rel_zs, [z1, z2])\n", " if (rel_zs > 0).all():\n", " pose = extr_2\n", " \n", " return pose" ] }, { "cell_type": "markdown", "id": "86aebd26-676d-45a4-b5e1-571e7448019e", "metadata": { "tags": [] }, "source": [ "# Visualize relative pose inferred from the essential matrix\n", "The ground truth pose of the second camera is shown in red. \n", "The inferred pose of the second camera is shown in blue. \n", "As can be seen, the pose can be very decently inferred from point correspondences. \n", "The careful reader will notice that the relative translation is artificially scaled by a \"magic\" number. \n", "This is because the essential matrix has no information on the scale of the relative camera translation. \n", "For details on why this is, see [this post](https://stackoverflow.com/questions/69742520/extracting-the-scale-of-translation-vector-that-i-got-from-the-essential-matrix/69980810#69980810). \n", "We do however know the scale of the scene due to the chessboard calibration object, which is how I manually found the translation scale 8.6." ] }, { "cell_type": "code", "execution_count": null, "id": "bbe308d9-8bf7-464c-b7aa-35c7ad971db8", "metadata": {}, "outputs": [], "source": [ "init_3d_plot()\n", "plot_chessboard(object_points)\n", "plot_camera_wireframe(cam_center_2_orig, vis_scale, np.linalg.inv(extrinsic_2_orig), color='red')\n", "\n", "essential = intrinsics[:3, :3].T @ F @ intrinsics[:3, :3]\n", "rel_extrinsic_2 = decompose_essential(essential)\n", "\n", "# Convert relative pose from essential matrix decomposition to absolute poses. \n", "rel_extrinsic_2[:, 3] *= 8.6\n", "extrinsic_2 = rel_extrinsic_2 @ extrinsic_1\n", "cam_center_2 = -extrinsic_2[:3, :3].T @ extrinsic_2[:3, 3] \n", "extrinsic_2 = np.vstack([extrinsic_2, [0, 0, 0, 1]])\n", "extrinsics[1] = extrinsic_2\n", "cam_centers[1] = cam_center_2\n", "\n", "for idx, (cam_center, extrinsic, image) in enumerate(zip(cam_centers, extrinsics, images)):\n", " inv_extrinsic = np.linalg.pinv(extrinsic)\n", " inv_intrinsics = np.linalg.inv(intrinsics)\n", " plot_camera_wireframe(cam_center, vis_scale, inv_extrinsic)\n", " plot_picture(image, inv_extrinsic, vis_scale)\n", " \n", "ipv.show()" ] }, { "cell_type": "markdown", "id": "8b2b1507-2ed3-49d2-99e8-6f625e09e312", "metadata": { "tags": [] }, "source": [ "## Conclusion\n", "In this notebook we \n", "* visualized the epipolar lines for stereo point correspondences \n", "* computed the fundamental matrix using the 8 point algorithm\n", "* decomposed the essential matrix into a relative pose of a camera\n", "\n", "Having an intuition of how epipolar geometry works will serve us well. \n", "It is both a tool for problem-solving and a theoretical basis for many computer vision algorithms. \n" ] }, { "cell_type": "markdown", "id": "5860dd9a-9bcd-4b34-9252-bbf0e2c985ab", "metadata": {}, "source": [ "## Sources" ] }, { "cell_type": "markdown", "id": "1d5c6367-e052-4393-ba14-510b8337d51d", "metadata": {}, "source": [ "1. [Szeliski, Richard. \"Computer vision: algorithms and applications.\" Springer Science & Business Media, 2010](https://szeliski.org/Book/)\n", "2. [Andrew, Alex M. \"Multiple view geometry in computer vision.\" Kybernetes, 2001](https://www.robots.ox.ac.uk/~vgg/hzbook/)" ] }, { "cell_type": "markdown", "id": "a0c3a335-fd6b-4065-b8df-cd94c05c5401", "metadata": {}, "source": [ "## Appendix \n", " \n", "#### Why does the SVD contain the least-squares solution? \n", "This section gives an intuitive proof of why for a homogenous equation (or data) matrix $\\mathbf{A}$ in $\\mathbf{A}b = 0$, the last column of the SVD's $\\mathbf{V}$ matrix is the least-squares solution.\n", "For a homogenous least-squares problem, we are interested in finding a parameter vector $b$ such that\n", "\n", "$$ \\mathbf{A}b = 0 $$\n", "\n", "subject to the constraint $ \\lVert b \\rVert ^{2}=1$ to avoid the arbitrary solution. Because there may be no exact solution, this is phrased as a minimization problem. \n", "\n", "$$ \\arg \\underset{b}{\\min} \\lVert \\mathbf{A}b \\rVert^2_2 $$\n", "\n", "\n", "The SVD decomposes a matrix into two orthonormal matrices $ \\mathbf {U, V^{T}} $ and one diagonal matrix with non-negative real numbers $\\mathbf{\\Sigma}$. Interpreting the SVD as a PCA, $\\mathbf{V}$ encodes the principal axes of the data and $\\mathbf{\\Sigma}$ the amount of variance of the data when projected onto these principal axes in descending order, i.e. $\\sigma_1 \\geq \\sigma_2 \\geq \\dots \\geq \\sigma_n$(for details on the SVD $\\leftrightarrow$PCA connection, see [this Princeton tutorial](https://www.cs.princeton.edu/courses/archive/spring12/cos598C/svdchapter.pdf)). \n", "\n", "$$ \\mathbf{A} \\rightarrow \\mathbf {U\\Sigma V^{T}} $$\n", "\n", "Matrix multiplication with an orthonormal matrix does not affect the norm of a vector, thus we can remove $\\mathbf{U}$ from the equation. \n", "\n", "$$ \\arg \\underset{b}{\\min} \\lVert \\mathbf {U \\Sigma V^{T}}b \\rVert = \\arg \\underset{b}{\\min} \\lVert \\mathbf {\\Sigma V^{T}}b \\rVert$$\n", "\n", "We further substitute with $y = \\mathbf{V^{T}} b$, with constraint $ \\lVert y \\rVert ^{2}=1$ due to orthonormality of $\\mathbf{V^{T}}$.\n", "\n", "$$ \\arg \\underset{b}{\\min} \\lVert \\mathbf {\\Sigma V^{T}}b \\rVert =arg \\underset{y}{\\min} \\lVert \\mathbf {\\Sigma} y \\rVert $$ \n", "\n", "Because the singular values $\\sigma$ in $\\mathbf{\\Sigma}$ encode the variance in descending order, the optimal $y$ is trivially $(0,\\dots,0,1)^T$. \n", "$$ y = (0,\\dots,0,1)^T$$\n", "$$ \\mathbf{V^{T}} b = (0,\\dots,0,1)^T$$\n", "$$ \\mathbf{V V^{T}}b = \\mathbf{V} (0,\\dots,0,1)^T $$ \n", "$$ b = \\mathbf{V} (0,\\dots,0,1)^T $$ \n", "$$ b = \\text{last column of } \\mathbf{V} $$ \n", "making the optimal $b$ the last column of $\\mathbf{V}$. " ] }, { "cell_type": "markdown", "id": "5e7c6893-59fe-4759-b712-9ae0e0137411", "metadata": {}, "source": [ "\n", "## Why is the fundamental matrix rank 2? \n", "This proof was taken from [this Quora post by Samarth Brahmbhatt](https://www.quora.com/Why-is-the-fundamental-matrix-in-computer-vision-rank-2). Additional references to other proofs have been added where needed.\n", "\n", "#### Intuition\n", "A fundamental matrix is given by the equation $ \\mathbf{x'}^T \\mathbf{F} \\mathbf{x} = 0 $. Now consider an epipolar line $ l' = \\mathbf{F} \\mathbf{x} $. The right epipole $ \\mathbf{e'} $ lies on this line, so $ \\mathbf{e'}^T l' = 0 $ or $ \\mathbf{e'} ^T \\mathbf{F} \\mathbf{x} = 0 $ for all $ \\mathbf{x} $. This implies that $ \\mathbf{e'} ^T \\mathbf{F} = 0 $. Similarly, one can prove that $ \\mathbf{F} \\mathbf{e} = 0 $. Hence $ \\mathbf{F} $ has a null space which is not just the zero vector. So $ \\mathbf{F} $ is rank deficient.\n", "\n", "#### Proof\n", "The proof that $ \\mathbf{F} $ has of rank exactly 2 comes from the fact that $ \\mathbf{F} $ is constructed from the essential matrix $ \\mathbf{E}$: \n", "\n", "$$ \\mathbf{F} = (\\mathbf{K'^{-1}})^T \\mathbf{E} \\mathbf{K^{-1}} $$ where the $ \\mathbf{K} $'s are intrinsic matrices of the two cameras. Now, $ \\mathbf{E} = [T_{\\times}]R $ where $ R $ is the rotation matrix relating the two camera co-ordinate systems and $ [T_{\\times}] = \\begin{pmatrix} 0 & -t_z & t_y\\\\ t_z & 0 & -t_x\\\\ -t_y & t_x & 0 \\end{pmatrix} $. For a proof of this identity, see the [wikipedia page](https://en.wikipedia.org/wiki/Essential_matrix#Derivation_and_definition). A little bit of manipulation will show that one column of $ [T_{\\times}] $ is a linear combination of the other two columns. So $ [T_{\\times}] $ has rank 2. For a proof of this see [this row reduced solution](https://www.wolframalpha.com/input/?i2d=true&i=row+echelon+form+of+%7B%7B0%2C-z%2Cy%7D%2C%7Bz%2C0%2C-x%7D%2C%7B-y%2Cx%2C0%7D%7D).\n", "\n", "Hence any matrix that you construct by multiplying other matrices with $ [T_{\\times}] $ (such as $ \\mathbf{E} $ and $ \\mathbf{F} $) will also have at most rank 2." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }