Intro - In case you don't know what a homography (computer vision) is
I'm working with some image processing problems where I extract a bunch of points of interest in two images which may or may not have the same subject.
So here's an example of two matching images and the points of interest that may match (and do match in this example):
Using a C++ library called opencv I can get a best fitting homography matrix that transforms points from image 1 into those from image 2. The idea is then to take the inverse of this matrix and use that to warp the perspective of image 2 such that it looks like image 1. The matrix for this particular example is:
So here's an example of a bad match. First the original keypoint pairings
and now with the second image warped according to the homography that was found:
Obviously this is rather "unreasonable"
The question
I need a way to tell if a homography is reasonable or not. My linear algebra isn't that great, but my best guesses would be to check:
- The eigenvalues. I'd guess that these need to be real and positive, and not insanely far off 1. For my reasonable homography example they are
[0.47, 0.74, 1.01]. For my unreasonable homography they are[66.83, 1.02 + 2.08i, 1.02 - 2.08i]. I just don't know if I'm right though. Would a reasonable homography have negative or complex eigenvalues? I think I know what a negative means (something like a flip along a certain axis), but I'm not sure what a complex eigenvalue means here. - The determinant. I'm just guessing this shouldn't be insanely far off 1 and should be positive. For my reasonable homography example it is 0.35. For my unreasonable homography it is 359.08. BUT I have found unreasonable homographies with determinants fairly close to 1.
So can someone please comment on my ideas and perhaps introduce better ones?
$\endgroup$ 171 Answer
$\begingroup$I think I've almost figured this one out. In the comments @BenGrossmann mentions condition number as a good way of evaluation how much a matrix "squishes" space. This is great, but it didn't work right away.
$H$ (the homogoraphy matrix) for my "reasonable" example was on the order of 4e5 which indicates an extreme amount of "squishing".
So to understand what's going on we need to understand what the homography matrix actually does. HINT: It isn't the only thing required for the transformation. It's a 3x3 matrix because we are actually transforming the points of a plane in 3d space:
So consider a point on our input plane $[x', y']^\intercal$. When applying a homography we actually put the plane at $z=1$ so that point is: $[x', y', 1]^\intercal$. Then there's a two step process to get to our transformed point:
- Get $[x, y, z] = H[x', y', 1]$. But we aren't done because now our output plane has a varying $z$ component. So the next step explains how to get back into the 2D world of an image.
- Divide by z to put the point back at $z=1$ to get $[x/z, y/z, 1]$ (not explicitly shown in diagram). This makes sense because if the transformed plane's normal is not parallel to $\hat{z}$, then that means visually that parts of it are further away from the "observer" and therefore appear smaller.
In more mathematical terms, we say that the homography applies the transformation of each point up to a scale factor (warning: the input points and output points are flipped in this screenshot relative to my notation):
So now that we have the computer vision specifics out of the way we can go back to the math. We now know it's not sensible to calculate the condition number of all of $H$ because it's possible to have a perfectly reasonable transformation which translates the input plane along the $z$ axis without touching the $x$ and $y$ coordinates like so:
$$ H = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 5 \end{bmatrix} $$
In which case the plane would be translated 4 units in the direction of $\hat{z}$. This would not be an overly unreasonable homography. It would just mean that we make an object look further away. But the condition number would be 5 which indicates a fair bit of squishing.
As another example, let's look at the actual homography matrix from my "reasonable" homography:
$$ H = \begin{bmatrix} 0.73 & -0.4 & 197.41 \\ 0.15 & 0.45 & -57.61 \\ 0 & 0 & 1.17 \end{bmatrix} $$
The mechanism here is different. The values 197.41 and -57.61 indicate a translation in the $x$, $y$ directions. The input image is about 512x512 pixels so these translations are not so unreasonable. But the matrix is ill-conditioned (condition number is ~4e5) because small changes in $z$ yield large changes in the response. But since $z$ is fixed, this should not matter.
The trick then is to take the condition number of the first two columns of $H$ only. Then we are omitting the $z$-dependence of the transformation, which should be irrelevant because we know $z$ will always be fixed to 1 on the input.
When I do this I get a condition number of 1.7 for my "reasonable" homography, and condition numbers of 5+ for "unreasonable" homographies.
I think the last step, which I have not addressed here, is to have some limit on how far along the z-axis I should allow my ouput to be translated. After all, if I translate the ouput to z=100, then the resulting image will be tiny, and therefore rather unreasonable. I believe that by simply looking at the bottom right entry of the matrix, we know a fair bit about the amount of translation. Combining that knowledge with a reasonable condition number for the first two columns of $H$, we should have a pair of totally descriptive criteria for the purpose of the question.
I'm still not 100% sure I'm right so I'll leave this question open and hopefully someone will comment on my reasoning.
$\endgroup$ 2