Inverse Function Theorem as Newton Iterations
In Newton's Method we attempt to find zeros of a function (that is, points in the inverse image of the point ${0}$) by moving our previous guess $x_i$ by $$x_{i+1} = x_i + \frac{- f(x_i)}{f'(x_i)} = x_i + \frac{0 - f(x_i)}{f'(x_i)}.$$ In one dimension, the linear transformation given by the derivative at $x_i$ is just multiplication by the number $f'(x_i)$ (and division by $f'(x_i)$ is just application of the inverse linear transformation). But if we write the above in the more general notation of a linear transformation $Df_{x_i}$, we would write $$x_{i+1} = x_i + (Df_{x_i})^{-1}(0 - f(x_i)).$$
This choice of $x_{i+1}$ makes quite a lot of sense even if we move to the general case where $f$ is a function from $\mathbb{R}^n \to \mathbb{R}^n$ and $Df$ is invertible. We are seeing how much we missed $0$ by with our latest guess $x_i$, and moving in the direction which the derivative tells us is most likely to get us from $f(x_i)$ to $0$, i.e. $(Df_{x_i})^{-1}(0-f(x_i))$. (Recall the derivative $Df_{x_0}$ is a linear transformation relating directions and speeds of movements near $x_0$ to directions and speeds of movement near $f(x_0)$.)
In fact, we can use Newton's Method to prove the Inverse Function Theorem. In attempting to find an inverse function for $f$, we choose a generic point $y$ (where before we had zero) and run the same algorithm. So we use the formula $$x_{i+1}= x_i + (Df_{x_i})^{-1}(y - f(x_i)).$$ Note that this algorithm terminates iff we we have found a preimage of $y$ (assuming that $Df_{x_i}$ is always invertible). As in the one-dimensional case, we try to establish convergence of this algorithm by showing that it is a contraction, i.e. that $|x_{i+1}-x_i| \leq c |x_i - x_{i-1}|$, for $c \in (0,1)$ fixed. In doing so, the initial selection of $y$ is crucial: this process may not work for all $y$, but, as we will show, it will always work if we choose $y$ close enough to $f(x_0)$.
Note that one of the consequences of our clever choice of $x_{i+1}$ is that $$\text{The linear approximation to } y_i=f(x_i) \text{ at } x_{i-1} \text{ is }y.$$ For $$f(x_i) = f(x_{i-1} + (x_i - x_{i-1})) = f(x_{i-1}) + Df_{x_{i-1}}(x_i - x_{i-1}) + E$$ $$= f(x_{i-1}) + Df_{x_{i-1}}((Df_{x_{i-1}})^{-1}(y - y_{i-1})) + E = y + E,$$ where $E$ is the error of the linear approximation, which is supposed to be $o(|x_i - x_{i-1}|)$ by the definition of the derivative. This fact is the heart of the proof, and our proof will succeed because we will choose a neighborhood $N$ of $x_0$ on which $E$ is a uniformly small proportion of $|x_i - x_{i-1}|$. Indeed, taking the derivative with respect to $t$ along a linear path from $x$ to $x'$ and integrating gives us an estimate $$|E| = |f(x) - f(x') - Df_{x'}(x-x')| = \left | \int_0^1 \frac{d}{dt}\left(f(x' + t(x-x')) \right)dt - Df_{x'}(x-x') \right| $$ $$ \left| \int_0^1 Df_{x' + t(x-x')}(x-x') - Df_{x'}(x-x') \, dt \right| \leq \int_0^1 \|Df_{x' + t(x-x')} - Df_{x'}\|\cdot |x-x'|\, dt$$ $$|E| \leq \epsilon |x-x'|, \tag{1}$$ for $x$, $x'$ in some neighborhood $N_\epsilon$ of $x_0$ where $Df$ does not change too much. Since $y$ is the linear approximation to $f(x_i)$ at $f(x_{i-1})$, we can write this as $$|y-f(x_i)| < \epsilon |x_i - x_{i-1}|,$$ as long as our iterations keep us in $N_\epsilon$.
Later, when we show that the inverse function we find is differentiable, we will find it convenient to get a lower bound for $|f(x) - f(z)|$ for $x$ and $z$ in our neighborhood, as in $$|f(x) - f(z)| \geq \frac{1}{2} \|(Df_{x_0})\|_{\text{min}} \cdot |x - z|.$$ Since we are trying to show that $f$ is one-one on a neighborhood, it is not surprising that we would want to make sure $f$ doesn't start mapping points too close together. (This is an easy mean value theorem estimate.)
The iteration proceeds by a sort of back-and-forth motion. We move forward by $f$, and back by $(Df_{x_i})^{-1}$ (or more precisely, by $(y_i \mapsto x_i + (Df_{x_i})^{-1}(y- y_i)$). In both directions we need to make sure things don't change too much: if we change too much going forth or going back, we might not obtain a contraction mapping, or we might wind up outside $N_\epsilon$. We control "forth" by (1), and "back" by getting an upper bound $\|(Df_x)^{-1}\|< M$ on $N_\epsilon$. Then, as we have seen, $$|y - f(x_i)| \leq \epsilon |x_i - x_{i-1}| , \quad \text{ and }$$ $$|x_{i+1} - x_i| = |(Df_{x_i})^{-1}(y - f(x_i))| \leq M |y - f(x_i)|.$$ In other words, going back we pick up at most a factor of $M$, and going forth we can get a factor of $\epsilon$. So we just need to choose $\epsilon$ small enough to counteract $M$, and a little left over to give us a contraction mapping. Also crucial here is the choice of $y$ so that $y-f(x_0)$ is small enough already so that $M|y - f(x_0)|$ is not a far enough distance to get us out of $N_\epsilon$.
Somewhat more rigorously, take $N$ to be a compact neighborhood of $x_0$ where $Df_{x}$ remains invertible and $\|(Df_x)^{-1}\|< M$, and then select a smaller open ball $N_\epsilon$ about $x_0$ with radius $\rho$ such that $$\|Df_x - Df_z\| < \frac{(1/3)}{M} \quad \text{and}$$ $$|f(x) - f(z)| \geq \frac12 \|Df_{x_0}\|_{\text{min}} \cdot |x - z|$$ when $x$ and $z$ are in $N_\epsilon$ (note the former gives us (1)). Now choose $y$ close enough to $f(x_0)$ so that $|y-f(x_0)| < \frac{\rho/2}{M}$, and I claim we'll have a contraction mapping.
With those details left to the reader, we have shown that for $y$ in a $\frac{\rho/2}{M}$ neighborhood of $f(x_0)$, $y$ has a unique preimage in $N_\epsilon$. Therefore there is an inverse function defined on this neighborhood. (And of course $(f \circ f^{-1})(B_{f(x_0)}(\frac{\rho/2}{M})) \subset B_{f(x_0)}(\frac{\rho/2}{M})$, so we have a bijection of open sets.) To show that $f^{-1}$ is differentiable with derivative $(Df_x)^{-1}$, suppose $x \mapsto y$ and $x+h \mapsto y + k$. Then we have by the definition of differentiability that $Df_x(h) - k$ is $o(|h|)$. So $$\frac{|h - (Df_x)^{-1}(h)|}{|k|}= \frac{|(Df_x)^{-1}(Df_x(h) - k)|}{|k|},$$ and $|k| \geq (1/2)\|Df_{x_0}\|_{\text{min}}|h|$ by above, so $$\leq \frac{M}{(1/2)\|Df_{x_0}\|_{\text{min}}} \frac{|Df_x(h) - k|}{|h|},$$ and of course $|h|$ goes to zero as $|k|$ goes to zero, so we are good.










