Majorization in a Nutshell

I use majorization in many of my papers. So a relatively short general introduction with some examples may be useful. Here we go. Keep reading. If the math looks weird, refresh your browser.

The problem we try to solve throughout is to construct a convergent and stable iterative algorithm to minimize a function $f:X\rightarrow\mathbb{R}$ over $X\subseteq\mathbb{R}^n$.

A majorization scheme for $f$ on $X$ is a function $g:X\otimes X\rightarrow\mathbb{R}$ such that

• $g(x,y)\geq f(x)$ for all $x,y\in X$.
• $g(x,x)=f(x)$ for all $x\in X$.

In other words $f(x)=\min_{y\in X}g(x,y)$ and $x\in\mathop{\text{argmin}}_{y\in X}g(x,y)$. A majorization scheme is strict if $g(x,y)<f(x)$ for all $x\not =y$ in $X$.

A majorization scheme generates an iterative majorization algorithm by the rule

The key result in majorization theory is the sandwich inequality

If the function $f$ is bounded below, the iterates stay in a compact set, the majorization scheme is continuous, and all minima are attained at unique points, then we have convergence to a fixed point $x_\infty\in X$, i.e. a point with $x_\infty=\mathop{\text{argmin}}_{x\in X}g(x,x_\infty)$. More precisely, the sequence $f(x^{(k)})$ converges, each accumulation point of the sequence $x^{(k)}$ we generate is a fixed point, and all accumulation points have the same function value. The assumptions for convergence can be relaxed a great deal, but we will not specify these in detail.

Finally, note that we use majorization for minimization problems. In the same way we can use minorization for maximization problems. This is the reason majorization algorithms are usually called MM algorithms these days.

Suppose $X$ is the whole of $\mathbb{R}^n$, and $f$ and $g$ are differentiable at a fixed point $x$. Then $\mathcal{D}_1g(x,x)=\mathcal{D}f(x)$ and if $f$ and $g$ are twice differentiable $\mathcal{D}_{11}g(x,x)\gtrsim\mathcal{D}^2f(x)$ in the Loewner sense. This follows from the fact that $g(x,y)-f(x)$ attains its minimum, equal to zero, over $x\in X$ at $x=y$. In fact the majorization relation implies that $\mathcal{D}^2f(x)=\mathcal{D}_{11}g(x,x)+\mathcal{D}_{12}g(x,x)$ and $\mathcal{D}_{21}g(x,x)+\mathcal{D}_{22}g(x,x)=0$ for all $x$.

If $A$ is the algorithmic map that computes the successor of $y$, i.e. $A(y)=\mathop{\text{argmin}}_{x\in X}g(x,y)$, then

If the spectral norm $\rho(x)$, the eigenvalue of largest modulus, of $\mathcal{D}A(x)$ is strictly between zero and one, then the majorization algorithm converges linearly with rate $\rho(x)$.

In this section we will try to give some idea of the tools that are available to construct majorization schemes and of the corresponding majorization algorithms they generate.

Any inequality of the form $h(x,y)\geq f(x)+f(y)$, with equality if and only if $x=y$ leads to a strict majorization scheme with $g(x,y)=h(x,y)-f(y)$. This also covers inequalities of the form $h(x,y)\leq f(x)+f(y)$ by multiplying by -1, and inequalities of the form $h(x,y)\geq f(x)f(y)$ (with positive functions) by taking logarithms.

Consider, for example, minimization of

where the $h_i(x)=x'A_ix-2x'b_i+c_i$ and we assume the quadratics $h_i(x)$ are positive for all $x$. This has as a special case the Weber problem of locating a point $x$ at minimum weighted distance fom a number of given points $z_i$. In statistics it is also known as computing the multivariate median.

The AM-GM inequality says

and thus $f(x)\leq g(x,y)$ with a majorization scheme

which is quadratic in $x$, and thus easy to minimize. The majorization algorithm is

In multidimensional scaling, or MDS, we minimize a function of the form

where again the $A_i$ are positive semi-definite. Suppose without loss of generality the $\delta_i$ have weighted sum of squares equal to 1. Then

with

The Cauchy-Schwartz inequality gives

and thus we have a majorization scheme defined by

with

The majorization algorithm is

Suppose $-\infty=a_0<a_1<\cdots<a_n<a_{m+1}=+\infty$ are numbers that partition the real line into $(m+1)$ intervals. Suppose that a discrete random variable takes the values $1,\cdots,m+1$ with probabilities

with

From a random sample of size $n$ we have observed proportions $p_j$, which gives a log-likelihood

We will now use Jensen’s inequality in the form

which gives

It follows that in step $k$ of the majorization algorithm we minimize

where $E_j^{(k)}$ and $V_j^{(k)}$ are the mean and variance of the truncated normal on $(a_{j-1},a_j)$ with parameters $\mu^{(k)}$ and $\sigma^{(k)}$. The minimizers are

and

In this case the majorization algorithm gives the same results as the EM algorithm, which is not surprising because EM algorithms are majorization algorithms based on Jensen’s inequality.

Instead of specific inequalities we can also base majorization schemes on more general tools, such as properties of convex or concave functions. If $f$ is concave, for example,

for any $z\in\partial f(y)$, the subgradient at $y$. This is known as tangential minorization, because a concave function is below any of its tangents, in the same way as a convex function is above its tangents.

Suppose $\mathcal{K}_j$, for $j=1,\cdots,m$, are convex cones in $\mathbb{R}^n$. Cone $\mathcal{K}_j$ defines the possible transformations of $n$ observations on variable $j$. Also $\mathcal{S}$ is the unit sphere in $\mathbb{R}^n$.

Given $x_j\in\mathcal{K}_j\cap\mathcal{S}$ we can compute the correlation matrix $R(x)=\{r_{j\ell}(x)\}=\{x_j'x_\ell\}$. Now suppose

where $h$ is a convex real-valued function on the space of correlation matrices, which we call an aspect. Multivariate analysis abounds with convex gain functions, such as the sum of the $p$ largest eigenvalues, the negative log-determinant, the squared multiple correlation of one variable with the others, and the maximum of the multinormal log-likelihood.

By concavity

with $G(y)\in\partial h(y)$. Thus we have a minorization scheme and the minorization algorithm maximizes the quadratic

over the $x_j\in\mathcal{K}_j\cap\mathcal{S}$. Computing transformed variables in this way is sometimes known as optimal scaling.

We can use the definition of convexity directly to majorize $f$ by a separable majorization scheme, a weighted sum of functions of one variable. Suppose the function $f$ we must minimize is defined by

where $h$ is a convex function of a single variable, and $w$ is a vector of positive numbers.

If $y$ is another vector of $n$ positive numbers we can write

and if $g$ is defined as

then, by the definition of convexity, $f(x)\leq g(x,y)$. Also, clearly, $f(x)=g(x,x)$ and thus we have us a separable majorization scheme.

Alternatively, for any positive vector $\pi$ with elements adding up to one,

and the majorization scheme is defined as

Another, even more general tool, is Taylor’s theorem for a sufficiently many times differentiable function.

If we use the Langrange form of the remainder we can write

for some $\xi$ on the line connecting $x$ and $y$. Thus

gives a majorization scheme, but the scheme may not be simple to minimize. If, however, $\mathcal{D}^2f(x)\lesssim A$ for all $x\in X$ then the majorization scheme can be chosen as the quadratic

and the majorization algorithm is

with $\mathcal{P}$ the metric projection on $X$.

A simple example is logistic regression. The negative log-likelihood is

with

This simplifies to

and thus

and

Now $\pi_i(x)(1-\pi_i(x))\leq\frac14$, and thus

This simple bound can be used to get a quadratic majorization scheme.

Most majorization algorithms discussed so far have a linear convergence rate, because $\mathcal{D}f^2(x)$ is strictly smaller, in the Loewner sense, than $\mathcal{D}_{11}g(x,x)$. If we can find majorization schemes $g$ with the same second order derivatives as the objective function $f$, then we will have a superlinear convergence rate.

Taylor’s theorem, with the Lagrange remainder, tells us that for three-times differentiable functions

with apologies for the notation. Here again $\xi=\lambda x+(1-\lambda)y$ for some $0\leq\lambda\leq 1$.

We get a majorization scheme by bounding the cubic term $\mathcal{D}^3f(\xi)\{(x-y)^3\}.$ This can be done in various ways, but most of them lead to majorization schemes that are difficult to minimize.

One of the more promising ones uses a bound of the form

with $\|x-y\|=\sqrt{(x-y)'(x-y)}$.

This leads to minimization of the majorization scheme

and to solving the system

Let $\delta=\|x-y\|$. The solution for $x$ is a regularized Newton step of the form

where $\delta$ solves the single-variable secular equation

The secular equation can be solved efficiently, starting from the eigen-decomposition of $\mathcal{D}^2f(y)$. Suppose $L\Lambda L'$ is one such eigen-decomposition, and define $g:=L'\mathcal{D}f(y)$. Then we must solve

We again use logistic regression as an example. For the negative log-likelihood

Now $|\pi_i(x)(1-\pi_i(x))(1-2\pi_i(x))|\leq \frac{1}{18}\sqrt{3}$ so that

Thus

The bound is crude, so it may take some iterations before superlinear convergence takes over.