Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory deleeuwpdx.net/pubfolders/constraints has a pdf version, the bib file, the complete Rmd file, and the R and C code (if applicable).

# 1 Introduction

Majorization algorithms or MM algorithms are an increasingly popular class of techniues for iteratively solving the problem of minimizing a real valued function $$f$$ over a set $$\mathcal{X}\subseteq\mathbb{R}^n$$. Some of the pertinent early references are De Leeuw (1994) and W. Heiser (1995), and there are recent book-length treatments in De Leeuw (2016) and Lange (2016).

Our treatment here is more general than the usual one. A majorization algorithm is defined by a majorization scheme, which is

• a point-to-set map $$\mathcal{Z}:\mathcal{X}\rightrightarrows\mathcal{X}$$ with $$y\in\mathcal{Z}(y)$$ for all $$y\in\mathcal{X}$$.

• a function $$g:\mathcal{X}\otimes\mathcal{X}\Rightarrow\mathbb{R}$$ such that $$f(x)\leq g(x,y)$$ for all $$y\in\mathcal{X}$$ and $$x\in\mathcal{Z}(y)$$ and $$g(y,y)=f(y)$$ for all $$y\in\mathcal{X}$$.

The update map of a majorization scheme is $\begin{equation}\label{E:omega} \omega(y)\mathop{=}^\Delta\mathop{\mathbf{argmin}}\ \{g(x,y)\mid x\in\mathcal{Z}(y)\}. \end{equation}$

Iteration $$k+1$$ of a majorization algorithm updates the current solution $$x^{(k)}$$by the rule

• if $$x^{(k)}\in\omega(x^{(k)})$$ stop,
• else $$x^{(k+1)}\in\omega(x^{(k)})$$.

The algorithm is stable, in the sense that it either stops at a fixed point $$x\in\omega(x)$$ or, by the sandwich inequality, $f(x^{(k+1)})\leq g(x^{(k+1)},x^{(k)})<g(x^{(k)},x^{(k)})=f(x^{(k)})$ an iteration strictly decreases the objective function $$f$$. Our definitions are more general than the usual ones, because $$\mathcal{Z}(x^{(k)})$$ can be different for each $$k$$ and we only require $$f(x)\leq g(x,x^{(k)})$$ for $$x\in\mathcal{Z}(x^{(k)})$$.

Convergence of majorization algorithms to a fixed point $$x\in\omega(x)$$ is guaranteed under conditions that are satisfied in many applications (De Leeuw (2016), Lange (2016)). For most majorization algorithms used so far convergence is linear, with a convergence rate equal to the spectral norm, i.e. the eigenvalue of maximum modulus, of the derivative $$\mathcal{D}\omega(y)$$ at the fixed point $$y$$. This paper is about computing that derivative in several majorization situations.

There is a huge literature on derivatives of the update map in the area of parametric mathematical programming, where the update map is usually called the solution map. Good overviews of the initial developments of the field are in Bank et al. (1982) and Fiacco (1983). Levitin (1994) is a truly virtuoso and exhaustive treatment, primarily using the tools of classical analysis. Recent modern book length treatments, using the more abstract framework of modern variational analysis, are in Bonnans and Shapiro (2000), Luderer, Minchenko, and Satsura (2002), and Dontchev and Rockafellar (2014). Excellent introductory overviews of the field are Giorgi and Zuccotti (1918) and the lecture notes of Still (2018).

# 2 No Constraints

If there are no constraints on $$x$$ the derivative of the update map in a majorization algorithm follows from a straightforward application of the implicit function theorem (IFT). More precisely, the IFT guarantees the existence of the derivative, and then the actual derivative is easily computed. The result below, which applies the IFT to majorization algorithms, is taken from De Leeuw (2016).

Define $$\mathcal{Z}(y)=\mathbb{R}^n$$ for all $$y$$. We assume that $$g$$ is two times continuously differentiable in a neighborhood of the fixed point $$(\omega(y),y)$$.

At a minimum $\mathcal{D}_1g(\omega(y),y)=0$ Differentiating again $\mathcal{D}_{11}g(\omega,y)\mathcal{D}\omega(y)+\mathcal{D}_{12}g(\omega,y)=0,$ and thus, assuming the inverse exists, $\mathcal{D}\omega(y)=-[\mathcal{D}_{11}g(\omega(y),y)]^{-1}\mathcal{D}_{12}g(\omega(y),y).$ So far, nothing new. This is just the usual formula for the derivatives of an implicit function. But in our case $$g$$ is a majorization scheme for $$f$$. Thus $$g(x,y)-f(x)$$ has a minimum over $$x$$ at $$y$$, with minimum value zero. This implies that $\mathcal{D}_1g(y,y)=\mathcal{D}f(y)$ for all $$y$$, and thus, assuming $$f$$ is also twice differentiable, $\mathcal{D}_{11}g(y,y)+\mathcal{D}_{12}g(y,y)=\mathcal{D}^2f(y),$ Majorization also implies that $$\mathcal{D}_{11}g(y,y)\succeq\mathcal{D}^2f(y)$$ in the Loewner sense, i.e. $$\mathcal{D}_{11}g(y,y)-\mathcal{D}^2f(y)$$ is positive semi-definite. Thus $$\mathcal{D}_{12}g(y,y)$$ is symmetric and negative semi-definite.

Combining majorization with the derivative from the IFT shows that at a fixed point $$y$$, $\mathcal{D}\omega(y)=-[\mathcal{D}_{11}g(y,y)]^{-1}\mathcal{D}_{12}g(y,y)=I-[\mathcal{D}_{11}g(y,y)]^{-1}\mathcal{D}^2f(y).$

## 2.1 Example: SMACOF

The Euclidean least squares metric multidimensional scaling problem is the minimization, after a change of coordinates detailed in De Leeuw (1993), of the function $f(x)=1-\sum_{k=1}^K w_k\delta_k\sqrt{x'A_kx}+\frac12x'x,$ where the $$A_i$$ are positive semi-definite matrices of order $$n$$ that add up to the identity. The $$w_k$$ and $$\delta_k$$ are positive weights and dissimilarities. It follows that, provided $$x'A_kx>0$$ for all $$k$$, that $\mathcal{D}f(x)=x-B(x)x,$ where $B(x)\mathop{=}^\Delta\sum_{k=1}^K w_k\frac{\delta_k}{\sqrt{x'A_kx}}A_k$ Also $\mathcal{D}^2f(x)=I-H(x),$ where $H(x)\mathop{=}^\Delta\sum_{k=1}^K w_k\frac{\delta_k}{\sqrt{x'A_kx}}\left\{A_k-\frac{A_kxx'A_k}{x'A_kx}\right\}.$ The SMACOF majorization algorithm (De Leeuw (1977), De Leeuw and Heiser (1977), De Leeuw and Heiser (1980), De Leeuw (1984), De Leeuw (1988), Borg and Groenen (2005), De Leeuw and Mair (2009)) uses the Cauchy-Schwartz inequality in the form $\sqrt{x'A_kx}\geq\frac{1}{\sqrt{y'A_ky}}x'A_ky.$ This implies that $$g$$, defined as $g(x,y)=1-x'B(y)y+\frac12 x'x$ majorizes $$f$$ at $$y$$, which leads to the algorithm $x^{(k+1)}=B(x^{(k)})x^{(k)}.$ By direct calculation the derivatives of the algorithmic map at a fixed point $$y$$ is $$H(y)$$. From our previous implicit function formulas we get $\mathcal{D}\omega(y)=I-[\mathcal{D}_{11}g(y,y)]^{-1}\mathcal{D}^2f(y)=I-(I)^{-1}(I-H(y))=H(y),$ while also $$\mathcal{D}_{12}g(y,y)=-H(y)$$.

# 3 Equality Constraints

Suppose $$h_j:\mathbb{R}^n\Rightarrow\mathbb{R}$$ are $$m$$ continuously differentiable functions, and $\mathcal{X}=\{x\in\mathbb{R}^n\mid h_j(x)=0\}.$ We assume that for each $$y$$ the minimum of $$g$$ over $$x\in\mathcal{X}$$ is unique and attained at $$\omega(y)$$, and that the $$m$$ vectors $$\mathcal{D}h_j(\omega(y))$$ are linearly independent. At a solution of the stationary equations we have \begin{align*} \mathcal{D}_1g(\omega(y),y)+\sum_{j=1}^m\lambda_j(y)\mathcal{D}h_j(\omega(y))&=0,\\ h_j(\omega(y))&=0, \end{align*}

where the $$\lambda_j$$ are Lagrange multipliers, which are uniquely defined because of the linear independence assumption.

Define $A\mathop{=}^\Delta\mathcal{D}_{11}g(\omega(y),y)+\sum_{j=1}^m\lambda_j(y)\mathcal{D}^2h_j(\omega(y)),$ and \begin{align*} B&\mathop{=}^\Delta\mathcal{D}_1h(\omega(y)),\\ E&\mathop{=}^\Delta\mathcal{D}_{12}g(\omega(y),y),\\ F&\mathop{=}^\Delta\mathcal{D}\omega(y),\\ G&\mathop{=}^\Delta\mathcal{D}\lambda(y). \end{align*}

Differentiating the stationary equations gives

$\begin{bmatrix} \mathcal{D}_{11}g(\omega(y),y)+\sum_{j=1}^m\lambda_j(y)\mathcal{D}^2h_j(\omega(y))&\mathcal{D}_1h(\omega(y))\\\mathcal{D}_1h(\omega(y))&0 \end{bmatrix} \begin{bmatrix} \mathcal{D}\omega(y)\\ \mathcal{D}\lambda(y) \end{bmatrix} = \begin{bmatrix} -\mathcal{D}_{12}g(\omega(y),y)\\ 0 \end{bmatrix}$ \begin{align*} AF+B'G&=-E,\\ BF&=0, \end{align*} which has the solution, assuming $$A$$ is invertible, \begin{align*} G&=-(BA^{-1}B')^{-1}BA^{-1}E,\\ F&=-A^{-1}(I-B'(BA^{-1}B')^{-1}BA^{-1})E. \end{align*}

The convergence rate of the algorithm is the eigenvalue of $$F$$ with largest modulus, evaluated at the fixed point where $$\omega(y)=y$$.

Also, majorization of $$f$$ by $$g$$ implies that $$\mathcal{D}_1g(x,x)=\mathcal{D}f(x)$$ for all $$x$$, and thus at a fixed point $-E=\mathcal{D}_{12}g(x,x)=\mathcal{D}_{11}g(x,x)-\mathcal{D}^2f(x).$ This, in turn, implies that $$-E$$ is symmetric and positive semi-definite.

## 3.1 Example: Power Method

Consider the problem of minimizing $$f$$ with $$f(x)=-\frac12 x'Vx$$, with $$V$$ positive definite of order $$n$$, over all $$x$$ satisfying $$x'x=1$$ and $$Ux=0$$, with $$U$$ an $$m\times n$$ matrix of rank $$m$$.

Since $$-\frac12(x-y)'V(x-y)\leq 0$$ we see that $$g$$ with $$g(x,y)=\frac12y'Vy-x'Vy$$ is a suitable majorization function. In this case we can derive an explicit formula for the update map, and consequently we can check if the results correspond with the ones derived from our version of the IFT for the Langrangian.

The Langrangian stationary equations for solving the majorization iterations are \begin{align*} -Vy+\lambda x+U'\mu&=0,\\ x'x&=1,\\ Ux&=0. \end{align*}

Thus $$\mu=(UU')^{-1}UVy$$ and $$QVy=\lambda x$$, with $$Q$$ the orthogonal projector $Q\mathop{=}^\Delta I-U'(UU')^{-1}U.$ We find $$\lambda=\sqrt{y'VQVy}$$, and the update map is $\omega(y)=\frac{1}{\lambda}QVy.$ Thus the majorization algorithm is the power method for the matrix $$QV$$, which converges to its largest eigenvalue and corresponding right eigenvector.

Taking derivatives of the update map $\mathcal{D}\omega(y)=\frac{1}{\lambda}\left\{QV-\frac{1}{\lambda^2}QVyy'VQV\right\},$ At a stationary point $$QVy=\lambda y$$, and thus $\mathcal{D}\omega(y)=\frac{1}{\lambda}\{QV-yy'V\}$ Note that if $$y$$ is a right eigenvector of $$QV$$ then $$Vy$$ is a left eigenvector. The convergence rate of the majorization algorithm is the ratio of the second largest to the largest eigenvalue of $$QV$$.

In addition we find $$\mathcal{D}\mu(y)=(UU')^{-1}UV$$ and $$\mathcal{D}\lambda(y)=\frac{1}{\lambda}VQVy$$, which is $$\mathcal{D}\lambda(y)=Vy$$ at a stationary point.

Now let’s see if we get the same solution by solving the system \begin{align*} AF+B'G&=-E,\\ BF&=0. \end{align*} For our example \begin{align*} A&=\lambda I,\\ B&=\begin{bmatrix}x'\\U\end{bmatrix},\\ E&=-V,\\ \end{align*}

We must have $\lambda F+\begin{bmatrix}x&U'\end{bmatrix}G=V,$ and $\begin{bmatrix} x'\\U\end{bmatrix}F=0.$ For the derivatives of the Lagrange multipliers we find $G=\begin{bmatrix}x'V\$$UU')^{-1}UV\end{bmatrix},$ which corresponds with our results using the explicit formulas. Also $\lambda F=V-\begin{bmatrix}x&U'\end{bmatrix}G=V-xx'V-U(UU')^{-1}UV=QV-xx'V,$ which is indeed what we found using the explicit expression for \(\omega$$.

## 3.2 Example: Least Squares with Quadratic Constraint

Consider the problem of minimizing $f(x)=\frac12(x-u)'V(x-u)$ over all $$x$$ satisfying $$\frac12 x'Wx=1$$. Here $$W$$ is assumed to be positive definite. This problem, or some variation of it, has been studied by many authors, starting with Forsythe and Golub (1965), Spjøtvoll (1972), Gander (1981).

We use majorization to construct an iterative algorithm. Suppose $$\kappa$$ is such that $$V\preceq\kappa W$$. Then $f(x)\leq f(y)+(x-y)'V(y-z)+\frac12\kappa(x-y)'W(x-y).$ We can minimize the majorization function over $$\frac12 x'Wx=1$$ by minimizing the linear function $$x'V(y-z)-\kappa x'Wy$$ over $$\frac12x'Wx=1$$. The minimum is attained for $x\propto y-\frac{1}{\kappa}W^{-1}V(y-z),$ and we find the update by normalizing. The derivative of the update function, computed directly, is $\mathcal{D}\omega(y)=\left[I-\frac{yy'W}{y'Wy}\right]\left[I-\frac{1}{\kappa}W^{-1}V\right]$ The Lagrangian equations for the majorization problem are \begin{align*} V(y-z)-\kappa Wy+\lambda(y)W\omega(y)&=0,\\ \frac12\omega(y)'W\omega(y)&=1 \end{align*} Thus \begin{align*} \lambda'(y)W\omega(y)+\lambda(y)W\mathcal{D}\omega(y)&=0,\\ \omega(y)'W\mathcal{D}\omega(y)&=0. \end{align*}

$\mathcal{D}\omega(y)=-\frac{\lambda'(y)}{\lambda(y)}\omega(y)$ Thus $$\lambda'(y)=\omega(y)'W\omega(y)$$ and $$\lambda(y)=\mathcal{D}\omega(y)'W\mathcal{D}\omega(y)$$.

The IFT with \begin{align*} A&=\lambda W\\ B&=Wx,\\ E&=V-\kappa W \end{align*}

$F=-(\lambda W)^{-1}(I-x'W(Wx(\lambda W)^{-1}xW)^{-1}Wx(\lambda W)^{-1})(V-\kappa W)=\\ -\lambda^{-2} W^{-1}(I-\frac{\lambda}{x'W^{-1}x}xx'W)^{-1}x'W^{-1}(V-\kappa W)$

# 4 Parametric Equality Constraints

A more general problem is to compute, and study, $\omega(y)\mathop{=}^\Delta\mathop{\mathbf{argmin}}\ \{g(x,y)\mid H(x,y)=0\},$ $\mathcal{X}(y)=\{x\in\mathbb{R}^n\mid\mathop{\forall}_{j=1}^mh_j(x,y)=0\}.$ where the constraints also depend on the parameter $$y$$.The relevance for majorization algorithms may not be immediately obvious. When treating inequality constraints in a subsequent section, however, the formulas in this section will be useful.

The Lagrangian equations are

\begin{align*} \mathcal{D}_1g(\omega(y),y)+\sum_{j=1}^m\lambda_j(y)\mathcal{D}_1h_j(\omega(y),y)&=0,\\ H(\omega(y),y)&=0. \end{align*} Differentiating again gves $\begin{multline*} \mathcal{D}_{11}g(\omega(y),y)\mathcal{D}x(y)+\mathcal{D}_{12}g(\omega(y),y)+\\ \sum_{j=1}^m\lambda_j(y)\{\mathcal{D}_{11}h_j(\omega(y),y)\mathcal{D}\omega(y)+\mathcal{D}_{12}h(\omega(y),y)\}+\\ \sum_{j=1}^m\mathcal{D}\lambda_j(y)\mathcal{D}_1h_j(\omega(y),y)=0, \end{multline*}$ and $\mathcal{D}_1h_j(\omega(y),y)\mathcal{D}\omega(y)+\mathcal{D}_2h_j(\omega(y),y)=0.$ Define $A\mathop{=}^\Delta\mathcal{D}_{11}g(\omega(y),y)+\sum_{j=1}^m\lambda_j(y)\mathcal{D}_{11}h_j(\omega(y),y),$ and \begin{align*} B&\mathop{=}^\Delta\mathcal{D}_1h(\omega(y),y),\\ C&\mathop{=}^\Delta\mathcal{D}_2h(\omega(y),y),\\ D&\mathop{=}^\Delta\sum_{j=1}^m\lambda_j(y)\mathcal{D}_{12}h_j(\omega(y),y),\\\ E&\mathop{=}^\Delta\mathcal{D}_{12}g(\omega(y),y),\\ F&\mathop{=}^\Delta\mathcal{D}\omega(y),\\ G&\mathop{=}^\Delta\mathcal{D}\lambda(y). \end{align*} We must solve \begin{align*} AF+B'G=-(E+D),\\ BF=-C. \end{align*}

We have $$F=-A^{-1}(B'G+E+D)$$ and $$BA^{-1}(B'G+E+D)=C$$. Thus $G=(BA^{-1}B')^{-1}\{C-BA^{-1}(E+D)\},$ and $F=-A^{-1}B'(BA^{-1}B')^{-1}C-\{A^{-1}-A^{-1}B'(BA^{-1}B')^{-1}BA^{-1}\}(E+D).$ We recover the results for the basic case from the previous section by setting $$C=0$$ and $$D=0$$.

# 5 Inequality Constraints, Part I

If the minimization problem solved in the majorization step involves inequality constraints $\omega(y)\mathop{=}^\Delta\mathop{\mathbf{argmin}}\ \{g(x,y)\mid H(x)=0\ \wedge K(x)\geq 0\},$ $\mathcal{X}(y)=\{x\in\mathbb{R}^n\mid\mathop{\forall}_{j=1}^\ell h_j(x,y)= 0, \mathop{\forall}_{j=\ell+1}^mh_j(x,y)\geq 0\}.$ then we cannot expect differentiability any more.

Of course an inequality constraint like $$h_j(x)\geq 0$$ can be rewritten as an equality constraint $$h_j(x)-z_j^2=0$$, where the $$z_j$$ are additional variables.

Just consider the very simple example $\theta(y)=\min\ \{(x-y)^2\mid x\geq 0\}=\begin{cases}0&\text{ if }y\geq 0,\\ y^2&\text{ otherwise},\end{cases}$ and $\omega(y)=\mathop{\mathbf{argmin}}\ \{(x-y)^2\mid x\geq 0\}=\begin{cases}y&\text{ if }y\geq 0,\\0&\text{ otherwise}.\end{cases}$ The value function $$\theta$$ is differentiable at zero, but not twice differentiable. The solution function $$\omega$$ is continuous, but not differentiable at zero. At zero it does, however, have a left derivative equal to zero and a right derivative equal to one.

Minimizing $$g(x,y)=(x^2-y^2)^2$$ over $$x\geq 0$$ gives $$\theta$$ identically equal to zero, and $$\omega(y)=|y|$$, continuous but not differentiable at zero. For $$g(x,y)=(x-y^2)^2$$ and $$x\geq 0$$ we again find $$\theta$$ identically equal to zero b but now $$\omega(y)=y^2$$, which is differentiable. These small examples show that in general the solution function will not be differentiable, but may be directionally differentiable.

Convert

$$k_j(x)-z_j^2=0$$

If our original problem is to minimize $$f$$ over all $$x$$ satisfying $$h_j(x)=0$$ and $$p_\ell(x)\leq 0$$, then it may make sense to majorize both $$f$$ and the $$k_\ell$$.

If $$g:\mathbb{R}\otimes\mathbb{R}\Rightarrow\mathbb{R}$$ majorizes $$f$$ at $$y$$ and $$q_\ell:\mathbb{R}\otimes\mathbb{R}\Rightarrow\mathbb{R}$$ majorizes $$k_ell$$ at $$y$$, then the majorization algorithm is $x^{(k+1)}=\mathop{\mathbf{argmin}}\ \{g(x,x^{(k)}\mid h_j(x)=0\ \wedge\ q_\ell(x,y^{(k)})\leq 0\}$ In fact we will look at the more general problem $\omega(y)=\mathop{\mathbf{argmin}}_x\ \{g(x,y)\mid h_j(x,y)=0\ \wedge\ p_\ell(x,y)\leq 0\}$ From the Karush-Kuhn-Tucker necessary conditions for this minimization problem $\mathcal{D}_1g(\omega(y),y)+\sum_{j=1}^m\lambda_j(y)\mathcal{D}_1h_j(\omega(y),y)+\sum_{\ell=1}^r\gamma_\ell(y)\mathcal{D}_1p_\ell(\omega(y),y)=0$ \begin{align*} h_j(\omega(y),y)&=0,\\ p_\ell(\omega(y),y)&\leq 0,\\ \gamma_\ell(y)&\geq 0,\\ \gamma_\ell(y)p_\ell(\omega(y),y)&=0. \end{align*}

# 6 Inequality Constraints, Part II

If the conditions of the previous section are not satisfied we cannot guarantee differentiability, and we need to take a step back. First, we need to use the version of the linear convergence theorem for fixed point iterations from Sheng and Xu (2001). Then we need to step down a level from continuous differentiability to directional differentiability and the Clark Jacobian, following Dempe and Pallaschke (1997).