Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory has a pdf version, the bib file, the complete Rmd file with the code chunks, and the R source code.

1 Introduction

This book is about the least squares loss function \[\begin{equation}\label{E:stress}\sigma(X)=\frac12\mathop{\sum}_{1\leq i<j\leq n} w_{ij}(\delta_{ij}-d_{ij}(X))^2. \end{equation}\] define on \(\mathbb{R}^{n\times p}\), the space of all \(n\times p\) matrices. We follow Kruskal (1964) and call \(\sigma(X)\) the stress of configuration \(X\). Minimizing stress over p-dimensional configurations is the pMDS problem.

In \(\eqref{E:stress}\) the matrices of weights \(W=\{w_{ij}\}\) and dissimilarities \(\Delta=\{\delta_{ij}\}\) are symmetric, non-negative, and hollow (zero diagonal). The matrix-valued function \(D(X)=\{d_{ij}(X)\}\) contains Euclidean distances between the rows of the configuration \(X\), which are the coordinates of \(n\) points in \(\mathbb{R}^p\). Thus \(D(X)\) is also symmetric, non-negative, and hollow.

Two important special cases of pMDS are Unidimensional Scaling, which is 1MDS, and Full-dimensional Scaling, which is nMDS. Because of their importance, they also have their very own acronyms UDS and FDS, and they have their own chapters in this book.

In pMDS we minimize stress. This means that we are trying to find the global minimum over p-dimensional configurations. In practice, however, our algorithms find local minima, which may or may not be global. In this paper we will summarize what we know about global and local minima, and what we know about local maxima and saddle points.

2 Notation

First some convenient notation, first introduced in De Leeuw (1977). Vector \(e_i\) has \(n\) elements, with element \(i\) equal to \(+1\), and all other elements zero. \(A_{ij}\) is the matrix \((e_i-e_j)(e_i-e_j)'\), which means elements \((i,i)\) and \((j,j)\) are equal to \(+1\), while \((i,j)\) and \((j,i)\) are \(-1\). Thus \[\begin{equation}\label{E:anot} d_{ij}^2(X)=(e_i-e_i)'XX'(e_i-e_j)=\text{tr}\ X'A_{ij}X=\text{tr}\ A_{ij}C, \end{equation}\] with \(C=XX'\).

We also define \[\begin{equation}\label{E:V} V=\mathop{\sum}_{1\leq i<j\leq n}w_{ij}A_{ij}, \end{equation}\] and the matrix-valued function \(B(\bullet)\) with \[\begin{equation}\label{E:B} B(X)=\mathop{\sum}_{d_{ij}(X)>0}w_{ij}\frac{\delta_{ij}}{d_{ij}(X)}A_{ij}, \end{equation}\] and \(B(X)=0\) if \(X=0\). If we assume, without loss of generality, that \[ \frac12\mathop{\sum}_{1\leq i<j\leq n}w_{ij}\delta_{ij}^2=1, \] then \[\begin{equation}\label{E:S} \sigma(X)=1-\text{tr}\ X'B(X)X+\frac12\text{tr}\ X'VX. \end{equation}\]

We also suppose, without loss of generality, that \(W\) is irreducible, so that the pMDS problem does not separate into a number of smaller pMDS problems. For symmetric matrices irreducibity means that we cannot find a permutation matrix \(\Pi\) such that \(\Pi'W\Pi\) is the direct sum of a number of smaller matrices.

\(V\) is symmetric with non-positive off-diagonal elements. It is doubly-centered (rows and columns add up to zero) and thus weakly diagonally dominant. It follows that it is positive semi-definite (see Varga (1962), section 1.5). Because of irreducibility it has rank \(n-1\), and the vectors in its null space are all proportional to \(e\), the vector with all elements equal to \(+1\). The matrix \(B(X)\) is also symmetric, positive semi-definite, and doubly-centered for each \(X\). It may not be irreducible, because for example \(B(0)=0\).

3 Differentiability

Obviously \[ \mathcal{D}\eta^2(X)=2VX \] Additionally \(\rho(\bullet)\) is differentiable at \(X\) if and only if \(d_{ij}(X)>0\) for all \(i<j\) with \(w_{ij}\delta_{ij}>0\). In that case \[ \mathcal{D}\rho(X)=B(X)X. \] Result 13: \(\sigma(\bullet)\) is differentiable at at \(X\) if and only if \(d_{ij}(X)>0\) for all \(i<j\) with \(w_{ij}\delta_{ij}>0\). In that case \[ \mathcal{D}\sigma(X)=(V-B(X))X. \] Result 14: If \(\sigma(\bullet)\) is differentiable at the stationary point \(X\) then \((V-B(X))X=0\), or, in fixed point form, \(X=V^+B(X)X\).

4 DC Functions

Following De Leeuw (1977) we also define \[\begin{align} \rho(X)&=\mathop{\sum}_{1\leq i<j\leq n} w_{ij}\delta_{ij}d_{ij}(X)=\text{tr}\ X'B(X)X,\\ \eta^2(X)&=\mathop{\sum}_{1\leq i<j\leq n} w_{ij}d_{ij}^2(X)=\text{tr}\ X'VX. \end{align}\]

Result 1: [DC]

  1. \(\rho(\bullet)\) is convex, homogeneous of degree one, non-negative, and continuous.
  2. \(\eta^2(\bullet)\) is convex, homogeneous of degree two, non-negative, quadratic, and continuous.
  3. \(\sigma(\bullet)\) is a difference of convex functions (a.k.a. a DC function or a delta-convex function).

See Hirriart-Urruty (1988) or Vesely and Zajicek (1989) for a general discussion of DC functions.

From the literature we know \(\sigma(\bullet)\) is locally Lipschitz, has directional derivatives and nonempty subdifferentials everywhere, and is twice differentiable almost everywhere. In fact \(\sigma(\bullet)\) is infinitely many times differentiable at all \(X\) that have \(d_{ij}(X)>0\) for all \(i<j\) with \(w_{ij}\delta_{ij}>0\).

5 Homogeneity

Result 2: [Bounded] At a local minimum point \(X\) of \(\sigma(\bullet)\) we have \(\eta(X)\leq 1\).

Proof: Homogeneity gives \(\sigma(\lambda X)=1-\lambda\rho(X)+\frac12\lambda^2\eta^2(X)\). If \(X\) is a local minimum then the minimum over \(\lambda\) is attained for \(\lambda=1\), i.e. we must have \(\rho(X)=\eta^2(X)\). By Cauchy-Schwartz \(\rho(X)\leq\eta(X)\) for all \(X\), and thus at a local minimum \(\eta^2(X)\leq\eta(X)\), i.e. \(\eta(X)\leq 1\). \(\blacksquare\)

Result 3: [Local Maxima] \(X=0\) is the unique local maximum point of \(\sigma(\bullet)\) and \(\sigma(0)=1\) is the unique local maximum.

Proof: This is because \(\sigma(\lambda X)=1-\lambda\rho(X)+\frac12\lambda^2\eta^2(X)\) is a convex quadratic in \(\lambda\). The only maximum on the ray through \(X\) occurs at the boundary \(\lambda=0\). \(\blacksquare\)

Result 4: [Unbounded] \(\sigma(\bullet)\) is unbounded and consequently has no global maximum.

Proof: \(\blacksquare\)

Note that the unboundedness of stress also follows from the proof of result 3.

6 Picture

Here is a picture of stress on a two-dimensional subspace which illustrates both result 3 and result 15.

We first make a global perspective plot, over the range \((-2.5,+2.5)\).