Note: This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome, and some would be really benificial. For example, I only use base R graphics, nothing more fancy, because base is all I know. The directory has a pdf version, the complete Rmd file with all code chunks, the bib file, and the R source code.

1 Theory

In Multidimensional Scaling (MDS) we minimize a loss function, called Kruskal’s stress, which is defined by (Kruskal (1964a), Kruskal (1964b)) \[\begin{equation} \sigma(Z):=\frac12\sum_{i=1}^n\sum_{j=1}^nw_{ij}(\delta_{ij}-d_{ij}(Z))^2\label{E:loss} \end{equation}\]

over the \(n\times p\) configurations \(Z\). Here the weights \(W\) and the dissimilarities \(\Delta\) are known symmetric, non-negative, and hollow matrices of numbers. The matrix \(D(Z)\) in \(\eqref{E:loss}\) has the (Euclidean) distances between the \(n\) points in the configuration. Thus if \(e_i\) are unit vectors, with zeroes everywhere, except for element \(i\), which is one, and \(A_{ij}:=(e_i-e_j)(e_i-e_j)'\), then \(d_{ij}^2(Z)=\mathbf{tr}\ ZA_{ij}Z'\).

Stress is a complicated function with potentially a large number of local minima and saddle points. In order to study the behavior of stress we shall look at configurations of the form \(Z=\alpha X+\beta Y\), with \(X\) and \(Y\) known, fixed, and centered configurations. This makes stress a function of the two variables \(\alpha\) and \(\beta\), and we can use standard contour and perspective plots to study the function. This extends earlier work by De Leeuw (1993). We assume linear independence, in the sense that \(\alpha X+\beta Y=0\) if and only if \(\alpha=\beta=0\).

First some simplications. Define the \(2\times 2\) matrices \[ V_{ij}:=\begin{bmatrix}\mathbf{tr}\ XA_{ij}X'&\mathbf{tr}\ XA_{ij}Y'\\\mathbf{tr}\ YA_{ij}X'&\mathbf{tr}\ YA_{ij}Y'\end{bmatrix}, \] and define \(\gamma\) as the vector with elements \(\alpha\) and \(\beta\). Now \[\begin{equation} \sigma(\gamma)=1-\sum_{i=1}^n\sum_{j=1}^nw_{ij}\delta_{ij}\sqrt{\gamma'V_{ij}\gamma}+\frac12\gamma'V_{\star\star}\gamma,\label{E:gamma} \end{equation}\]

where we have assumed for convenience that \[ \frac12\sum_{i=1}^n\sum_{j=1}^nw_{ij}\delta_{ij}^2=1, \] and where \[ V_{\star\star}:=\sum_{i=1}^n\sum_{j=1}^n w_{ij}V_{ij}. \] Note that if all \(w_{ij}\) are one, then \[ V_{\star\star}=2n\begin{bmatrix}X'X&X'Y\\Y'X&Y'Y\end{bmatrix}. \]

We now make a change of variables, using the Cholesky decomposition \(V_{\star\star}=S'S\), with \(S\) upper-triangular. Define \(\theta:=S\gamma\) and \(U_{ij}:=(S')^{-1}V_{ij}S^{-1}\). Then \[\begin{equation} \sigma(\theta)=1-\sum_{i=1}^n\sum_{j=1}^nw_{ij}\delta_{ij}\sqrt{\theta'U_{ij}\theta}+\frac12\theta'\theta.\label{E:theta} \end{equation}\] In MDS we usually (De Leeuw (1977)) also define the matrix-valued function \[\begin{equation} B(\theta):=\sum_{i=1}^n\sum_{j=1}^nw_{ij}\frac{\delta_{ij}}{d_{ij}(\theta)}U_{ij},\label{E:bdef} \end{equation}\] with \(d_{ij}(\theta):=\sqrt{\theta'U_{ij}\theta}\) and the function \(\rho(\theta):=\theta'B(\theta)\theta\). Definition \(\eqref{E:bdef}\) can be extended using subgradients if \(d_{ij}(\theta)=0\) for some \(i,j\) (see De Leeuw (1977)). For our purposes we can define \(B(\theta)\) more generally by omitting the terms with \(d_{ij}(\theta)=0\). The definition of \(\rho\) allows us to write \[\begin{equation} \sigma(\theta)=1-\rho(\theta)+\frac12\theta'\theta.\label{E:short} \end{equation}\]

If \(d_{ij}(\theta)>0\) for all \(i,j\) such that \(w_{ij}\delta_{ij}>0\) then stress is differentiable at \(\theta\) and \(\mathcal{D}\sigma(\theta)=\theta-B(\theta)\theta\), so that stationary points are defined as solutions of \(\theta=B(\theta)\theta\). The more general definition is \(\theta\in\partial\rho(\theta)\). De Leeuw (1984) shows that stress is always differentiable at local minima.

At a stationary point \(\theta\) is an eigenvector of \(B(\theta)\) with eigenvalue equal to one. If this is the largest eigenvalue, then \(\theta\) actually gives the global minimum of stress (De Leeuw, Groenen, and Mair (2016)). Also observe that at a stationary point \(\rho(\theta)=\theta'\theta\), and thus \(\eqref{E:short}\) implies \(\theta'\theta\leq 2\). All stationary values are within a circle or sphere with radius \(\sqrt{2}\).

The smacof algorithm (De Leeuw (1977), De Leeuw and Mair (2009)) is the iterative algorithm \[\begin{equation} \theta^{(k+1)}=B(\theta^{(k)})\theta^{(k)}. \end{equation}\]

Note that the algorithm is self-scaling, in the sense that all points on a ray through the origin have the same update. Thus \(B(\lambda\theta)(\lambda\theta)= B(\theta)\theta\) for all \(\lambda\not= 0\).

If stress is differentiable at \(\theta\) then \(\mathcal{D}^2\sigma(\theta)=I-H(\theta)\), where \[\begin{equation} H(\theta):=\sum_{i=1}^n\sum_{j=1}^nw_{ij}\frac{\delta_{ij}}{d_{ij}(\theta)}\left\{U_{ij}-\frac{U_{ij}\theta\theta'U_{ij}}{\theta'U_{ij}\theta}\right\}. \end{equation}\]

Note that \(\mathcal{D}\rho(\theta)=B(\theta)\theta\) and \(\mathcal{D}^2\rho(\theta)=H(\theta)\). Also note that \(H(\lambda\theta)=|\lambda|^{-1}H(\theta)\), and thus for any \(\theta\) there is a \(\lambda(\theta)\) such that \(\mathcal{D}^2\sigma(\lambda\theta)\) is positive definite for all \(\lambda>\lambda(\theta)\).

The smacof algorithm convergences to a solution \(\theta\) of the stationary equations, with a linear convergence rate equal to the largest eigenvalue of \(H(\theta)\) (see De Leeuw (1988)).

Note that \(\mathcal{D}^2\sigma(\theta)\theta=\theta\), which means that Newton’s method takes the form \[\begin{equation} \theta^{(k+1)}=(I-H(\theta^{(k)})^{-1}B(\theta^{(k)})\theta^{(k)}.\label{E:newton} \end{equation}\]

Note that Newton is definitely not self-scaling. Equation \(\eqref{E:newton}\) also suggest a natural safeguarded version of Newton’s method, using \(I-\eta H\), where \(0\leq\eta\leq 1\). For \(\eta=0\) this is a smacof step, for \(\eta=1\) this is a Newton step.

On the line through \(\theta\) and the origin the function \(\sigma\) is a convex quadratic. Thus at every point, except at the origin, there is at least one direction of descent, and consequently stress has only a single local maximum at \(\theta=0\), equal to one. In the differentiable case this is also clear from \(\mathcal{D}^2\sigma(\theta)\theta=\theta\), which says that the Hessian has at least one positive eigenvalue.

Because stress is an even function, with \(\sigma(\theta)=-\sigma(\theta)\) for all \(\theta\), minima come in pairs.

2 Example

Our first example has \(n=4\), all weights equal to one, and all dissimilarities equal. The same example has been analyzed by De Leeuw (1988), De Leeuw (1993), Trosset and Mathar (1997). For this example the global minimum in two dimensions has its four points in the corners of a square. That is our \(X\), which has stress 0.02859548. Our \(Y\) is another stationary point, which has three points in the corners of an equilateral triangle and the fourth point in the center of the triangle. Its stress is 0.0669873. Another way of looking at the two configurations is that \(X\) are four points equally spaced on a circle, and \(Y\) are three points equally spaced on a circle with the fourth point in the center of the circle. De Leeuw (1988) erroneously claims that \(Y\) is a non-isolated local minimum of stress, but Trosset and Mathar (1997) have shown there exists a descent direction at \(Y\), and thus \(Y\) is actually a saddle point. Of course the stationary points defined by \(X\) and \(Y\) are far from unique, because we can distribute the four points over the various corners in many ways.

The example is chosen in such a way that there are non-zero \(\alpha\) and \(\beta\) such that \(d_{12}(\alpha X+\beta Y)=0\). In fact \(d_{12}\) is the only distance that can be made zero by a non-trivial linear combination.

Note that we have used \(\sigma\) for three different functions. The first one with argument \(Z\) is defined on configuration space, the second one with argument \(\gamma\) on coefficient space, and the third one with argument \(\theta\) also on coefficient space. This is a slight abuse of notation, rather innocuous, but we have to keep it in mind.

From lemma 1we see that \(\mathcal{D}\sigma(X)=\mathcal{D}\sigma(Y)=0\) then \(\mathcal{D}\sigma(1,0)=\mathcal{D}\sigma(0,1)=0\). Thus stationary points in configuration space are preserved as stationary points in coefficient space, but the reverse implication may not be true. If \(\mathcal{D}^2\sigma(X)\) and \(\mathcal{D}^2\sigma(Y)\) are positive semi-definite, then so are \(\mathcal{D}^2\sigma(1,0)\) and \(\mathcal{D}^2\sigma(0,1)\). Thus local minima are preserved. But it is entirely possible that \(\mathcal{D}^2\sigma(X)\) and/or \(\mathcal{D}^2\sigma(Y)\) are indefinite, and that \(\mathcal{D}^2\sigma(1,0)\) and/or \(\mathcal{D}^2\sigma(0,1)\) are positive semi-definite. Thus saddle points in configuration space can be mapped into local minima in coefficient space. As we will see this actually happens with \(Y\), the equilateral triangle with center, in our example.

2.1 Global Pictures

We first make a global perspective plot, over the range \((-2.5,+2.5)\).