Abstract

We discuss two different even and convex non-negative smooth approximations of the absolute value function and apply them to construct MM algorithms for least absolute deviation regression. Both uniform and sharp quadratic majorizations are constructed. As an example we use the Boston housing data. In our example sharp quadratic majorization is typically 10-20 times as fast as uniform quadratic majorization.**Note:** This is a working paper which will be expanded/updated frequently. All suggestions for improvement are welcome. The directory gifi.stat.ucla.edu/absapp has a pdf version, the bib file, the complete Rmd file with the code chunks, and the R source code.

Various problems in computational statistics involve absolute values. We mention \(\ell_1\) regression, with the median as a special case, unidimensional scaling, support vector machines, and regression with \(\ell_1\) penalty terms. Using absolute values can create problems for optimization algorithms used in fitting, because the absolute value function is not differentiable at zero. Smooth parametric approximations to the absolute value function have been suggested by, among many others, Ramirez et al. (2014), Voronin, Ozkaya, and Yoshida (2015), and Bagul (2017).

It seemed to be useful exercise to use some of these approximations to construct majorization (nowadays MM) algorithms for various statistical techniques.

Clearly \(f_\epsilon(x)>|x|\) for all \(x\), and the maximum of \(f_\epsilon(x)-|x|\) is \(\epsilon\), attained at zero. Also \[ \lim_{x\rightarrow\infty} f_\epsilon(x)-|x|=\lim_{x\rightarrow-\infty} f_\epsilon(x) - |x| = 0. \]

The first derivative is

\[ f'_\epsilon(x)=\frac{x}{\sqrt{x^2+\epsilon^2}} \] This is plotted, for the same values of \(\epsilon\), in figure 2.We see that \(f'_\epsilon\) is an increasing function of \(x\), and thus \(f_\epsilon\) is convex.

The second derivative, plotted in figure 3, is

\[ f''_\epsilon(x)=\frac{1}{\sqrt{x^2+\epsilon^2}}\frac{\epsilon^2}{x^2+\epsilon^2}. \]

The second derivative is non-negative, with the horizontal axis as its asymptote. It attains its maximum, equal to \(\epsilon^{-1}\), at zero. This bound on the second derivative is the main reason for using \(\epsilon^2\) in the definition of \(f_\epsilon\).

The following convolution approximation to \(|x|\) was suggested by Voronin, Ozkaya, and Yoshida (2015). \[\begin{equation}\label{E:g} g_\epsilon(x)\mathop{=}\limits^\Delta\frac{1}{\epsilon\sqrt{2\pi}}\int_{-\infty}^{+\infty}|x-y|\exp\{-\frac12\frac{y^2}{\epsilon^2}\}dy. \end{equation}\] Carrying out the integration gives \[ g_\epsilon(x)=x\left(2\Phi\left(\frac{x}{\epsilon}\right)-1\right)+2\epsilon\phi\left(\frac{x}{\epsilon}\right). \] Because \(g_\epsilon\) is a weighted sum (integral) of a set of convex functions it is convex. We have \(g_\epsilon(x)>|x|\), and the maximum of \(g_\epsilon(x)-|x|\) is equal to \(\epsilon\sqrt{\frac{2}{\pi}}\), attained at zero.

Voronin, Ozkaya, and Yoshida (2015) show that as \(\epsilon\rightarrow 0\) we have both uniform and \(L_1\) convergence of \(g_\epsilon\) to \(|x|\). Of course other convolution approximation with similar scale families (a.k.a. *mollifiers*), which have support converging to zero while maintaining mass equal to one, if scale goes to zero, can also be used.

The first derivative, in figure 5 is

\[ g'_\epsilon(x)=2\Phi\left(\frac{x}{\epsilon}\right)-1. \]The second derivative is positive and unimodal, with a maximum equal to \(\frac{1}{\epsilon}\sqrt{\frac{2}{\pi}}\) attained at zero.

Consider the problem of minimizing \[\begin{equation}\label{E:l1loss} \sigma(\beta)=\sum_{i=1}^nw_i|y_i-x_i'\beta| \end{equation}\] This problem has been studied in statistics for many, many years. See, for example, Dielman (2005) or Farebrother (2013). Our contribution to the computational aspects of the problem is modest, because we are mainly interested in its connection to majorization theory.

Let’s start with the square root approximation \(f_\epsilon\). The first majorization we construct is based on the AM/GM inequality. Because for each \(\beta,\tilde\beta\) we have \[ \sqrt{\{(y_i-x_i'\beta)^2+\epsilon^2\}\{(y_i-x_i'\tilde\beta)^2+\epsilon^2\}} \leq\frac12\{(y_i-x_i'\beta)^2+\epsilon^2\}+\{(y_i-x_i'\tilde\beta)^2+\epsilon^2\}, \] it follows that \[ \sum_{i=1}^nw_if_\epsilon(y_i-x_i'\beta)\leq\frac12\sum_{i=1}^nw_i\frac{1}{\sqrt{(y_i-x_i'\tilde\beta)^2+\epsilon^2}}\{(y_i-x_i'\beta)^2+(y_i-x_i'\tilde\beta)^2+2\epsilon^2\}. \] Thus if \(V(\beta)\) is the diagonal matrix with diagonal elements \[ v_{ii}(\beta)=\frac{1}{\sqrt{(y_i-x_i'\beta)^2+\epsilon^2}} \] then majorization leads to an iteratively reweighted least squares algorithm. The update formula is \[ \beta^{(k+1)}=(X'WV(\beta^{(k)})X)^{-1}X'WV(\beta^{(k)})y. \]

We apply this to the Boston housing data from the UCI repository (Newman et al. (1998)), loaded from the mlbench package (Leisch and Dimitriadou (2010)).

```
data("BostonHousing")
y <- BostonHousing$medv
z <- BostonHousing[, 1:13]
x <- cbind (1, apply(z, 2, as.numeric))
w <- rep (1, length(y))
```

The R function l1fn2() is used.

`h1 <- l1fn2 (x, y, aeps = 1e-2, itmax = 10000)`

We have convergence of function values, to a change of less than 10^{-10}, in 530 iterations. The output below gives both \(\sigma_\epsilon\), the approximating loss function that we minimize, and \(\sigma\) from \(\eqref{E:l1loss}\), the actual \(\ell_1\) loss function that we approximate.
```
## eps: 1.0e-03 itel: 530 sigma_e: 1559.812228 sigma: 1559.709732
## beta:
## 14.633179 -0.144086 0.036871 0.019540 1.278130 -8.961015 5.324724
## -0.030748 -1.035830 0.183490 -0.010219 -0.728994 0.011279 -0.300423
```

The second majorization uses an upper bound for the second derivative, as in Voss and Eckhardt (1980), Böhning and Lindsay (1988), Lange (2016), p. 100, or De Leeuw (2016), chapter 11. It can be applied both to our square root \(f_\epsilon\) and convolution \(g_\epsilon\) approximations. First the square root. Because

\[ f_\epsilon(x)\leq f_\epsilon(y)+\frac{y}{\sqrt{y^2+\epsilon^2}}(x-y)+\frac{1}{2\epsilon}(x-y)^2 \] we have \[ \sum_{i=1}^nw_if_\epsilon(y_i-x_i'\beta)\leq\sum_{i=1}^nw_if_\epsilon(y_i-x_i'\tilde\beta)-\sum_{i=1}^nw_i\left\{\frac{y_i-x_i'\tilde\beta}{\sqrt{(y_i-x_i'\tilde\beta)^2+\epsilon^2}}\right\}x_i'(\beta-\tilde\beta)+\frac{1}{2\epsilon}\sum_{i=1}^nw_i(x_i'(\beta-\tilde\beta))^2. \] which gives the majorization algorithm \[ \beta^{(k+1)}=(X'WX)^{-1}X'W(X\beta^{(k)}+\epsilon\ F_\epsilon'(\beta^{(k)})), \] where \(F_\epsilon'(\beta)\) is a vector with elements \[ \mathcal{D}f_\epsilon(y_i-x_i'\beta)=\frac{y_i-x_i'\beta}{\sqrt{(y_i-x_i'\beta)^2+\epsilon^2}}. \] The corresponding expression for the convolution approximation \(g_\epsilon\) is \[ \sum_{i=1}^nw_ig_\epsilon(y_i-x_i'\beta)\leq\sum_{i=1}^nw_ig_\epsilon(y_i-x_i'\tilde\beta)-\sum_{i=1}^nw_i\left\{2\Phi(\frac{y_i-x_i'\tilde\beta}{\epsilon})-1\right\}x_i'(\beta-\tilde\beta)+\frac12\frac{1}{\epsilon}\sqrt{\frac{2}{\pi}}\sum_{i=1}^nw_i(x_i'(\beta-\tilde\beta))^2. \] The majorization algorithm is \[ \beta^{(k+1)}=(X'WX)^{-1}X'W(X\beta^{(k)}+\epsilon\sqrt{\frac{\pi}{2}}G'_\epsilon(\beta^{(k)})) \] Given the shape of the second derivatives of the approximating functions, we expect this algorithm to converge slowly. The uniform upper bound is only good very close to the origin, and horribly bad everywhere else. The update function only makes a very small move in the descent direction. We illustrate the slow convergence with two analyses of the Boston housing data, one using the square root and one using the convolution approximation, using the R routines l1fu2() and l1gu2().

For the \(f_\epsilon\) approximation we find

`h2 <- l1fu2 (x, y, aeps = 1e-2, itmax= 100000)`

```
## eps: 1.0e-03 itel: 31800 sigma_e: 1559.812229 sigma: 1559.709719
## beta:
## 14.636104 -0.144089 0.036873 0.019553 1.278381 -8.963905 5.324655
## -0.030749 -1.035922 0.183485 -0.010218 -0.729064 0.011278 -0.300400
```

We see that convergence to our default precision requires 31800 iterations. The results for the convolution approximation are similar.

`h3 <- l1gu2 (x, y, aeps = 1e-2, itmax= 100000)`

```
## eps: 1.0e-02 itel: 16848 sigma_e: 1559.744234 sigma: 1559.708984
## beta:
## 14.780373 -0.144247 0.037000 0.020306 1.292046 -9.124745 5.323192
## -0.030801 -1.041139 0.183103 -0.010149 -0.732826 0.011262 -0.299007
```

Suppose \(\eta\) is a majorization scheme for \(\tau\), i.e. \(\eta:\mathbb{R}^n\otimes\mathbb{R}^n\rightarrow\mathbb{R}\) and \(\tau:\mathbb{R}^n\rightarrow\mathbb{R}\) such that \(\eta(x,y)\geq\tau(x)\) for all \(x,y\) and \(\eta(x,x)=\tau(x)\) for all \(x\). Also assume sufficient differentiability and uniqueness of the minima in majorization iterations. De Leeuw (1994) (also see De Leeuw (2016)) shows that the convergence rate of the majorization algorithm is the largest eigenvalue of the matrix \[ I-\{\mathcal{D}_{11}\eta(x,x)\}^{-1}\mathcal{D}^2\tau(x)=-\{\mathcal{D}_{11}\eta(x,x)\}^{-1}\mathcal{D}_{12}\eta(x,x). \] evaulated at the fixed point \(x\). This is the Jacobian of the update function, that maps \(x^{(k)}\) to its successor \(x^{(k+1)}\).

Note that in our majorization example \(\eta\) is a quadratic in \(x\). Thus the second derivatives \(\mathcal{D}_{11}\) do not depend on \(x\) and are simply the matrix of the quadratic form.

If \(\sigma_\epsilon^f(\beta)\mathop{=}\limits^\Delta\sum_{i=1}^n w_if_\epsilon(y_i-x_i'\beta)\) then \[ \mathcal{D}^2\sigma_\epsilon^f(\beta)=\epsilon^2\sum_{i=1}^n w_i\left\{(y_i-x_i'\beta)^2+\epsilon^2\right\}^{-\frac32}x_i^{\ }x_i' \] If \(\sigma_\epsilon^g(\beta)\mathop{=}\limits^\Delta\sum_{i=1}^n w_ig_\epsilon(y_i-x_i'\beta)\) then \[ \mathcal{D}^2\sigma_\epsilon^g(\beta)=\frac{2}{\epsilon}\sum_{i=1}^n w_i\phi(\frac{y_i-x_i'\beta}{\epsilon})x_i^{\ }x_i' \]

The function computeThatRate() gives the theoretical convergence rates at the \(\ell_1\) solution for the Boston housing data for various values of \(\epsilon\) and for all four quadratic majorizations (uniform-f, uniform-g, non-uniform-F, and non-uniform-g).

```
epss <- c (5, 2, 1, .5, .1, .01, .005)
beta <- h1$beta
computeThatRate (x, y, w, beta, epss)
```

```
## [1] 5.0000000000 0.6576324826 0.5723976134 0.4279092359 0.3873481792
## [1] 2.0000000000 0.8225953967 0.7835762713 0.5415263594 0.5286119029
## [1] 1.0000000000 0.8963063025 0.8710636781 0.6125974817 0.5956478849
## [1] 0.5000000000 0.9444461085 0.9288920175 0.7051830266 0.6940155591
## [1] 0.1000000000 0.9889511967 0.9881274205 0.8800781159 0.9004846261
## [1] 0.0100000000 0.9998468914 0.9999785172 0.9872466313 0.9985891937
## [1] 0.0050000000 0.9999785958 0.9999999998 0.9964372125 0.9999999747
```

From the output we see that the convergence rate tends to one if \(\epsilon\downarrow 0\). Thus asymptotically our quadratic majorization algorithms will have a sublinear convergence rate.

As we have already seen in our example, the non-uniform methods converge much faster than the uniform ones. Throughout this report we have chosen \(\epsilon\) as some conventional numbers, without taking into account the scale of the data or the closeness of the fit. This is probably not a good idea, and especially in this large example our conventional \(\epsilon\)’s could very well be too small.

We check this more closely by using non-uniform majorization for seven different values of \(\epsilon\). The results are below. Epsilon is in the first column, followed by the results for the number of iterations, the value of the approximation function at the minimum, and the value of the least abosolute deviation function itself.

`## [1] 5.000000 15.000000 3212.724906 1580.815597`

`## [1] 2.000000 25.000000 2027.412038 1565.003116`

`## [1] 1.000000 32.000000 1725.433167 1561.816194`

`## [1] 0.500000 35.000000 1615.242147 1560.740384`

`## [1] 0.100000 89.000000 1563.678895 1559.955323`

`## [1] 0.050000 131.000000 1561.000966 1559.831086`

`## [1] 0.010000 530.000000 1559.812228 1559.709732`

For larger values of \(\epsilon\) the values of \(\sigma(\beta)\) and \(\sigma_\epsilon^f(\beta)\) are very different. But the value of \(\sigma(\beta)\) is already quite close to its minimum value, and that is of course what we are interested in. This is illustrated also in figure 7, where the 7 solutions for \(\beta\) are plotted. Only the first one, for \(\epsilon\) equal to 5, is any different, the rest are virtually indistinguishable. Thus if you want 10 decimals precision in your regression analysis then by all means choose \(\epsilon\) small. But seeing a regression analysis with 10 decimals precision tell us more about the data analyst than it tells us about the data.

For completeness we repeat the analysis for the convolution approximation. The results are very much the same.

`## [1] 5.000000 13.000000 2660.097037 1589.022400`

`## [1] 2.000000 22.000000 1815.279469 1565.920065`

`## [1] 1.000000 27.000000 1634.626462 1561.803819`

`## [1] 0.500000 32.000000 1580.827133 1560.899992`

`## [1] 0.100000 110.000000 1560.955511 1560.006532`

`## [1] 0.050000 123.000000 1560.122239 1559.837335`

`## [1] 0.010000 335.000000 1559.744234 1559.708994`

In future reports we may consider applications to nonlinear \(\ell_1\), to the lasso, to support vector machines, and to unidimensional scaling.

```
l1fu2 <-
function (x,
y,
w = rep(1, length (y)),
bold = NULL,
aeps = 1e-3,
eps = 1e-10,
itmax = 1000,
verbose = FALSE) {
ffunc <- function (x, y, b, aeps) {
z <- y - drop(x %*% b)
return (sqrt (z ^ 2 + aeps ^ 2))
}
fgrad <- function (x, y, b, aeps) {
z <- y - drop(x %*% b)
return (z / sqrt(z ^ 2 + aeps ^ 2))
}
if (is.null(bold)) {
bold <- lm.wfit (x, y, w)$coefficients
}
zold <- ffunc (x, y, bold, aeps)
fold <- sum (w * zold)
vold <- fgrad (x, y, bold, aeps)
itel <- 1
repeat {
bnew <- lm.wfit (x, drop(x %*% bold) + aeps * vold, w)$coefficients
znew <- ffunc (x, y, bnew, aeps)
fnew <- sum (w * znew)
vnew <- fgrad (x, y, bnew, aeps)
if (verbose)
cat(
"Iteration: ",
formatC(itel, width = 3, format = "d"),
"old: ",
formatC(
fold,
digits = 8,
width = 12,
format = "f"
),
"fnew: ",
formatC(
fnew,
digits = 8,
width = 12,
format = "f"
),
"\n"
)
if (((fold - fnew) < eps) || (itel == itmax))
break
vold <- vnew
fold <- fnew
bold <- bnew
itel <- itel + 1
}
ff <- sum (w * abs (y - drop (x %*% bnew)))
return(list(
itel = itel,
f = fnew,
ff = ff,
beta = bnew
))
}
```

```
l1fn2 <-
function (x,
y,
w = rep(1, length (y)),
bold = NULL,
aeps = 1e-3,
eps = 1e-10,
itmax = 1000,
verbose = FALSE) {
if (is.null(bold)) {
bold <- lm.wfit (x, y, w)$coefficients
}
zold <- sqrt ((y - drop (x %*% bold)) ^ 2 + aeps ^ 2)
fold <- sum (w * zold)
vold <- w / zold
itel <- 1
repeat {
bnew <- lm.wfit (x, y, vold)$coefficients
znew <- sqrt ((y - drop (x %*% bnew)) ^ 2 + aeps ^ 2)
fnew <- sum (w * znew)
vnew <- w / znew
if (verbose)
cat(
"Iteration: ",
formatC(itel, width = 3, format = "d"),
"old: ",
formatC(
fold,
digits = 8,
width = 12,
format = "f"
),
"fnew: ",
formatC(
fnew,
digits = 8,
width = 12,
format = "f"
),
"\n"
)
if (((fold - fnew) < eps) || (itel == itmax))
break
vold <- vnew
fold <- fnew
itel <- itel + 1
}
ff <- sum (w * abs (y - drop (x %*% bnew)))
return(list(
itel = itel,
f = fnew,
ff = ff,
beta = bnew
))
}
```

```
l1gu2 <-
function (x,
y,
w = rep(1, length (y)),
bold = NULL,
aeps = 1e-3,
eps = 1e-10,
itmax = 1000,
verbose = FALSE) {
cfunc <- function (x, y, b, aeps) {
z <- y - drop (x %*% b)
return (z * (2 * pnorm(z / aeps) - 1) + 2 * aeps * dnorm(z / aeps))
}
cgrad <- function (x, y, b, aeps) {
z <- y - drop(x %*% b)
return (2 * pnorm(z / aeps) - 1)
}
bnd <- aeps * sqrt (pi / 2)
if (is.null(bold)) {
bold <- lm.wfit (x, y, w)$coefficients
}
zold <- cfunc (x, y, bold, aeps)
fold <- sum (w * zold)
vold <- cgrad (x, y, bold, aeps)
itel <- 1
repeat {
bnew <- lm.wfit (x, drop(x %*% bold) + bnd * vold, w)$coefficients
znew <- cfunc (x, y, bnew, aeps)
fnew <- sum (w * znew)
vnew <- cgrad (x, y, bnew, aeps)
if (verbose)
cat(
"Iteration: ",
formatC(itel, width = 3, format = "d"),
"old: ",
formatC(
fold,
digits = 8,
width = 12,
format = "f"
),
"fnew: ",
formatC(
fnew,
digits = 8,
width = 12,
format = "f"
),
"\n"
)
if (((fold - fnew) < eps) || (itel == itmax))
break
vold <- vnew
fold <- fnew
bold <- bnew
itel <- itel + 1
}
ff <- sum (w * abs (y - drop (x %*% bnew)))
return(list(
itel = itel,
f = fnew,
ff = ff,
beta = bnew
))
}
```

```
l1gn2 <-
function (x,
y,
w = rep(1, length (y)),
bold = NULL,
aeps = 1e-3,
eps = 1e-10,
itmax = 1000,
verbose = FALSE) {
cfunc <- function (x, y, b, aeps) {
z <- y - drop (x %*% b)
return (z * (2 * pnorm(z / aeps) - 1) + 2 * aeps * dnorm(z / aeps))
}
cgrad <- function (x, y, b, aeps) {
z <- y - drop(x %*% b)
return (2 * pnorm(z / aeps) - 1)
}
if (is.null(bold)) {
bold <- lm.wfit (x, y, w)$coefficients
}
zold <- cfunc (x, y, bold, aeps)
fold <- sum (w * zold)
vold <- cgrad (x, y, bold, aeps) / (y - drop(x %*% bold))
itel <- 1
repeat {
bnew <- lm.wfit (x, y, w * vold)$coefficients
znew <- cfunc (x, y, bnew, aeps)
fnew <- sum (w * znew)
vnew <- cgrad (x, y, bnew, aeps) / (y - drop(x %*% bnew))
if (verbose)
cat(
"Iteration: ",
formatC(itel, width = 3, format = "d"),
"old: ",
formatC(
fold,
digits = 8,
width = 12,
format = "f"
),
"fnew: ",
formatC(
fnew,
digits = 8,
width = 12,
format = "f"
),
"\n"
)
if (((fold - fnew) < eps) || (itel == itmax))
break
vold <- vnew
fold <- fnew
bold <- bnew
itel <- itel + 1
}
ff <- sum (w * abs (y - drop (x %*% bnew)))
return(list(
itel = itel,
f = fnew,
ff = ff,
beta = bnew
))
}
```

```
computeThatRate <- function (x, y, w, beta, epss) {
for (eps in epss) {
r <- y - drop (x %*% beta)
v <- 1 / ((r ^ 2 + eps ^ 2) ^ (3 / 2))
d2_tau_f <- crossprod (x, v * w * x) * (eps ^ 2)
v <- dnorm (r / eps)
d2_tau_g <- (2 / eps) * crossprod (x, v * w * x)
d2_eta_f_uq <- crossprod (x, w * x) / eps
d2_eta_g_uq <- sqrt (2 / pi) * crossprod (x, w * x) / eps
v <- 1 / sqrt (r ^ 2 + eps ^ 2)
d2_eta_f_nq <- crossprod (x, w * v * x)
v <- (2 * pnorm (r / eps) - 1) / r
d2_eta_g_nq <- crossprod (x, w * v * x)
r_f_uq <-
1 - min (eigen (solve (d2_eta_f_uq, d2_tau_f))$values)
r_g_uq <-
1 - min (eigen (solve (d2_eta_g_uq, d2_tau_g))$values)
r_f_nq <-
1 - min (eigen (solve (d2_eta_f_nq, d2_tau_f))$values)
r_g_nq <-
1 - min (eigen (solve (d2_eta_g_nq, d2_tau_g))$values)
print (c(eps, r_f_uq, r_g_uq, r_f_nq, r_g_nq))
}
}
```

Bagul, Y.J. 2017. “A Smooth Transcendental Approximation to |x|.” *International Journal of Mathematical Sciences and Engineering Applications* 11: 213–17.

Böhning, D., and B.G. Lindsay. 1988. “Monotonicity of Quadratic-approximation Algorithms.” *Annals of the Institute of Statistical Mathematics* 40 (4): 641–63.

De Leeuw, J. 1994. “Block Relaxation Algorithms in Statistics.” In *Information Systems and Data Analysis*, edited by H.H. Bock, W. Lenski, and M.M. Richter, 308–24. Berlin: Springer Verlag. http://www.stat.ucla.edu/~deleeuw/janspubs/1994/chapters/deleeuw_C_94c.pdf.

———. 2016. *Block Relaxation Methods in Statistics*. Bookdown. https://bookdown.org/jandeleeuw6/bras/.

De Leeuw, J., and K. Lange. 2009. “Sharp Quadratic Majorization in One Dimension.” *Computational Statistics and Data Analysis* 53: 2471–84. http://www.stat.ucla.edu/~deleeuw/janspubs/2009/articles/deleeuw_lange_A_09.pdf.

Dielman, T.E. 2005. “Least Absolute Value Regression: Recent Contributions.” *Journal of Statistical Computation and Simulation* 75 (4): 263–86.

Farebrother, R. W. 2013. *\(L_1\)-Norm and \(L_\infty\)-Norm Estimation*. Springer Briefs in Statistics. Springer Verlag.

Lange, K. 2016. *MM Optimization Algorithms*. SIAM.

Leisch, F., and F. Dimitriadou. 2010. *mlbench: Machine Learning Benchmark Problems*.

Newman, D.J., S. Hettich, C.L. Blake, and C.J. Merz. 1998. “UCI Repository of machine learning databases.” University of California, Irvine, Dept. of Information and Computer Sciences. {http://www.ics.uci.edu/~mlearn/MLRepository.html},

Ramirez, C., R. Sanchez, V. Kreinovich, and M. Argaez. 2014. “\(\sqrt{x^2+\mu}\) is the Most Computationally Efficient Smooth Approximation to |x|.” *Journal of Uncertain Systems* 8: 205–10.

Voronin, S., G. Ozkaya, and Y. Yoshida. 2015. “Convolution Based Smooth Approximations to the Absolute Value Function with Application to Non-smooth Regularization.” 2015. https://arxiv.org/abs/1408.6795.

Voss, H., and U. Eckhardt. 1980. “Linear Convergence of Generalized Weiszfeld’s Method.” *Computing* 25: 243–51.