-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathVarietyInNonlinearLeastSquaresCodes.Rmd
1151 lines (867 loc) · 47.4 KB
/
VarietyInNonlinearLeastSquaresCodes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Variety in the Implementation of Nonlinear Least Squares Program Codes"
author:
- John C Nash, retired professor, University of Ottawa
date: "16/02/2021"
output:
pdf_document:
keep_tex: false
toc: true
bibliography: ImproveNLS.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
## ??require(bookdown) # language engine to display text??
```
# Abstract
There are many ways to structure a Gauss-Newton style nonlinear least squares
program code. In organizing and documenting the nearly half-century of programs
in the Nashlib collection associated with @jncnm79, the author realized that this
variety could be an instructive subject for software designers.
# Underlying algorithms
## Gauss Newton method
An overview of the method is provided by Wikipedia:
\url{https://en.wikipedia.org/wiki/Gauss%E2%80%93Newton_algorithm}
The method was described by C.F.Gauss in 1809 in *Theoria motus corporum
coelestium in sectionibus conicis solem ambientum*.
In calculus we learn that a stationary point (local maximum, minimum
or saddle point) of a function $f$ occurs where its gradient (or first
derivative) is zero. In multiple dimensions, this is the same as
having the gradient $g$ stationary. The so-called Newton method
uses the second derivatives in the Hessian matrix $H$ defined as
$$ H_{i,j} = \partial^2{f}/ {\partial{x_i} \partial{x_j}}$$
where $f$ is a function of several variables $x_i$, where $i=1,...,n$, written
collectively as $x$. The Newton equations are defined as
$$ H \delta = - g $$
where $H$ and $g$ are evaluated at a guess or estimate of $x$. $x$ is then updated
to
$$ x \leftarrow x + \delta $$
and we iterate until there is no change in $x$.
There are some adjustments when our function $f(x)$ can be written as a sum of squares
$$ f(x) = \sum_{i=1}^n {r_i(x)^2} = r^t r $$
The final vector product form is the one I favour because it is the least fussy to write and read.
In particular, when we try to evaluate the Hessian of this sum of squares function, we see that it is
$$ H_{j,k} = 2 \sum_{i-1}^n { { [ (\partial r_i/\partial{x_j} ) } ({\partial r_i/\partial{x_k} ) + r_i (\partial^2 r_i/{\partial{x_j} \partial{x_k} ) ]}}} $$
If we define the **Jacobian** matrix
$$ J_{i,j} = \partial r_i/\partial{x_j} $$
then the gradient is
$$ g = 2 J^t r $$
and the first part of the Hessian is
$$ J^t J $$
Arguing that the residuals should be "small", Gauss proposed that the
Newton equations could be approximated by ignoring
the second term (using elements of the residual times second derivatives
of the residual). This gives the **Gauss-Newton**
equations (cancelling the factor 2)
$$ (J^t J) \delta = - J^t r $$
We can use this in an iteration similar to the Newton iteration. Note that these
equations are equivalent to the linear least squares problem that can be written
$$ J \delta \approx - r $$
## Hartley's method
Conditions for convergence of the Newton iteration and the Gauss-Newton
iteration are rather a nuisance
to verify, and in any case we want a solution to the original problem
rather than one that can be obtained because of special features of the
problem that allow this default Gauss-Newton iteration.
@Hartley61 proposed that rather than use an iteration
$$ x \leftarrow x + \delta$$
one could do a search along the direction $\delta$. Ideally we would want to minimize
$$ f(x + \alpha \delta) $$
with respect to the (scalar) parameter $\delta$. However, typically a value of $\alpha$ that
permits a reduction in the sum of squares $f(x)$ is accepted and the iteration repeated.
That is, if we have a steplength quantity $fac$, we use
$$ x_{new} = x + fac * \delta $$
Clearly there are many tactical choices that give rise to a variety of particular algorithms.
One concern is that the direction $\delta$ may give no lower value of the sum of squares, since
there is an approximation involved in using $J^t J$ rather than $H$.
In `nls.c`, the Gauss Newton direction is computed (`incr` -- see nls-flowchart.txt)
and a step-halving "line-search" is used. This is a form of the method of @Hartley61,
though Hartley recommends a quadratic (parabolic) line search where two "new" points
are evaluated, followed by a third try at the suggested minimum of the parabola fitted
to the existing and two new sum of squares functions.
In `nls.c` we also see that once a satisfactory "new" set of parameters has been
obtained, we reset the $fac$ value to the minimum of $2 * fac$ and 1.0. Oddly, there
is no provision for the user to set the tactical choices of the maximum, the reduction
or the increase in $fac$, despite our experience that other choices can be advantageous
for some problems. Moreover, $x_{new}$ is deemed satisfactory if its *deviance* (essentially
the weighted sum of squares of the residuals) is **no greater than** the current best
value. Our experience is that it may be better to insist on a decrease in the value,
or even to use an "acceptable point" approach that sets the size of the decrease using
a measure based on the gradient of the deviance projected on $\delta$.
<!-- ?? Should we explain more. Possibly incorporate into code? -->
## Marquardt stabilized Gauss-Newton
@Marquardt1963 is perhaps one of the most important developments in nonlinear least squares
apart from the Gauss-Newton method. There are several ways to view the method.
First, considering that the $J^t J$ may be effectively singular in the computational
environment at hand, the Gauss-Newton method may be unable to compute a search
step that reduces the sum of squared residuals. The right hand side of the
normal equations, that is, $g = - J^t r$ is the gradient for the sum of squares.
Thus a solution of
<!-- Note mathbf use below -->
$$ \mathbf{1}_n \delta = - g $$
is clearly a downhill version of the gradient. And if we solve
$$ \lambda \mathbf{1}_n \delta = - g $$
various values of $\lambda$ will produce steps along the **steepest descents**
direction. Solutions of the Levenberg-Marquardt equations
$$ (J^t J + \lambda \mathbf{1}_n) \delta = A \delta = - J^t r $$
can be thought of as yielding a step $\delta$ that merges the Gauss-Newton and
steepest-descents direction. This approach was actually first suggested by
@Levenberg1944, but it is my opinion that while the ideas are sound, Levenberg
very likely never tested them in practice. Before computers were commonly
available, this was not unusual, and many papers were written with computational
ideas that were not tested or were given only cursory trial.
Note that we can define a matrix $K$ which is $J$ augmented below by a suitable diagonal
matrix that is the square root of $\lambda$ times the unit matrix $1_n$. Then the
Levenberg-Marquardt equations are equivalent to
$$ K \delta \approx - r_{augmented} $$
where we have added $n$ zeros to $r$ to make $r_{augmented}$. Similar adjustments
are used for scaled Marquardt methods mentioned below.
Practical Marquardt methods require us to specify an initial $\lambda$ and ways
to adjust its value. For example, if $$SS(x) = r^t r = f(x)$$ is the function to
minimize to "fit" our nonlinear model:
- set an initial $\lambda$ of 0.0001
- **A** Compute $J$ and $g$, set up Marquardt equations and find $\delta$
- **B** compute new set of parameters $$x_{new} = x + \delta $$
- if $x_{new} \ne x$, compute $SS(x_{new})$, else quit (**termination**)
- if $SS(x_{new}) < SS(x)$, replace $x$ with $x_{new}$, replace $\lambda$
with $0.4*\lambda$, then retry from **A**
- ELSE increase $\lambda$ to $10*\lambda$ and retry from **B** ($J$ and $g$ do
NOT need to be recomputed)
In this scheme, an algorithm is defined by the initial value, decrease factor
(e.g., 0.4) and increase factor (e.g., 10) for $\lambda$, but there are a number of
annoying computational
details concerning underflow or overflow and how we measure equivalence of
iterates. That is, we need to ensure $\lambda$ is not so "small" that multiplying
by 10 is meaningless, and we need to know what "not equal" means for floating
point vectors in the arithmetic in use.
An example of choices to deal with these details:
- if $\lambda$ is smaller than some pre-set tolerance, set it to the tolerance.
(Otherwise do not worry if it is tiny, as long as the sum of squares still gets
reduced.)
- when comparing $x_{new}$ and $x$, we can use a modest number, e.g, 100.0 which
we will call $OFFSET$. In the arithmetic of the system at hand, if the floating-point
number $(x_{new} + OFFSET)$ matches bit for bit $(x + OFFSET)$, then we take the
unmodified numbers as equal.
Many practitioners dislike using IF statements (comparisons) of floating point
numbers and I have had a number of exchanges with other workers over the years.
However, I have never seen any cases where this approach caused difficulties.
When $x$ is tiny, we compare $OFFSET$ to
itself and declare equality. When $x$ is very large, $OFFSET$ does not change it and
we compare $x$ with a value that is identical to it. Moreover, we use very little
extra code.
### Automatic scaling
In @Marquardt1963, it is shown that the use of the diagonal elements of $J^t J$
instead of $\mathbf{1}_n$ in the Marquardt stabilized Gauss-Newton equations is equivalent
to having a similar scale for all the parameters. Call this diagonal matrix $D$.
Generally this is the preferred
approach to Levenberg-Marquardt stabilizations. However, in low-precision
arithmetic, elements of the diagonal may underflow to zero and cause issues of
singularity in the equations to be solved. A simple workaround is to slightly
modify the equations to
$$ (J^t J + \lambda (D + \phi 1_n) \delta = - J^t r $$
where $\phi$ is a modest number such as 1.
Note that we form $J^t J$ in `nlfb()` to provide the scaled stabilization,
which is quite a lot of computational work. The experimental package
`nlsralt` has been built to add functions `nlxbx()` and `nlfbx()` that
use ONLY the fixed element (i.e., unit matrix) stabilizations. A very
preliminary timing ([2021-6-26]) using the Hobbs unscaled problem shows
a slight improvement for the simplified version in timing with equivalent
solution.
## Others
There have been many proposed approaches to nonlinear least squares.
### Spiral
@Jones70Spiral is a method ...
### Approaches to Newton's method
Newton (and his sometime associated worker Raphson) used the algorithm we attribute
to them applied to a very special case and in a manner unrecognizable as that described
above. (See \url{https://en.wikipedia.org/wiki/Newton%27s_method}).
The main attraction of the Gauss-Newton method is that the Jacobian only needs first
derivatives, which are generally much, much easier to compute than the full second
derivatives of the Hessian. However, there have been suggestions
Newton-like approaches, possibly for special types of models where some short cut
to the Hessian is available. These may or may
not make sense in the context of
a given user or group of users. I regard these approaches as part of the
computational infrastructure of the special cases rather than as general nonlinear
least squares methods.
### Hybrid methods
The central idea of Hartley's method is a line search and that of Marquardt a
parametrized stabilization that can be shown to reduce a sum of squares for large
enough stabilization parameter. There is nothing to prevent use of a line
search additional to the Marquardt stabilization, and there are, of course,
several reasonable choices for line search. Also we could solve the (possibly
stabilized) Gauss-Newton equations by several quite different numerical linear
algebra techniques, or even approximations. In certain situations, some of these
choices may be important, but here I will not pursue them further.
# Sources of implementation variety
The sources of variety in implementation include:
- structuring of the algorithm, that is, how we set up and sequence the
parts of the overall computation
- Linking of data to the problem. In particular, R allows **subsetting**
of data at quite a high level, but this capability is almost totally
hidden from the user or programmer, so that the feature could be dangerous to
use. There appears, moreover, to be no example or test with `nls()`
to show how to use this facility, even though `subset` is an argument
to the `nls()` function. This is discussed further below.
- programming language
- possible operating environment features
- solver for the least squares or linear equations sub-problems
- stucture of storage for the solver, that is, compact or full
- sequential or full creation of the Jacobian and residual, since
it may be done in parts
- how the Jacobian is computed or approximated
- higher level presentation of the problem to the computer, as
in R's `nls` versus packages `minpack.lm` and `nlsr`, or as
a general optimization problem.
## Algorithm structure
The R function `nls()` and also the package `minpack.lm` both set up the nonlinear
least squares computation for estimating a model specified as a **formula** (an
R object class) by building an `nlsModel` object, which we will label `m`. This
object is a set of functions that provide the information needed to compute the
residuals, jacobian, the search direction, and many other objects required for
the nonlinear least squares solution as well as ancillary information used in
the post-solution analysis and reporting.
In the package `nlsr` (@nlsr-manual), the function `nlxb()` calls `model2rjfun()`
to create functions `res()` and `jac()` that are used in `nlfb()` (the function
`nls.lm()` in `minpack.lm` is very similar) to find a
nonlinear least squares solution. However, post-solution information is built
explicitly AFTER a solution is found.
The essential difference is that `nls()` and `minpack.lm` build a toolbox, `m`,
while `nlsr` creates its required objects as they are needed for the Marquardt
computations.
### Issues with this structure
A major criticism of this approach is that the collection of information needed
at any one time -- the "environment" in which residuals and jacobians are calculated,
as well as the solution methods -- is not at all well-documented. If the software is
to be updated or maintained, the location and purpose of the many elements that
reside in different environments is difficult to determine without much effort.
In my opinion, this renders the structure prone to collapse if there are infrastructure
or other changes. Documentation would go a long way to correcting this problem, but
even better would be clearer structuring. Separation of the problem from the method
would also be useful. At the time of writing of this document, the Gauss-Newton
METHOD is embedded in the functions created by `nlsModel()` for a given PROBLEM.
## Programming language
We have a versions of the Nashlib Algorithm 23 in BASIC, Fortran, Pascal, and R,
with Python pending. There may be dialects of these programming languages also,
giving rise to other variations.
## Data linkage to software
Quite reasonably, linking particular data to our problem is mediated by software,
and hence is influenced by the programming language. This is particularly true
for R, since R allows us to **subset** data. Essentially this provides an index
to the rows of data, most easily thought of in terms of a data frame structure.
### Data subsets
The `nls()` function has an argument `subset` and a test included in the
`Croucher-example1.R` file of this project showed this works as expected.
However, nowhere have we found a published example of the use of the
`subset` argument with `nls()`. Worse, it clearly must be applied to
many of the functions used by `nls()` but makes no explicit appearance. Thus it
is VERY difficult to ensure subsetting is correctly applied when we wish to modify
the code, for instance, to provide for newer features especially more reliable
solvers.
Note that the `subset` ARGUMENT to `nls()` is not the same as the `subset()` function
that is usable to restrict the scope of an R object.
OUR HYPOTHESIS: `nls()` uses `subset` to automatically adjust the data to which all
the relevant functions are applied. We note that `minpack.lm::nlsLM` also has a `subset` argument,
but package `nlsr` does not at time of writing (2021-7-12), but will use the **weights** mechanism
of the next subsection of this document at some future release.
### Other approaches to subsetting
In order to provide a sane and well-understood mechanism for subsetting, there
are alternatives for implementation.
- **via weights**: We can weight each "row" that we do NOT wish included as 0.
This does require us to compute the number of observations as the count of
non-zero weights e.g, `sum(weights > 0). Incidentally, this underlines the
need to ensure weights are non-negative. ??now in nlsj, need to add to nlsr code.??
- **via preprocessing**: We could prepare a function to pre-process the problem and
create a data collection that is appropriately subset. This may offend against
the spirit of R/S, but it has the merit that we can easily reassure ourselves
that we have worked with the correct information, correctly used in our methods.
?? Should add this as an option.
PROPOSAL: use the "weights" approach if `nlsj` is called with `subset` active,
else recommend the preprocessing approach. The preprocessor could also do a lot
of checking, which we might leave out of `nlsj()` in some cases. As of 2021-7-12,
the weights mechanism is working in a trial `nlsj()` and should be transferrable
to package `nlsr` fairly easily.
## Operating environment
Generally, for modern computers, the operating system and its localization do not
have much influence on the way in which we set up nonlinear least squares computations,
though the actual performance of the computations may well be influenced by such factors.
I will comment below about some issues where speed and size of data storage may favour
some approaches over others. However, such issues are less prominent today than in
the past.
One aspect of the computational environment that has proved important to me has
been that of low-precision arithmetic. In the 1970s, it was common to have only
the equivalent of 6 or 7 (decimal) digits of precision. In such environments,
where, moreover, guard digits and underflow and overflow behaviour were highly
variable features, it was quite common that arguments to functions like exp()
and log() were inadmissible for reasons of computability in the local system
rather than underlying mathematics. In such environments, where there were also
storage limitations, it was important to use very robust algorithms implemented
in ways that were simple but "safe". This led to a number of choices in the
development of the codes in @jncnm79. However, it has turned out that these
choices fare rather well even in more capable computational systems.
## Solver for the least squares or linear equations sub-problems
There are many choices for solver of the Gauss-Newton or related equations
in a nonlinear least squares program. We will consider some of the options.
Note that the choice to set up the normal equations will direct part of the
algorithm selection.
### Solution of the linear normal equations
The Gauss-Newton or Marquardt eqations are a set of linear equations. Moreover,
the coefficient matrix is non-negative definite and symmetric. Thus it permits of
both general and specialized methods for linear equations. Furthermore, one can also
set up the solution without forming a coefficient matrix $A$
by using several matrix decomposition methods. Thus there are many possible procedures.
- Gauss elimination with partial or complete pivoting: The choice of partial or
complete pivoting changes greatly the program code. However, neither of these
choices uses the underlying structure and non-negative definite nature of
the matrix $A$.
- Variants of Gauss elimination that build matrix decompositions: We can apply
elimination to produce an LU decomposition of $A$, and this may offer some
computational advantages, though in the present case such approaches do not
seem to be ideal.
- Gauss-Jordan inversion: Here we compute the inverse of $A$ and then work with
this inverse. That is, we use
$$\delta = - A^{-1} g$$
This uses much more computational effort than simply solving for $\delta$,
though there is an interesting, and very tiny, method by @BauerReinsch71.
The sheer ingenuity of this code was magnetic, but it has proved to be of
limited value.
- Choleski Decomposition and back substitution: There are a number of varieties
of Choleski decomposition, but they all take advantage of the symmetric and
non-negative definite structure of $A$. A particularly compact form was used
in @Nash1987a, though it is debatable whether the saving in storage was not
heavily outweighed by the complexity of the code and its documentation.
- Eigendecompositions of the sum of squares and cross-products (SSCP)
matrix: the eigenvalues of real, symmetric matrices
such as $A$ can be computed in many ways and the eigendecomposition used to
solve the Gauss-Newton equations. The computations are much more costly than
simple solution of the equations, but the knowledge of the eigenvalues allows
us to avoid, or at least be aware of, conditioning issues.
### Solution of the least squares sub-problem by matrix decomposition
Here we proposed to solve using the matrix $J$ or an augmented form $K$
directly. There are a number of approaches that can be used, which we
will divide into two classes, QR and SVD methods.
The QR approach aims to decompose the working matrix (call it $K$,
with $m$ rows and $n$ columns)
$$ K = Q R $$
where $Q$ is $m$ by $n$ orthogonal so
$$ Q^t Q = I_n $$
and $R$ is upper triangular and $n$ by $n$. Some variants have $Q$ an $m$ by $m$ orthogonal
matrix and $R$ upper triangular with $m - n$ rows of zeros appended below.
There are many ways of computing such decompostions:
- Householder reflections with or without pivoting options
- Givens plane rotations, similarly
- Gram-Schmidt orthogonalization of the columns of $K$, either directly (one
column at a time orhogonalized to all previous columns) or in a modified
fashion that proceeds row-wise. There are a number of detailed choices.
SVD approaches compute a decompostion
$$ K = U S V^t$$
where U and V have various orthogonality properties and S is a diagonal matrix
with elements called the singular values (the square roots of the eigenvalues
of $K^t K$ actually). There are many variations depending on whether $U$ is
a full $m$ by $m$ matrix, limited to $m$ by $n$ or even truncated to $m$ by $r$ where
$r$ is the rank of $K$.
A detail in choosing our approach is the mechanism of inclusion of the
Marquardt (or Marquardt Nash) stabilization. It turns out to be remarkably
easy, if a little tedious. We simply augment $J$ by a diagonal matrix that
has elements $\sqrt {(\lambda)}$ times the diagonal stabilization elements
we wish to use. Then $K^t K$ is the appropriate $A$ for the Marquardt version
of the Gauss-Newton.
As we explain below, we do, unfortunately, realize that Marquardt suggested a scaled
stabiilzation where we want to add the diagonals of $J^t J$ normal equation approach.
This requires that we perform at least a partial computation of the SSCP matrix,
which is relatively costly. For this reason we plan ?? to explore some proxies
for these elements.
### Avoiding re-computation of the Jacobian
The "original" Gauss-Newton iteration offers no guidance in how to deal with a
situation where $x_{new}$ has a poorer fit than $x$. @Hartley61 and @Marquardt1963,
along with most other practical methods, need to recompute the (weighted) residuals
and consequent loss function (sum of squares or deviance) for each value of a step-length
or stabilization parameter ($\lambda$) until either a "better" point $x_{new}$ is found
or we terminate the search.
Thus it is useful to be able to separate the computation of residuals from that of the
Jacobian. However, the structure of R is such that it is much easier to have the Jacobian
as an **attribute** of the residuals object. Indeed, it is the "gradient" attribute for
historical reasons. This is the setup for both the finite difference approximations of
`nlspkg::numericDeriv()` (the Jacobian routine in base R for nls() in file `./src/library/stats/R/nls.R`)
as well as the `rjfun()` generated by `nlsr::model2rjfun()`. Nevertheless, it is likley worth
avoiding duplication if possible, though the computation of Jacobian elements uses a lot of the
same steps as the computation of the residuals.
?? need to explore timings
Evidence is needed as to which de-duplication efforts give worthwhile savings, and in what
situations:
- saving residual values when computing derivatives via finite differences
- separation of the Jacobian routine completely from that of the residuals
- other measures?
## Storage stucture
If the choice of approach to Gauss-Newton or Marquardt is to build the normal
equations and hence the sum of squares and cross products (SSCP) matrix, we
know by construnction that this is a symmetric matrix and also non-negative definite.
(Generally, unless we have a strict linear dependence between the columns of
the Jacobian, we can say "positive definite".)
If we use an SSCP form, we can apply algorithms that specifically take advantage of both these
properties, including programs that store the working matrix in special formats.
Some of these are Algorithms 7, 8 and 9 of Nashlib (@jncnm79). Algorithms 7 and 8 are the
Cholesky decomposition and back-solution using a vector of length `n*(n+1)/2`
to store just the lower triangle of the SSCP matrix. Algorithm 9 inverts this
matrix *in situ*.
The original @jncnm79 Algorithm 23 (Marquardt nonlinear least squares solution) computes
the SSCP matrix $J^t J$ and solves the Marquardt-Nash augmented normal equations
with the Cholesky approach. This was continued in the Second Edition @nlacatvn1060620 and
in @jnmws87. However, in the now defunct @jnnlmrt12 and successor @nlsr-manual, the choice
has been to use a QR decomposition of an augmented Jacobian. `nls()` uses a QR decomposition
of the Jacobian without augmentation (referred to as the "gradient", however).
The particular QR calculations in these packages and in R-base are quite well-hidden in internal
code, complicating comparisons of storage, complexity and performance. The present exposition
is an attempt to provide some documentation.
### Storage in Nash/Walker-Smith (1987) BASIC code
In @Nash1987a, each major iteration generates a $A = J^t J$ matrix and the right hand side $-g = J^t r$
at parameters $x$ using functions for the residuals and Jacobian.
The lower triangle of $A$ is stored as a vector of $n (n+1)/2$ elements (row-wise ordering of $C$).
The diagonal elements are then multiplied by $(1 + \lambda)$ and $\lambda * \phi$ is added to each,
where $\lambda$ is our Marquardt parameter and $\phi$ is the adjustment suggested in @jn77ima to
overcome potential underflow to zero of the accumulation of the diagonal elements in low-precision
arithmetic. Let us call the augmented SSCP matrix $C$.
Solving $C \delta = -g$, we compute the new sum of squares at $x + \delta$. If this value is greater than
or equal to the current best (i.e., lowest) sum of squares, we increase $\lambda$ and generate a new
$C$ matrix. Note that we do not, however, need to build a new SSCP matrix or gradient $g$, just make the diagonal
adjustment.
The code in @Nash1987a is much more complicated than this description suggests because of
- bounds and masks on the parameters
- tests for "non-computability", where we allow the user to provide an explicit flag that the function cannot
be computed at the trial parameters. This was included to avoid some of the weaknesses in evaluation of
special functions, e.g., `log()` for small arguments or `exp()` for large ones, or for problem specific
situations where some parameter combinations might be considered inadmissible.
- situations where rounding and truncation errors make the Cholesky decomposition return with an indication
that $C$ is "singular", at least in the particular computational arithmetic.
### Approach in package `nlsr`
In @nlsr-manual, the approach to the augmented Gauss-Newton equations is nominally equivalent to that in @Nash1987a,
except that we perform the solution using a QR decomposition, in fact a modification of the DQRDC routine from
@lapack99, which uses Householder transformations.
One of the issues with creating the SSCP matrix $A$ is that we lose information in finite arithmetic
(See @jncnm79, Chapter 5). A QR approach to the traditional Gauss-Newton codes is to solve
$$ J \delta = -r $$
That is, we assume a generalized inverse of $J$ can be applied to both sides of the Gauss-Newton equations.
To move to the Marquardt variant of the Gauss-Newton, we need to augment $J$. Let us call this augmented matrix
$K$. We will also need to augment $r$ by an appropriate number of zeros. For the traditional Marquardt method, we
need to use the square roots of the diagonal elements of $J^t J$, that is, an $n$ by $n$
diagonal matrix $Z1$ where
$$ Z1_{k,k} = \sqrt(\lambda) \sqrt(J^t J)_{k,k} $$
and $r$ is extended by $n$ zeros. For the Nash modification, we add a further $n$ zeros to $r$ and augment
the $K$ matrix with another $n$ by $n$ block $Z2$ which is a diagonal matrix where the diagonal elements are
$$ Z2_{k,k} = \sqrt(\lambda) \sqrt(\phi) $$
Unfortunately, there is a need in each major iteration to compute at least the diagonal elements of the
SSCP matrix. It may, in fact, be more efficient to omit the $Z1$ augmentation. Package `nlsralt`, as
mentioned, does this with functions `nlxbx()` (for formula-specified models) and `nlfbx()` (for problems
specified with `res()` and `jac()` functions). This gives the following preliminary example using the
Hobbs weed problem.
<!-- https://bookdown.org/yihui/rmarkdown-cookbook/source-script.html -->
<!-- ```{r, include=FALSE} -->
<!-- source("your-script.R", local = knitr::knit_global()) -->
<!-- # or sys.source("your-script.R", envir = knitr::knit_global()) -->
<!-- ``` -->
??FIXME??
```{r hobbtime1}
source("Tests/Examples/HobbsTiming.R", echo=TRUE)
```
### Other storage approaches.
?? do we want to report any other approaches here?
In particular, should we look at minpack.lm to see how the Marquardt stabilization
is applied? This is a very complicated program, since it calls C to call Fortran,
and the Fortran is layered in a way that each part is well-documented but the
general flow and linkage of the parts is non-obvious.
## Sequential or full Jacobian computation
We could compute a row of the Jacobian plus the corresponding residual element
and process this before computing the next row etc. This means the full Jacobian
does not have to be stored. In Nashlib, Algorithms 3 and 4, we used row-wise data
entry in linear least squares via Givens' triangularization (QR decompostition),
with the possibility of extending the QR to a singular value decomposition.
Forming the SSCP matrix can also be generated row-wise as well. We do not
anticipate that any of these approaches will offer advantages that are so
great that we would choose them.
## Analytic or approximate Jacobian
`nls()` and `minpack.lm` use finite difference approximations to compute the Jacobian.
In the case of `minpack.lm` it seems that the standard `numericDeriv()` (from the
set of routines in file `nls.R`) is used initially for the "gradient", but within
the iteration, the Fortran routine lmdif.f computes its own approximations for
the Jacobian, adding another layer of potential confusion.
The `nlsr` package, however, attempts to compute analytic derivatives using symbolic
and analytic derivative possibilities from other parts of R. A great advantage of
these is that they are for a given point, and do not suffer the issue that a step
must be taken, thereby chancing the violation of a constraint or else an inadmissible
set of arguments for special functions. In my experience, the use of analytic derivatives
is not much faster than the finite difference approximations, and generally finite
difference methods iterate as quickly as the "exact" derivative methods to the
neighbourhood of a solution. It is in allowing good estimates of convergence
criteria that the analytic derivatives seem to show their advantage i.e., knowing
we are at a solution rather than the speed in getting there.
## Interfacing to problems
R allows the nonlinear least squares problem to be presented via a formula
for the model. This is generally very attractive to users, since they can
specify their problem in ways that often accord with how they think about them.
Many (most?) researchers do not write program code, so preparing residual and
Jacobian functions is foreign to them.
From the point of view of package or system developers, however, specifications as
formulas pose some particular issues:
- translating a formula to a function for its actual evaluation may be tricky,
since we must essentially parse the formula. R has some tools for this that
we use.
- even if we can translate a formula to a function, we need to check that we
have all the data and parameter inputs to enable the evaluation. Moreover,
we need to check if the inputs are valid or admissible.
- checking for correctness of inputs and translations can be considered to
take up most of the code in a reliable and robust package.
# Saving storage
The obvious ways to reduce storage are:
- use a row-wise generation of the Jacobian in either a Givens' QR or SSCP approach.
This saves space for the Jacobian as well as as well as the working matrices of
the Gauss-Newton or Marquardt iterations;
- if the number of parameters to estimate is large enough, then a normal equations
approach using a compact storage of the lower triangle of the SSCP matrix.
However, the scale of the saving is really very small in comparison to
the size of most programs.
# Measuring performance
Do we plan to use only timing? Or size of code and memory used? ??
# Test problems
```{r settest15}
# set parameters
set.seed(123456)
a <- 1
b <- 2
c <- 3
np <- 15
tt <-1:np
yy <- 100*a/(1+10*b*exp(-0.1*c*tt))
plot.new()
plot(tt, yy, type='l')
set.seed(123456)
ev <- runif(np)
ev <- ev - mean(ev)
y1 <- yy + ev
points(tt,y1,type='p', col="green")
y2 <- yy + 5*ev
points(tt,y2,type='p', col="blue")
y3 <- yy + 10*ev
lg3d15 <- data.frame(tt, yy, y1, y2, y3)
points(tt,y3,type='p', col="red")
library(nlsr)
sol0 <- nlxb(yy ~ a0/(1+b0*exp(-c0*tt)), data=lg3d15, start=list(a0=1, b0=1, c0=1))
print(sol0)
sol1 <- nlxb(y1 ~ a1/(1+b1*exp(-c1*tt)), data=lg3d15, start=list(a1=1, b1=1, c1=1))
print(sol1)
sol2 <- nlxb(y2 ~ a2/(2+b2*exp(-c2*tt)), data=lg3d15, start=list(a2=1, b2=1, c2=1))
print(sol2)
sol3 <- nlxb(y3 ~ a3/(3+b3*exp(-c3*tt)), data=lg3d15, start=list(a3=1, b3=1, c3=1))
print(sol3)
```
The following is a larger dataset version of this test.
```{r settest150, eval=FALSE}
np <- 150
tt <- (1:np)/10
yy <- 100*a/(1+10*b*exp(-0.1*c*tt))
set.seed(123456)
ev <- runif(np)
ev <- ev - mean(ev)
y1 <- yy + ev
y2 <- yy + 5*ev
y3 <- yy + 10*ev
lg3d150 <- data.frame(tt, yy, y1, y2, y3)
np <- 1500
tt <- (1:np)/100
yy <- 100*a/(1+10*b*exp(-0.1*c*tt))
set.seed(123456)
ev <- runif(np)
ev <- ev - mean(ev)
y1 <- yy + ev
y2 <- yy + 5*ev
y3 <- yy + 10*ev
lg3d1500 <- data.frame(tt, yy, y1, y2, y3)
f0 <- yy ~ a0/(1+b0*exp(-c0*tt))
f1 <- y1 ~ a1/(1+b1*exp(-c1*tt))
f2 <- y2 ~ a2/(2+b2*exp(-c2*tt))
f3 <- y3 ~ a3/(3+b3*exp(-c3*tt))
```
# Implementation comparisons
Here want to explore the ideas. First we will examine the sub-problem of solving
the Gauss-Newton equations or their Marquardt variants.
## Linear least squares and storage considerations
Without going into too many details, we will present the linear least squares
problem as
$$ A x \tilde = b $$
In this case $A$ is an $m$ by $n$ matrix with $m \ge n$ and $b$ a vector of lenght $m$.
We write **residuals** as
$$ r = A x - b $$
or as
$$ r_1 = b - A x $$
Then we wish to minimize the sum of squares $r^t r$. This problem does not necessarily
have a unique solution, but the **minimal length least squares solution** which is
the $x$ that has the smallest $x' x$ that also minimizes $r' r$ is unique.
### Example setup and run in lm()
Let us set up a simple problem in R:
```{r ls1, echo=TRUE}
# simple linear least squares examples
v <- 1:6
v2 <- v^2
vx <- v+5
one <- rep(1,6)
Ad <- data.frame(one, v, v2)
A <- as.matrix(Ad)
print(A)
Ax <- as.matrix(data.frame(one, v, vx, v2))
print(Ax)
y <- -3 + v + v2
print(y)
set.seed(12345)
ee <- rnorm(6)
ee <- ee - mean(ee)
ye <- y + 0.5*ee
print(ye)
sol1 <- lm.fit(A, y)
print(sol1)
cat("Residual SS=",as.numeric(crossprod(sol1$residuals)),"\n")
sol1e <- lm.fit(A,ye)
print(sol1e)
crossprod(sol1e$residuals)
sol2<-lm.fit(Ax,y)
# Note the NA in the coefficients -- Ax is effectively singular
print(sol2)
S2 <- sol2$coefficients
J <- which(is.na(S2))
S2[J]<-0
crossprod(S2)
```
The above uses the intermediate code in function `lm.fit()`. This uses a QR solver, but
it is written as a wrapper in C calling a Fortran routine (in the Fortran 77 dialect).
?? Should we put in the structure and where the code is located?
I believe that the structure of lm() predates the availability of the family
of `qr.xxx()` functions for R. These allow us to access the QR approach
directly.
```{r ls2, echo=TRUE}
x <- qr.solve(A,y)
print(x)
xe<-qr.solve(A, ye)
print(xe)
# But then we get an error when we try to solve the singular system
# This was NOT caught above.
xx <- try(qr.solve(Ax,y))
print(xx)
xxe<-try(qr.solve(Ax, ye))
print(xxe)
```
## Traditional Normal equations approach
The historically traditional method for solving the **linear** least squares problem was
to apply calculus to set the partial derivatives of the sum of squares with respect to
each parameter to zero. This forms the **normal equations**
$$ A^t A x = A^t b $$
This was attractive to early computational workers, since while $A$ is $m$ by $n$, $A^t A$
is only $n$ by $n$. Unfortunately, this **sum of squares and cross-products** (SSCP) matrix
can make the solution less reliable, and this is discussed with examples in @jncnm79 and
@nlacatvn1060620.
```{r lm5, echo=TRUE}
AtA <- t(A)%*%A
print(AtA)
Aty <- t(A)%*%y
print(t(Aty))
x<-solve(AtA,Aty)
print(t(x))
```
Let us try this with the extended matrix `Ax`
```{r lm6, echo=TRUE}
AxA <- t(Ax)%*%Ax
print(AxA)
Axy <- t(Ax)%*%y
print(t(Axy))
xx<-try(solve(AxA,Axy))
print(t(xx))
```
### Dealing with singularity
The `lm.fit()` function has a parameter `singular.ok` which defaults to TRUE.
By setting this to FALSE, we get.
```{r lm3, echo=TRUE}
sol2b<-try(lm.fit(Ax,y, singular.ok=FALSE))
print(sol2b)
```
We've already seen above that the solution `sol2` has a size of
```{r num1}
eval(as.numeric(crossprod(S2)))
```
The solution `sol2` is, it turns out, not unique, since the variables `v` and
`vx` and `one` are related, namely
```{r num2, echo=TRUE}
print(vx - (v+5*one))
```
Thus we can find a "new" solution as follows and show (essentially) the same
residuals
```{r num3, echo=TRUE}
res2<-Ax%*%S2-y
print(t(res2))
print(S2)
S2b<-S2+c(10,2,-2,0)
res2b<-Ax%*%S2b-y
print(t(res2b))
cat("Sum of squares of S2b=", as.numeric(crossprod(S2b)),"\n")
```
### Approximate solution -- Ridge regression
An approximate solution via the normal equations can be found by adding a small
diagonal matrix to the sum of squares and
cross products `AxA`.
```{r lm7, echo=TRUE}
AxA <- AxA + diag(rep(1e-10,4))
print(AxA)
xxx<-try(solve(AxA,Axy))
print(t(xxx))
print(t(Ax %*% xxx - y))
```
We note that this solution is rather different from `S2` or `S2b`.
There is a large (and often misguided and confusing) literature
about this sort of approach under the title **ridge regression**.
Note that the Marquardt stabilization of the Gauss-Newton equations
uses the same general idea.
### Dealing with singularity in QR solutions
We already saw that we got errors when trying to use `qr.solve()`, but there
are ways to use the QR decomposition that overcome the singularity. Here
are some illustrations. Note that the size of the solution as measured by
the sum of squares of the coefficients (ignoring the NA usng `na.rm=TRUE`)
are different. Moreover, they are different from the minimum length solution
in the next subsection.
```{r qrsing1, echo=TRUE}
## This fails
xx <- try(qr.solve(Ax,y))
print(xx)
## So does this
xxe<-try(qr.solve(Ax, ye))
print(xxe)
## Let us compute a QR decomposition, using LINPACK then LAPACK
qrd1<-qr(Ax, LAPACK=FALSE)
qrd2<-qr(Ax, LAPACK=TRUE)
# and get the solutions
xx1 <- try(qr.coef(qrd1,y))
print(xx1)
xx2 <- try(qr.coef(qrd2,y))
print(xx2)
# and computer the sum of squares of the coefficients (size of solution)
try(sum(xx1^2))
try(sum(xx1^2, na.rm=TRUE))
try(sum(xx2^2))
```
### Minimum length least squares solution
The minimum length least squares solution, which is unique,
is found using the Moore-Penrose inverse of `Ax`
(the article https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse gives a quite good
overview) which can be computed from the Singular Value Decomposition. R has a function `svd()`
which is adequate to our needs here.
```{r lm4, echo=TRUE}
Z<-svd(Ax) # get the svd
D1 <- 1/Z$d # invert S, the singular values diagonal matrix
print(D1)
D1[4]<-0 # to remove linear dependency (small singvals are set to zero)