Header

Inversion questions from 2004 summer school in Golden, CO
(This Q&A list is a bit ad-hoc, but may be useful to others studying geophysical inverse theory. It was the result of a list drawn up during the summer school. Many thanks to those who submitted questions and answers. Note additional resources and links on Andy Ganse's geophysical inversion resources page link at left.)

  1. Which inversion algorithm is best for my nonlinear model?

  2. How to set up prior distributions?

  3. Does error/uncertainty of model always obey Gauss distribution?

  4. How to estimate the distribution of modeling error?

  5. Things to consider when quantitatively comparing different models & their uncertainties?

  6. Thoughts on how to graphically visualize uncertain results?

  7. Where does the production of scientific results end and the advising of decision-makers begin?

  8. Is it possible to measure errors in one prestack migration section?

  9. In statistical inversion, how do we define a persistant feature?

  10. What are some practical ways to incorporate hard constraints?

  11. How do you choose the parametrization of your model?

  12. What is the difference between "inversion" and "optimization"?

  13. Is there a systematic method of analyzing residuals to indicate deficiencies in the model complexities? (Ie can looking at the residuals tell something about how to improve the model?)

  14. What is the formal (mathematical) definition of the term "seismic image"?

  15. What is the difference between "probability" and "statistics"? (with reference to the lecture saying probability was for the forward problem and statistics was for the inverse problem)

  16. How do we interpret resolution of non-ordered model parameters or data? (eg density, soundspeed, attenuation, and so on for a halfspace as opposed to say soundspeed vs depth)

  17. What is the difference between an "inverse problem" and a "parameter estimation" problem?

  18. What are Cramer-Rao bounds and why are they often mentioned with respect to inversion results?

  19. Can one handle non-Gaussian data distributions in classical / frequentist inversion, or must we resort to Bayesian inversion for that?

  20. We heard a number of times for Bayesian inversion about using the posterior distribution from one run as the prior distribution of another run. So if I keep doing that for a problem, will the resulting posterior distribution always converge?







  1. Which inversion algorithm is best for my nonlinear model?
    • from Alexandra Newman, CSM, 07/05/2004 07:46 PM:

      This is a very difficult question that cannot be answered in general terms. Whereas there are two main algorithms for linear optimization models (with continuous-valued variables), there are many algorithms to handle nonlinear models. The algorithm you use will depend on a variety of aspects of the mathematical structure of your problem, including:

      • the density of the constraint ("A") matrix
      • whether the nonlinearity is in the objective function, the constraints, or both
      • whether the model is quadratic, or whether, at the least, you are minimizing a convex function over a convex set
      • the degree of nonlinearity in a nonconvex model (how ill-behaved your functional forms f, g, and h are)
      • whether the model is unconstrained or not
      • whether the constraints, if they exist, are equality or inequality constraints (or both)
      • whether your model contains integer variables

      As a rough (and very incomplete) guide:

      • Eldad suggested Gauss-Newton for geophysical inverse problems.
      • The Stanford Business Software webpage (www.sbsi-sol-optimize.com) suggests Minos for large-scale nonlinear optimization models containing a nonlinear objective and linear constraints with a sparse A matrix.
      • CPLEX will solve quadratic (convex) optimization problems.
      • BARON and MINLPBB are two more recent algorithms that are supposed to be able to handle nonlinear problems with integer-valued variables.

      The neos server (www-neos.mcs.anl.gov/neos) allows you to submit problems (in a certain format, such as AMPL) and it will "decide" which software is best for your particular problem instance. This also allows you to test software before you invest in it. There are 11 nonlinear solvers for models with continuous variables, and two solvers for nonlinear models with integer valued variables.


    • from Brian Borchers, NMT, 07/06/2004 05:40 PM:

      There is no simple answer to this question, except that you should talk to an optimization expert, especially if you think that the computations are going to be time consuming.

      Some questions that I would consider in selecting an optimization method are:

      • Does the problem have special structure that can be exploited? Is it linear or nonlinear? Is it a least squares problem? Is the objective function separable into parts that are independent of each other? Are there constraints or is the problem unconstrained? For linear problems, does the matrix have Toeplitz structure (this occurs in deconvolution problems and is a very nice property to have.)
      • In what form do I have the objective function and constraint functions? Are they given to me in matrix form? As functions written out in mathematical form? As computer subroutines with source code? As black box subroutines that I can't understand?
      • How large is the problem? Is the number of decisions variables in the hundreds, thousands, or much larger?

      The problems that come up in inversion are typically least squares problems, but they may be linear or nonlinear, and the size can range from small to quite large. As a starting point, here are my suggestions for some classes of unconstrained least squares problems (including least squares problems with Tikhonov regularization, since the TR term is just another least squares term.) For the nonlinear problems, I've assumed that the problems of nonconvexity are not severe, so that simply finding a local minimum or performing multistart is adequate. If the problem is strongly nonconvex, then you need to think about that a lot more. In the folowing, by "small", I mean less than about 1,000 variables.

      Small, linear, least squares: Use the singular value decomposition. This gives you the most flexibility with respect to Tikhonov regularization and can be much more accurate than solving the normal equations.
      Large, linear, least squares: Use conjugate gradients or other more sophisticated iterative methods that can take advantage of sparsity in the matrix.
      Small, nonlinear, least squares: Use the Levenberg-Marquardt method. The Gauss-Newton method is another option but less robust than LM.
      Large, nonlinear, least squares: Lots of options here. This is a situation where I would definitely recommend getting in touch with an optimization expert. I'd also recommend working with experts of large scale computing. Limited memory versions of Gauss-Newton or the Levenberg-Marquardt method use tricks to avoid computing the full Hessian matrix. These include the truncated Newton method and the limited memory BFGS method.

      Another general approach avoids Newton's method entirely by just using the gradient of the objective method. The conjugate method is most commonly used. The method of steepest descent typically performs very badly in practice and is definitely not recommended.



  2. How to set up prior distributions?
    • from Andy Ganse, APL-UW, 07/06/2004 10:48 AM:

      Note that a number of issues relating to this question were addressed in Alberto Maliverno's presentation on the last day of the program, which came after this question was posted... Please look at his presentation material for more info.



  3. Does error/uncertainty of model always obey Gauss distribution?
    • from Brian Borchers, NMT, 07/06/2004 05:16 PM:

      No.

      In the Bayesian approach with a multivariate normal prior and multivariate normal data (with known covariance), and a linear relationship between model and data, you get a multivariate normal posterior. If the relationship is weakly nonlinear, then it may be useful to linearize around the MAP solution to get a multivariate normal approximation to the posterior distribution. If the problem is strongly nonlinear, or the prior is nonnormal, or the data are nonnormal, then the posterior distribution won't be normal.

      In the classical approach, if you had to regularize the problem, then reguarlization bias creeps in, and all bets are off. However, if you solve a linear problem, and have normal data errors with known covariances, then you can construct confidence intervals based on the normal distribution. If the problem is weakly nonlinear, then it may also be appropriate to construct confidence intervals based on the normal distribution.




  4. How to estimate the distribution of modeling error?



  5. Things to consider when quantitatively comparing different models
    & their uncertainties?




  6. Thoughts on how to graphically visualize uncertain results?



  7. Where does the production of scientific results end and the
    advising of decision-makers begin?

    • from Alexandra Newman, CSM, 07/05/2004 07:46 PM:

      I would argue that this is not a two-phase process, but rather an iterative process. When I formulate an optimization model for someone
      in the real world, I take as much information as they give me, build a model, present a result, refine the model based on their reaction to the result, etc., and this may go on for quite a few iterations. Eventually, the user of the solution is relatively happy, yet may still need to tweak the result in order to account for aspects that are difficult to quantify.



  8. Is it possible to measure errors in one prestack migration
    section?



  9. In statistical inversion, how do we define a persistant feature?


  10. What are some practical ways to incorporate hard constraints?
    • from Alexandra Newman, CSM, 07/05/2004 07:46 PM:

      I am not sure if this means "difficult to model", "difficult to satisfy", or "constraints that are not soft". And, I would also like some clarification as to whether this is "incorporate with respect to an optimization model", or exactly how the constraints are to be incorporated.

      If the constraints are difficult to satisfy and a seemingly-appropriate algorithm will not solve well with the constraints, there are ways to allow a violation of the constraints with a penalty, and place this penalty in the objective, i.e., turn the constraints from "hard" to "soft".

      Mathematically, if we have an optimization model as follows:

      min cx

      subject to Ax = b

      and we want to make the constraint set Ax = b soft, we can add two vectors of elastic variables to the model, let's call them z and w, where z would be the undersatisfaction of the constraint and w the oversatisfaction, and then reformulate the model as follows:

      min cx + z + w

      subject to Ax + z - w = b.

      w, z >= 0.

      Now, any undersatisfaction in Ax = b for a particular constraint is picked up by the corresponding z variable, whereas oversatisfaction is picked up with the relevant w variable. There is a penalty for using both variables in the objective. Of course, one can make the penalty greater than 1.

      For inequality constraints, one need only use the w's or z's, depending on the direction of the inequality.


    • from Andy Ganse, APL-UW, 07/06/2004 10:55 AM:

      I think part of this question came from a conversation a few of us students were having. As far as myself was concerned, I'd mentioned confusion in using an optimization package (NPSOL), not being sure whether to implement a "hard" constraint (eg density definitely >= 0) in the "soft" constraint functionality that comes built in to NPSOL, or to try to implement it as an additional term in my objective function via a "log-barrier" function (set to blow up very quickly once you pass the bound, but perhaps that in fact makes it a soft constraint anyway?) as had been suggested to me previously. And then we got in a big muddled discussion of just how these things should be implemented and then the break was over...



  11. How do you choose the parametrization of your model?


  12. What is the difference between "inversion" and "optimization"?
    • from Alexandra Newman, CSM, 07/05/2004 07:46 PM:

      If we again consider the optimization problem:

      min cx

      subject to Ax = b

      inversion of A (the only matrix here) will give a set of solutions to the vector of decision variables, x. However, this is a set of solutions, and, if Ax = b is uniquely determined, the only set of solutions. Note we did not even consider the objective when setting:

      x = A^{-1}b

      In the event that A is underdetermined and there are many such solutions to the system Ax = b, and if we would like to choose among all such x vectors the one that makes us "best off", in our case above, minimizes cx, then we need to optimize, i.e., sort through all the x vectors that satisfy Ax = b in some (hopefully) systematic way to find the one that minimizes the expression cx. An optimization algorithm will do this "systematic sorting".


    • from Brian Borchers, NMT, 07/06/2004 05:09 PM:

      Optimization is the process of minimizing or maximizing a function, possibly subject to additional constraints.

      Optimization often appears within inversion, because the statistical approaches to inversion, including "least squares", "maximum likelihood estimation", and "maximum a posteriori solutions" all involve optimization problems.

      Another place where optimization appears in inverse problems is in finding the maximum/minimum values of parameters that are consistent with a set of observations. This is a case of minimizing or maximizing the model parameter subject to the constraint that the misfit between the data and the model predictions be less than some tolerance.

      However, optimization is also used in many other areas that have nothing to do with parameter estimation and inverse problems. For example, optimization techniques are often used in engineering design to find the best (cheapest, best performing, or whatever) design for a system.



  13. Is there a systematic method of analyzing residuals to indicate deficiencies in the model complexities? (Ie can looking at the residuals tell something about how to improve the model?)


  14. What is the formal (mathematical) definition of the term "seismic image"?


  15. What is the difference between "probability" and "statistics"? (with reference to the lecture saying probability was for the forward problem and statistics was for the inverse problem)


  16. How do we interpret resolution of non-ordered model parameters or data? (eg density, soundspeed, attenuation, and so on for a halfspace as opposed to say soundspeed vs depth)
    • from Brian Borchers, NMT, 07/06/2004 05:03 PM:

      I'd argue that you shouldn't even try to do this.

      In practice, many problems with a small number of unordered parameters are well conditioned parameter estimation problems that don't require regularization. For such problems, it's relatively easy to compute and plot confidence intervals or confidence regions or their Bayesian equivalents.

      In my experience, in those cases where the problem is very badly conditioned, this is typically due to inclusion of redundant parameters in the model that simply aren't required. Rewriting the problem in terms of the minimal number of parameters can often resolve the multicollinearity ("multicollinearity" is staticiain talk for ill-conditioning.)



  17. What is the difference between an "inverse problem" and a "parameter estimation" problem?
    • from Brian Borchers, NMT, 07/06/2004 04:58 PM:

      A "parameter estimation problem" is a problem in which a finite collection of parameters is to be estimated.

      To mathematicians, "inverse problem" typically means a problem in which the unknown is a function rather than a finite collection of parameters. In this world, all discretized versions of inverse problems are simply parameter estimation problems.

      Some people who work with discretized versions of inverse problems have begun to refer these problems as discrete inverse problems or discrete ill-posed inverse problems. This terminology is used for example in Per Hansen's book.

      Within the world of discretized problems, I think it's most important to make the distinction between problems that are well conditioned (and can be solved without regularization) and those problems that are so poorly conditioned that some kind of regularization is required.


  18. What are Cramer-Rao bounds and why are they often mentioned with respect to inversion results?


  19. Can one handle non-Gaussian data distributions in classical / frequentist inversion, or must we resort to Bayesian inversion for that?
    • from Brian Borchers, NMT, 07/06/2004 04:51 PM:

      In the classical world, the approach to use with non-Gaussian data distributions is "maximum likelihood estimation." This produces an estimate that happens to be the same as the MAP solution for a uniform (flat) prior in the Bayesian case.

      If regularization is added, then you have what is called "maximum penalized likelihood estimation."



  20. We heard a number of times for Bayesian inversion about using the posterior distribution from one run as the prior distribution of another run. So if I keep doing that for a problem, will the resulting posterior distribution always converge?
    • from Brian Borchers, NMT, 07/06/2004 04:47 PM:

      Converge to what? Ideally, you'd like the posterior distribution to converge to a delta function, indicating that precisely one model is consistent with the data and prior assumptions.

      Unfortunately, this doesn't typically happen in practice. Some things that can go wrong:

      • Due to errors in your modeling assumptions, or changes in environmental conditions, the new data that you get might be inconsistent with previous data.
      • The new data might not provide much additional information, so the posterior distribution doesn't change much (or doesn't change at all) from the prior. As an extreme example, if you repeatedly weigh the paperweight on my desk, it isn't going to give you any useful information about the depth of the core-mantle boundary.
      • Due to the ill-posed nature of many inverse problems, additional data simply don't provide enough information to significantly tighten the posterior distribution. (This is akin to the situation in the classical approach where the regularization bias is much larger than the uncertainty propagated from the data to the estimated model.)


      One common situation is one in which you repeatedly perform the same experiment (as in "stacking") Suppose that we have a discrete linear inverse problem, with a multivariate normal and multivariate normal distribution for the measurements. You can show that in this case, repeating the measurements is exactly equivalent to taking one more accurate measurement. Unfortunately, the effective measurement standard deviation is reduced at a rate proportional to one over the square root of the number of repititions. e.g. 100 repititions get you a factor of 10 reduction in the effective standard deviation of the measurement.

      This will help some, but as you keep repeating the measurements, you'll typically find that the prior uncertainty and ill-posed nature of the inverse problem conspire to keep the posterior distribution broad. For well posed problems (e.g. simple linear regression) things will work out much better.

      I'd encourage the student to actually try this with a simple linear inverse problem.