See details for model. Should likely be followed by function uncollapsePibble. Notation: N is number of samples, D is number of multinomial categories, and Q is number of covariates.

optimMaltipooCollapsed(
  Y,
  upsilon,
  Theta,
  X,
  KInv,
  U,
  init,
  ellinit,
  n_samples = 2000L,
  calcGradHess = TRUE,
  b1 = 0.9,
  b2 = 0.99,
  step_size = 0.003,
  epsilon = 1e-06,
  eps_f = 1e-10,
  eps_g = 1e-04,
  max_iter = 10000L,
  verbose = FALSE,
  verbose_rate = 10L,
  decomp_method = "cholesky",
  eigvalthresh = 0,
  jitter = 0
)

Arguments

Y

D x N matrix of counts

upsilon

(must be > D)

Theta

D-1 x Q matrix the prior mean for regression coefficients

X

Q x N matrix of covariates

KInv

D-1 x D-1 symmetric positive-definite matrix

U

a PQxQ matrix of stacked variance components

init

D-1 x N matrix of initial guess for eta used for optimization

ellinit

P vector of initial guess for ell used for optimization

n_samples

number of samples for Laplace Approximation (=0 very fast as no inversion or decomposition of Hessian is required)

calcGradHess

if n_samples=0 should Gradient and Hessian still be calculated using closed form solutions?

b1

(ADAM) 1st moment decay parameter (recommend 0.9) "aka momentum"

b2

(ADAM) 2nd moment decay parameter (recommend 0.99 or 0.999)

step_size

(ADAM) step size for descent (recommend 0.001-0.003)

epsilon

(ADAM) parameter to avoid divide by zero

eps_f

(ADAM) normalized function improvement stopping criteria

eps_g

(ADAM) normalized gradient magnitude stopping criteria

max_iter

(ADAM) maximum number of iterations before stopping

verbose

(ADAM) if true will print stats for stopping criteria and iteration number

verbose_rate

(ADAM) rate to print verbose stats to screen

decomp_method

decomposition of hessian for Laplace approximation 'eigen' (more stable-slightly, slower) or 'cholesky' (less stable, faster, default)

eigvalthresh

threshold for negative eigenvalues in decomposition of negative inverse hessian (should be <=0)

jitter

(default: 0) if >0 then adds that factor to diagonal of Hessian before decomposition (to improve matrix conditioning)

Value

List containing (all with respect to found optima)

  1. LogLik - Log Likelihood of collapsed model (up to proportionality constant)

  2. Gradient - (if calcGradHess=true)

  3. Hessian - (if calcGradHess=true) of the POSITIVE log posterior

  4. Pars - Parameter value of eta

  5. Samples - (D-1) x N x n_samples array containing posterior samples of eta based on Laplace approximation (if n_samples>0)

  6. VCScale - value of e^ell_i at optima

  7. logInvNegHessDet - the log determinant of the covariacne of the Laplace approximation, useful for calculating marginal likelihood

Details

Notation: Let Z_j denote the J-th row of a matrix Z. Model: $$Y_j \sim Multinomial(Pi_j)$$ $$Pi_j = Phi^{-1}(Eta_j)$$ $$Eta \sim T_{D-1, N}(upsilon, Theta\\*X, K, A)$$

Where A = (I_N + e^ell_1\*X\*U_1\*X' + ... + e^ell_P\*X\*U_P\*X' ), K is a D-1xD-1 covariance and Phi is ALRInv_D transform.

Gradient and Hessian calculations are fast as they are computed using closed form solutions. That said, the Hessian matrix can be quite large [N\(D-1) x N\(D-1)] and storage may be an issue.

Note: Warnings about large negative eigenvalues can either signal that the optimizer did not reach an optima or (more commonly in my experience) that the prior / degrees of freedom for the covariance (given by parameters upsilon and KInv) were too specific and at odds with the observed data. If you get this warning try the following.

  1. Try restarting the optimization using a different initial guess for eta

  2. Try decreasing (or even increasing)step_size (by increments of 0.001 or 0.002) and increasing max_iter parameters in optimizer. Also can try increasing b1 to 0.99 and decreasing eps_f by a few orders of magnitude

  3. Try relaxing prior assumptions regarding covariance matrix. (e.g., may want to consider decreasing parameter upsilon closer to a minimum value of D)

  4. Try adding small amount of jitter (e.g., set jitter=1e-5) to address potential floating point errors.

References

S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv 1609.04747

See also