The IWLS algorithm used to fit conditional logit models

The package “mclogit” fits conditional logit models using a maximum likelihood estimator. It does this by maximizing the log-likelihood function using an iterative weighted least-squares (IWLS) algorithm, which follows the algorithm used by the glm.fit() function from the “stats” package of R (Nelder and Wedderburn 1972; McCullagh and Nelder 1989; R Core Team 2023).

If πij is the probability that individual i chooses alternative j from his/her choice set 𝒮i, where

$$ \pi_{ij}=\frac{\exp(\eta_{ij})}{\sum_k{\in\mathcal{S}_i}\exp(\eta_{ik})} $$

and if yij is the dummy variable with equals 1 if individual i chooses alternative j and equals 0 otherwise, the log-likelihood function (given that the choices are identically independent distributed given πij) can be written as

ℓ = ∑i, jyijln πij = ∑i, jyijηij − ∑iln (∑jexp (ηij))

If the data are aggregated in the terms of counts such that nij is the number of individuals with the same choice set and the same choice probabilities πij that have chosen alternative j, the log-likelihood is (given that the choices are identically independent distributed given πij)

ℓ = ∑i, jnijln πij = ∑i, jnijηij − ∑ini+ln (∑jexp (ηij))

where ni+ = ∑j ∈ 𝒮inij.

If

ηij = α1x1ij + ⋯ + αrxrij = xijα

then the gradient of the log-likelihood with respect to the coefficient vector α is

$$ \frac{\partial\ell}{\partial\boldsymbol{\alpha}} = \sum_{i,j} \frac{\partial\eta_{ij}}{\partial\boldsymbol{\alpha}} \frac{\partial\ell}{\partial\eta_{ij}} = \sum_{i,j} \boldsymbol{x}_{ij} (n_{ij}-n_{i+}\pi_{ij}) = \sum_{i,j} \boldsymbol{x}_{ij} n_{i+} (y_{ij}-\pi_{ij}) = \boldsymbol{X}'\boldsymbol{N}(\boldsymbol{y}-\boldsymbol{\pi}) $$

and the Hessian is

$$ \frac{\partial^2\ell}{\partial\boldsymbol{\alpha}\partial\boldsymbol{\alpha}'} = \sum_{i,j} \frac{\partial\eta_{ij}}{\partial\boldsymbol{\alpha}} \frac{\partial\eta_{ij}}{\partial\boldsymbol{\alpha}'} \frac{\partial\ell^2}{\partial\eta_{ij}^2} = - \sum_{i,j,k} \boldsymbol{x}_{ij} n_{i+} (\delta_{jk}-\pi_{ij}\pi_{ik}) \boldsymbol{x}_{ij}' = - \boldsymbol{X}'\boldsymbol{W}\boldsymbol{X} $$

Here yij = nij/ni+, while N is a diagonal matrix with diagonal elements ni+.

Newton-Raphson iterations then take the form

$$ \boldsymbol{\alpha}^{(s+1)} = \boldsymbol{\alpha}^{(s)} - \left( \frac{\partial^2\ell}{\partial\boldsymbol{\alpha}\partial\boldsymbol{\alpha}'} \right)^{-1} \frac{\partial\ell}{\partial\boldsymbol{\alpha}} = \boldsymbol{\alpha}^{(s)} + \left( \boldsymbol{X}'\boldsymbol{W}\boldsymbol{X} \right)^{-1} \boldsymbol{X}'\boldsymbol{N}(\boldsymbol{y}-\boldsymbol{\pi}) $$

where π and W are evaluated at α = α(s).

Multiplying by XWX gives

XWXα(s + 1) = XWXα(s) + XN(y − π) = XW(Xα(s) + WN(y − π)) = XWy*

where W is a generalized inverse of W and y* is a “working response vector” with elements

$$ y_{ij}^*=\boldsymbol{x}_{ij}'\boldsymbol{\alpha}^{(s)}+\frac{y_{ij}-\pi_{ij}}{\pi_{ij}} $$

The IWLS algorithm thus involves the following steps:

  1. Create some suitable starting values for π, W, and y*

  2. Construct the “working dependent variable” y*

  3. Solve the equation

    XWXα = XWy*

    for α.

  4. Compute updated η, π, W, and y*.

  5. Compute the updated value for the log-likelihood or the deviance

    $$ d=2\sum_{i,j}n_{ij}\ln\frac{y_{ij}}{\pi_{ij}} $$

  6. If the decrease of the deviance (or the increase of the log-likelihood) is smaller than a given tolerance criterian (typically Δd ≤ 10−7) stop the algorighm and declare it as converged. Otherwise go back to step 2 with the updated value of α.

The starting values for the algorithm used by the mclogit package are constructe as follows:

  1. Set

    $$ \eta_{ij}^{(0)} = \ln (n_{ij}+\tfrac12) - \frac1{q_i}\sum_{k\in\mathcal{S}_i}\ln (n_{ij}+\tfrac12) $$

    (where qi is the size of the choice set 𝒮i)

  2. Compute the starting values of the choice probabilities πij(0) according to the equation at the beginning of the page

  3. Compute intial values of the working dependent variable according to

    $$ y_{ij}^{*(0)} = \eta_{ij}^{(0)}+\frac{y_{ij}-\pi_{ij}^{(0)}}{\pi_{ij}^{(0)}} $$

References

McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. Monographs on Statistics & Applied Probability. Boca Raton et al.: Chapman & Hall/CRC.
Nelder, J. A., and R. W. M. Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society. Series A (General) 135 (3): 370–84. https://doi.org/10.2307/2344614.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.