Statement of Research Goals and Accomplishments
Measurements in the social and behavioral sciences are often discrete or categorical (e.g., gender, race, occupation, type of high school program attended, the response option selected on a survey, ratings made by individuals on an integer scale). Unobserved or latent variables are often assumed to underly observed measurements on discrete variables. My scholarship lies at the intersection of statistical models for multivariate discrete data and psychometrics. The general problem that motivates my research is how to represent and model associations between discretely measured variables in meaningful and appropriate way, including situations where latent variables are hypothesized to lead to observed behavior.
In my initial work on developing models for multivariate categorical I proposed a family of models, ``3--mode association'' models, which consists of four general classes of models (Anderson, 1996). These models are generalizations of log-linear models (standard approach for modeling multivariate discrete data) and Leo Goodman's multidimensional row-column or RC(M) association model. I have extended my initial work on models for categorical data in the following directions
- Association models as latent variable models
- Estimation and Reproducibility
- Distance based models.
- Social Networks
Latent variable models
In education and psychology, theories often postulate that unobserved constructs underly observed behavior. Log-multiplicative association models had originally been interpreted informally as latent variable models; however, their use in education and psychology was never adopted. It was relatively well known in the RC(M) and log-multiplicative association model literature that when data arise from a multivariate normal distribution but observations are measured discretely, the model implied for the observed data is log-multiplicative. A major contribution that I added was showing that log-multiplicative association models can be derived from statistical graphical models where the observed variables are discrete and the unobserved ones are continuous (Anderson & Vermunt, 2000); that is, I provided a formal latent variable model. I further expanded on this by adding an observed numerical or metric (``continuous'') observed variables to the system of observed and latent variables (Anderson & Bockhenholt, 2000). The models proposed until this point could only represent 2--way associations or interactions between observed between variables. This limitation was removed in Anderson (2002) where I proposed three ways of representing higher-way associations.
A major class of latent variables models exist that model the relationship between discretely measured variables where the observed association is postulated to be due to latent continuous variables, namely item response theory (IRT) models. In Anderson and Yu (2007) I show that the assumptions that I made in deriving of log-multiplicative models from a latent graphical model are identical to those my by Paul Holland (1993)who started with item response theory principles. This connection has both philosophical and practical implications. Using the statistical graphical models, an implication is that the items or variables chosen define the latent variable, which is the opposite of the construction of most latent variable models. Of more importance in Anderson and Yu (2007) is that I provide a second way to derive log-multiplicative models for observed data that postulates the existence of the latent construct that lead to observed behavior. In empirically show that log-multiplicative association models for dichotomous items behave nearly the same as standard IRT models (i.e., Rasch and two-parameter logistic models). I extended this second derivation in Anderson, Li and Vermunt (2007, almost in press) where I generalize it to the situation of polytomous items and multidimensioanl models in the Rasch family.
A very small modification of the proof in Anderson, Li and Vermunt (2007, almost in press) holds the key to showing that LMA models can be derived from compensatory multidimensional IRT models, which includes many IRT models as special cases (e.g., two-parameter logistic, Bock's nominal response model, models with covariates).
Bringing together capabilities of LMA including covariates in a number of different ways and the formal latent variable interpretation of LMA that I have added to the literature, I explicitly discuss how covariates can be added to the LMA model and what this mean in terms of the latent variable model. In Tettegah and Anderson (2007) is did this treating the LMA as a formative model; however, using the conditional specification (i.e., reflective model) and covariates can be added in many ways (e.g., models for the the latent variable(s), item difficulties, particular response option types, and more).Comparing my earlier writing to my most recent writing on latent variable interpretation of log-multiplicative models shows a shift from only presenting and using log-multiplicative models as formative models to formative or reflective models. The latter removes the philosophical argument against LMA models as IRT models, but an estimation block remains
Estimation and Reproducibility
The parameters of log-multiplicative association models are typically estimated using MLE implemented in a Newton-Raphson type algorithm. Although this work wells for small problems, it does not work for moderate to large problems. I developed an algorithm to fit LMA models to large data sets. It works very well, is remarkably flexible and yields parameter estimates that are nearly indistinguishable from MLE ones. The next problem I faced was explaining why it worked. Starting with the special case of LMA models that correspond to models in the Rasch family, I extended a proof in the literature on estimation to prove that I was performing pseudo-likelihood estimation. Since the estimation method is pseudo-likelihood estimation, the parameter estimates are asymptotically normal and consistent and robust standard error can be computed. In the case of Rasch models, the method can be implemented in most standard statistical packages. In Anderson, Li and Vermunt (2007, almost in press), I present the special case of the algorithm, which also includes a description of how to use an R package that we developed for estimating parameters of Rasch models for polytomous (or dichotomous) items and multidimensional (or unidimensional) models. I have also implemented the method in SAS and within the next year put examples on my web-site for how to do it.
Not only is the application of pseudo-likelihood estimation for fitting IRT models to data new, it is capable of doing what is know as item factor analysis. My method can estimate models with high dimensionality, which is a limiting problem for marginal maximum likelihood estimation, and the method is extremely flexible, which is not true of Bayesian estimation methods. The general algorithm will need additional programing to be fully useful; however, a set of macros for R, SAS or some other computing environment will suffice. Given such a set of macros, any very wide array of models can be fit.
Reproducibility in the quantitative and statistics literatures is a growing movement. For new developments to have an impact on practice, software that implements them should be developed that is useable by applied researchers. Toward this end, I have created a web-site that contains program code and macros using standard or readily available software (see http://www.ed.uiuc.edu/faculty/cja/homepage/software_index.html. All of the code found on the web-site can be run using standard or readily available software. Example of the code that can be found include R (and soon SAS) code for pseudo-likelihood estimation of log-linear by linear association models (i.e., a family of Rasch models) (Anderson, Li & Vermunt, 2007), a SAS macro that implements a new logistic regression model diagnostic procedure (Anderson & Rutkowski, 2007 in press), SAS macros for estimating restricted singular values decompositions for sets of matrices (de Rooij & Anderson, 2007 in press), and all input code for models reported in papers that I have published since 2002.
Distance Models
My work with distance (scaling) models has primarily been collaborative with my Dutch colleagues, including papers with Jeroen Vermunt (Vermunt & Anderson, 2005), Pieter Kroonenberg (Kroonenberg & Anderson, 2006), and Mark de Rooij (de Rooij & Anderson, 2007 in press).
In my dissertation, I compared the performance of the 3-model association models, which I proposed, to 3-mode correspondence analysis. The development of 3-mode association models was published in Anderson (1996) but I did not publish the comparison with 3-model correspondence analysis. In Kroonenberg and Anderson (2006), we compare the performance of 3-mode correspondence analysis with models from my more recent research (i.e., LMA as latent variable models). An interesting result is that the association models fit better than the correspondence analysis models and yield similar results in terms of scaling categories as multiple correspondence analysis.
Association models can be reparameterized in terms of a distance model, which yields a more direct way to interpret plots of scale values in terms of odds ratios. In a project with Mark de Rooij that uses this distance based idea, my major contribution deals with sets of tables and the estimation of the models. I wrote a SAS macros adapting a procedure proposed by Keirs and ten Berg to fit restricted singular value decompositions to sets of 2-way tables (de Rooij & Anderson, 2007 in press).
Social Networks
In Templin, Moon-Ho, Anderson and Wasserman (2003), I proposed a random effects model p* model (i.e., hierarchical exponential random graph model) to study the relationships between actors within a network when there is a set of networks. The set of networks can be thought of as a random sample of networks. The idea was mine, but the nuts-and-bolts work was done by Templin and Moon-Ho who were at the time graduate students at UIUC.
