Hi Nikolay and Yulia,
Awesome. I think we're talking about slightly different things with regards to #3. Here's an extremely brief explanation of the infit/outfit statistics from the Rasch perspective: http://www.rasch.org/rmt/rmt16... as well as a partially useful excerpt from Wright, B. D., & Masters, G. N. (1982). Rating Scale Analysis. Chicago, Il: MESA Press: http://www.rasch.org/rmt/rmt34....
The infit/outfit statistics are used when making decisions about retaining/dropping an item from the scoring/item bank/calibration and to some degree the person analogs of these statistics could be useful in detecting possible cases of test irregularities (e.g., a student with low theta answering difficult questions correctly and easier questions at chance levels, etc...). Nothing I've read thus far has talked about an omnibus style goodness of fit, and the discussion (at least with the folks from the Rasch camp) tends to be reversed into testing how well the data fit the model (instead of how well the model fits the data).
In either case, this is awesome and timely (there was a comparison of Stata's Bayesian capabilities to JAGS and Stan on Andrew Gelman's blog today). If it's possible to throw out another potential idea for a future blog post, if it isn't too trouble some anything that demonstrates fitting any latent class models and/or mixture measurement models would be truly outstanding.
Thanks again,
Billy