Markov Chain Monte Carlo Simulation in R

Markov chain Monte Carlo Simulation in R
Author’s name
Date:
Summary
Markov Chain Monte Carlo (MCMC) simulations have been used increasingly over the
last two decades. Reaching out to a broad audience, this report provides guidelines for reporting
Monte Carlo studies in that field. Such an experiments are more commonly referred to as Monte
Carlo or simulation studies are used to study the behavior of statistical methods and measures
under controlled situations. Whereas recent computing and methodological advances have
permitted increased efficiency in the simulation process, known as markov chains, such
experiments remain limited by their finite nature and hence are subject to uncertainty; when a
simulation is run more than once, different results are obtained. However, little emphasis has
been placed on reporting the uncertainty, referred to here as Monte Carlo error, associated with
simulation results in the published literature, or on justifying the number of replications used.
The datasets for simulation is selected from the inbuilt storage in R. This research presents series
of simple and practical criterions for estimating Monte Carlo error as well as determining the
number of replications required to achieve a desired level of accuracy. The issues and methods
are demonstrated with two simple examples, one evaluating algorithm output using diagnostic
tools to get the maximum likelihood estimator. For the parameters in logistic regression in rjags
we obtain 95% confidence intervals. The results suggest that in many settings, Monte Carlo error
may be more substantial than traditionally thought to provide best Bayesian analysis using
MCMC approach.
Keywords: Markov Chain Monte Carlo; Simulation; rjags; Diagnostic tool; Markov
Chains
Introduction
Markov Chain Monte Carlo (MCMC) criterion is now being routinely applied to fit complex
models in many disciplines. Its popularity is mainly due to wide usage in Bayesian stastistics
though it also plays a crucial role in frequentist inference. Two critical questions that MCMC
researchers addressed were where to start and when to stop the simulation. The main idea of
MCMC is that if simulating from a target density π is difficult so that the ordinary Monte Carlo
method based on independent and identically distributed (iid) samples cannot be used for making
inference on π. it is be possible to construct a Markov chain {yn} n≥0 with constant density π for
forming parameters called Monte Carlo estimators. An introduction to simulation using MCMC
and construction of Markov chains using rjags and Gibbs sampler was provided. The general
purpose of those algorithms were provided in R. several R packages were implemented specific
to MCMC algorithms for finite number of statistical models. This project did not dwell on the
model development but rather on analysis of the Markov chains we obtained after running the
algorithms to determine its convergence.
Since they have been widely used in Bayesian analysis, most have are employed by frequentists
in missing and dependent data where likelihood has complex dimensional integrals. While
MCMC algorithms allow large expansion of the class of candidate models for a given dataset,
they also suffer from well-known and potentially serious disadvantages. Researchers face
difficulties trying to decide the appropriate instance for their use and make conclusion about
convergence. It is not clear to know that a particular sample truly represents the underlying
stationary distribution of Markov chain.
In this research we solved the problem of Bayesian analysis by determining MCMC algorithms
using diagnostic tools to output produced. Majority of the applied work involve this method to
the convergence problem. Employing a large number of parallel, independent chains, led to
obtaining simple moment, quantile and density estimates. since the stationary distribution was
unknown to us in practice, this same basic difficulty will plague any convergence diagnostic;
indeed, making the researchers conclude that all such diagnostics are fundamentally unsound.
In this study, we introduced the MCMC convergence diagnostics. For each category of dataset,
we briefly reviewed their theoretical bases, and discussed their practicality of implementation.
We also classified the methods according to whether they measured the convergence of
univariate or multivariate quantities or full joint distribution. Also, we check whether their
results are quantitative or qualitative in nature. Finally, we assessed the extent to which each
addressed the competing issues of biasness and variance in the resulting estimated features of the
stationary distribution form women dataset
A dataset is a collection of data with definite features. A data point on the contrary is a single
instance of women dataset. They are sometimes referred as an object or record. Any head within
a dataset which contains text, numeric or categorical data is referred to as an attribute. Boolean
data types are good examples of numerical dataset.
Body
Illustration of the “Women “dataset

In Studio, there is a section for finding and viewing all public datasets. A survey is conducted on
the platform by using Data () function to check all the public information in the software. The
factors considered while choosing the women file is quality, quantity, availability, gaps and
whether the data can be useful in MCMC simulation without changing the objective of this
research. For this study, a sample dataset of 15 data points with two attributes was put together:
height and weight. It has a perfect defined structure with 15 rows and two columns along with
their heads. Each of the fifteen rows in the women dataset represents a data point which has a
similar structure. The predictor in this case was the weight and height. The Boolean data types
are the heads since they possess numerical attributes.
Before perfoming any statistical analysis on the numerical dataset, it is important we understand
its nature. This can be done as part of illustrating the dataset. In that case, we use techniques that
explore the information which helps to uncover its properties. With the help of these approaches,
we are able to check center of data, probabilities, skewness, spread and presence of outliers.
2. Overall features of the statistical model such as the role of the parameters and the
inferential goals of the analysis
The overall feature of the statistical model is to simulate mean, standard deviation and regression
analysis using posterior approach in MCMC. We find mean and sd to check whether they are
close to the values we input. The simulation is repeated few times to confirm these observations
that indeed they are not exactly in the distribution. The (par) is a feature which allows us to
modify low-level plotting functions. When we repeat the trials for the random numbers with
small sample size using apply feature, the spread of experimental trials shifts. The standard
deviation between the four reduces. Taking large sample sizes and repeat results to sd which is
further away. For regression model Y=N (a+b*x, sd) means a and b are intercepts, y is our
response variable(height), x the explanatory variable. it is a linear regression model on a standard
deviation on a normal distribution. Our explanatory variable is fixed and the intercepts a and b
given. The figure 1 shows the degree of uncertainty, linear regression simulations when done
many times.
FIGURE 1: Deterministic model or linear regression
3. Illustrate the main inferential findings (Bayesian point and interval estimation,
hypothesis testing)
Fig2: Showing inferential findings and estimations

A summary statistics is performed using R command lines which provide the following outputs.
Some of the values are min, mean, maximum, median, intercepts, standard deviation, error, t-
value residuals, r-squared degree of freedom-value and F-statistics. Figure 2 displays the values
generated in R under jags. The mean for random normal values from the women’s data is
repeated for 100 draws. Also, the same criteria are repeated four times for 100 draws and
histograms drawn for same data. Figure 3 shows four histograms for the random numbers of
weight and height of women. We find four different means for the same. The values are found to
be close to 110 but not equal to 110. We do simulations for the distribution when n is varying,
that is, 10, 100, 25 and 1000. From this we realized that when the data is simulated, as our n
increase we will get closer to the expected value.
Figure 4: Histograms for normal random

Distribution of women dataset
4. Discuss one possible alternative statistical model and illustrate results of model
comparison through DIC and/or marginal likelihood
An alternative statistical model other than Bayesian analysis using MCMC simulation is the
selection models. The DIC was made as a Bayesian counterpart of the AIC that is most popular
model comparison approach widely used in frequentist’s analysis. Its popularity is accelerated
because it is similar to AIC.it is computed as a function of the likelihood formula. This model
deals with Deviance Information Criterion (DIC) construction for missing data. One is based on
full data likelihood (DICf) the one used in our analysis while the other on observed data
likelihood (DICO).The DIC is used as a Bayesian measure of model fit that is penalized due to
being complex. We can compare the two models using this approach. For starters, DICf is hard to
modify while in practice unlike the observed one. Insights from the DICO can be complimented
from the other alternative to provide vital components necessary for comparison of the selection
models.
The two models DIC and DICO where both are based on likelihood conditional on observed and
missing data is able to provide information about the comparative fit of various group of models.
Both models require different plug-ins, algorithms, and sample size. With larger values, DIC
indicates poorer fit while Bayesian MCMC simulation has best model fit. The DICO is used to
compare models with same model of messiness while the part with this type in DICC is used to
compare those with similar model of interests. The DICO cannot be computed using Win BUGS
alone, because in general the required expectations cannot be evaluated directly from the output
of a standard MCMC run. For these, either “nested” MCMC is required, or some other
simulation method. However, for Bayesian analysis using MCMC simulation in rjags, we found
that no other methods are needed.
5. Illustration of the features of the MCMC convergence diagnostics and error control.
Majority of MCMC users address the convergence problem by applying diagnostic tools to the
output produced by running their samplers. The Coda R packages give a practitioner many
popular diagnostics for assessing output of MCMC convergence. In our study, effective sample
size was sought to be the most appropriate (effectiveSize()). After giving a brief overview of the
area, we had to provide an expository review of 15 convergence diagnostics, describing the
theoretical basis and practical implementation of each. Comparing their performance is crucial
feature in two simple models and concluding that all of the methods can fail to detect the sorts of
convergence. The MCMC convergence diagnostics were designed to identify these errors. In that
case, a recommendation was made for combining strategies aimed at evaluating and accelerating
MCMC sampler convergence, including applying diagnostic procedures to a small number of
parallel chains, monitoring autocorrelations and cross-correlations, and modifying
parametrizations or sampling algorithms appropriately. In general, our emphasis is that it isn’t
possible to say with certainty that a finite sample dataset from this algorithm can represent an
existing stationary distribution.
FIGURE5:Autocorrelation
FIGURE6: Trace plots simulation iterations
Conclusion
A Monte Carlo Bayesian analysis research is not a purely theoretical affair, although its results
might help to find answers to theoretical questions. It is rather a pseudo-empirical experiments,
using literature generated sample data in R software. A key element of Markov Chains Monte
Carlo simulations is that lots of sample data Z are generated for the estimation of some statistical
model, in our case two regression equation models Mj . Monte Carlo research can be
characterized as experimental, where the research questions have a known format. In our
investigations, we find What is the relationships between a number of explanatory variables x
(women height) and response or performance variables (women’s weight) y . The large amounts
of results are clearly summarized and displayed, so as to facilitate direct answers to the questions
posed and to be linked to submitted expectations regarding the relationships between x and y.
The free R software environment for statistical computation and graphics in jags provides an
abundance of graphical options worth considering. Valuable recommendations on the graphical
display of data with enlightening illustrations can be found above.
The DIC was made as a Bayesian counterpart of the AIC that is most popular model comparison
approach widely used in frequentist’s analysis. Its popularity is accelerated because it is similar
to AIC. It is computed as a function of the likelihood formula. This model deals with Deviance
Information Criterion (DIC) construction for missing data. Our MCMC uses address the
convergence problem by applying diagnostic tools to the output produced by running the
sampled data. We made a recommendation for combining strategies aimed at evaluating and
accelerating MCMC sampler convergence, including applying diagnostic procedures to a small
number of parallel chains, monitoring autocorrelations and cross-correlations, and modifying
parametrizations or sampling algorithms appropriately.

Markov Chain Monte Carlo Simulation in R

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markov Chain Monte Carlo Simulation in R

Uploaded by

Copyright:

Available Formats

Markov chain Monte Carlo Simulation in R

algorithms to determine its convergence.

stationary distribution of Markov chain.

stationary distribution form women dataset

data types are good examples of numerical dataset.

Illustration of the “Women “dataset

are the heads since they possess numerical attributes.

inferential goals of the analysis

response variable(height), x the explanatory variable. it is a linear regression model on a standard

FIGURE 1: Deterministic model or linear regression

Fig2: Showing inferential findings and estimations

Figure 4: Histograms for normal random

might help to find answers to theoretical questions. It is rather a pseudo-empirical experiments,

investigations, we find What is the relationships between a number of explanatory variables x

display of data with enlightening illustrations can be found above.

accelerating MCMC sampler convergence, including applying diagnostic procedures to a small

number of parallel chains, monitoring autocorrelations and cross-correlations, and modifying

parametrizations or sampling algorithms appropriately.

You might also like