Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Markov chain Monte Carlo Simulation in R

Author’s name

Date:

Summary

Markov Chain Monte Carlo (MCMC) simulations have been used increasingly over the

last two decades. Reaching out to a broad audience, this report provides guidelines for reporting

Monte Carlo studies in that field. Such an experiments are more commonly referred to as Monte

Carlo or simulation studies are used to study the behavior of statistical methods and measures

under controlled situations. Whereas recent computing and methodological advances have

permitted increased efficiency in the simulation process, known as markov chains, such

experiments remain limited by their finite nature and hence are subject to uncertainty; when a

simulation is run more than once, different results are obtained. However, little emphasis has

been placed on reporting the uncertainty, referred to here as Monte Carlo error, associated with

simulation results in the published literature, or on justifying the number of replications used.

The datasets for simulation is selected from the inbuilt storage in R. This research presents series

of simple and practical criterions for estimating Monte Carlo error as well as determining the

number of replications required to achieve a desired level of accuracy. The issues and methods

are demonstrated with two simple examples, one evaluating algorithm output using diagnostic

tools to get the maximum likelihood estimator. For the parameters in logistic regression in rjags

we obtain 95% confidence intervals. The results suggest that in many settings, Monte Carlo error

may be more substantial than traditionally thought to provide best Bayesian analysis using

MCMC approach.
Keywords: Markov Chain Monte Carlo; Simulation; rjags; Diagnostic tool; Markov

Chains

Introduction

Markov Chain Monte Carlo (MCMC) criterion is now being routinely applied to fit complex

models in many disciplines. Its popularity is mainly due to wide usage in Bayesian stastistics

though it also plays a crucial role in frequentist inference. Two critical questions that MCMC

researchers addressed were where to start and when to stop the simulation. The main idea of

MCMC is that if simulating from a target density π is difficult so that the ordinary Monte Carlo

method based on independent and identically distributed (iid) samples cannot be used for making

inference on π. it is be possible to construct a Markov chain {yn} n≥0 with constant density π for

forming parameters called Monte Carlo estimators. An introduction to simulation using MCMC

and construction of Markov chains using rjags and Gibbs sampler was provided. The general

purpose of those algorithms were provided in R. several R packages were implemented specific

to MCMC algorithms for finite number of statistical models. This project did not dwell on the

model development but rather on analysis of the Markov chains we obtained after running the

algorithms to determine its convergence.

Since they have been widely used in Bayesian analysis, most have are employed by frequentists

in missing and dependent data where likelihood has complex dimensional integrals. While

MCMC algorithms allow large expansion of the class of candidate models for a given dataset,

they also suffer from well-known and potentially serious disadvantages. Researchers face

difficulties trying to decide the appropriate instance for their use and make conclusion about
convergence. It is not clear to know that a particular sample truly represents the underlying

stationary distribution of Markov chain.

In this research we solved the problem of Bayesian analysis by determining MCMC algorithms

using diagnostic tools to output produced. Majority of the applied work involve this method to

the convergence problem. Employing a large number of parallel, independent chains, led to

obtaining simple moment, quantile and density estimates. since the stationary distribution was

unknown to us in practice, this same basic difficulty will plague any convergence diagnostic;

indeed, making the researchers conclude that all such diagnostics are fundamentally unsound.

In this study, we introduced the MCMC convergence diagnostics. For each category of dataset,

we briefly reviewed their theoretical bases, and discussed their practicality of implementation.

We also classified the methods according to whether they measured the convergence of

univariate or multivariate quantities or full joint distribution. Also, we check whether their

results are quantitative or qualitative in nature. Finally, we assessed the extent to which each

addressed the competing issues of biasness and variance in the resulting estimated features of the

stationary distribution form women dataset

A dataset is a collection of data with definite features. A data point on the contrary is a single

instance of women dataset. They are sometimes referred as an object or record. Any head within

a dataset which contains text, numeric or categorical data is referred to as an attribute. Boolean

data types are good examples of numerical dataset.

Body

Illustration of the “Women “dataset


In Studio, there is a section for finding and viewing all public datasets. A survey is conducted on

the platform by using Data () function to check all the public information in the software. The

factors considered while choosing the women file is quality, quantity, availability, gaps and

whether the data can be useful in MCMC simulation without changing the objective of this

research. For this study, a sample dataset of 15 data points with two attributes was put together:

height and weight. It has a perfect defined structure with 15 rows and two columns along with

their heads. Each of the fifteen rows in the women dataset represents a data point which has a

similar structure. The predictor in this case was the weight and height. The Boolean data types

are the heads since they possess numerical attributes.

Before perfoming any statistical analysis on the numerical dataset, it is important we understand

its nature. This can be done as part of illustrating the dataset. In that case, we use techniques that

explore the information which helps to uncover its properties. With the help of these approaches,

we are able to check center of data, probabilities, skewness, spread and presence of outliers.

2. Overall features of the statistical model such as the role of the parameters and the

inferential goals of the analysis

The overall feature of the statistical model is to simulate mean, standard deviation and regression

analysis using posterior approach in MCMC. We find mean and sd to check whether they are

close to the values we input. The simulation is repeated few times to confirm these observations

that indeed they are not exactly in the distribution. The (par) is a feature which allows us to

modify low-level plotting functions. When we repeat the trials for the random numbers with

small sample size using apply feature, the spread of experimental trials shifts. The standard

deviation between the four reduces. Taking large sample sizes and repeat results to sd which is
further away. For regression model Y=N (a+b*x, sd) means a and b are intercepts, y is our

response variable(height), x the explanatory variable. it is a linear regression model on a standard

deviation on a normal distribution. Our explanatory variable is fixed and the intercepts a and b

given. The figure 1 shows the degree of uncertainty, linear regression simulations when done

many times.

FIGURE 1: Deterministic model or linear regression

3. Illustrate the main inferential findings (Bayesian point and interval estimation,
hypothesis testing)

Fig2: Showing inferential findings and estimations


A summary statistics is performed using R command lines which provide the following outputs.
Some of the values are min, mean, maximum, median, intercepts, standard deviation, error, t-
value residuals, r-squared degree of freedom-value and F-statistics. Figure 2 displays the values
generated in R under jags. The mean for random normal values from the women’s data is
repeated for 100 draws. Also, the same criteria are repeated four times for 100 draws and
histograms drawn for same data. Figure 3 shows four histograms for the random numbers of
weight and height of women. We find four different means for the same. The values are found to
be close to 110 but not equal to 110. We do simulations for the distribution when n is varying,
that is, 10, 100, 25 and 1000. From this we realized that when the data is simulated, as our n
increase we will get closer to the expected value.

Figure 4: Histograms for normal random


Distribution of women dataset
4. Discuss one possible alternative statistical model and illustrate results of model
comparison through DIC and/or marginal likelihood

An alternative statistical model other than Bayesian analysis using MCMC simulation is the
selection models. The DIC was made as a Bayesian counterpart of the AIC that is most popular
model comparison approach widely used in frequentist’s analysis. Its popularity is accelerated
because it is similar to AIC.it is computed as a function of the likelihood formula. This model
deals with Deviance Information Criterion (DIC) construction for missing data. One is based on
full data likelihood (DICf) the one used in our analysis while the other on observed data
likelihood (DICO).The DIC is used as a Bayesian measure of model fit that is penalized due to
being complex. We can compare the two models using this approach. For starters, DICf is hard to
modify while in practice unlike the observed one. Insights from the DICO can be complimented
from the other alternative to provide vital components necessary for comparison of the selection
models.

The two models DIC and DICO where both are based on likelihood conditional on observed and
missing data is able to provide information about the comparative fit of various group of models.
Both models require different plug-ins, algorithms, and sample size. With larger values, DIC
indicates poorer fit while Bayesian MCMC simulation has best model fit. The DICO is used to
compare models with same model of messiness while the part with this type in DICC is used to
compare those with similar model of interests. The DICO cannot be computed using Win BUGS
alone, because in general the required expectations cannot be evaluated directly from the output
of a standard MCMC run. For these, either “nested” MCMC is required, or some other
simulation method. However, for Bayesian analysis using MCMC simulation in rjags, we found
that no other methods are needed.

5. Illustration of the features of the MCMC convergence diagnostics and error control.

Majority of MCMC users address the convergence problem by applying diagnostic tools to the
output produced by running their samplers. The Coda R packages give a practitioner many
popular diagnostics for assessing output of MCMC convergence. In our study, effective sample
size was sought to be the most appropriate (effectiveSize()). After giving a brief overview of the
area, we had to provide an expository review of 15 convergence diagnostics, describing the
theoretical basis and practical implementation of each. Comparing their performance is crucial
feature in two simple models and concluding that all of the methods can fail to detect the sorts of
convergence. The MCMC convergence diagnostics were designed to identify these errors. In that
case, a recommendation was made for combining strategies aimed at evaluating and accelerating
MCMC sampler convergence, including applying diagnostic procedures to a small number of
parallel chains, monitoring autocorrelations and cross-correlations, and modifying
parametrizations or sampling algorithms appropriately. In general, our emphasis is that it isn’t
possible to say with certainty that a finite sample dataset from this algorithm can represent an
existing stationary distribution.

FIGURE5:Autocorrelation
FIGURE6: Trace plots simulation iterations
Conclusion

A Monte Carlo Bayesian analysis research is not a purely theoretical affair, although its results

might help to find answers to theoretical questions. It is rather a pseudo-empirical experiments,

using literature generated sample data in R software. A key element of Markov Chains Monte

Carlo simulations is that lots of sample data Z are generated for the estimation of some statistical

model, in our case two regression equation models Mj . Monte Carlo research can be

characterized as experimental, where the research questions have a known format. In our

investigations, we find What is the relationships between a number of explanatory variables x

(women height) and response or performance variables (women’s weight) y . The large amounts

of results are clearly summarized and displayed, so as to facilitate direct answers to the questions

posed and to be linked to submitted expectations regarding the relationships between x and y.

The free R software environment for statistical computation and graphics in jags provides an
abundance of graphical options worth considering. Valuable recommendations on the graphical

display of data with enlightening illustrations can be found above.

The DIC was made as a Bayesian counterpart of the AIC that is most popular model comparison

approach widely used in frequentist’s analysis. Its popularity is accelerated because it is similar

to AIC. It is computed as a function of the likelihood formula. This model deals with Deviance

Information Criterion (DIC) construction for missing data. Our MCMC uses address the

convergence problem by applying diagnostic tools to the output produced by running the

sampled data. We made a recommendation for combining strategies aimed at evaluating and

accelerating MCMC sampler convergence, including applying diagnostic procedures to a small

number of parallel chains, monitoring autocorrelations and cross-correlations, and modifying

parametrizations or sampling algorithms appropriately.

You might also like