Simulation is one of the most common applications of programming. In statistics, simulation is almost necessary if the results depend on a model that may not be totally correct or an asymptotic conclusion when small sample properties are unknown. In 2006 September issue of JSAS, there are 25 theory and methods articles, 16 of them used simulation.

In R Functions & datasets, there are many functions to generate random deviates. Most of them start with r. So it is easy to find them. Here are some of them

- rbeta (for the beta random variable)
- rbinom (for the binomial random variable)
- rexp (for the exponential random variable)
- rf (for the F random variable)
- rgamma (for the gamma random variable)
- rgeom (for the geometric random variable)
- rhyper (for the hypergeometric random variable)
- rlnorm (for the lognormal random variable)
- rlogis (for the logistic random variable)
- rmvbin (for the multivariate binary random variable)
- rnbinom (for the negative binomial random variable)
- rmvnorm (for the multivariate normal random variable, nonexisting?)
- rnorm (for the normal random variable)
- rpois (for the Poisson random variable)
- rweibull (for the weibull random variable)
- runif (for the uniform random variable)

One very important step in simulation is set the initial seed for your simulation. It is the function rngseed(x), where x should be a large integer. The set.seed() function works the same way.

Here are the two most important random deviates.

- The uniform distribution (random numbers):
runif(n, min=, max=), the default value for min is 0 and max is 1, i.e., runif(n) will generate
n random numbers. - The normal distribution (random numbers): rnorm(n,mean=,sd=) with default mean=0 and sd=1.

Let us starts with 4 different distributions

postscript("Four_distr.ps") par(mfrow=c(2,2)) set.seed(12345) x<-runif(1000) hist(x) y<-rnorm(1000) hist(y) z<-rgamma(1000,shape=4,scale=2.5) # f(x) = C x^(4-1)exp(-x/2.5) hist(z) w<-rbeta(1000,5,2) # f(x) = C x^(5-1)*(1-x)^(2-1) hist(w) dev.off() <\PRE>Four_distr.ps Here are first order autoregressive time series with different correlation coefficient. (The model is given in the comments of the program.)

# Different shapes of a autoregressive process # Saved as autoreg3.s # Various forms of (1-0.8B)(1+0.3B)Y(t)=Z(t) #1-#3. # 1 Y(t)=0.5Y(t-1)+Z(t) # 2 Y(t)=0.5Y(t-1)+0.24Y(t-1)+Z(t) # 3 Y(t)=0.8Y(t-1)+Z(t) # 4 Y(t)=-0.8Y(t-1)+Z(t) postscript("timeseries.ps") par(mfrow=c(2,2)) # Four figures in the output t<-c(1:200); # Generate t from 1 to 200 set.seed(12345); # Set seed from random number generatots par(mfrow=c(2,2)) try.ar1<-function(n,a1,a2) { y<-c(1:n); y[1]<-0; y[2]<-0 for(j in c(3:n)) y[j]<-a1*y[j-1]+a2*y[j-2]+rnorm(1) y} n<-200 a1<- 0.5 a2<- 0.0 y1<-try.ar1(n,a1,a2) ts.plot(y1,main="AR(1), 0.5") # Plot the time series y1 a1<- 0.5 a2<- 0.24 y2<-try.ar1(n,a1,a2) ts.plot(y2,main="AR(2), 0.5, 0.24") a1<- 0.9 a2<- 0.0 y3<-try.ar1(n,a1,a2) ts.plot(y3,main="AR(1), 0.9") a1<- -0.8 a2<- 0.0 y4<-try.ar1(n,a1,a2) ts.plot(y4,main="AR(1), -0.8") dev.off()Suppose we are interested in how robust is the least squares estimate b=(X'X)^(-1)X'Y when the error terms are not normal. The two other error terms are uniform (short tail distribution) and exponential (long tail distribution). Both of them are adjusted so that the errors have mean 0 and variance 1.

postscript("robustness.ps") par(mfrow=c(2,2)) set.seed(123456) n=20 x<-runif(n)*5 y1<-c(1:n) e1<-rnorm(n) # normal residuals for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i] result1<-lm(y1~x) summary(result1) plot(x,y1) abline(lsfit(x,y1),lty=2) c=sqrt(12.0) e2<-(runif(n)-0.5)*c # uniform residuals y2<-c(1:n) for(i in c(1:n)) y2[i]=1+0.5*x[i]+e2[i] result2<-lm(y2~x) summary(result2) plot(x,y2) abline(lsfit(x,y2),lty=2) e3<-rexp(n)-1 # exponential residuals y3<-c(1:n) for(i in c(1:n)) y3[i]=1+0.5*x[i]+e3[i] result3<-lm(y3~x) summary(result3) plot(x,y3) abline(lsfit(x,y3),lty=2) dev.off()The outputs from the summary all show significance of the slope and intercept. The p-values for the slopes are 0.015, 0.0002, and 0.015 for normal, uniform and exponential errors respectively. From the range of the residuals, we can see that y3 has much longer tail than the other two. This also shows the merit of mixture of graphics and data analysis. The when the seed is changed to set.seed(1234567), we have the following graphs. The p-values are 0.00004, 0.0001 and 0.0005 for normal, uniform and exponential errors respectively. It is surprising it have such a large variation.

A typical output from summary() is

>summary(result1) Residuals: Min 1Q Median 3Q Max -1.18925 -0.55789 0.07722 0.54001 1.80446 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.7557 0.3935 4.461 0.000302 *** x 0.3194 0.1184 2.699 0.014688 * --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.7897 on 18 degrees of freedom Multiple R-Squared: 0.2881, Adjusted R-squared: 0.2485 F-statistic: 7.284 on 1 and 18 DF, p-value: 0.01469## Repeated simulations

The previous example tells us that we need to do more than one experiment to check the accuracy of an estimate. In other words, we need to check the distribution of the estimates. To do this, we need to know how to get the R lm() output and summary them. The manual usually provided the information. For the lm example, see the statement slope[j]=fit$coef[2].

robust<-function(ns,n,error,a,c) { slope<-c(1:ns) x<-runif(n)*10 y1<-c(1:n) for (j in c(1:ns)) { e1<-(error(n)-a)*c # normalizing residuals for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i] fit<-lm(y1~x) slope[j]=fit$coef[2] } mean<-mean(slope) std<-var(slope)^0.5 list(mean,std) } set.seed(12345) test1<-robust(1000,10,rnorm,a=0,c=1) # Note the error # function in the function statement test2<-robust(1000,10,runif,a=0.5,c=3.464) # 3.464 = sqrt(12) test3<-robust(1000,10,rexp,a=1,c=1)The results showsmean std normal 0.515 0.125 uniform 0.490 0.158 exp 0.502 0.110 <\PRE>Apparently, the regression estimate is not sensitive to the error distribution. It would be better if we can check the relation between the estimated slope and its standard error. Hopefully, the worse the estimate, the larger the standard error. In other words, the confidence interval coverage is robust. Unfortunately, there is no way to recover the standrad error from the R output (see summary(result1).