Lesson 4: Simulation Using R

Simulation is one of the most common applications of programming. In statistics, simulation is almost necessary if the results depend on a model that may not be totally correct or an asymptotic conclusion when small sample properties are unknown. In 2006 September issue of JSAS, there are 25 theory and methods articles, 16 of them used simulation.

In R Functions & datasets, there are many functions to generate random deviates. Most of them start with r. So it is easy to find them. Here are some of them

One very important step in simulation is set the initial seed for your simulation. It is the function rngseed(x), where x should be a large integer. The set.seed() function works the same way.

Here are the two most important random deviates.

  1. The uniform distribution (random numbers): runif(n, min=, max=), the default value for min is 0 and max is 1, i.e., runif(n) will generate n random numbers.
  2. The normal distribution (random numbers): rnorm(n,mean=,sd=) with default mean=0 and sd=1.

Simple simulations

Let us starts with 4 different distributions

postscript("Four_distr.ps")
par(mfrow=c(2,2))
set.seed(12345)
x<-runif(1000)
hist(x)
y<-rnorm(1000)
hist(y)
z<-rgamma(1000,shape=4,scale=2.5) # f(x) = C x^(4-1)exp(-x/2.5)
hist(z)
w<-rbeta(1000,5,2)  # f(x) = C x^(5-1)*(1-x)^(2-1)
hist(w)
dev.off()
<\PRE>

Four_distr.ps Here are first order autoregressive time series with different correlation coefficient. (The model is given in the comments of the program.)

# Different shapes of a autoregressive process
#                                             Saved as autoreg3.s

# Various forms of  (1-0.8B)(1+0.3B)Y(t)=Z(t) #1-#3.
# 1 Y(t)=0.5Y(t-1)+Z(t)
# 2 Y(t)=0.5Y(t-1)+0.24Y(t-1)+Z(t)
# 3 Y(t)=0.8Y(t-1)+Z(t)
# 4 Y(t)=-0.8Y(t-1)+Z(t)

postscript("timeseries.ps")
par(mfrow=c(2,2))          # Four figures in the output

t<-c(1:200);               # Generate t from 1 to 200
set.seed(12345);           # Set seed from random number generatots

par(mfrow=c(2,2))

try.ar1<-function(n,a1,a2)
       { 
       y<-c(1:n); y[1]<-0; y[2]<-0
       for(j in c(3:n)) y[j]<-a1*y[j-1]+a2*y[j-2]+rnorm(1) 
       y}
n<-200
a1<- 0.5 
a2<- 0.0
y1<-try.ar1(n,a1,a2)
ts.plot(y1,main="AR(1), 0.5")           # Plot the time series y1

a1<- 0.5
a2<- 0.24
y2<-try.ar1(n,a1,a2)
ts.plot(y2,main="AR(2), 0.5, 0.24")

a1<- 0.9
a2<- 0.0
y3<-try.ar1(n,a1,a2)
ts.plot(y3,main="AR(1), 0.9")                

a1<- -0.8
a2<- 0.0
y4<-try.ar1(n,a1,a2)
ts.plot(y4,main="AR(1), -0.8")
dev.off()   

timeseries.ps

Suppose we are interested in how robust is the least squares estimate b=(X'X)^(-1)X'Y when the error terms are not normal. The two other error terms are uniform (short tail distribution) and exponential (long tail distribution). Both of them are adjusted so that the errors have mean 0 and variance 1.

postscript("robustness.ps")
par(mfrow=c(2,2))
set.seed(123456)
n=20
x<-runif(n)*5
y1<-c(1:n)
e1<-rnorm(n)  # normal residuals
for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i]
result1<-lm(y1~x)
summary(result1)
plot(x,y1)
abline(lsfit(x,y1),lty=2)

c=sqrt(12.0)
e2<-(runif(n)-0.5)*c   # uniform residuals
y2<-c(1:n) 
for(i in c(1:n)) y2[i]=1+0.5*x[i]+e2[i]
result2<-lm(y2~x)
summary(result2)
plot(x,y2)
abline(lsfit(x,y2),lty=2)

e3<-rexp(n)-1   # exponential residuals
y3<-c(1:n) 
for(i in c(1:n)) y3[i]=1+0.5*x[i]+e3[i]
result3<-lm(y3~x)
summary(result3)
plot(x,y3)
abline(lsfit(x,y3),lty=2)
dev.off()

robust1.ps

The outputs from the summary all show significance of the slope and intercept. The p-values for the slopes are 0.015, 0.0002, and 0.015 for normal, uniform and exponential errors respectively. From the range of the residuals, we can see that y3 has much longer tail than the other two. This also shows the merit of mixture of graphics and data analysis. The when the seed is changed to set.seed(1234567), we have the following graphs. The p-values are 0.00004, 0.0001 and 0.0005 for normal, uniform and exponential errors respectively. It is surprising it have such a large variation.

robust2.ps

A typical output from summary() is



>summary(result1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.18925 -0.55789  0.07722  0.54001  1.80446 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.7557     0.3935   4.461 0.000302 ***
x             0.3194     0.1184   2.699 0.014688 *  
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.7897 on 18 degrees of freedom
Multiple R-Squared: 0.2881,	Adjusted R-squared: 0.2485 
F-statistic: 7.284 on 1 and 18 DF,  p-value: 0.01469 

Repeated simulations

The previous example tells us that we need to do more than one experiment to check the accuracy of an estimate. In other words, we need to check the distribution of the estimates. To do this, we need to know how to get the R lm() output and summary them. The manual usually provided the information. For the lm example, see the statement slope[j]=fit$coef[2].

robust<-function(ns,n,error,a,c) {
slope<-c(1:ns)
x<-runif(n)*10
y1<-c(1:n)
for (j in c(1:ns))    {
   e1<-(error(n)-a)*c  # normalizing residuals
   for(i in c(1:n)) y1[i]=1+0.5*x[i]+e1[i]
   fit<-lm(y1~x)
   slope[j]=fit$coef[2]  }
mean<-mean(slope)
std<-var(slope)^0.5
list(mean,std)
}
set.seed(12345)
test1<-robust(1000,10,rnorm,a=0,c=1)  # Note the error 
                                      # function in the function statement
test2<-robust(1000,10,runif,a=0.5,c=3.464)  # 3.464 = sqrt(12)
test3<-robust(1000,10,rexp,a=1,c=1)
The results shows

           mean   std 
normal     0.515  0.125
uniform    0.490  0.158
exp        0.502  0.110

<\PRE>

Apparently, the regression estimate is not sensitive to the error distribution. It would be better if we can check the relation between the estimated slope and its standard error. Hopefully, the worse the estimate, the larger the standard error. In other words, the confidence interval coverage is robust. Unfortunately, there is no way to recover the standrad error from the R output (see summary(result1).

Return to R home page