Lesson 2: Writing functions and using R statistical functions

This lecture covers the following section in the An Introduction to R:

Writing your own functions

The basic idea in R function writing can be demonstrated by the following example which generates the mean and variance of n random normal variates.

 
 mixsiml <- function (n) {
     x <- c(1:n);
     for(i in c(1:n)) {x[i] <- 0.5*(rnorm(1)+3*rnorm(1))}
     mx <- mean (x); vx <- var (x)
     cat(" mean of x is ",mx, "\n")
     cat(" var of x is ",vx,"\n")  } # "\n" for a new line

 mixsiml(20)
 mean of x is  1.083920 
 var of x is  2.487287 
To output values as a data set, use "list".:
 mixsim2 <- function (n){
     x <- c(1:n)
     y <- c(1:n)
     for(i in c(1:n)){
      x[i] <- 0.5* (rnorm(1)+3*rnorm(1))
      y[i] <- x[i]*2.0
     }
list (x1=x, yl=y)
}

a <- mixsim2(4)
a
$x1:
[1] -1.6163232 -1.4401603 -1.2824844 0.3978134

$yl:
[1] -3.2326465 -2.8803205 -2.5649688 0.7956267

> a$x1

[1] -1.6163232 -1.4401603 -1.2824844 0.3978134

In other words, the outputs are divided into sub-categories a$x1 and a$y1.

It is my experience, the R is not a good for writing long programs. The error messages are just too vague. One way that helps debugging is to use cat() inside function to get detailed intermediate values. Fortunately, R has thousands of canned programs. Thus, a short program is usually enough to do many statistical jobs. A list of functions can be found in Functions & Datasets . Here are examples.

Two sample tests

From pp. 43-45 of An Introduction to R.
> A<-scan()
> 79.98 80.04 80.02 80.04 80.03 80.03 80.004 79.97
> 80.05 80.03 80.02 80.00 80.02
> B<-scan()
> 80.02 79.94 79.98 79.97 79.97 80.03  79.95 79.97

> library(ctest)
> t.test(A,B)    # Welch Two sample t-test

	Welch Two Sample t-test

data:  A and B 
t = 3.0467, df = 11.899, p-value = 0.01024
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.01115442 0.06734558 
sample estimates:
mean of x mean of y 
 80.01800  79.97875 

> var.test(A,B)  #F test to compare two variances

	F test to compare two variances

data:  A and B 
F = 0.5678, num df = 12, denom df = 7, p-value = 0.3713
alternative hypothesis: true ratio of variances is not equal to 1 
95 percent confidence interval:
 0.1216915 2.0477498 
sample estimates:
ratio of variances 
         0.5677919 


> t.test(A,B,var.equal=TRUE) #Two Sample t-test assuming equal variance

	Two Sample t-test

data:  A and B 
t = 3.2658, df = 19, p-value = 0.004067
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 0.01409501 0.06440499 
sample estimates:
mean of x mean of y 
 80.01800  79.97875 
wilcox.test(A,B) #Wilcoxon rank sum test with continuity correction
	Wilcoxon rank sum test with continuity correction

data:  A and B 
W = 87, p-value = 0.01157
alternative hypothesis: true mu is not equal to 0 

Simple linear regression (analysis of covariance)


Description

lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).

Usage lm(formula, data, subset, weights, ...) Arguments formula a symbolic description of the model to be fit. data an optional data frame containing the variables in the model. weights an optional vector of weights to be used in the fitting process. If specified, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. An example: Morley's measurements on the speed of light (last three digits in 299***Km/sec.) There are 5 experiments (Expt) each with 20 replicates (Run). > data(morley) > Morley
    Expt Run Speed
1      1   1   850
2      1   2   740
3      1   3   900
4      1   4  1070
5      1   5   930
6      1   6   850
7      1   7   950
..................
95     5  15   810
96     5  16   940
97     5  17   950
98     5  18   800
99     5  19   810
100    5  20   870
> morley$Expt<-factor(morley$Expt) # Define Expt as a factor (discrete)
> attach(morley)
> fm<-lm(Speed~Run+Expt, data=morley) # Run is assumed to be continuous.
> summary(fm)        #print the outputs
Residuals:
     Min       1Q   Median       3Q      Max 
-257.768  -43.329    1.926   42.250  158.713 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 912.6947    21.5119  42.427  < 2e-16 ***
Run          -0.3519     1.2937  -0.272 0.786222    
Expt2       -53.0000    23.5900  -2.247 0.026999 *  
Expt3       -64.0000    23.5900  -2.713 0.007930 ** 
Expt4       -88.5000    23.5900  -3.752 0.000304 ***
Expt5       -77.5000    23.5900  -3.285 0.001432 ** 
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 74.6 on 94 degrees of freedom
Multiple R-Squared: 0.1536,	Adjusted R-squared: 0.1086 
F-statistic: 3.412 on 5 and 94 DF,  p-value: 0.007075  

Note Exp1 is used as a reference for Expt2-Expt5. The Run has only one coefficient, because we did not put the Expt*Run interaction term in. There are other outputs such as residuals, but analysis of variance table is not in the output. To get the table, you have to use

 
> am<-aov(Speed~Run+Expt, data=morley)
> summary(am)
            Df Sum Sq Mean Sq F value   Pr(>F)   
Run          1    412     412   0.074 0.786222   
Expt         4  94514   23629   4.246 0.003336 **
Residuals   94 523098    5565                    
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Numerical integration

The program given on page 53 of An Introduction to R.
# Numerical integration.

area<-function(f,a,b,eps=1.0e-06,lim=10) 
 {
      fun1<-function(f,a,b,fa,fb, a0,eps, lim, fun) 
     {
      d<-(a+b)/2
      h<-(b-a)/4
      fd<-f(d)
      a1<-h*(fa+fd)
      a2<-h*(fd+fb)

      if(abs(a0-a1-a2) < eps||lim==0)
           return(a1+a2)
      else {
           return(fun(f,a,d, fa,fd,a1,eps,lim-1,fun)+
                  fun(f,d,b,fd,fb ,a2,eps,lim-1,fun))
           }
      }
  fa<-f(a)
  fb<-f(b)
  a0<-((fa+fb)*(b-a))/2
  fun1(f,a,b,fa,fb,a0, eps, lim, fun1)
 }
                
f0<-function(x) {
    fun00<-(1/sqrt(2*3.14))*exp(-x^2/2)
    }

areanorm<-area(f0,-6,1.96)
areanorm


[1] 0.97525

You may also use integrate() (see Functions & Datasets ). integrate(functn,lower_bounds, upper_bounds,minpts,maxpts,eps,passing parameters) minpts, maxpts = minimumn (max) number of function evaluation An example to integrate a beta function


> fbeta<-function(x,alpha,beta){x^(alpha-1)*(1-x)^(beta-1)}
> byIn<-integrate(fbeta,0,1,100,500,0.01,alpha=3.5,beta=1.5)
> byIn
0.1227233 with absolute error < 0.078
> byGm<-exp(lgamma(3.5)+lgamma(1.5)-lgamma(5)) # By gamma functions > byGm
[1] 0.1227185 
> d=byIn-byGm # To print the difference

An error message will show. What happens is that the value byIn assigned to the integral is a string "0.1227233 with absolute error < 0.078". To get the value, you need to look into the detail in help(integrate). It will tell you that the integral value is in value (another in abs,error). Now

> byIn$value-byGm  # byIn$val will also work
[1] 4.79645e-06

Minimization of a function nlm()

The description of this procedure is on page 70 of in An Introduction to R. It is to minimize a function (or to maximize the negative of a function).

# Example for MLE (logistic regression) Dobson (pp. 108-110)

x<-c(1.69,1.72,1.76,1.78,1.81,1.84,1.86,1.88)
y<-c(   6,  13,  18,  18,  52,  53,  61,  60)
n<-c(  59,  60,  62,  56,  63,  59,  62,  60)
p<-c(0,0)      # define the dimension of p 
fn<-function(p){sum(-(y*(p[1]+p[2]*x)-n*log(1+exp(p[1]+p[2]*x))
+log(choose(n,y))))}
out<-nlm(fn,p=c(-50,20),hessian=T)
out
$minimum
[1] 25.6519

$estimate
[1] -60.33438  33.96939

$gradient
[1] -2.987290e-06 -5.480607e-06

$hessian
          [,1]     [,2]
[1,]  59.74214 106.4543
[2,] 106.45430 189.7951

$code
[1] 2

$iterations
[1] 15
> solve(out$hessian) # inverse of the information matrix
          [,1]       [,2]
[1,]  30.38361 -17.041884
[2,] -17.04188   9.563904

Return to R home page