Spatio-Temporal Analysis Using coefclust

Set-Up

First, load the coefclust package.

set.seed(1)
library(coefclust)

Load in the toy dataset for the spatio-temporal analysis that comes from the package, SE_FakeData_SpTM. Explore the dataset using the str() function:

data("SE_FakeData_SpTm")
str(SE_FakeData_SpTm)

## 'data.frame':    616 obs. of  11 variables:
##  $ State : Factor w/ 7 levels "Alabama","Florida",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County: Factor w/ 438 levels "Abbeville","Adams",..: 17 21 24 36 40 51 54 58 69 77 ...
##  $ FIPS  : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
##  $ long  : num  -86.6 -87.7 -85.4 -87.1 -86.6 ...
##  $ lat   : num  32.5 30.7 31.9 33 34 ...
##  $ x1    : num  -1.692 0.407 -3.864 7.074 3.073 ...
##  $ x2    : num  -2.141 -0.0214 -4.3699 5.7659 3.4897 ...
##  $ x3    : num  -3.213 0.759 -2.858 5.157 0.564 ...
##  $ y1    : num  -1.824 0.567 -0.144 16.095 -0.67 ...
##  $ y2    : num  -2.044 -0.212 -0.81 16.156 -1.007 ...
##  $ y3    : num  2.489 1.072 -0.161 1.858 1.154 ...

This dataset has 11 variables: State, County, FIPS code (as an integer value), long (longitude of the centroid for each county), lat (latitude of the centroid for each county), 3 covariates (x1, x2, x3), and three responses (y1, y2, y2).

Next, we extract the longitude and latitude into separate objects, long and lat:

long <- SE_FakeData_SpTm$long; lat <- SE_FakeData_SpTm$lat

We must first prepare the data into lists. First we prepare the list yList, where each element of the list will contain a single series of the three responses:

yList <-list()
yList[[1]] <- SE_FakeData_SpTm$y1
yList[[2]] <- SE_FakeData_SpTm$y2
yList[[3]] <- SE_FakeData_SpTm$y3
str(yList)

## List of 3
##  $ : num [1:616] -1.824 0.567 -0.144 16.095 -0.67 ...
##  $ : num [1:616] -2.044 -0.212 -0.81 16.156 -1.007 ...
##  $ : num [1:616] 2.489 1.072 -0.161 1.858 1.154 ...

Next, we prepare the list XList. Each element of XList will have two columns. The first vector is 1 repeated the length of the number of observations. This first column corresponds to the intercept, $\beta_0$. The second column corresponds to the x covariate.

XList <-list()
XList[[1]] <- cbind(rep(1,length(long)), SE_FakeData_SpTm$x1)
XList[[2]] <- cbind(rep(1,length(long)), SE_FakeData_SpTm$x2)
XList[[3]] <- cbind(rep(1,length(long)), SE_FakeData_SpTm$x3)
str(XList)

## List of 3
##  $ : num [1:616, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
##  $ : num [1:616, 1:2] 1 1 1 1 1 1 1 1 1 1 ...
##  $ : num [1:616, 1:2] 1 1 1 1 1 1 1 1 1 1 ...

We start by setting the maximum radius (MR) to 300 and the number of simulation (M) to 2. In a full simulation, we recommend setting M=1000, but have only set it to 2 here for illustration:

MR <- 300
M <- 2

Detectiong Multiple Clusters Sequentially Using Simultaneous Detection

We use the function Find.Clusters.SI.ST() for detectiong multiple clusters sequentially using simultaneous detection. This function will take the list of responses we created (yList) as the first argument and the list of covariates (XList) as the second argument. The lists of resposnes and covariates are followed by the longitude coordinate, latitude coordinate, the maximum radius for potential clusters MR, the number of simulations M, we set overlap=FALSE to not allow for overlapping clusters, and set alpha=0.05 as the Type I error rate:

#Find Multiple Clusters Sequentially via the Simulatneous Detection 
Clusters_simST <- Find.Clusters.SI.ST(yList, XList, long, lat, MR, M, overlap=FALSE, alpha=0.05)
str(Clusters_simST)

## List of 3
##  $ Clusters : num [1, 1:5] 11 92.8697 0.3333 0.0352 16
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "center" "radius" "pval" "elapsed" ...
##  $ Coef     :List of 2
##   ..$ Coeff_History:List of 3
##   .. ..$ : num [1:2, 1] 0.0395 0.062
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. .. ..$ : chr "0"
##   .. ..$ : num [1:2, 1] 0.1129 0.0619
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. .. ..$ : chr "0"
##   .. ..$ : num [1:2, 1] 0.1091 0.0186
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. .. ..$ : chr "0"
##   ..$ Coeff_Table  : num [1:2, 1:3] 0.0395 0.062 0.1129 0.0619 0.1091 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. ..$ : NULL
##  $ Indicator:List of 1
##   ..$ : logi [1:616] TRUE FALSE FALSE TRUE FALSE FALSE ...

Clusters_simST$Clusters

##      center   radius      pval    elapsed n_obs
## [1,]     11 92.86966 0.3333333 0.03516667    16

Clusters_simST$Coef

## $Coeff_History
## $Coeff_History[[1]]
##                 0
## beta_0 0.03945850
## beta_1 0.06195994
## 
## $Coeff_History[[2]]
##                 0
## beta_0 0.11293727
## beta_1 0.06194135
## 
## $Coeff_History[[3]]
##                 0
## beta_0 0.10911750
## beta_1 0.01859798
## 
## 
## $Coeff_Table
##              [,1]       [,2]       [,3]
## beta_0 0.03945850 0.11293727 0.10911750
## beta_1 0.06195994 0.06194135 0.01859798

table(Clusters_simST$Indicator)

## 
## FALSE  TRUE 
##   600    16

The output is an object Clusters_simST which is a list of 3 elements:

Clusters_simST$Clusters (first element of list): provides the resulting identified cluster, with information about the center ID, radius, p-value for the cluster, elapsed computation time, and number of observations inside the detected cluster.
Clusters_simST$Coef (second element of list): provides estimates for beta_0 (intercept) and beta_1 (slope) for each time period.
Clusters_simST$Indicator (third element of list): is Boolean vector that is TRUE if the given center is inside the cluster and FALSE if it is not.

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData_SpTm dataset with the identified cluster information:

cbind(SE_FakeData_SpTm[Clusters_simST$Clusters[,1],1:2],Clusters_simST$Clusters)

##      State  County center   radius      pval    elapsed n_obs
## 11 Alabama Chilton     11 92.86966 0.3333333 0.03516667    16

Detecting Multiple Clusters Sequentially Using Two-Stage Detection

We use the functions Find.Clusters.TS.ST1(), Find.Clusters.TS.ST2() for detectiong multiple clusters sequentially using simultaneous detection. Like the Find.Clusters.SI.ST() function for simultaneous detection above, this function will take the list of responses we created (yList) as the first argument and the list of covariates (XList) as the second argument. The lists of resposnes and covariates are followed by the longitude coordinate, latitude coordinate, the maximum radius for potential clusters MR, the number of simulations M, we set overlap=FALSE to not allow for overlapping clusters, and set alpha=0.05 as the Type I error rate.

The results from the first stage (using Find.Clusters.TS.ST1()) are stored as the object Clusters_ts1N. This is then used as an argument for the function Find.Clusters.TS.ST2() where we specify the clusters identified in the first stage as Cls1st=Clusters_ts1N.

#Find Multiple Clusters Sequentially via the Two--Stage Detections
Clusters_ts1N <- Find.Clusters.TS.ST1(yList, XList, long, lat, MR, M, overlap=FALSE, alpha=0.05)
Clusters_ts2N <- Find.Clusters.TS.ST2(yList, XList, long, lat, MR, M, Cls1st=Clusters_ts1N, overlap=FALSE, alpha=0.05)

#Explore the spatio-temporal identified clusters (via two-stage detection)
str(Clusters_ts2N)

## List of 3
##  $ Clusters : num [1, 1:6] 11 88.699 0.333 0.048 15 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:6] "center" "radius" "pval" "elapsed" ...
##  $ Coef     :List of 2
##   ..$ Coeff_History:List of 3
##   .. ..$ : num [1:2, 1] 0.0395 0.062
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. .. ..$ : chr "0"
##   .. ..$ : num [1:2, 1] 0.1129 0.0619
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. .. ..$ : chr "0"
##   .. ..$ : num [1:2, 1] 0.1091 0.0186
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. .. ..$ : chr "0"
##   ..$ Coeff_Table  : num [1:2, 1:3] 0.0395 0.062 0.1129 0.0619 0.1091 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. .. ..$ : NULL
##  $ Indicator:List of 1
##   ..$ : logi [1:616] TRUE FALSE FALSE TRUE FALSE FALSE ...

Clusters_ts2N$Clusters

##      center   radius      pval elapsed n_obs stage
## [1,]     11 88.69924 0.3333333   0.048    15     1

Clusters_ts2N$Coef

## $Coeff_History
## $Coeff_History[[1]]
##                 0
## beta_0 0.03945850
## beta_1 0.06195994
## 
## $Coeff_History[[2]]
##                 0
## beta_0 0.11293727
## beta_1 0.06194135
## 
## $Coeff_History[[3]]
##                 0
## beta_0 0.10911750
## beta_1 0.01859798
## 
## 
## $Coeff_Table
##              [,1]       [,2]       [,3]
## beta_0 0.03945850 0.11293727 0.10911750
## beta_1 0.06195994 0.06194135 0.01859798

table(Clusters_ts2N$Indicator)

## 
## FALSE  TRUE 
##   592    24

The final output of the spatio-temporal two-stage detection is stored in the object Clusters_ts2N, which is a list of 3 elements:

Clusters_ts2N$Clusters (first element of list): provides the resulting identified cluster, with information about the center ID, radius, p-value, elapsed computation time, and number of observations in the cluster.
Clusters_ts2N$Coef (second element of list): provides estimates for beta_0 (intercept) and beta_1 (slope) for each time period.
Clusters_ts2N$Indicator (third element of list): is Boolean vector that is TRUE if the given center is inside the cluster and FALSE if it is not.

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData_SpTm dataset with the identified cluster information:

cbind(SE_FakeData_SpTm[Clusters_ts2N$Clusters[,1],1:2],Clusters_ts2N$Clusters)

##      State  County center   radius      pval elapsed n_obs stage
## 11 Alabama Chilton     11 88.69924 0.3333333   0.048    15     1

Bonferroni Correction

In order to adjust for multiple testing across the two stages, the Bonferroni correction can be applied to the p-value associated with the detected cluster. To do so, we specify alpha=(0.05/2):

### With the Bonferroni correction
Clusters_ts1B <- Find.Clusters.TS.ST1(yList, XList, long, lat, MR, M, overlap=FALSE, alpha=(0.05/2))
Clusters_ts2B <- Find.Clusters.TS.ST2(yList, XList, long, lat, MR, M, Cls1st=Clusters_ts1B, overlap=FALSE, alpha=(0.05/2))

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData_SpTm dataset with the identified cluster information:

cbind(SE_FakeData_SpTm[Clusters_ts2B$Clusters[,1],1:2],Clusters_ts2B$Clusters)

##      State  County center   radius      pval elapsed n_obs stage
## 11 Alabama Chilton     11 88.69924 0.3333333    0.05    15     1

Spatio-Temporal Analysis Using coefclust

Junho Lee, Maria Kamenetsky

Set-Up

Detectiong Multiple Clusters Sequentially Using Simultaneous Detection

Detecting Multiple Clusters Sequentially Using Two-Stage Detection

Bonferroni Correction