Introduction to coefclust

Junho Lee, Maria Kamenetsky

First, load the coefclust package.

library(coefclust)

Next load the toy dataset that comes with the package, SoutheastFakeData:

data("SE_FakeData")
str(SE_FakeData)
## 'data.frame':    616 obs. of  7 variables:
##  $ State : Factor w/ 7 levels "Alabama","Florida",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County: Factor w/ 438 levels "Abbeville","Adams",..: 17 21 24 36 40 51 54 58 69 77 ...
##  $ FIPS  : int  1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
##  $ long  : num  -86.6 -87.7 -85.4 -87.1 -86.6 ...
##  $ lat   : num  32.5 30.7 31.9 33 34 ...
##  $ x     : num  -1.692 0.407 -3.864 7.074 3.073 ...
##  $ y     : num  -1.833 -0.428 -0.506 14.84 0.417 ...

From the str() function, we see that the toy dataset has 7 variables: State, County, FIPS code (as an integer value), long - the longitude of the centroid for each county, lat - the latitude of the centroid for each county, x, and y - the vector of responses.

Next, set initial x and y inputs, the longitude and latitude, maximum radius (MR) in km, and number of simulations (M). y will be equivalent to the response (y) in the dataset SE_FakeData. For X, we must first column bind a vector of 1’s for the intercept.

In a full simulation, we recommend setting M=1000. However for illustration, we set M=2 here.

y <- SE_FakeData$y
X <- cbind(rep(1,length(y)), SE_FakeData$x)
head(X)
##      [,1]       [,2]
## [1,]    1 -1.6916972
## [2,]    1  0.4066972
## [3,]    1 -3.8642180
## [4,]    1  7.0744111
## [5,]    1  3.0729092
## [6,]    1 -2.2670492
long <- SE_FakeData$long; lat <- SE_FakeData$lat
MR <- 300
M <- 2

Find Multiple Clusters Sequentially via the Simultaneous Detection

To find multiple clusters sequentially via simultaneous detection, we use the function Find.Clusters.Simul(). This function takes as arguments the response vector, y, the design matrix X we created above, the longitude and latitude coordinates for the centroid for each county, MR which is the maximum radius for a potential clusters, M the number of simulations, we set overlap=TRUE to allow for overlapping clusters, and set alpha=0.05 as the Type I error rate.

Clusters_sim <- Find.Clusters.Simul(y, X, long, lat, MR, M, overlap=TRUE, alpha=0.05)
str(Clusters_sim)
## List of 3
##  $ Clusters: num [1, 1:5] 11 92.8697 0.3333 0.0302 16
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "center" "radius" "pval" "elapsed" ...
##  $ Coef    : num [1:2, 1] 0.089 0.0461
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. ..$ : chr "0"
##  $ clsL    :List of 1
##   ..$ : logi [1:616] TRUE FALSE FALSE TRUE FALSE FALSE ...
Clusters_sim$Clusters
##      center   radius      pval    elapsed n_obs
## [1,]     11 92.86966 0.3333333 0.03016667    16
Clusters_sim$Coef
##                 0
## beta_0 0.08902763
## beta_1 0.04611213

The output is object Clusters_sim, which is a list of three elements.

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData dataset with the identified cluster information:

cbind(SE_FakeData[Clusters_sim$Clusters[,1],1:2],Clusters_sim$Clusters)
##      State  County center   radius      pval    elapsed n_obs
## 11 Alabama Chilton     11 92.86966 0.3333333 0.03016667    16

Find Multiple Clusters Sequentially via the Two-Stage Detection

To find multiple clusters sequentially via two-stage detection, we use the functions Find.Clusters.TStg1() (stage 1) and Find.Clusters.TStg2() (stage 2).

Find.Clusters.TStg1() takes as arguments the response vector, y, the design matrix X we created above, the longitude and latitude coordinates for the centroid for each county, MR which is the maximum radius for a potential clusters, M the number of simulations, we set overlap=TRUE to allow for overlapping clusters, and set alpha=0.05 as the Type I error rate. The object created by this function (Clusters_ts1) becomes an input for the stage 2 function, Find.Clusters.TStg2().

Find.Clusters.TStg2() takes as arguments y, the design matrix X we created above, the longitude and latitude coordinates for the centroid for each county, MR which is the maximum radius for a potential clusters, M the number of simulations, Cls1st which is the output of the first stage and we specify as Clusters_ts1, we set overlap=TRUE to allow for overlapping clusters, and set alpha=0.05 as the Type I error rate.

#Perform two-stage detection
Clusters_ts1 <- Find.Clusters.TStg1(y, X, long, lat, MR, M, overlap=TRUE, alpha=0.05)
Clusters_ts2 <- Find.Clusters.TStg2(y, X, long, lat, MR, M, Clusters_ts1, overlap=TRUE, alpha=0.05)

#Explore identified clusters (via two-stage detection)
Clusters_ts2$Clusters
##      center   radius      pval elapsed n_obs stage
## [1,]     11 92.86966 0.3333333  0.0235    16     1
Clusters_ts2$Coef
##                 0
## beta_0 0.08902763
## beta_1 0.04611213

The final output of the two-stage detection is in the object Clusters_ts2, which is a list of 3 elements:

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData dataset with the identified cluster information:

cbind(SE_FakeData[Clusters_ts2$Clusters[,1],1:2],Clusters_ts2$Clusters)
##      State  County center   radius      pval elapsed n_obs stage
## 11 Alabama Chilton     11 92.86966 0.3333333  0.0235    16     1