Introduction to coefclust

From the str() function, we see that the toy dataset has 7 variables: State, County, FIPS code (as an integer value), long - the longitude of the centroid for each county, lat - the latitude of the centroid for each county, x, and y - the vector of responses.

Next, set initial x and y inputs, the longitude and latitude, maximum radius (MR) in km, and number of simulations (M). y will be equivalent to the response (y) in the dataset SE_FakeData. For X, we must first column bind a vector of 1’s for the intercept.

In a full simulation, we recommend setting M=1000. However for illustration, we set M=2 here.

Find Multiple Clusters Sequentially via the Simultaneous Detection

To find multiple clusters sequentially via simultaneous detection, we use the function Find.Clusters.Simul(). This function takes as arguments the response vector, y, the design matrix X we created above, the longitude and latitude coordinates for the centroid for each county, MR which is the maximum radius for a potential clusters, M the number of simulations, we set overlap=TRUE to allow for overlapping clusters, and set alpha=0.05 as the Type I error rate.

Clusters_sim <- Find.Clusters.Simul(y, X, long, lat, MR, M, overlap=TRUE, alpha=0.05)
str(Clusters_sim)

## List of 3
##  $ Clusters: num [1, 1:5] 11 92.8697 0.3333 0.0302 16
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "center" "radius" "pval" "elapsed" ...
##  $ Coef    : num [1:2, 1] 0.089 0.0461
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "beta_0" "beta_1"
##   .. ..$ : chr "0"
##  $ clsL    :List of 1
##   ..$ : logi [1:616] TRUE FALSE FALSE TRUE FALSE FALSE ...

Clusters_sim$Clusters

##      center   radius      pval    elapsed n_obs
## [1,]     11 92.86966 0.3333333 0.03016667    16

Clusters_sim$Coef

##                 0
## beta_0 0.08902763
## beta_1 0.04611213

The output is object Clusters_sim, which is a list of three elements.

Clusters_sim$Clusters (first element of list): provides the resulting identified cluster, with information about the center ID, radius, p-value, elapsed computation time, and number of observations in the cluster.
Clusters_sim$Coef (second element of list): provides estimates for beta_0 (intercept) and beta_1.
Clusters_sim$clsL: Boolean vector of TRUE/FALSE indicating locations identified to be inside the cluster (TRUE) or outside of the cluster (FALSE).

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData dataset with the identified cluster information:

cbind(SE_FakeData[Clusters_sim$Clusters[,1],1:2],Clusters_sim$Clusters)

##      State  County center   radius      pval    elapsed n_obs
## 11 Alabama Chilton     11 92.86966 0.3333333 0.03016667    16

Find Multiple Clusters Sequentially via the Two-Stage Detection

To find multiple clusters sequentially via two-stage detection, we use the functions Find.Clusters.TStg1() (stage 1) and Find.Clusters.TStg2() (stage 2).

Find.Clusters.TStg1() takes as arguments the response vector, y, the design matrix X we created above, the longitude and latitude coordinates for the centroid for each county, MR which is the maximum radius for a potential clusters, M the number of simulations, we set overlap=TRUE to allow for overlapping clusters, and set alpha=0.05 as the Type I error rate. The object created by this function (Clusters_ts1) becomes an input for the stage 2 function, Find.Clusters.TStg2().

Find.Clusters.TStg2() takes as arguments y, the design matrix X we created above, the longitude and latitude coordinates for the centroid for each county, MR which is the maximum radius for a potential clusters, M the number of simulations, Cls1st which is the output of the first stage and we specify as Clusters_ts1, we set overlap=TRUE to allow for overlapping clusters, and set alpha=0.05 as the Type I error rate.

#Perform two-stage detection
Clusters_ts1 <- Find.Clusters.TStg1(y, X, long, lat, MR, M, overlap=TRUE, alpha=0.05)
Clusters_ts2 <- Find.Clusters.TStg2(y, X, long, lat, MR, M, Clusters_ts1, overlap=TRUE, alpha=0.05)

#Explore identified clusters (via two-stage detection)
Clusters_ts2$Clusters

##      center   radius      pval elapsed n_obs stage
## [1,]     11 92.86966 0.3333333  0.0235    16     1

Clusters_ts2$Coef

##                 0
## beta_0 0.08902763
## beta_1 0.04611213

The final output of the two-stage detection is in the object Clusters_ts2, which is a list of 3 elements:

Clusters_sim$Clusters (first element of list): provides the resulting identified cluster, with information about the center ID, radius, p-value, elapsed computation time, and number of observations in the cluster.
Clusters_sim$Coef (second element of list): provides estimates for beta_0 (intercept) and beta_1.
Clusters_sim$clsL: Boolean vector of TRUE/FALSE indicating locations identified to be inside the cluster (TRUE) or outside of the cluster (FALSE).

To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData dataset with the identified cluster information:

cbind(SE_FakeData[Clusters_ts2$Clusters[,1],1:2],Clusters_ts2$Clusters)

##      State  County center   radius      pval elapsed n_obs stage
## 11 Alabama Chilton     11 92.86966 0.3333333  0.0235    16     1

Introduction to coefclust

Junho Lee, Maria Kamenetsky

Find Multiple Clusters Sequentially via the Simultaneous Detection

Find Multiple Clusters Sequentially via the Two-Stage Detection