First, load the coefclust
package.
library(coefclust)
Next load the toy dataset that comes with the package, SoutheastFakeData:
data("SE_FakeData")
str(SE_FakeData)
## 'data.frame': 616 obs. of 7 variables:
## $ State : Factor w/ 7 levels "Alabama","Florida",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County: Factor w/ 438 levels "Abbeville","Adams",..: 17 21 24 36 40 51 54 58 69 77 ...
## $ FIPS : int 1001 1003 1005 1007 1009 1011 1013 1015 1017 1019 ...
## $ long : num -86.6 -87.7 -85.4 -87.1 -86.6 ...
## $ lat : num 32.5 30.7 31.9 33 34 ...
## $ x : num -1.692 0.407 -3.864 7.074 3.073 ...
## $ y : num -1.833 -0.428 -0.506 14.84 0.417 ...
From the str()
function, we see that the toy dataset has 7 variables: State, County, FIPS code (as an integer value), long - the longitude of the centroid for each county, lat - the latitude of the centroid for each county, x, and y - the vector of responses.
Next, set initial x
and y
inputs, the longitude and latitude, maximum radius (MR
) in km, and number of simulations (M
). y
will be equivalent to the response (y
) in the dataset SE_FakeData
. For X
, we must first column bind a vector of 1’s for the intercept.
In a full simulation, we recommend setting M=1000
. However for illustration, we set M=2
here.
y <- SE_FakeData$y
X <- cbind(rep(1,length(y)), SE_FakeData$x)
head(X)
## [,1] [,2]
## [1,] 1 -1.6916972
## [2,] 1 0.4066972
## [3,] 1 -3.8642180
## [4,] 1 7.0744111
## [5,] 1 3.0729092
## [6,] 1 -2.2670492
long <- SE_FakeData$long; lat <- SE_FakeData$lat
MR <- 300
M <- 2
To find multiple clusters sequentially via simultaneous detection, we use the function Find.Clusters.Simul()
. This function takes as arguments the response vector, y
, the design matrix X
we created above, the longitude and latitude coordinates for the centroid for each county, MR
which is the maximum radius for a potential clusters, M
the number of simulations, we set overlap=TRUE
to allow for overlapping clusters, and set alpha=0.05
as the Type I error rate.
Clusters_sim <- Find.Clusters.Simul(y, X, long, lat, MR, M, overlap=TRUE, alpha=0.05)
str(Clusters_sim)
## List of 3
## $ Clusters: num [1, 1:5] 11 92.8697 0.3333 0.0302 16
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:5] "center" "radius" "pval" "elapsed" ...
## $ Coef : num [1:2, 1] 0.089 0.0461
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "beta_0" "beta_1"
## .. ..$ : chr "0"
## $ clsL :List of 1
## ..$ : logi [1:616] TRUE FALSE FALSE TRUE FALSE FALSE ...
Clusters_sim$Clusters
## center radius pval elapsed n_obs
## [1,] 11 92.86966 0.3333333 0.03016667 16
Clusters_sim$Coef
## 0
## beta_0 0.08902763
## beta_1 0.04611213
The output is object Clusters_sim
, which is a list of three elements.
Clusters_sim$Clusters
(first element of list): provides the resulting identified cluster, with information about the center ID, radius, p-value, elapsed computation time, and number of observations in the cluster.Clusters_sim$Coef
(second element of list): provides estimates for beta_0
(intercept) and beta_1
.Clusters_sim$clsL
: Boolean vector of TRUE/FALSE indicating locations identified to be inside the cluster (TRUE) or outside of the cluster (FALSE).To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData
dataset with the identified cluster information:
cbind(SE_FakeData[Clusters_sim$Clusters[,1],1:2],Clusters_sim$Clusters)
## State County center radius pval elapsed n_obs
## 11 Alabama Chilton 11 92.86966 0.3333333 0.03016667 16
To find multiple clusters sequentially via two-stage detection, we use the functions Find.Clusters.TStg1()
(stage 1) and Find.Clusters.TStg2()
(stage 2).
Find.Clusters.TStg1()
takes as arguments the response vector, y
, the design matrix X
we created above, the longitude and latitude coordinates for the centroid for each county, MR
which is the maximum radius for a potential clusters, M
the number of simulations, we set overlap=TRUE
to allow for overlapping clusters, and set alpha=0.05
as the Type I error rate. The object created by this function (Clusters_ts1
) becomes an input for the stage 2 function, Find.Clusters.TStg2()
.
Find.Clusters.TStg2()
takes as arguments y
, the design matrix X
we created above, the longitude and latitude coordinates for the centroid for each county, MR
which is the maximum radius for a potential clusters, M
the number of simulations, Cls1st
which is the output of the first stage and we specify as Clusters_ts1
, we set overlap=TRUE
to allow for overlapping clusters, and set alpha=0.05
as the Type I error rate.
#Perform two-stage detection
Clusters_ts1 <- Find.Clusters.TStg1(y, X, long, lat, MR, M, overlap=TRUE, alpha=0.05)
Clusters_ts2 <- Find.Clusters.TStg2(y, X, long, lat, MR, M, Clusters_ts1, overlap=TRUE, alpha=0.05)
#Explore identified clusters (via two-stage detection)
Clusters_ts2$Clusters
## center radius pval elapsed n_obs stage
## [1,] 11 92.86966 0.3333333 0.0235 16 1
Clusters_ts2$Coef
## 0
## beta_0 0.08902763
## beta_1 0.04611213
The final output of the two-stage detection is in the object Clusters_ts2
, which is a list of 3 elements:
Clusters_sim$Clusters
(first element of list): provides the resulting identified cluster, with information about the center ID, radius, p-value, elapsed computation time, and number of observations in the cluster.Clusters_sim$Coef
(second element of list): provides estimates for beta_0
(intercept) and beta_1
.Clusters_sim$clsL
: Boolean vector of TRUE/FALSE indicating locations identified to be inside the cluster (TRUE) or outside of the cluster (FALSE).To identify the cluster by it’s county name, we can column bind the information from the SE_FakeData
dataset with the identified cluster information:
cbind(SE_FakeData[Clusters_ts2$Clusters[,1],1:2],Clusters_ts2$Clusters)
## State County center radius pval elapsed n_obs stage
## 11 Alabama Chilton 11 92.86966 0.3333333 0.0235 16 1