Please report any issues here.
Messages from clusso
have been suppressed for this tutorial, but there are many messages generated that we hope the user will find helpful. Feedback is welcome here.
Load the clusso
library:
library("clusso")
If you haven't already, you can install clusso
directly from GitHub:
library("devtools")
devtools::install_github("mkamenet3/clusso")
clusso
In this example, we will use simulated data of breast cancer incidence counts across 208 prefects (municipalities) in Japan across 5 time periods. These data are based on original incidence counts, but we have added noise to these data.
We load four data sets that come with the clusso
package:
#load data
data("jbc")
data("utmJapan")
To explore the 4 data sets, we apply the head()
function to each of the data sets.
utmJapan
: data set containing a unique identifier for each prefect centroid (id), x-coordinate in UTM (utmx), and y-coordinate in UTM (utmy).jbc
: data set containing a unique identifier for each centroid (id), period of observation (categorical variable with 5 levels, period), death count (death), expected death count (expdeath). These data are simulated (with some noise) based on observed data.#inspect
head(utmJapan)
## id utmx utmy
## 1 9201 399786.6 4047756
## 2 9202 360917.0 4023885
## 3 9203 385175.1 4025749
## 4 9204 371603.4 4018172
## 5 9205 388154.2 4047900
## 6 9206 375023.3 4068053
head(jbc)
## id period death expdeath covar1 covar2 covar3 covar4 covar5
## 1 9201 7906 35 36.75088 458.5877 0.01894360 34 1 0
## 2 9201 7922 53 41.78682 470.2921 0.08078908 40 1 0
## 3 9201 7938 41 47.41040 490.6435 0.02267590 32 0 1
## 4 9201 7954 66 53.36025 501.6789 0.03259346 35 0 0
## 5 9201 7970 73 58.70772 531.8578 0.01600942 32 0 1
## 6 9202 7906 15 19.77900 521.2707 0.05811436 31 0 0
We first convert period
to a factor in the data frame jbc
:
jbc$period <- as.factor(jbc$period)
When using clusso
, there are global parameters that need to be set first.
We take the easting and northing coordinates from the utmJapan
data set, set the easting coordinate, utmx
, to x
and the northing coordinate, utmy
, to y
. Each is divided by 1000 to change the scale from meters to kilometers.
x <- utmJapan$utmx/1000
y <- utmJapan$utmy/1000
Below, we set the maximum radius for a cluster to 20 km (rMax=20
) based on scientific knowledge of the area and our definition of a meaningful cluster. We set the number of unique time periods to 5 (Time=5
), and set a floor for overdispersion to avoid estimating underdispersion (overdispfloor=TRUE
).
rMax <- 20
Time <- 5
overdispfloor <- TRUE
If you do not have expected counts readily available, they can be calculated using the auxiliary function calcexpected()
, from the clusso
package.
As an example, we generate some data and apply calcexpected()
. The output is a dataframe with an additional variable, expected
, which contains the expected counts:
set.seed(2)
period <- rep(c("1","2"), times=5)
observed <- MASS::rnegbin(n = length(period), mu=20, theta=2)
pop <- MASS::rnegbin(n = length(period), mu=200, theta=2)
ids <- rep(1:5, each=2)
calcexpected(observed, pop, period, ids)
## ids periods population observed expected
## 1 1 1 298 5 19.288026
## 2 1 2 112 10 7.249191
## 3 2 1 455 2 29.449838
## 4 2 2 67 7 4.336570
## 5 3 1 87 11 5.631068
## 6 3 2 50 12 3.236246
## 7 4 1 314 10 20.323625
## 8 4 2 165 17 10.679612
## 9 5 1 207 40 13.398058
## 10 5 2 99 6 6.407767
In this data example, we already have expected counts and will use the dataframe jbc
.
clusso
To perform the regularized spatial and spatio-temporal cluster detection using the Lasso penalty, we use the function clusso()
. In the first argument, we set the name of the dataframe, df=jbc
. In the second argument we set the expected counts equal to the name of expected counts in the jbc
dataframe (expected = expdeath
). In the third argument, we set the observed counts equal to the name of the observed counts in the jbc
dataframe (observed = death
). In the fourth argument, we set the time period equal to the name of the time periods in the jbc
dataframe (timeperiod = period
). This variable has already been converted to a factor in the data-cleaning step above (jbc$period <- as.factor(jbc$period)
). In this example, we do not want to include other covariates as unpenalized terms in the model so we set covars = FALSE
.
The next two arguments are x
and y
, which are the UTM easting and northing coordinates, scaled to be in kilometers. The next argument is the maximum radius (in km) for the potential clusters. We set rMax = rMax
, which we defined above as 20km. The argument utm
specifies that our coordinates are indeed in UTM coordinates (utm = TRUE
). For latitude-longitude coordinates, we would set utm = FALSE
. The is analysis
argument defines the type of analysis to be performed. Options include space
, spacetime
, or both
. Here, we set analysis
equal to "both", which will perform both the spatial and spatio-temporal analyses. The argument model
is the type of model to be used. Currently (2020-09-14), either the Poisson or binomial models can be specified. In this example, we specify the Poisson model. The final required argument is maxclust
. This is the maximum number of actual clusters expected in the given area. This is based on scientific knowledge of the area. For the Japanese breast cancer data, we set maxclust=11
. If the number of identified clusters exceeds maxclust
, a warning will encourage you to increase your set maxclust
.
The output of clusso()
is assigned to the object resreal
. resreal
is a large list of lists. The recommended way to explore the results is to explore the names(resreal)
and select each sub-list of interest.
head(jbc)
## id period death expdeath covar1 covar2 covar3 covar4 covar5
## 1 9201 7906 35 36.75088 458.5877 0.01894360 34 1 0
## 2 9201 7922 53 41.78682 470.2921 0.08078908 40 1 0
## 3 9201 7938 41 47.41040 490.6435 0.02267590 32 0 1
## 4 9201 7954 66 53.36025 501.6789 0.03259346 35 0 0
## 5 9201 7970 73 58.70772 531.8578 0.01600942 32 0 1
## 6 9202 7906 15 19.77900 521.2707 0.05811436 31 0 0
system.time(resreal <- clusso(jbc, expected = expdeath, observed=death,timeperiod = period,
covars=FALSE, id = id, x= x,y = y, rMax = rMax,
utm=TRUE, analysis="both", model="poisson",maxclust=11))
## Running Poisson both spatial and spatio-temporal model(s).
## Creating radius-based potential clusters
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.1011
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.1011
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## All models ran successfully
## user system elapsed
## 25.36 0.34 28.33
We can use the function clussopretty()
to create a nice table for the number of clusters detected:
clussopretty(resreal, analysis="both", model="poisson",clusteridentify=FALSE)
## model analysistype numclust.AIC numclust.AICc numclust.BIC
## 1 Poisson Space 8 8 4
## 2 Poisson Space-Time 8 8 4
## 3 Quasi-Poisson Space 4 4 3
## 4 Quasi-Poisson Space-Time 4 4 3
With the argument clusteridentify=TRUE
, centroid indices (by time period) are extracted. Since these lists are very long, output is left to the user to explore:
#identify cluster IDs
clussopretty(resreal, analysis="both", model="poisson",clusteridentify=TRUE)
#identify cluster IDs where RR is greater or equal to 1.05
clussopretty(resreal, analysis="both", model="poisson", clusteridentify=TRUE, clusterRR = 1.05)
#identify cluster IDS that differ from the background rate
clussopretty(resreal, analysis="both", model="poisson", clusteridentify=TRUE, clusterRR = "background")
We can plot the output using clussoplot()
to visualize the coefficient paths, where each line is the path for a potential cluster and the final selected clusters are identified with dashed vertical lines:
clussoplot(resreal, analysis="both",model="poisson", Time=5, cv=FALSE)
We continue with the same data set as above, but now explore results by \(k\)-fold cross-validation. All initial inputs are assumed to be the same. We specify a total of 5 folds using the argument cv=5
in the clusso()
function.
#perform clusso using cv (5 folds)
system.time(rescv <- clusso(jbc, expected = expdeath, observed=death,timeperiod = period,
id=id, covars=FALSE, x= x,y = y, rMax = rMax, utm=TRUE,
analysis="both", model="poisson",maxclust=11, cv=5))
## Running cross-validation method for path selection. For AIC, AICc, and BIC, set `cv = NULL`
## Running Poisson both spatial and spatio-temporal model(s).
## Creating radius-based potential clusters
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: cross-validation
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: cross-validation
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: cross-validation
## No covariates found
## Number of potential clusters to scan through: 66870
## Path selection: cross-validation
## All models ran successfully
## user system elapsed
## 80.56 0.86 82.15
clussopretty(rescv, analysis="both", model="poisson",clusteridentify=FALSE, cv=TRUE)
## model analysistype numclust.cv
## 1 Poisson Space 7
## 2 Poisson Space-Time 13
## 3 Quasi-Poisson Space 7
## 4 Quasi-Poisson Space-Time 13
We continue with the Japanese breast cancer data set imported above, but now consider covariates in the model that will be unpenalized.
We explore the first 6 observations of the jbc
dataframe using the head()
function.
#Set-up
head(jbc)
## id period death expdeath covar1 covar2 covar3 covar4 covar5
## 1 9201 7906 35 36.75088 458.5877 0.01894360 34 1 0
## 2 9201 7922 53 41.78682 470.2921 0.08078908 40 1 0
## 3 9201 7938 41 47.41040 490.6435 0.02267590 32 0 1
## 4 9201 7954 66 53.36025 501.6789 0.03259346 35 0 0
## 5 9201 7970 73 58.70772 531.8578 0.01600942 32 0 1
## 6 9202 7906 15 19.77900 521.2707 0.05811436 31 0 0
There are 5 additional covariates, covar1
- covar5
that we will want to adjust for in our Poisson or quasi-Poisson regression as unpenalized terms.
We now set the argument covars=TRUE
to tell clusso()
to identify the covariates and to include them as unpenalized terms:
system.time(rescovars<- clusso(df=jbc, expected = expdeath, observed=death,
timeperiod = period, id=id, covars=TRUE,
x= x,y = y, rMax = rMax, utm=TRUE, analysis="both",
model="poisson",maxclust=11,overdispfloor = TRUE,
cv=NULL, collapsetime = FALSE))
## Running Poisson both spatial and spatio-temporal model(s).
## Creating radius-based potential clusters
## Running with covariates
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.0996
## Running with covariates
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Running with covariates
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.0996
## Running with covariates
## Number of potential clusters to scan through: 66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## All models ran successfully
## user system elapsed
## 22.63 0.28 22.97
Use clussopretty()
to create a nice table of clusters detected:
clussopretty(rescovars, analysis="both", model="poisson",clusteridentify = FALSE)
## model analysistype numclust.AIC numclust.AICc numclust.BIC
## 1 Poisson Space 7 7 3
## 2 Poisson Space-Time 7 7 3
## 3 Quasi-Poisson Space 7 7 3
## 4 Quasi-Poisson Space-Time 7 7 3
We can also extract the coefficients for the unpenalized terms and undo the log-link by exponentiating:
out <- clussopretty(rescovars, analysis="both", model="poisson",covars=TRUE)
out$table.coefs
## IC model analysistype time_period1 time_period2 time_period3
## 1 (Q)AIC Poisson Space 0 0 0
## 2 (Q)AIC Poisson Space-Time 0 0 0
## 3 (Q)AIC Poisson Space 0 0 0
## 4 (Q)AIC Poisson Space-Time 0 0 0
## 5 (Q)AICc Quasi-Poisson Space 0 0 0
## 6 (Q)AICc Quasi-Poisson Space-Time 0 0 0
## 7 (Q)AICc Quasi-Poisson Space 0 0 0
## 8 (Q)AICc Quasi-Poisson Space-Time 0 0 0
## 9 (Q)BIC Poisson Space 0 0 0
## 10 (Q)BIC Poisson Space-Time 0 0 0
## 11 (Q)BIC Poisson Space 0 0 0
## 12 (Q)BIC Poisson Space-Time 0 0 0
## 13 (Q)AIC Quasi-Poisson Space 0 0 0
## 14 (Q)AIC Quasi-Poisson Space-Time 0 0 0
## 15 (Q)AIC Quasi-Poisson Space 0 0 0
## 16 (Q)AIC Quasi-Poisson Space-Time 0 0 0
## 17 (Q)AICc Poisson Space 0 0 0
## 18 (Q)AICc Poisson Space-Time 0 0 0
## 19 (Q)AICc Poisson Space 0 0 0
## 20 (Q)AICc Poisson Space-Time 0 0 0
## 21 (Q)BIC Quasi-Poisson Space 0 0 0
## 22 (Q)BIC Quasi-Poisson Space-Time 0 0 0
## 23 (Q)BIC Quasi-Poisson Space 0 0 0
## 24 (Q)BIC Quasi-Poisson Space-Time 0 0 0
## time_period4 time_period5 covar1 covar2 covar3 covar4
## 1 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 2 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 3 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 4 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 5 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 6 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 7 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 8 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 9 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 10 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 11 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 12 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 13 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 14 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 15 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 16 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 17 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 18 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 19 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 20 0 0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 21 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 22 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 23 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 24 0 0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## covar5
## 1 -0.05838234
## 2 -0.05838234
## 3 -0.05838234
## 4 -0.05838234
## 5 -0.05838234
## 6 -0.05838234
## 7 -0.05838234
## 8 -0.05838234
## 9 -0.05465438
## 10 -0.05465438
## 11 -0.05465438
## 12 -0.05465438
## 13 -0.05838234
## 14 -0.05838234
## 15 -0.05838234
## 16 -0.05838234
## 17 -0.05838234
## 18 -0.05838234
## 19 -0.05838234
## 20 -0.05838234
## 21 -0.05465438
## 22 -0.05465438
## 23 -0.05465438
## 24 -0.05465438
out$table.expcoefs
## IC model analysistype time_period1 time_period2 time_period3
## 1 (Q)AIC Poisson Space 1 1 1
## 2 (Q)AIC Poisson Space-Time 1 1 1
## 3 (Q)AIC Poisson Space 1 1 1
## 4 (Q)AIC Poisson Space-Time 1 1 1
## 5 (Q)AICc Quasi-Poisson Space 1 1 1
## 6 (Q)AICc Quasi-Poisson Space-Time 1 1 1
## 7 (Q)AICc Quasi-Poisson Space 1 1 1
## 8 (Q)AICc Quasi-Poisson Space-Time 1 1 1
## 9 (Q)BIC Poisson Space 1 1 1
## 10 (Q)BIC Poisson Space-Time 1 1 1
## 11 (Q)BIC Poisson Space 1 1 1
## 12 (Q)BIC Poisson Space-Time 1 1 1
## 13 (Q)AIC Quasi-Poisson Space 1 1 1
## 14 (Q)AIC Quasi-Poisson Space-Time 1 1 1
## 15 (Q)AIC Quasi-Poisson Space 1 1 1
## 16 (Q)AIC Quasi-Poisson Space-Time 1 1 1
## 17 (Q)AICc Poisson Space 1 1 1
## 18 (Q)AICc Poisson Space-Time 1 1 1
## 19 (Q)AICc Poisson Space 1 1 1
## 20 (Q)AICc Poisson Space-Time 1 1 1
## 21 (Q)BIC Quasi-Poisson Space 1 1 1
## 22 (Q)BIC Quasi-Poisson Space-Time 1 1 1
## 23 (Q)BIC Quasi-Poisson Space 1 1 1
## 24 (Q)BIC Quasi-Poisson Space-Time 1 1 1
## time_period4 time_period5 covar1 covar2 covar3 covar4 covar5
## 1 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 2 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 3 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 4 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 5 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 6 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 7 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 8 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 9 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 10 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 11 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 12 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 13 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 14 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 15 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 16 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 17 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 18 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 19 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 20 1 1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 21 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 22 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 23 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 24 1 1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
out$table.clusters
## model analysistype numclust.AIC numclust.AICc numclust.BIC
## 1 Poisson Space 7 7 3
## 2 Poisson Space-Time 7 7 3
## 3 Quasi-Poisson Space 7 7 3
## 4 Quasi-Poisson Space-Time 7 7 3