Introduction to clusso

Introduction

Please report any issues here.

Messages from clusso have been suppressed for this tutorial, but there are many messages generated that we hope the user will find helpful. Feedback is welcome here.

Load the clusso library:

library("clusso")

If you haven't already, you can install clusso directly from GitHub:

library("devtools")
devtools::install_github("mkamenet3/clusso")

Using clusso

In this example, we will use simulated data of breast cancer incidence counts across 208 prefects (municipalities) in Japan across 5 time periods. These data are based on original incidence counts, but we have added noise to these data.

Prepare Data

We load four data sets that come with the clusso package:

#load data
data("jbc")
data("utmJapan")

To explore the 4 data sets, we apply the head() function to each of the data sets.

  1. utmJapan: data set containing a unique identifier for each prefect centroid (id), x-coordinate in UTM (utmx), and y-coordinate in UTM (utmy).
  2. jbc: data set containing a unique identifier for each centroid (id), period of observation (categorical variable with 5 levels, period), death count (death), expected death count (expdeath). These data are simulated (with some noise) based on observed data.
#inspect
head(utmJapan)
##     id     utmx    utmy
## 1 9201 399786.6 4047756
## 2 9202 360917.0 4023885
## 3 9203 385175.1 4025749
## 4 9204 371603.4 4018172
## 5 9205 388154.2 4047900
## 6 9206 375023.3 4068053
head(jbc)
##     id period death expdeath   covar1     covar2 covar3 covar4 covar5
## 1 9201   7906    35 36.75088 458.5877 0.01894360     34      1      0
## 2 9201   7922    53 41.78682 470.2921 0.08078908     40      1      0
## 3 9201   7938    41 47.41040 490.6435 0.02267590     32      0      1
## 4 9201   7954    66 53.36025 501.6789 0.03259346     35      0      0
## 5 9201   7970    73 58.70772 531.8578 0.01600942     32      0      1
## 6 9202   7906    15 19.77900 521.2707 0.05811436     31      0      0

We first convert period to a factor in the data frame jbc:

jbc$period <- as.factor(jbc$period)

Set global parameters

When using clusso, there are global parameters that need to be set first.

We take the easting and northing coordinates from the utmJapan data set, set the easting coordinate, utmx, to x and the northing coordinate, utmy, to y. Each is divided by 1000 to change the scale from meters to kilometers.

x <- utmJapan$utmx/1000
y <- utmJapan$utmy/1000

Below, we set the maximum radius for a cluster to 20 km (rMax=20) based on scientific knowledge of the area and our definition of a meaningful cluster. We set the number of unique time periods to 5 (Time=5), and set a floor for overdispersion to avoid estimating underdispersion (overdispfloor=TRUE).

rMax <- 20 
Time <- 5
overdispfloor <- TRUE

If you do not have expected counts readily available, they can be calculated using the auxiliary function calcexpected(), from the clusso package.

As an example, we generate some data and apply calcexpected(). The output is a dataframe with an additional variable, expected, which contains the expected counts:

set.seed(2)
period <- rep(c("1","2"), times=5)
observed <- MASS::rnegbin(n = length(period), mu=20, theta=2)
pop <-  MASS::rnegbin(n = length(period), mu=200, theta=2)
ids <- rep(1:5, each=2)

calcexpected(observed, pop, period, ids)
##    ids periods population observed  expected
## 1    1       1        298        5 19.288026
## 2    1       2        112       10  7.249191
## 3    2       1        455        2 29.449838
## 4    2       2         67        7  4.336570
## 5    3       1         87       11  5.631068
## 6    3       2         50       12  3.236246
## 7    4       1        314       10 20.323625
## 8    4       2        165       17 10.679612
## 9    5       1        207       40 13.398058
## 10   5       2         99        6  6.407767

In this data example, we already have expected counts and will use the dataframe jbc.

Fitting using clusso

To perform the regularized spatial and spatio-temporal cluster detection using the Lasso penalty, we use the function clusso(). In the first argument, we set the name of the dataframe, df=jbc. In the second argument we set the expected counts equal to the name of expected counts in the jbc dataframe (expected = expdeath). In the third argument, we set the observed counts equal to the name of the observed counts in the jbc dataframe (observed = death). In the fourth argument, we set the time period equal to the name of the time periods in the jbc dataframe (timeperiod = period). This variable has already been converted to a factor in the data-cleaning step above (jbc$period <- as.factor(jbc$period)). In this example, we do not want to include other covariates as unpenalized terms in the model so we set covars = FALSE.

The next two arguments are x and y, which are the UTM easting and northing coordinates, scaled to be in kilometers. The next argument is the maximum radius (in km) for the potential clusters. We set rMax = rMax, which we defined above as 20km. The argument utm specifies that our coordinates are indeed in UTM coordinates (utm = TRUE). For latitude-longitude coordinates, we would set utm = FALSE. The is analysis argument defines the type of analysis to be performed. Options include space, spacetime, or both. Here, we set analysis equal to "both", which will perform both the spatial and spatio-temporal analyses. The argument model is the type of model to be used. Currently (2020-09-14), either the Poisson or binomial models can be specified. In this example, we specify the Poisson model. The final required argument is maxclust. This is the maximum number of actual clusters expected in the given area. This is based on scientific knowledge of the area. For the Japanese breast cancer data, we set maxclust=11. If the number of identified clusters exceeds maxclust, a warning will encourage you to increase your set maxclust.

The output of clusso() is assigned to the object resreal. resreal is a large list of lists. The recommended way to explore the results is to explore the names(resreal) and select each sub-list of interest.

head(jbc)
##     id period death expdeath   covar1     covar2 covar3 covar4 covar5
## 1 9201   7906    35 36.75088 458.5877 0.01894360     34      1      0
## 2 9201   7922    53 41.78682 470.2921 0.08078908     40      1      0
## 3 9201   7938    41 47.41040 490.6435 0.02267590     32      0      1
## 4 9201   7954    66 53.36025 501.6789 0.03259346     35      0      0
## 5 9201   7970    73 58.70772 531.8578 0.01600942     32      0      1
## 6 9202   7906    15 19.77900 521.2707 0.05811436     31      0      0
system.time(resreal <- clusso(jbc, expected = expdeath, observed=death,timeperiod = period,
                              covars=FALSE, id = id, x= x,y = y, rMax =  rMax, 
                              utm=TRUE, analysis="both", model="poisson",maxclust=11))
## Running Poisson both spatial and spatio-temporal model(s).
## Creating radius-based potential clusters
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.1011
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.1011
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## All models ran successfully
##    user  system elapsed 
##   25.36    0.34   28.33

We can use the function clussopretty() to create a nice table for the number of clusters detected:

clussopretty(resreal, analysis="both", model="poisson",clusteridentify=FALSE)
##           model analysistype numclust.AIC numclust.AICc numclust.BIC
## 1       Poisson        Space            8             8            4
## 2       Poisson   Space-Time            8             8            4
## 3 Quasi-Poisson        Space            4             4            3
## 4 Quasi-Poisson   Space-Time            4             4            3

With the argument clusteridentify=TRUE, centroid indices (by time period) are extracted. Since these lists are very long, output is left to the user to explore:

#identify cluster IDs
clussopretty(resreal, analysis="both", model="poisson",clusteridentify=TRUE)
#identify cluster IDs where RR is greater or equal to 1.05
clussopretty(resreal, analysis="both", model="poisson", clusteridentify=TRUE, clusterRR = 1.05)
#identify cluster IDS that differ from the background rate
clussopretty(resreal, analysis="both", model="poisson", clusteridentify=TRUE, clusterRR = "background")

We can plot the output using clussoplot() to visualize the coefficient paths, where each line is the path for a potential cluster and the final selected clusters are identified with dashed vertical lines:

clussoplot(resreal, analysis="both",model="poisson", Time=5, cv=FALSE)

Comparison to Cross-Validation

We continue with the same data set as above, but now explore results by \(k\)-fold cross-validation. All initial inputs are assumed to be the same. We specify a total of 5 folds using the argument cv=5 in the clusso() function.

#perform clusso using cv (5 folds)
system.time(rescv <- clusso(jbc, expected = expdeath, observed=death,timeperiod = period, 
                            id=id, covars=FALSE, x= x,y = y, rMax =  rMax, utm=TRUE, 
                            analysis="both", model="poisson",maxclust=11, cv=5))
## Running cross-validation method for path selection. For AIC, AICc, and BIC, set `cv = NULL`
## Running Poisson both spatial and spatio-temporal model(s).
## Creating radius-based potential clusters
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: cross-validation
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: cross-validation
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: cross-validation
## No covariates found
## Number of potential clusters to scan through:  66870
## Path selection: cross-validation
## All models ran successfully
##    user  system elapsed 
##   80.56    0.86   82.15
clussopretty(rescv, analysis="both", model="poisson",clusteridentify=FALSE, cv=TRUE)
##           model analysistype numclust.cv
## 1       Poisson        Space           7
## 2       Poisson   Space-Time          13
## 3 Quasi-Poisson        Space           7
## 4 Quasi-Poisson   Space-Time          13

Example with Covariates

We continue with the Japanese breast cancer data set imported above, but now consider covariates in the model that will be unpenalized.

We explore the first 6 observations of the jbc dataframe using the head() function.

#Set-up
head(jbc)
##     id period death expdeath   covar1     covar2 covar3 covar4 covar5
## 1 9201   7906    35 36.75088 458.5877 0.01894360     34      1      0
## 2 9201   7922    53 41.78682 470.2921 0.08078908     40      1      0
## 3 9201   7938    41 47.41040 490.6435 0.02267590     32      0      1
## 4 9201   7954    66 53.36025 501.6789 0.03259346     35      0      0
## 5 9201   7970    73 58.70772 531.8578 0.01600942     32      0      1
## 6 9202   7906    15 19.77900 521.2707 0.05811436     31      0      0

There are 5 additional covariates, covar1 - covar5 that we will want to adjust for in our Poisson or quasi-Poisson regression as unpenalized terms.

We now set the argument covars=TRUE to tell clusso() to identify the covariates and to include them as unpenalized terms:

system.time(rescovars<- clusso(df=jbc, expected = expdeath, observed=death,
                               timeperiod = period, id=id, covars=TRUE,
                               x= x,y = y, rMax =  rMax, utm=TRUE, analysis="both", 
                               model="poisson",maxclust=11,overdispfloor = TRUE,
                               cv=NULL, collapsetime = FALSE))
## Running Poisson both spatial and spatio-temporal model(s).
## Creating radius-based potential clusters
## Running with covariates
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.0996
## Running with covariates
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Running with covariates
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## Overdispersion estimate: 2.0996
## Running with covariates
## Number of potential clusters to scan through:  66870
## Path selection: information criteria
## Lasso complete - extracting estimates and paths
## All models ran successfully
##    user  system elapsed 
##   22.63    0.28   22.97

Use clussopretty() to create a nice table of clusters detected:

clussopretty(rescovars, analysis="both", model="poisson",clusteridentify = FALSE)
##           model analysistype numclust.AIC numclust.AICc numclust.BIC
## 1       Poisson        Space            7             7            3
## 2       Poisson   Space-Time            7             7            3
## 3 Quasi-Poisson        Space            7             7            3
## 4 Quasi-Poisson   Space-Time            7             7            3

We can also extract the coefficients for the unpenalized terms and undo the log-link by exponentiating:

out <- clussopretty(rescovars, analysis="both", model="poisson",covars=TRUE)
out$table.coefs
##         IC         model analysistype time_period1 time_period2 time_period3
## 1   (Q)AIC       Poisson        Space            0            0            0
## 2   (Q)AIC       Poisson   Space-Time            0            0            0
## 3   (Q)AIC       Poisson        Space            0            0            0
## 4   (Q)AIC       Poisson   Space-Time            0            0            0
## 5  (Q)AICc Quasi-Poisson        Space            0            0            0
## 6  (Q)AICc Quasi-Poisson   Space-Time            0            0            0
## 7  (Q)AICc Quasi-Poisson        Space            0            0            0
## 8  (Q)AICc Quasi-Poisson   Space-Time            0            0            0
## 9   (Q)BIC       Poisson        Space            0            0            0
## 10  (Q)BIC       Poisson   Space-Time            0            0            0
## 11  (Q)BIC       Poisson        Space            0            0            0
## 12  (Q)BIC       Poisson   Space-Time            0            0            0
## 13  (Q)AIC Quasi-Poisson        Space            0            0            0
## 14  (Q)AIC Quasi-Poisson   Space-Time            0            0            0
## 15  (Q)AIC Quasi-Poisson        Space            0            0            0
## 16  (Q)AIC Quasi-Poisson   Space-Time            0            0            0
## 17 (Q)AICc       Poisson        Space            0            0            0
## 18 (Q)AICc       Poisson   Space-Time            0            0            0
## 19 (Q)AICc       Poisson        Space            0            0            0
## 20 (Q)AICc       Poisson   Space-Time            0            0            0
## 21  (Q)BIC Quasi-Poisson        Space            0            0            0
## 22  (Q)BIC Quasi-Poisson   Space-Time            0            0            0
## 23  (Q)BIC Quasi-Poisson        Space            0            0            0
## 24  (Q)BIC Quasi-Poisson   Space-Time            0            0            0
##    time_period4 time_period5        covar1      covar2      covar3      covar4
## 1             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 2             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 3             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 4             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 5             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 6             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 7             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 8             0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 9             0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 10            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 11            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 12            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 13            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 14            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 15            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 16            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 17            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 18            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 19            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 20            0            0 -0.0002554145 -0.01482589 0.003543421 -0.03695785
## 21            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 22            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 23            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
## 24            0            0 -0.0002017837 -0.01581189 0.002950320 -0.03921182
##         covar5
## 1  -0.05838234
## 2  -0.05838234
## 3  -0.05838234
## 4  -0.05838234
## 5  -0.05838234
## 6  -0.05838234
## 7  -0.05838234
## 8  -0.05838234
## 9  -0.05465438
## 10 -0.05465438
## 11 -0.05465438
## 12 -0.05465438
## 13 -0.05838234
## 14 -0.05838234
## 15 -0.05838234
## 16 -0.05838234
## 17 -0.05838234
## 18 -0.05838234
## 19 -0.05838234
## 20 -0.05838234
## 21 -0.05465438
## 22 -0.05465438
## 23 -0.05465438
## 24 -0.05465438
out$table.expcoefs
##         IC         model analysistype time_period1 time_period2 time_period3
## 1   (Q)AIC       Poisson        Space            1            1            1
## 2   (Q)AIC       Poisson   Space-Time            1            1            1
## 3   (Q)AIC       Poisson        Space            1            1            1
## 4   (Q)AIC       Poisson   Space-Time            1            1            1
## 5  (Q)AICc Quasi-Poisson        Space            1            1            1
## 6  (Q)AICc Quasi-Poisson   Space-Time            1            1            1
## 7  (Q)AICc Quasi-Poisson        Space            1            1            1
## 8  (Q)AICc Quasi-Poisson   Space-Time            1            1            1
## 9   (Q)BIC       Poisson        Space            1            1            1
## 10  (Q)BIC       Poisson   Space-Time            1            1            1
## 11  (Q)BIC       Poisson        Space            1            1            1
## 12  (Q)BIC       Poisson   Space-Time            1            1            1
## 13  (Q)AIC Quasi-Poisson        Space            1            1            1
## 14  (Q)AIC Quasi-Poisson   Space-Time            1            1            1
## 15  (Q)AIC Quasi-Poisson        Space            1            1            1
## 16  (Q)AIC Quasi-Poisson   Space-Time            1            1            1
## 17 (Q)AICc       Poisson        Space            1            1            1
## 18 (Q)AICc       Poisson   Space-Time            1            1            1
## 19 (Q)AICc       Poisson        Space            1            1            1
## 20 (Q)AICc       Poisson   Space-Time            1            1            1
## 21  (Q)BIC Quasi-Poisson        Space            1            1            1
## 22  (Q)BIC Quasi-Poisson   Space-Time            1            1            1
## 23  (Q)BIC Quasi-Poisson        Space            1            1            1
## 24  (Q)BIC Quasi-Poisson   Space-Time            1            1            1
##    time_period4 time_period5    covar1    covar2   covar3    covar4    covar5
## 1             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 2             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 3             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 4             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 5             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 6             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 7             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 8             1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 9             1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 10            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 11            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 12            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 13            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 14            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 15            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 16            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 17            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 18            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 19            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 20            1            1 0.9997446 0.9852835 1.003550 0.9637168 0.9432892
## 21            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 22            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 23            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
## 24            1            1 0.9997982 0.9843125 1.002955 0.9615470 0.9468123
out$table.clusters
##           model analysistype numclust.AIC numclust.AICc numclust.BIC
## 1       Poisson        Space            7             7            3
## 2       Poisson   Space-Time            7             7            3
## 3 Quasi-Poisson        Space            7             7            3
## 4 Quasi-Poisson   Space-Time            7             7            3