Skip to contents

Emerging hot spot Analysis (EHSA) is a technique that falls under exploratory spatial data analysis (ESDA). It combines the traditional ESDA technique of hot spot analysis using the Getis-Ord Gi* statistic with the traditional time-series Mann-Kendall test for monotonic trends.

The goal of EHSA is to evaluate how hot and cold spots are changing over time. It helps us answer the questions: are they becoming increasingly hotter, are they cooling down, or are they staying the same?

In brief, EHSA works by calculating the Gi* for each time period. The series of Gi* at each location is treated as a time-series and evaluated for a trend using the Mann-Kendall statistic. The Gi* and the Mann-Kendall are compared together to create 17 unique classifications to help better understand how the locations have changed over time.

In this vignette we walk through the Getis-Ord Gi*, the Mann-Kendall, and how the two work together to in EHSA.

Gettis-Ord Gi*

The Gettis-Ord Gi and Gi* (pronounced gee-eye-star) are one of the earliest LISAs. The Gi and Gi* measures are typically reported as a Z-score where high values indicate a high-high cluster and negative Z-scores indicate a low-low cluster. There are no high-low and low-high classifications like the local Moran.

“The Gi statistic consist of a ratio of the weighted average of the values in the neighboring locations, to the sum of all values, not including the value at the location (\(x_i\))” (Local Spatial Autocorrelation (2), GeoDa Center).

\[ G_i = \frac{\sum_{j \ne i}W_{ij}X_j} {\sum_{j \ne i}X_j} \]

The Gi* statistic includes the focal (or self, or ith) observation in the neighborhood.

\[ G_i* = \frac{\sum_{j}W_{ij}X_j} {\sum_{j}X_j} \]

Calculating the local Gi*

To calculate the local Gi* using sfdep, we have to be especially aware of the neighbors list when we create it. By default, when we create a neighbors list, we exclude the self from the neighbors list because it seems a little silly to say “I am my own neighbor.” However, if we want to calculate the local Gi*, we must be sure to explicitly add it using include_self()

Here, we follow the example laid out by the GeoDa center documentation to calculate and plot the local Gi* statistic for donations using the Guerry dataset.

First we create a neighbor list ensuring that the self is included and then create the weights list from the new neighbors list.

library(sfdep)
library(dplyr)

guerry_nb <- guerry |> 
  mutate(
    nb = include_self(st_contiguity(geometry)),
         wt = st_weights(nb)
    ) 

Following, we calculate the local Gi* using local_gstar_perm() on the donations column which creates a new data frame column called gi_star. We then unnest it using tidyr::unnest().

donat_gistar <- guerry_nb |> 
  transmute(gi_star = local_gstar_perm(donations, nb, wt, nsim = 199)) |> 
  tidyr::unnest(gi_star)

donat_gistar
#> Simple feature collection with 85 features and 8 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 47680 ymin: 1703258 xmax: 1031401 ymax: 2677441
#> CRS:           NA
#> # A tibble: 85 × 9
#>    gi_star   e_gi     var_gi p_value  p_sim p_folded_sim skewness kurtosis
#>      <dbl>  <dbl>      <dbl>   <dbl>  <dbl>        <dbl>    <dbl>    <dbl>
#>  1  -1.49  0.0113 0.0000136   -1.38  0.167          0.06    0.03     1.10 
#>  2  -0.482 0.0118 0.00000938  -0.502 0.616          0.67    0.335    0.733
#>  3   0.652 0.0126 0.00000716   0.433 0.665          0.61    0.305    0.731
#>  4  -1.65  0.0101 0.00000891  -1.47  0.141          0.08    0.04     0.760
#>  5  -1.08  0.0117 0.0000149   -1.14  0.252          0.12    0.06     0.957
#>  6  -2.38  0.0112 0.0000113   -1.87  0.0614         0.02    0.01     0.688
#>  7  -0.362 0.0118 0.0000136   -0.414 0.679          0.8     0.4      0.882
#>  8  -0.720 0.0102 0.0000109   -0.429 0.668          0.75    0.375    0.627
#>  9  -1.45  0.0111 0.00000995  -1.33  0.185          0.13    0.065    0.770
#> 10  -1.32  0.0106 0.00000942  -1.05  0.295          0.25    0.125    1.02 
#> # ℹ 75 more rows
#> # ℹ 1 more variable: geometry <MULTIPOLYGON>

Lastly, we classify the clusters using a combination of mutate() and case_when() which is then piped into a ggplot map. While not a perfect recreation of the GeoDa map, it is very close—the differences likely due to conditional permutation (see conditional permutation vignette for more on significance calculation).

library(ggplot2)

donat_gistar |> 
  mutate(cluster = case_when(
    p_folded_sim > 0.05 ~ "Not Significant",
    p_folded_sim <= 0.05 & gi_star < 0 ~ "Low",
    p_folded_sim <= 0.05 & gi_star > 0 ~ "High"
  )) |> 
  ggplot(aes(fill = cluster)) +
  geom_sf(lwd = 0.2, color = "black") +
  scale_fill_manual(values = c("High" = "red",
                               "Low" = "Blue", 
                               "Not Significant" = "white")) +
  theme_void()

In EHSA, we calculate this statistic for each time period.

Mann-Kendall Test

With the calculations of the local Gi* for each unit of time for every geography, we can evaluate how the hotspots change over time. We incorporate time-series analysis through the use of the Mann-Kendall (MK) Trend test. The MK test is a nonparametric test that checks for monotonic trends.

If you don’t recall the term monotonic from your calculus class, that’s okay. A monotonic series or function is one that only increases (or decreases) and never changes direction. So long as the function either stays flat or continues to increase, it is monotonic.

The benefit of the test being nonparametric is that the trend can be non-linear. And further, the MK test does not check if a time-series is strictly monotonic—e.g. every subsequent unit is increasing or decreasing—but if the overall series is increasing or decreasing.

Below is an example of a monotonic upward trend which is not linear.

series <- c(0, 1, 1.4, 1.5, 1.6, 5, 5, 5.5, 8, seq(8, 8.5, length.out = 5))

plot(series, type = "l")

For a more realistic example, let’s take a look a the economics dataset from ggplot2 which contains data from FRED. We’ll use the psavert variable which contains data about personal savings rates in the United States.

econ <- ggplot2::economics

ggplot(econ, aes(date, psavert)) +
  geom_line() +
  theme_light()

From a visual inspection, we can tell that overall the trend is decreasing. The MK test can confirm this. We use Kendall::MannKendall() to test if the trend is significantly decreasing.

Kendall::MannKendall(econ$psavert)
#> tau = -0.644, 2-sided pvalue =< 2.22e-16

The above results show the tau and p-value. Tau ranges between -1 and 1 where -1 is a perfectly decreasing series and 1 is a perfectly increasing series. The p-value is extremely small indicating that the null-hypothesis (that there is no trend) can be rejected. The tau is -0.64 which indicates a moderate downward trend.

The whole game

We’ve gone over briefly the local Gi* and the Mann-Kendall Test. EHSA combines the both of these to evaluate if there are trends in hot or cold spots over time.

EHSA utilizes a spacetime cube, or spatio-temporal full grid. Please see spacetime vignette for more on spacetime representations. In short, a spacetime cube contains a regular time-series for each location in a dataset—i.e. there are \(n \times m\) observations where n is the number of locations and m is the number of times.

For each time-slice (i.e. the complete set of geometries for a given time) the local Gi* is calculated. Say we have 10 times, we will have for each i \(G_{i,t=1}^*, \ldots, G_{i,t=10}^*\).

Then, given each location’s time-series of Gi*, they are assessed using the MK test.

A hands on example

We can compute the EHSA manually using sfdep and spacetime classes. In this example we utilize Ecometric Data from the Boston Area Research Initiative (data and documentation). These data are generated from 911 calls in Boston from 2010 - 2020 and aggregated to the Census Block Group (CBG) level. The column value indicates the calculated prevalence which is derived from Exploratory Factor Analysis and population projections.

Data

The data are stored in the internal data file bos-ecometric.csv and bos-ecometric.geojson. We will read these in to df and geo and will create a new spacetime object called bos with them.

Note: 2022-05-31 this data is available only in the development version, so please install with remotes::install_github("josiahparry/sfdep").

# Create objects to store file paths 
df_fp <- system.file("extdata", "bos-ecometric.csv", package = "sfdep")
geo_fp <- system.file("extdata", "bos-ecometric.geojson", package = "sfdep")

# read in data 
df <- readr::read_csv(df_fp, col_types = "ccidD")
geo <- sf::st_read(geo_fp)
#> Reading layer `bos-ecometric' from data source 
#>   `/tmp/RtmpjUxQna/temp_libpath151111eb2be80/sfdep/extdata/bos-ecometric.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 168 features and 1 field
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.19115 ymin: 42.22788 xmax: -70.99445 ymax: 42.3974
#> Geodetic CRS:  NAD83


# Create spacetime object called `bos`
bos <- spacetime(df, geo, 
                 .loc_col = ".region_id",
                 .time_col = "time_period")

bos
#> spacetime ────
#> Context:`data`
#> 168 locations `.region_id`
#> 10 time periods `time_period`
#> ── data context ────────────────────────────────────────────────────────────────
#> # A tibble: 1,680 × 5
#>    .region_id  ecometric  year value time_period
#>  * <chr>       <chr>     <int> <dbl> <date>     
#>  1 25025010405 Guns       2010  0.35 2010-01-01 
#>  2 25025010405 Guns       2011  0.89 2011-01-01 
#>  3 25025010405 Guns       2012  1.2  2012-01-01 
#>  4 25025010405 Guns       2013  0.84 2013-01-01 
#>  5 25025010405 Guns       2014  1.5  2014-01-01 
#>  6 25025010405 Guns       2015  1.15 2015-01-01 
#>  7 25025010405 Guns       2016  1.48 2016-01-01 
#>  8 25025010405 Guns       2017  1.64 2017-01-01 
#>  9 25025010405 Guns       2018  0.49 2018-01-01 
#> 10 25025010405 Guns       2019  0.17 2019-01-01 
#> # ℹ 1,670 more rows

Local Gi*

The first step in EHSA is to calculate the local Gi* for each time period. But that, however, requires a couple of preceding steps. Before we calculate Gi* we need to identify neighbors and weights.

To do this, we activate the geometry context and create two new columns nb and wt. Then we will activate the data context again and copy over the nb and wt columns to each time-slice using set_nbs() and set_wts()row order is very important so do not rearrange the observations after using set_nbs() or set_wts().

bos_nb <- bos |> 
  activate("geometry") |> 
  mutate(
    nb = include_self(st_contiguity(geometry)),
    wt = st_weights(nb)
  ) |> 
  set_nbs("nb") |> 
  set_wts("wt")

head(bos_nb)
#> spacetime ────
#> Context:`data`
#> 168 locations `.region_id`
#> 10 time periods `time_period`
#> ── data context ────────────────────────────────────────────────────────────────
#> # A tibble: 6 × 7
#>   .region_id  ecometric  year value time_period nb        wt       
#>   <chr>       <chr>     <int> <dbl> <date>      <list>    <list>   
#> 1 25025010405 Guns       2010  0.35 2010-01-01  <int [9]> <dbl [9]>
#> 2 25025010404 Guns       2010  0    2010-01-01  <int [4]> <dbl [4]>
#> 3 25025010801 Guns       2010  0    2010-01-01  <int [5]> <dbl [5]>
#> 4 25025010702 Guns       2010  0.46 2010-01-01  <int [6]> <dbl [6]>
#> 5 25025010204 Guns       2010  0    2010-01-01  <int [4]> <dbl [4]>
#> 6 25025010802 Guns       2010  0    2010-01-01  <int [5]> <dbl [5]>

This dataset now has neighbors and weights for each time-slice. We can use these new columns to manually calculate the local Gi* for each location. We can do this by grouping by time_period and using local_gstar_perm(). After which, we unnest the new data frame column gi_star.

gi_stars <- bos_nb |> 
  group_by(time_period) |> 
  mutate(gi_star = local_gstar_perm(value, nb, wt)) |> 
  tidyr::unnest(gi_star)


gi_stars |> 
  select(.region_id, time_period, gi_star, p_folded_sim)
#> # A tibble: 1,680 × 4
#> # Groups:   time_period [10]
#>    .region_id  time_period gi_star p_folded_sim
#>    <chr>       <date>        <dbl>        <dbl>
#>  1 25025010405 2010-01-01   -0.987        0.276
#>  2 25025010404 2010-01-01   -1.37         0.348
#>  3 25025010801 2010-01-01   -1.48         0.208
#>  4 25025010702 2010-01-01   -0.863        0.304
#>  5 25025010204 2010-01-01   -1.32         0.392
#>  6 25025010802 2010-01-01   -1.48         0.236
#>  7 25025010104 2010-01-01   -1.68         0.1  
#>  8 25025000703 2010-01-01   -1.44         0.288
#>  9 25025000504 2010-01-01   -1.42         0.076
#> 10 25025000704 2010-01-01   -1.62         0.152
#> # ℹ 1,670 more rows

With these Gi* measures we can then evaluate each location for a trend using the Mann-Kendall test.

Mann-Kendall Test

Let’s first take only a single location, 25025010403, and evaluate the Mann-Kendall statistic for it by hand. Then we can extend this to each location.

Note that there is a bit of finagling the MK test to make it work nicely in a dplyr::summarise() call.

cbg <- gi_stars |> 
  ungroup() |> 
  filter(.region_id == "25025010403") |> 
  select(.region_id, time_period, year, gi_star)

We can also visualized this trend.

# visualize trend
ggplot(cbg, aes(time_period, gi_star)) +
  geom_line() +
  theme_light()

cbg |> 
  summarise(mk = list(unclass(Kendall::MannKendall(gi_star)))) |> 
  tidyr::unnest_wider(mk)
#> # A tibble: 1 × 5
#>     tau    sl     S     D  varS
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.378 0.152    17    45   125

In the above result, sl is the p-value. This result tells us that there is a slight upward but insignificant trend.

We can replicate this for each location by using group_by().

ehsa_manual <- gi_stars |> 
  group_by(.region_id) |> 
  summarise(mk = list(unclass(Kendall::MannKendall(gi_star)))) |> 
  tidyr::unnest_wider(mk)

# arrange to show significant emerging hot/cold spots
emerging <- ehsa_manual |> 
  arrange(sl, abs(tau)) |> 
  slice(1:5)

# show top 5 locations
emerging
#> # A tibble: 5 × 6
#>   .region_id     tau        sl     S     D  varS
#>   <chr>        <dbl>     <dbl> <dbl> <dbl> <dbl>
#> 1 25025120201 -1     0.0000830   -45    45   125
#> 2 25025090700  0.822 0.00128      37    45   125
#> 3 25025060600  0.733 0.00421      33    45   125
#> 4 25025100601 -0.733 0.00421     -33    45   125
#> 5 25025000201 -0.689 0.00729     -31    45   125

Here we can see there are a few locations that are becoming significantly colder and others that are becoming significantly warmer.

Using emerging_hotspot_analysis()

While we can do the calculations manually as above, this is limited in two ways. Primarily that in the above example we used spatial neighbors only. Whereas in EHSA we can—and likely should—incorporate the time-lag of our spatial neighbors. Secondly, there are classifications proposed by ESRI which help us understand how each location is changing over time. Both of these are handled by the emerging_hotspot_analysis() function.

This emerging_hotspot_analysis() takes a spacetime object x, and the quoted name of the variable of interested in .var at minimum. We can specify the number of time lags using the argument k which is set to 1 by default.

# conduct EHSA
ehsa <- emerging_hotspot_analysis(
  x = bos, 
  .var = "value", 
  k = 1, 
  nsim = 99
)

# preview some values 
head(ehsa)
#> # A tibble: 6 × 4
#>   location       tau p_value classification     
#>   <chr>        <dbl>   <dbl> <chr>              
#> 1 25025010405  0.111  0.721  sporadic coldspot  
#> 2 25025010404 -0.333  0.210  sporadic coldspot  
#> 3 25025010801 -0.200  0.474  sporadic coldspot  
#> 4 25025010702 -0.600  0.0200 sporadic coldspot  
#> 5 25025010204 -0.467  0.0736 persistent coldspot
#> 6 25025010802 -0.333  0.210  sporadic coldspot

# evaluate the classifications of our hotspots
count(ehsa, classification)
#> # A tibble: 8 × 2
#>   classification           n
#>   <chr>                <int>
#> 1 consecutive hotspot     17
#> 2 intensifying hotspot     3
#> 3 no pattern detected     20
#> 4 oscilating hotspot       1
#> 5 persistent coldspot     11
#> 6 persistent hotspot       7
#> 7 sporadic coldspot       83
#> 8 sporadic hotspot        26