The following are the image processing and analysis steps.

Step 1. Set up your environment Step 2. Prepare training and test data sets Step 3. Perform exploratory data analysis (EDA) Step 4. Train and evaluate the RF model Step 5. Compute local interpretable model-agnostic explanations (LIME)

Step 1. Set up your environment

Before getting started, install and load the necessary packages that we will use in this blog.

rm(list=ls())
# Load the following packages
library(rgdal) # provides an interface to the Geospatial Data Abstraction Library (GDAL)         
library(raster) # provides important functions for importing and handling raster data
library(mapview)  # provides functions to visualize geospatial data
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(sf) # represents simple features as native R objects
library(randomForest) # implements Breiman’s RF algorithm
library(dplyr) # provides basic data transformation functions
library(ggplot2) # provides extension of visualizations
library(corrplot) # provides a graphical display of a correlation matrix
library(gridExtra) # provides user-level functions to arrange multiple grid-based plots on a page
library(tmaptools) # provides a set of tools for reading and processing spatial data
library(tmap) # provides a flexible and easy to use approach to create thematic maps
library(plotly) # creates interactive web-based graphs
library(lime)   #  provides local interpretation for ML models 
library(caret) # provides procedures for training and evaluating machine learning models
## Warning: package 'lattice' was built under R version 4.1.0

Next, set up your work environment. Note, this is the directory or folder that contains raster variables and training data.

#  set up the working directory
setwd("/home/kamusoko/Documents/Projects/Zimbabwe/Zimbabwe_Forest_Mapping/Zimbabwe_Landcover_2020/Provinces/Midlands/Mufungabusi_FR/Classification/Sentinel-2")

# Create a list of raster bands that will be used for modeling
VV_dry <- brick("Muf_S1_median_Asc_VV_Jul_Oct_21.tif")
VH_dry <- brick("Muf_S1_median_Asc_VH_Jul_Oct_21.tif")
Forest_height <-brick("Forest_height.tif")
Elevation <-brick("Elevation.tif")
Slope <-brick("Muf_Slope.tif")
LSTemp <- brick("LST_Jul_Oct2021.tif")
NDVI <-brick("S2_MedianNDVI_DS_21.tif")
NDWI <- brick("S2_MedianNDWI_DS_21.tif")
CSI <- brick("S2_MedianCSI_DS_21.tif")
dNBR <- brick("dNBR.tif")

# Combine or stack the raster layers.
rvars <- stack(VV_dry, VH_dry, Forest_height, Elevation, LSTemp, NDVI, NDWI, CSI, dNBR)

# Create an object “ta_data” and import the training data
ta_data <- readOGR(getwd(), "Muf_Forest_Fire_Risk_2021")
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/kamusoko/Documents/Projects/Zimbabwe/Zimbabwe_Forest_Mapping/Zimbabwe_Landcover_2020/Provinces/Midlands/Mufungabusi_FR/Classification/Sentinel-2", layer: "Muf_Forest_Fire_Risk_2021"
## with 50 features
## It has 2 fields
## Integer64 fields read as strings:  id

Check the attributes of the raster variables (rvars) and the training data.

# Check the raster variables and training data
print(rvars)
## class      : RasterStack 
## dimensions : 3115, 6567, 20456205, 9  (nrow, ncol, ncell, nlayers)
## resolution : 10, 10  (x, y)
## extent     : 658990, 724660, 7945640, 7976790  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=35 +south +datum=WGS84 +units=m +no_defs 
## names      : VV, VH,  b1, Elevation, LST_Jul_Oct2021, S2_MedianNDVI_DS_21, S2_MedianNDWI_DS_21, S2_MedianCSI_DS_21, nd 
## min values :  ?,  ?,   0,         ?,               ?,                   ?,                   ?,                  ?,  ? 
## max values :  ?,  ?, 255,         ?,               ?,                   ?,                   ?,                  ?,  ?
summary(ta_data)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x  659472.3  724236.2
## y 7946454.9 7976554.9
## Is projected: TRUE 
## proj4string :
## [+proj=utm +zone=35 +south +datum=WGS84 +units=m +no_defs]
## Data attributes:
##        id          Class   
##  0      : 1   Fire    :23  
##  1      : 1   Non_fire:27  
##  10     : 1                
##  11     : 1                
##  12     : 1                
##  13     : 1                
##  (Other):44

The training data has 23 fire and 27 non-fire polygons. Let’s create a balanced training data set.

Following are the steps. First, we will create a vector of unique class values.

# Create a vector of unique class values
uniqueClasses <- unique(ta_data$Class)
uniqueClasses
## [1] Fire     Non_fire
## Levels: Fire Non_fire

Second, we will sample the same number of random points inside all polygons.

# Set the seed
set.seed(27)
for (i in 1:length(uniqueClasses)) {
  # Select only ploygons from ghe current class
  class_data <- subset(ta_data, Class == uniqueClasses[i])
  # Get random points for these polygons
  classpts <- spsample(class_data, type = "random", n = 200)
  # Add class column to the SpatialPoints object
  classpts$class <- rep(uniqueClasses[i], length(classpts))
  if (i == 1) {
    xy <- classpts
  } else {
    xy <- rbind(xy, classpts)
      }
}
## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html

## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html

Let’s plot xy data set (random points) on the DEM.

# Plot the random points on the DEM
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(rvars$Elevation) + # Elevation in meters
tm_raster(style= "quantile") +
tm_layout(legend.outside = TRUE) +
# Add training sample points
tm_shape(xy) + tm_dots(size = 0.05) 
## stars object downsampled to 1452 by 689 cells. See tm_shape manual (argument raster.downsample)

Step 2. Prepare training and test data sets

We are going to use the extract() function to extract values from the different raster variables (rvars).

# Set seed
set.seed(27)
# Extract reflectance values from the raster variables and training sample points
train_samples <- extract(rvars, y = xy, cellnumbers = FALSE)
train_samples = data.frame(response = xy$class, train_samples)

# Check the structure of the training data
str(train_samples)
## 'data.frame':    400 obs. of  10 variables:
##  $ response           : Factor w/ 2 levels "Fire","Non_fire": 1 1 1 1 1 1 1 1 1 1 ...
##  $ VV                 : num  -12.39 -13.92 -11.42 -9.52 -11.05 ...
##  $ VH                 : num  -19.7 -20.7 -18.2 -15.4 -16.7 ...
##  $ b1                 : num  0 0 3 7 6 3 0 0 6 6 ...
##  $ Elevation          : num  1221 1191 1287 1246 1185 ...
##  $ LST_Jul_Oct2021    : num  29.2 28.3 30.6 24.7 27.6 ...
##  $ S2_MedianNDVI_DS_21: num  0.211 0.221 0.265 0.503 0.309 ...
##  $ S2_MedianNDWI_DS_21: num  -0.16785 -0.19608 -0.20304 -0.00889 -0.17852 ...
##  $ S2_MedianCSI_DS_21 : num  0.464 0.429 0.385 0.325 0.368 ...
##  $ nd                 : num  101 103 165 206 247 ...

Let’s use the complete.cases() function to check for missing values.

# Check for missing values
train_samples_complete <-complete.cases(train_samples)
(train_samples_complete)
##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Next, check the training points using the table() function.

# Check the number of training points
table(train_samples$response)
## 
##     Fire Non_fire 
##      200      200

Next, check for any duplicates in the training points.

# Check for any duplicates
any(duplicated(train_samples$cells))
## [1] FALSE

The training points contain no duplicates. Next, we will split the training data into training and test sets based on the createDataPartition() function from the caret package. The training data will comprise 60%, while the test data set will include 40%.

# Set seed and split training data
set.seed(27)
inTraining <- createDataPartition(train_samples$response, p = .60, list = FALSE)
training <- train_samples[ inTraining,]
testing  <- train_samples[-inTraining,]

Step 3. Perform exploratory data analysis (EDA)

Performing exploratory data analysis (EDA) before training machine learning models is very important. We start by checking the descriptive statistics of the training data set using the summary() function.

# Check summary statistics
summary(training) 
##      response         VV                VH               b1        
##  Fire    :120   Min.   :-16.184   Min.   :-23.59   Min.   : 0.000  
##  Non_fire:120   1st Qu.:-12.305   1st Qu.:-19.34   1st Qu.: 0.000  
##                 Median :-11.201   Median :-17.29   Median : 3.000  
##                 Mean   :-11.344   Mean   :-17.74   Mean   : 3.067  
##                 3rd Qu.:-10.130   3rd Qu.:-16.00   3rd Qu.: 6.000  
##                 Max.   : -8.039   Max.   :-13.82   Max.   :11.000  
##    Elevation    LST_Jul_Oct2021 S2_MedianNDVI_DS_21 S2_MedianNDWI_DS_21
##  Min.   :1011   Min.   :22.91   Min.   :0.1535      Min.   :-0.31273   
##  1st Qu.:1187   1st Qu.:25.74   1st Qu.:0.2268      1st Qu.:-0.20775   
##  Median :1232   Median :27.49   Median :0.3113      Median :-0.14589   
##  Mean   :1223   Mean   :27.31   Mean   :0.3337      Mean   :-0.14397   
##  3rd Qu.:1274   3rd Qu.:28.88   3rd Qu.:0.4311      3rd Qu.:-0.07992   
##  Max.   :1321   Max.   :30.83   Max.   :0.6117      Max.   : 0.03975   
##  S2_MedianCSI_DS_21       nd         
##  Min.   :0.2324     Min.   :-105.28  
##  1st Qu.:0.3339     1st Qu.:  86.02  
##  Median :0.3711     Median : 140.24  
##  Mean   :0.3731     Mean   : 149.83  
##  3rd Qu.:0.4106     3rd Qu.: 200.97  
##  Max.   :0.5379     Max.   : 440.58

Next, we are going to use the cor() function to check the correlation between predictor variables.

# Check band correlations
bandCorrelations <- cor(training[, 2:10])
bandCorrelations
##                             VV         VH         b1  Elevation LST_Jul_Oct2021
## VV                   1.0000000  0.9282080  0.6439457  0.3182611      -0.6523490
## VH                   0.9282080  1.0000000  0.7092236  0.2629269      -0.7247826
## b1                   0.6439457  0.7092236  1.0000000  0.3955019      -0.8099542
## Elevation            0.3182611  0.2629269  0.3955019  1.0000000      -0.3965351
## LST_Jul_Oct2021     -0.6523490 -0.7247826 -0.8099542 -0.3965351       1.0000000
## S2_MedianNDVI_DS_21  0.6980735  0.7803658  0.7729157  0.3864639      -0.8201790
## S2_MedianNDWI_DS_21  0.5794321  0.6750357  0.7314010  0.2694921      -0.7933359
## S2_MedianCSI_DS_21  -0.5564565 -0.5828334 -0.5192182 -0.3919507       0.5246208
## nd                   0.3550289  0.3570384  0.2185974 -0.0806910      -0.1085367
##                     S2_MedianNDVI_DS_21 S2_MedianNDWI_DS_21 S2_MedianCSI_DS_21
## VV                            0.6980735           0.5794321         -0.5564565
## VH                            0.7803658           0.6750357         -0.5828334
## b1                            0.7729157           0.7314010         -0.5192182
## Elevation                     0.3864639           0.2694921         -0.3919507
## LST_Jul_Oct2021              -0.8201790          -0.7933359          0.5246208
## S2_MedianNDVI_DS_21           1.0000000           0.8326319         -0.7865036
## S2_MedianNDWI_DS_21           0.8326319           1.0000000         -0.3174576
## S2_MedianCSI_DS_21           -0.7865036          -0.3174576          1.0000000
## nd                            0.2168401           0.1451293         -0.2035316
##                             nd
## VV                   0.3550289
## VH                   0.3570384
## b1                   0.2185974
## Elevation           -0.0806910
## LST_Jul_Oct2021     -0.1085367
## S2_MedianNDVI_DS_21  0.2168401
## S2_MedianNDWI_DS_21  0.1451293
## S2_MedianCSI_DS_21  -0.2035316
## nd                   1.0000000

We will display a mixed plot (numbers and plots) to understand the correlation between the predictors.

# Display  mixed plot
corrplot.mixed(bandCorrelations,lower.col="black", number.cex = .7, upper = "color")

Step 4. Train and evaluate the random forest (RF) model

We will specify the random forest (RF) model as shown below.

# Set up the model tuning parameters
fitControl<- trainControl(method = "repeatedcv", number = 10, repeats = 10)

# Set-up timer to check how long it takes to run the RF model
timeStart <- proc.time()
set.seed(27)
# Train the RF model
rf_model <- train(response ~.,data = training,
                  method = "rf",
                  trControl = fitControl,
                  prox = TRUE,
                  fitBest = FALSE,
                  localImp = TRUE,
                  returnData = TRUE)
proc.time() - timeStart
##    user  system elapsed 
##  40.138   0.062  40.201

Next, check the RF model performance using the print() function.

# Check the RF model performance
print(rf_model)
## Random Forest 
## 
## 240 samples
##   9 predictor
##   2 classes: 'Fire', 'Non_fire' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 216, 216, 216, 216, 216, 216, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7329167  0.4658333
##   5     0.7320833  0.4641667
##   9     0.7233333  0.4466667
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Next, check the parameters of the best model.

# check the parameters of the best model.
rf_model$fin
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, localImp = TRUE,      proximity = TRUE, fitBest = FALSE, returnData = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 28.75%
## Confusion matrix:
##          Fire Non_fire class.error
## Fire       87       33       0.275
## Non_fire   36       84       0.300

Next, calculate and display the top predictor variables.

# Compute RF variable importance
rf_varImp <- varImp(rf_model)
ggplot(rf_varImp) # Display the most important predictor variables

Next, we will assess the RF model performance using new testing data.

# Prepare a confusion matrix
pred_rf <- predict(rf_model, newdata = testing)
confusionMatrix(data = pred_rf, testing$response)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Fire Non_fire
##   Fire       62       22
##   Non_fire   18       58
##                                          
##                Accuracy : 0.75           
##                  95% CI : (0.6755, 0.815)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 8.762e-11      
##                                          
##                   Kappa : 0.5            
##                                          
##  Mcnemar's Test P-Value : 0.6353         
##                                          
##             Sensitivity : 0.7750         
##             Specificity : 0.7250         
##          Pos Pred Value : 0.7381         
##          Neg Pred Value : 0.7632         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3875         
##    Detection Prevalence : 0.5250         
##       Balanced Accuracy : 0.7500         
##                                          
##        'Positive' Class : Fire           
## 

We will use the RF model to woodland fire severity risk using the predict() function.

# Predict woodland fire severity risk
timeStart<- proc.time() # measure computation time
lc_rf <- predict(rvars, rf_model, type = "prob")
proc.time() - timeStart # user time and system time
##    user  system elapsed 
## 326.313   3.211 329.523

We are going to display a woodland fire severity risk map using the tmap package.

# Display woodland fire severity risk map using the tmap and tmaptools packages
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(lc_rf) + tm_raster(style= "quantile", n=7, palette=get_brewer_pal("Oranges", n = 7, plot=FALSE)) +
tm_layout(legend.outside = TRUE)
## stars object downsampled to 1452 by 689 cells. See tm_shape manual (argument raster.downsample)

Finally, save the woodland fire severity risk map in “img” or “geotiff” format.

# Save the woodland fire severity risk map in “img” or “geotiff” formats
writeRaster(lc_rf, filename="Muf_Wd_Fire_Severity_Risk.img", datatype='FLT4S', index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

Step 5. Local Interpretable Model-agnostic Explanations (LIME)

In this step, we will use the LIME package to help us explain individual predictions. First, we remove the response variable since LIME needs data without a response variable.

# Remove the response variable (response) 
training_x <- dplyr::select(training, -response)
testing_x <- dplyr::select(testing, -response)

training_y <- dplyr::select(training, response)
testing_y <- dplyr::select(testing, response)

Next, let’s build the explainer using the LIME() function.

# Create an explainer object
set.seed(27)
explainer_rf <- lime(training_x, rf_model, n_bins = 5, quantile_bins = TRUE)
summary(explainer_rf)
##                      Length Class  Mode     
## model                23     train  list     
## preprocess            1     -none- function 
## bin_continuous        1     -none- logical  
## n_bins                1     -none- numeric  
## quantile_bins         1     -none- logical  
## use_density           1     -none- logical  
## feature_type          9     -none- character
## bin_cuts              9     -none- list     
## feature_distribution  9     -none- list

Next, we will use the explain() function to predict and evaluate the explanations.

timeStart<- proc.time() # measure computation time
# Explain new observations
explanation_rf <- explain(
  x = testing_x, 
  explainer = explainer_rf, 
  n_permutations = 5000,
  dist_fun = "euclidean",
  kernel_width = .75,
  n_features = 10, 
  feature_select = "highest_weights",
  n_labels = 2
  )
proc.time() - timeStart # user time and system time
##    user  system elapsed 
##  49.768   2.524  52.293

Finally, let’s visualize cases (single pixels) between rows 1 and 50.

# Create a plot to visualize the explanaitions
plot_features(explanation_rf[1:50, ], ncol = 2)