The following are the image processing and analysis steps.
Step 1. Set up your environment Step 2. Prepare training and test data sets Step 3. Perform exploratory data analysis (EDA) Step 4. Train and evaluate the RF model Step 5. Compute local interpretable model-agnostic explanations (LIME)
Before getting started, install and load the necessary packages that we will use in this blog.
rm(list=ls())
# Load the following packages
library(rgdal) # provides an interface to the Geospatial Data Abstraction Library (GDAL)
library(raster) # provides important functions for importing and handling raster data
library(mapview) # provides functions to visualize geospatial data
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(sf) # represents simple features as native R objects
library(randomForest) # implements Breiman’s RF algorithm
library(dplyr) # provides basic data transformation functions
library(ggplot2) # provides extension of visualizations
library(corrplot) # provides a graphical display of a correlation matrix
library(gridExtra) # provides user-level functions to arrange multiple grid-based plots on a page
library(tmaptools) # provides a set of tools for reading and processing spatial data
library(tmap) # provides a flexible and easy to use approach to create thematic maps
library(plotly) # creates interactive web-based graphs
library(lime) # provides local interpretation for ML models
library(caret) # provides procedures for training and evaluating machine learning models
## Warning: package 'lattice' was built under R version 4.1.0
Next, set up your work environment. Note, this is the directory or folder that contains raster variables and training data.
# set up the working directory
setwd("/home/kamusoko/Documents/Projects/Zimbabwe/Zimbabwe_Forest_Mapping/Zimbabwe_Landcover_2020/Provinces/Midlands/Mufungabusi_FR/Classification/Sentinel-2")
# Create a list of raster bands that will be used for modeling
VV_dry <- brick("Muf_S1_median_Asc_VV_Jul_Oct_21.tif")
VH_dry <- brick("Muf_S1_median_Asc_VH_Jul_Oct_21.tif")
Forest_height <-brick("Forest_height.tif")
Elevation <-brick("Elevation.tif")
Slope <-brick("Muf_Slope.tif")
LSTemp <- brick("LST_Jul_Oct2021.tif")
NDVI <-brick("S2_MedianNDVI_DS_21.tif")
NDWI <- brick("S2_MedianNDWI_DS_21.tif")
CSI <- brick("S2_MedianCSI_DS_21.tif")
dNBR <- brick("dNBR.tif")
# Combine or stack the raster layers.
rvars <- stack(VV_dry, VH_dry, Forest_height, Elevation, LSTemp, NDVI, NDWI, CSI, dNBR)
# Create an object “ta_data” and import the training data
ta_data <- readOGR(getwd(), "Muf_Forest_Fire_Risk_2021")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/kamusoko/Documents/Projects/Zimbabwe/Zimbabwe_Forest_Mapping/Zimbabwe_Landcover_2020/Provinces/Midlands/Mufungabusi_FR/Classification/Sentinel-2", layer: "Muf_Forest_Fire_Risk_2021"
## with 50 features
## It has 2 fields
## Integer64 fields read as strings: id
Check the attributes of the raster variables (rvars) and the training data.
# Check the raster variables and training data
print(rvars)
## class : RasterStack
## dimensions : 3115, 6567, 20456205, 9 (nrow, ncol, ncell, nlayers)
## resolution : 10, 10 (x, y)
## extent : 658990, 724660, 7945640, 7976790 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=35 +south +datum=WGS84 +units=m +no_defs
## names : VV, VH, b1, Elevation, LST_Jul_Oct2021, S2_MedianNDVI_DS_21, S2_MedianNDWI_DS_21, S2_MedianCSI_DS_21, nd
## min values : ?, ?, 0, ?, ?, ?, ?, ?, ?
## max values : ?, ?, 255, ?, ?, ?, ?, ?, ?
summary(ta_data)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 659472.3 724236.2
## y 7946454.9 7976554.9
## Is projected: TRUE
## proj4string :
## [+proj=utm +zone=35 +south +datum=WGS84 +units=m +no_defs]
## Data attributes:
## id Class
## 0 : 1 Fire :23
## 1 : 1 Non_fire:27
## 10 : 1
## 11 : 1
## 12 : 1
## 13 : 1
## (Other):44
The training data has 23 fire and 27 non-fire polygons. Let’s create a balanced training data set.
Following are the steps. First, we will create a vector of unique class values.
# Create a vector of unique class values
uniqueClasses <- unique(ta_data$Class)
uniqueClasses
## [1] Fire Non_fire
## Levels: Fire Non_fire
Second, we will sample the same number of random points inside all polygons.
# Set the seed
set.seed(27)
for (i in 1:length(uniqueClasses)) {
# Select only ploygons from ghe current class
class_data <- subset(ta_data, Class == uniqueClasses[i])
# Get random points for these polygons
classpts <- spsample(class_data, type = "random", n = 200)
# Add class column to the SpatialPoints object
classpts$class <- rep(uniqueClasses[i], length(classpts))
if (i == 1) {
xy <- classpts
} else {
xy <- rbind(xy, classpts)
}
}
## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Let’s plot xy data set (random points) on the DEM.
# Plot the random points on the DEM
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(rvars$Elevation) + # Elevation in meters
tm_raster(style= "quantile") +
tm_layout(legend.outside = TRUE) +
# Add training sample points
tm_shape(xy) + tm_dots(size = 0.05)
## stars object downsampled to 1452 by 689 cells. See tm_shape manual (argument raster.downsample)
We are going to use the extract() function to extract values from the different raster variables (rvars).
# Set seed
set.seed(27)
# Extract reflectance values from the raster variables and training sample points
train_samples <- extract(rvars, y = xy, cellnumbers = FALSE)
train_samples = data.frame(response = xy$class, train_samples)
# Check the structure of the training data
str(train_samples)
## 'data.frame': 400 obs. of 10 variables:
## $ response : Factor w/ 2 levels "Fire","Non_fire": 1 1 1 1 1 1 1 1 1 1 ...
## $ VV : num -12.39 -13.92 -11.42 -9.52 -11.05 ...
## $ VH : num -19.7 -20.7 -18.2 -15.4 -16.7 ...
## $ b1 : num 0 0 3 7 6 3 0 0 6 6 ...
## $ Elevation : num 1221 1191 1287 1246 1185 ...
## $ LST_Jul_Oct2021 : num 29.2 28.3 30.6 24.7 27.6 ...
## $ S2_MedianNDVI_DS_21: num 0.211 0.221 0.265 0.503 0.309 ...
## $ S2_MedianNDWI_DS_21: num -0.16785 -0.19608 -0.20304 -0.00889 -0.17852 ...
## $ S2_MedianCSI_DS_21 : num 0.464 0.429 0.385 0.325 0.368 ...
## $ nd : num 101 103 165 206 247 ...
Let’s use the complete.cases() function to check for missing values.
# Check for missing values
train_samples_complete <-complete.cases(train_samples)
(train_samples_complete)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Next, check the training points using the table() function.
# Check the number of training points
table(train_samples$response)
##
## Fire Non_fire
## 200 200
Next, check for any duplicates in the training points.
# Check for any duplicates
any(duplicated(train_samples$cells))
## [1] FALSE
The training points contain no duplicates. Next, we will split the training data into training and test sets based on the createDataPartition() function from the caret package. The training data will comprise 60%, while the test data set will include 40%.
# Set seed and split training data
set.seed(27)
inTraining <- createDataPartition(train_samples$response, p = .60, list = FALSE)
training <- train_samples[ inTraining,]
testing <- train_samples[-inTraining,]
Performing exploratory data analysis (EDA) before training machine learning models is very important. We start by checking the descriptive statistics of the training data set using the summary() function.
# Check summary statistics
summary(training)
## response VV VH b1
## Fire :120 Min. :-16.184 Min. :-23.59 Min. : 0.000
## Non_fire:120 1st Qu.:-12.305 1st Qu.:-19.34 1st Qu.: 0.000
## Median :-11.201 Median :-17.29 Median : 3.000
## Mean :-11.344 Mean :-17.74 Mean : 3.067
## 3rd Qu.:-10.130 3rd Qu.:-16.00 3rd Qu.: 6.000
## Max. : -8.039 Max. :-13.82 Max. :11.000
## Elevation LST_Jul_Oct2021 S2_MedianNDVI_DS_21 S2_MedianNDWI_DS_21
## Min. :1011 Min. :22.91 Min. :0.1535 Min. :-0.31273
## 1st Qu.:1187 1st Qu.:25.74 1st Qu.:0.2268 1st Qu.:-0.20775
## Median :1232 Median :27.49 Median :0.3113 Median :-0.14589
## Mean :1223 Mean :27.31 Mean :0.3337 Mean :-0.14397
## 3rd Qu.:1274 3rd Qu.:28.88 3rd Qu.:0.4311 3rd Qu.:-0.07992
## Max. :1321 Max. :30.83 Max. :0.6117 Max. : 0.03975
## S2_MedianCSI_DS_21 nd
## Min. :0.2324 Min. :-105.28
## 1st Qu.:0.3339 1st Qu.: 86.02
## Median :0.3711 Median : 140.24
## Mean :0.3731 Mean : 149.83
## 3rd Qu.:0.4106 3rd Qu.: 200.97
## Max. :0.5379 Max. : 440.58
Next, we are going to use the cor() function to check the correlation between predictor variables.
# Check band correlations
bandCorrelations <- cor(training[, 2:10])
bandCorrelations
## VV VH b1 Elevation LST_Jul_Oct2021
## VV 1.0000000 0.9282080 0.6439457 0.3182611 -0.6523490
## VH 0.9282080 1.0000000 0.7092236 0.2629269 -0.7247826
## b1 0.6439457 0.7092236 1.0000000 0.3955019 -0.8099542
## Elevation 0.3182611 0.2629269 0.3955019 1.0000000 -0.3965351
## LST_Jul_Oct2021 -0.6523490 -0.7247826 -0.8099542 -0.3965351 1.0000000
## S2_MedianNDVI_DS_21 0.6980735 0.7803658 0.7729157 0.3864639 -0.8201790
## S2_MedianNDWI_DS_21 0.5794321 0.6750357 0.7314010 0.2694921 -0.7933359
## S2_MedianCSI_DS_21 -0.5564565 -0.5828334 -0.5192182 -0.3919507 0.5246208
## nd 0.3550289 0.3570384 0.2185974 -0.0806910 -0.1085367
## S2_MedianNDVI_DS_21 S2_MedianNDWI_DS_21 S2_MedianCSI_DS_21
## VV 0.6980735 0.5794321 -0.5564565
## VH 0.7803658 0.6750357 -0.5828334
## b1 0.7729157 0.7314010 -0.5192182
## Elevation 0.3864639 0.2694921 -0.3919507
## LST_Jul_Oct2021 -0.8201790 -0.7933359 0.5246208
## S2_MedianNDVI_DS_21 1.0000000 0.8326319 -0.7865036
## S2_MedianNDWI_DS_21 0.8326319 1.0000000 -0.3174576
## S2_MedianCSI_DS_21 -0.7865036 -0.3174576 1.0000000
## nd 0.2168401 0.1451293 -0.2035316
## nd
## VV 0.3550289
## VH 0.3570384
## b1 0.2185974
## Elevation -0.0806910
## LST_Jul_Oct2021 -0.1085367
## S2_MedianNDVI_DS_21 0.2168401
## S2_MedianNDWI_DS_21 0.1451293
## S2_MedianCSI_DS_21 -0.2035316
## nd 1.0000000
We will display a mixed plot (numbers and plots) to understand the correlation between the predictors.
# Display mixed plot
corrplot.mixed(bandCorrelations,lower.col="black", number.cex = .7, upper = "color")
We will specify the random forest (RF) model as shown below.
# Set up the model tuning parameters
fitControl<- trainControl(method = "repeatedcv", number = 10, repeats = 10)
# Set-up timer to check how long it takes to run the RF model
timeStart <- proc.time()
set.seed(27)
# Train the RF model
rf_model <- train(response ~.,data = training,
method = "rf",
trControl = fitControl,
prox = TRUE,
fitBest = FALSE,
localImp = TRUE,
returnData = TRUE)
proc.time() - timeStart
## user system elapsed
## 40.138 0.062 40.201
Next, check the RF model performance using the print() function.
# Check the RF model performance
print(rf_model)
## Random Forest
##
## 240 samples
## 9 predictor
## 2 classes: 'Fire', 'Non_fire'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 216, 216, 216, 216, 216, 216, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7329167 0.4658333
## 5 0.7320833 0.4641667
## 9 0.7233333 0.4466667
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Next, check the parameters of the best model.
# check the parameters of the best model.
rf_model$fin
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, localImp = TRUE, proximity = TRUE, fitBest = FALSE, returnData = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 28.75%
## Confusion matrix:
## Fire Non_fire class.error
## Fire 87 33 0.275
## Non_fire 36 84 0.300
Next, calculate and display the top predictor variables.
# Compute RF variable importance
rf_varImp <- varImp(rf_model)
ggplot(rf_varImp) # Display the most important predictor variables
Next, we will assess the RF model performance using new testing data.
# Prepare a confusion matrix
pred_rf <- predict(rf_model, newdata = testing)
confusionMatrix(data = pred_rf, testing$response)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Fire Non_fire
## Fire 62 22
## Non_fire 18 58
##
## Accuracy : 0.75
## 95% CI : (0.6755, 0.815)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 8.762e-11
##
## Kappa : 0.5
##
## Mcnemar's Test P-Value : 0.6353
##
## Sensitivity : 0.7750
## Specificity : 0.7250
## Pos Pred Value : 0.7381
## Neg Pred Value : 0.7632
## Prevalence : 0.5000
## Detection Rate : 0.3875
## Detection Prevalence : 0.5250
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : Fire
##
We will use the RF model to woodland fire severity risk using the predict() function.
# Predict woodland fire severity risk
timeStart<- proc.time() # measure computation time
lc_rf <- predict(rvars, rf_model, type = "prob")
proc.time() - timeStart # user time and system time
## user system elapsed
## 326.313 3.211 329.523
We are going to display a woodland fire severity risk map using the tmap package.
# Display woodland fire severity risk map using the tmap and tmaptools packages
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(lc_rf) + tm_raster(style= "quantile", n=7, palette=get_brewer_pal("Oranges", n = 7, plot=FALSE)) +
tm_layout(legend.outside = TRUE)
## stars object downsampled to 1452 by 689 cells. See tm_shape manual (argument raster.downsample)
Finally, save the woodland fire severity risk map in “img” or “geotiff” format.
# Save the woodland fire severity risk map in “img” or “geotiff” formats
writeRaster(lc_rf, filename="Muf_Wd_Fire_Severity_Risk.img", datatype='FLT4S', index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
In this step, we will use the LIME package to help us explain individual predictions. First, we remove the response variable since LIME needs data without a response variable.
# Remove the response variable (response)
training_x <- dplyr::select(training, -response)
testing_x <- dplyr::select(testing, -response)
training_y <- dplyr::select(training, response)
testing_y <- dplyr::select(testing, response)
Next, let’s build the explainer using the LIME() function.
# Create an explainer object
set.seed(27)
explainer_rf <- lime(training_x, rf_model, n_bins = 5, quantile_bins = TRUE)
summary(explainer_rf)
## Length Class Mode
## model 23 train list
## preprocess 1 -none- function
## bin_continuous 1 -none- logical
## n_bins 1 -none- numeric
## quantile_bins 1 -none- logical
## use_density 1 -none- logical
## feature_type 9 -none- character
## bin_cuts 9 -none- list
## feature_distribution 9 -none- list
Next, we will use the explain() function to predict and evaluate the explanations.
timeStart<- proc.time() # measure computation time
# Explain new observations
explanation_rf <- explain(
x = testing_x,
explainer = explainer_rf,
n_permutations = 5000,
dist_fun = "euclidean",
kernel_width = .75,
n_features = 10,
feature_select = "highest_weights",
n_labels = 2
)
proc.time() - timeStart # user time and system time
## user system elapsed
## 49.768 2.524 52.293
Finally, let’s visualize cases (single pixels) between rows 1 and 50.
# Create a plot to visualize the explanaitions
plot_features(explanation_rf[1:50, ], ncol = 2)