The following are the image processing and analysis steps.

Step 1. Set up your environment Step 2. Prepare training and test data sets Step 3. Perform exploratory data analysis (EDA) Step 4. Train and evaluate the RF model Step 5. Compute local interpretable model-agnostic explanations (LIME)

Step 1. Set up your environment

Before getting started, install and load the necessary packages that we will use in this lab.

rm(list=ls())
# Load the following packages
library(rgdal) # provides an interface to the Geospatial Data Abstraction Library (GDAL)         
library(raster) # provides important functions for importing and handling raster data
library(mapview)  # provides functions to visualize geospatial data
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(sf) # represents simple features as native R objects
library(randomForest) # implements Breiman’s RF algorithm
library(dplyr) # provides basic data transformation functions
library(ggplot2) # provides extension of visualizations
library(corrplot) # provides a graphical display of a correlation matrix
library(gridExtra) # provides user-level functions to arrange multiple grid-based plots on a page
library(tmaptools) # provides a set of tools for reading and processing spatial data
library(tmap) # provides a flexible and easy to use approach to create thematic maps
library(plotly) # creates interactive web-based graphs
library(lime)   #  provides local interpretation for ML models 
library(caret) # provides procedures for training and evaluating machine learning models
## Warning: package 'lattice' was built under R version 4.1.0

Next, set up your work environment. Note, this is the directory or folder that contains raster variables and training data.

#  set up the working directory
setwd("/home/kamusoko/Documents/GIS Company/2020-2021/ai.geolabs/Courses and Training/Online Course Bundles/1_Self-paced Courses/1_Beginner and Intermediate/Geospatial analysis and Modeling/Modeling/Regression_ML/Data/Raster/GEE")

# Create a list of raster bands that will be used for modeling
Elevation <- brick("Elevation.tif") # ASTER DEM
Slope <- brick("Slope.tif") # Derived from ASTER DEM
TPI <- brick("TPI.tif")  # Topographic Position Index (Derived from ASTER DEM)
TRI <- brick("TRI.tif") # Topographic Ruggedness Index (Derived from ASTER DEM)
Cross_curvature <- brick("Cross_curvature.tif") # Derived from ASTER DEM
Profile_curvature <- brick("Profile_curvature.tif") # Derived from ASTER DEM
Asc_VV <- brick("S1_median_Asc_VV.tif") # Ascending VV
Asc_VH <- brick("S1_median_Asc_VH.tif") # Ascending VV

# Combine or stack the raster layers.
rvars <- stack(Elevation, Slope, TPI, TRI, Cross_curvature, Profile_curvature, Asc_VV, Asc_VH)

# Create an object “ta_data” and import the training data
ta_data <- readOGR(getwd(), "Chimanimani_Landslides_2019_Poly")
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/kamusoko/Documents/GIS Company/2020-2021/ai.geolabs/Courses and Training/Online Course Bundles/1_Self-paced Courses/1_Beginner and Intermediate/Geospatial analysis and Modeling/Modeling/Regression_ML/Data/Raster/GEE", layer: "Chimanimani_Landslides_2019_Poly"
## with 111 features
## It has 1 fields

Check the attributes of the imagery and the training data.

# Check the raster variables and training data
print(rvars)
## class      : RasterStack 
## dimensions : 8523, 7525, 64135575, 8  (nrow, ncol, ncell, nlayers)
## resolution : 10, 10  (x, y)
## extent     : 430870, 506120, 7773710, 7858940  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs 
## names      : Elevation, Slope,   TPI,   TRI, Cross_curvature, Profile_curvature, S1_median_Asc_VV, S1_median_Asc_VH 
## min values :    -32768,     ?,     ?,     ?,               ?,                 ?,                ?,                ? 
## max values :     32767,     ?,     ?,     ?,               ?,                 ?,                ?,                ?
summary(ta_data)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x  432127.7  502322.8
## y 7775268.4 7858554.3
## Is projected: TRUE 
## proj4string :
## [+proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs]
## Data attributes:
##        Class   
##  Landslide:48  
##  Others   :63

The training data has 48 landslides and 63 other classes. Let’s create a balanced training data set.

Following are the steps. First, we will create a vector of unique land cover values.

# Create a vector of unique land cover values
uniqueClasses <- unique(ta_data$Class)
uniqueClasses
## [1] Landslide Others   
## Levels: Landslide Others

Second, we will sample the same number of random points inside all polygons.

# Set the seed
set.seed(27)
for (i in 1:length(uniqueClasses)) {
  # Select only ploygons from ghe current class
  class_data <- subset(ta_data, Class == uniqueClasses[i])
  # Get random points for these polygons
  classpts <- spsample(class_data, type = "random", n = 200)
  # Add class column to the SpatialPoints object
  classpts$class <- rep(uniqueClasses[i], length(classpts))
  if (i == 1) {
    xy <- classpts
  } else {
    xy <- rbind(xy, classpts)
      }
}
## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html

## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html

Next, let’s plot xy data set (random points) on the Sentinel-1 imagery.

# Set-up timer
timeStart <- proc.time()
# Plot the random points on Sentinel-1 imagery
plotRGB(rvars, r = 7, g = 8, b = 7, stretch = "lin")
#plot(rvars, 4)
points(xy)

proc.time() - timeStart
##    user  system elapsed 
##   1.072   0.088   1.159

Step 2. Prepare training and test data sets

We are going to use the extract() function to extract values from the different raster variables (rvars).

# Set seed
set.seed(27)
# Extract reflectance values from the raster variables and training sample points
train_samples <- extract(rvars, y = xy, cellnumbers = TRUE)
train_samples = data.frame(response = xy$class, train_samples)

# Check the structure of the training data
str(train_samples)
## 'data.frame':    400 obs. of  10 variables:
##  $ response         : Factor w/ 2 levels "Landslide","Others": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cells            : num  27735844 37412269 27555348 38209830 38691307 ...
##  $ Elevation        : num  1311 1429 1548 1405 1402 ...
##  $ Slope            : num  19.29 4.52 34.22 0 28.81 ...
##  $ TPI              : num  2.625 0.625 -7.375 0 -4.125 ...
##  $ TRI              : num  2.625 0.625 7.375 0 4.125 ...
##  $ Cross_curvature  : num  -1.18e-17 4.60e-03 -9.64e-02 0.00 0.00 ...
##  $ Profile_curvature: num  0.0589 0.0185 -0.0508 0 -0.074 ...
##  $ S1_median_Asc_VV : num  -8.56 -11.93 -12.02 -8.66 -12.81 ...
##  $ S1_median_Asc_VH : num  -17.2 -19.3 -18 -15.8 -19.4 ...

Next, check the training points using the table() function.

# Check the number of training points
table(train_samples$response)
## 
## Landslide    Others 
##       200       200

Next, check for any duplicates in the training points.

# Check for any duplicates
any(duplicated(train_samples$cells))
## [1] TRUE

The training points contain some duplicates. Therefore, we need to remove the duplicates.

train_samples <- train_samples[!duplicated(train_samples$cells), -2]
names(train_samples)
## [1] "response"          "Elevation"         "Slope"            
## [4] "TPI"               "TRI"               "Cross_curvature"  
## [7] "Profile_curvature" "S1_median_Asc_VV"  "S1_median_Asc_VH"

Next, let’s check the number of the remaining training points using the table() function.

# Check the number of remaining training points
table(train_samples$response)
## 
## Landslide    Others 
##       173       199

Next, we will split the training data into training and test sets based on the createDataPartition() function from the caret package. The training data will comprise 60%, while the test data set will include 40%.

# Set seed and split training data
set.seed(27)
inTraining <- createDataPartition(train_samples$response, p = .60, list = FALSE)
training <- train_samples[ inTraining,]
testing  <- train_samples[-inTraining,]

Step 3. Perform exploratory data analysis (EDA)

Performing exploratory data analysis (EDA) before training machine learning models is very important. We start by checking the descriptive statistics of the training data set using the summary() function.

# Check summary statistics
summary(training) 
##       response     Elevation          Slope             TPI           
##  Landslide:104   Min.   : 475.0   Min.   : 0.000   Min.   :-10.37500  
##  Others   :120   1st Qu.: 795.5   1st Qu.: 2.992   1st Qu.: -0.90625  
##                  Median :1397.0   Median : 8.111   Median :  0.00000  
##                  Mean   :1245.7   Mean   :12.550   Mean   : -0.07533  
##                  3rd Qu.:1448.0   3rd Qu.:16.699   3rd Qu.:  0.75000  
##                  Max.   :2063.0   Max.   :58.782   Max.   : 12.37500  
##       TRI         Cross_curvature      Profile_curvature   S1_median_Asc_VV 
##  Min.   : 0.000   Min.   :-0.1333309   Min.   :-0.076975   Min.   :-19.612  
##  1st Qu.: 0.500   1st Qu.: 0.0000000   1st Qu.:-0.019777   1st Qu.:-12.026  
##  Median : 1.125   Median : 0.0000000   Median : 0.000000   Median :-10.829  
##  Mean   : 1.986   Mean   :-0.0003023   Mean   :-0.001516   Mean   :-10.673  
##  3rd Qu.: 2.531   3rd Qu.: 0.0000000   3rd Qu.: 0.019704   3rd Qu.: -9.342  
##  Max.   :12.375   Max.   : 0.1333309   Max.   : 0.076975   Max.   : -3.562  
##  S1_median_Asc_VH 
##  Min.   :-25.031  
##  1st Qu.:-18.571  
##  Median :-17.019  
##  Mean   :-17.125  
##  3rd Qu.:-15.537  
##  Max.   : -9.718

Next, we are going to use the cor() function to check the correlation between predictor variables.

# Check band correlations
bandCorrelations <- cor(training[, 2:9])
bandCorrelations
##                      Elevation        Slope         TPI         TRI
## Elevation          1.000000000  0.418822974 0.002669579  0.39227577
## Slope              0.418822974  1.000000000 0.005908755  0.97389996
## TPI                0.002669579  0.005908755 1.000000000  0.03281532
## TRI                0.392275768  0.973899959 0.032815319  1.00000000
## Cross_curvature   -0.032075640 -0.040534014 0.543317569 -0.05408707
## Profile_curvature -0.015323367 -0.034033600 0.859319394 -0.01837885
## S1_median_Asc_VV   0.086447399 -0.080316685 0.058886532 -0.13075549
## S1_median_Asc_VH  -0.023755091 -0.053238763 0.048697798 -0.09663791
##                   Cross_curvature Profile_curvature S1_median_Asc_VV
## Elevation             -0.03207564       -0.01532337       0.08644740
## Slope                 -0.04053401       -0.03403360      -0.08031669
## TPI                    0.54331757        0.85931939       0.05888653
## TRI                   -0.05408707       -0.01837885      -0.13075549
## Cross_curvature        1.00000000        0.23535510       0.10882554
## Profile_curvature      0.23535510        1.00000000       0.08203996
## S1_median_Asc_VV       0.10882554        0.08203996       1.00000000
## S1_median_Asc_VH       0.10152396        0.06985885       0.94868730
##                   S1_median_Asc_VH
## Elevation              -0.02375509
## Slope                  -0.05323876
## TPI                     0.04869780
## TRI                    -0.09663791
## Cross_curvature         0.10152396
## Profile_curvature       0.06985885
## S1_median_Asc_VV        0.94868730
## S1_median_Asc_VH        1.00000000

We will display a mixed plot (numbers and plots) to understand the correlation between the predictors.

# Display  mixed plot
corrplot.mixed(bandCorrelations,lower.col="black", number.cex = .7, upper = "color")

Step 4. Train and evaluate the random forest (RF) model

We will specify the random forest (RF) model as shown below.

# Set up the model tuning parameters
fitControl<- trainControl(method = "repeatedcv", number = 10, repeats = 10)

# Set-up timer to check how long it takes to run the RF model
timeStart <- proc.time()
set.seed(27)
# Train the RF model
rf_model <- train(response ~.,data = training,
                  method = "rf",
                  trControl = fitControl,
                  prox = TRUE,
                  fitBest = FALSE,
                  localImp = TRUE,
                  returnData = TRUE)
proc.time() - timeStart
##    user  system elapsed 
##  31.871   0.272  32.147

Next, check the RF model performance using the print() function.

# Check the RF model performance
print(rf_model)
## Random Forest 
## 
## 224 samples
##   8 predictor
##   2 classes: 'Landslide', 'Others' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 201, 201, 202, 202, 201, 202, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8841700  0.7683833
##   5     0.9060474  0.8117782
##   8     0.9015217  0.8023295
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.

Next, check the parameters of the best model.

# check the parameters of the best model.
rf_model$fin
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, localImp = TRUE,      proximity = TRUE, fitBest = FALSE, returnData = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 10.27%
## Confusion matrix:
##           Landslide Others class.error
## Landslide        96      8  0.07692308
## Others           15    105  0.12500000

Next, calculate and display the top predictor variables.

# Compute RF variable importance
rf_varImp <- varImp(rf_model)
ggplot(rf_varImp) # Display the most important predictor variables

Next, we will assess the RF model performance using new testing data.

# Prepare a confusion matrix
pred_rf <- predict(rf_model, newdata = testing)
confusionMatrix(data = pred_rf, testing$response)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Landslide Others
##   Landslide        62      9
##   Others            7     70
##                                           
##                Accuracy : 0.8919          
##                  95% CI : (0.8304, 0.9369)
##     No Information Rate : 0.5338          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7832          
##                                           
##  Mcnemar's Test P-Value : 0.8026          
##                                           
##             Sensitivity : 0.8986          
##             Specificity : 0.8861          
##          Pos Pred Value : 0.8732          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.4662          
##          Detection Rate : 0.4189          
##    Detection Prevalence : 0.4797          
##       Balanced Accuracy : 0.8923          
##                                           
##        'Positive' Class : Landslide       
## 

We will use the RF model to landslide risk using the predict() function.

# Predict land cover
timeStart<- proc.time() # measure computation time
lc_rf <- predict(rvars, rf_model, type = "prob")
proc.time() - timeStart # user time and system time
##    user  system elapsed 
## 754.959   8.803 763.772

We are going to display a landslide risk map using the tmap package.

# Display landslide risk map using the tmap and tmaptools packages
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(lc_rf) + tm_raster(style= "quantile", n=7, palette=get_brewer_pal("Reds", n = 7, plot=FALSE)) +
tm_layout(legend.outside = TRUE)
## stars object downsampled to 940 by 1065 cells. See tm_shape manual (argument raster.downsample)

Finally, save the landslide risk map in “img” or “geotiff” format.

# Save the landslide risk map in “img” or “geotiff” formats
writeRaster(lc_rf, filename="Chiman_Landslide_Risk.img", datatype='FLT4S', index=1, na.rm=TRUE, progress="window", overwrite=TRUE)

Step 5. Local Interpretable Model-agnostic Explanations (LIME)

In this step, we will use the LIME package to help us explain individual predictions. First, we remove the response variable since LIME needs data without a response variable.

# Remove the response variable (response) 
training_x <- dplyr::select(training, -response)
testing_x <- dplyr::select(testing, -response)

training_y <- dplyr::select(training, response)
testing_y <- dplyr::select(testing, response)

Next, let’s build the explainer using the LIME() function.

# Create an explainer object
set.seed(27)
explainer_rf <- lime(training_x, rf_model, n_bins = 5, quantile_bins = TRUE)
summary(explainer_rf)
##                      Length Class  Mode     
## model                23     train  list     
## preprocess            1     -none- function 
## bin_continuous        1     -none- logical  
## n_bins                1     -none- numeric  
## quantile_bins         1     -none- logical  
## use_density           1     -none- logical  
## feature_type          8     -none- character
## bin_cuts              8     -none- list     
## feature_distribution  8     -none- list

Next, we will use the explain() function to predict and evaluate the explanations.

timeStart<- proc.time() # measure computation time
# Explain new observations
explanation_rf <- explain(
  x = testing_x, 
  explainer = explainer_rf, 
  n_permutations = 5000,
  dist_fun = "euclidean",
  kernel_width = .75,
  n_features = 8, 
  feature_select = "highest_weights",
  n_labels = 2
  )
proc.time() - timeStart # user time and system time
##    user  system elapsed 
##  49.452   0.452  49.908

Finally, let’s visualize cases (single pixels) between rows 1 and 50.

# Create a plot to visualize the explanaitions
plot_features(explanation_rf[1:50, ], ncol = 2)