The following are the image processing and analysis steps.
Step 1. Set up your environment Step 2. Prepare training and test data sets Step 3. Perform exploratory data analysis (EDA) Step 4. Train and evaluate the RF model Step 5. Compute local interpretable model-agnostic explanations (LIME)
Before getting started, install and load the necessary packages that we will use in this lab.
rm(list=ls())
# Load the following packages
library(rgdal) # provides an interface to the Geospatial Data Abstraction Library (GDAL)
library(raster) # provides important functions for importing and handling raster data
library(mapview) # provides functions to visualize geospatial data
## Warning: multiple methods tables found for 'crop'
## Warning: multiple methods tables found for 'extend'
library(sf) # represents simple features as native R objects
library(randomForest) # implements Breiman’s RF algorithm
library(dplyr) # provides basic data transformation functions
library(ggplot2) # provides extension of visualizations
library(corrplot) # provides a graphical display of a correlation matrix
library(gridExtra) # provides user-level functions to arrange multiple grid-based plots on a page
library(tmaptools) # provides a set of tools for reading and processing spatial data
library(tmap) # provides a flexible and easy to use approach to create thematic maps
library(plotly) # creates interactive web-based graphs
library(lime) # provides local interpretation for ML models
library(caret) # provides procedures for training and evaluating machine learning models
## Warning: package 'lattice' was built under R version 4.1.0
Next, set up your work environment. Note, this is the directory or folder that contains raster variables and training data.
# set up the working directory
setwd("/home/kamusoko/Documents/GIS Company/2020-2021/ai.geolabs/Courses and Training/Online Course Bundles/1_Self-paced Courses/1_Beginner and Intermediate/Geospatial analysis and Modeling/Modeling/Regression_ML/Data/Raster/GEE")
# Create a list of raster bands that will be used for modeling
Elevation <- brick("Elevation.tif") # ASTER DEM
Slope <- brick("Slope.tif") # Derived from ASTER DEM
TPI <- brick("TPI.tif") # Topographic Position Index (Derived from ASTER DEM)
TRI <- brick("TRI.tif") # Topographic Ruggedness Index (Derived from ASTER DEM)
Cross_curvature <- brick("Cross_curvature.tif") # Derived from ASTER DEM
Profile_curvature <- brick("Profile_curvature.tif") # Derived from ASTER DEM
Asc_VV <- brick("S1_median_Asc_VV.tif") # Ascending VV
Asc_VH <- brick("S1_median_Asc_VH.tif") # Ascending VV
# Combine or stack the raster layers.
rvars <- stack(Elevation, Slope, TPI, TRI, Cross_curvature, Profile_curvature, Asc_VV, Asc_VH)
# Create an object “ta_data” and import the training data
ta_data <- readOGR(getwd(), "Chimanimani_Landslides_2019_Poly")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/kamusoko/Documents/GIS Company/2020-2021/ai.geolabs/Courses and Training/Online Course Bundles/1_Self-paced Courses/1_Beginner and Intermediate/Geospatial analysis and Modeling/Modeling/Regression_ML/Data/Raster/GEE", layer: "Chimanimani_Landslides_2019_Poly"
## with 111 features
## It has 1 fields
Check the attributes of the imagery and the training data.
# Check the raster variables and training data
print(rvars)
## class : RasterStack
## dimensions : 8523, 7525, 64135575, 8 (nrow, ncol, ncell, nlayers)
## resolution : 10, 10 (x, y)
## extent : 430870, 506120, 7773710, 7858940 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs
## names : Elevation, Slope, TPI, TRI, Cross_curvature, Profile_curvature, S1_median_Asc_VV, S1_median_Asc_VH
## min values : -32768, ?, ?, ?, ?, ?, ?, ?
## max values : 32767, ?, ?, ?, ?, ?, ?, ?
summary(ta_data)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
## min max
## x 432127.7 502322.8
## y 7775268.4 7858554.3
## Is projected: TRUE
## proj4string :
## [+proj=utm +zone=36 +south +datum=WGS84 +units=m +no_defs]
## Data attributes:
## Class
## Landslide:48
## Others :63
The training data has 48 landslides and 63 other classes. Let’s create a balanced training data set.
Following are the steps. First, we will create a vector of unique land cover values.
# Create a vector of unique land cover values
uniqueClasses <- unique(ta_data$Class)
uniqueClasses
## [1] Landslide Others
## Levels: Landslide Others
Second, we will sample the same number of random points inside all polygons.
# Set the seed
set.seed(27)
for (i in 1:length(uniqueClasses)) {
# Select only ploygons from ghe current class
class_data <- subset(ta_data, Class == uniqueClasses[i])
# Get random points for these polygons
classpts <- spsample(class_data, type = "random", n = 200)
# Add class column to the SpatialPoints object
classpts$class <- rep(uniqueClasses[i], length(classpts))
if (i == 1) {
xy <- classpts
} else {
xy <- rbind(xy, classpts)
}
}
## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
## Warning in proj4string(obj): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
Next, let’s plot xy data set (random points) on the Sentinel-1 imagery.
# Set-up timer
timeStart <- proc.time()
# Plot the random points on Sentinel-1 imagery
plotRGB(rvars, r = 7, g = 8, b = 7, stretch = "lin")
#plot(rvars, 4)
points(xy)
proc.time() - timeStart
## user system elapsed
## 1.072 0.088 1.159
We are going to use the extract() function to extract values from the different raster variables (rvars).
# Set seed
set.seed(27)
# Extract reflectance values from the raster variables and training sample points
train_samples <- extract(rvars, y = xy, cellnumbers = TRUE)
train_samples = data.frame(response = xy$class, train_samples)
# Check the structure of the training data
str(train_samples)
## 'data.frame': 400 obs. of 10 variables:
## $ response : Factor w/ 2 levels "Landslide","Others": 1 1 1 1 1 1 1 1 1 1 ...
## $ cells : num 27735844 37412269 27555348 38209830 38691307 ...
## $ Elevation : num 1311 1429 1548 1405 1402 ...
## $ Slope : num 19.29 4.52 34.22 0 28.81 ...
## $ TPI : num 2.625 0.625 -7.375 0 -4.125 ...
## $ TRI : num 2.625 0.625 7.375 0 4.125 ...
## $ Cross_curvature : num -1.18e-17 4.60e-03 -9.64e-02 0.00 0.00 ...
## $ Profile_curvature: num 0.0589 0.0185 -0.0508 0 -0.074 ...
## $ S1_median_Asc_VV : num -8.56 -11.93 -12.02 -8.66 -12.81 ...
## $ S1_median_Asc_VH : num -17.2 -19.3 -18 -15.8 -19.4 ...
Next, check the training points using the table() function.
# Check the number of training points
table(train_samples$response)
##
## Landslide Others
## 200 200
Next, check for any duplicates in the training points.
# Check for any duplicates
any(duplicated(train_samples$cells))
## [1] TRUE
The training points contain some duplicates. Therefore, we need to remove the duplicates.
train_samples <- train_samples[!duplicated(train_samples$cells), -2]
names(train_samples)
## [1] "response" "Elevation" "Slope"
## [4] "TPI" "TRI" "Cross_curvature"
## [7] "Profile_curvature" "S1_median_Asc_VV" "S1_median_Asc_VH"
Next, let’s check the number of the remaining training points using the table() function.
# Check the number of remaining training points
table(train_samples$response)
##
## Landslide Others
## 173 199
Next, we will split the training data into training and test sets based on the createDataPartition() function from the caret package. The training data will comprise 60%, while the test data set will include 40%.
# Set seed and split training data
set.seed(27)
inTraining <- createDataPartition(train_samples$response, p = .60, list = FALSE)
training <- train_samples[ inTraining,]
testing <- train_samples[-inTraining,]
Performing exploratory data analysis (EDA) before training machine learning models is very important. We start by checking the descriptive statistics of the training data set using the summary() function.
# Check summary statistics
summary(training)
## response Elevation Slope TPI
## Landslide:104 Min. : 475.0 Min. : 0.000 Min. :-10.37500
## Others :120 1st Qu.: 795.5 1st Qu.: 2.992 1st Qu.: -0.90625
## Median :1397.0 Median : 8.111 Median : 0.00000
## Mean :1245.7 Mean :12.550 Mean : -0.07533
## 3rd Qu.:1448.0 3rd Qu.:16.699 3rd Qu.: 0.75000
## Max. :2063.0 Max. :58.782 Max. : 12.37500
## TRI Cross_curvature Profile_curvature S1_median_Asc_VV
## Min. : 0.000 Min. :-0.1333309 Min. :-0.076975 Min. :-19.612
## 1st Qu.: 0.500 1st Qu.: 0.0000000 1st Qu.:-0.019777 1st Qu.:-12.026
## Median : 1.125 Median : 0.0000000 Median : 0.000000 Median :-10.829
## Mean : 1.986 Mean :-0.0003023 Mean :-0.001516 Mean :-10.673
## 3rd Qu.: 2.531 3rd Qu.: 0.0000000 3rd Qu.: 0.019704 3rd Qu.: -9.342
## Max. :12.375 Max. : 0.1333309 Max. : 0.076975 Max. : -3.562
## S1_median_Asc_VH
## Min. :-25.031
## 1st Qu.:-18.571
## Median :-17.019
## Mean :-17.125
## 3rd Qu.:-15.537
## Max. : -9.718
Next, we are going to use the cor() function to check the correlation between predictor variables.
# Check band correlations
bandCorrelations <- cor(training[, 2:9])
bandCorrelations
## Elevation Slope TPI TRI
## Elevation 1.000000000 0.418822974 0.002669579 0.39227577
## Slope 0.418822974 1.000000000 0.005908755 0.97389996
## TPI 0.002669579 0.005908755 1.000000000 0.03281532
## TRI 0.392275768 0.973899959 0.032815319 1.00000000
## Cross_curvature -0.032075640 -0.040534014 0.543317569 -0.05408707
## Profile_curvature -0.015323367 -0.034033600 0.859319394 -0.01837885
## S1_median_Asc_VV 0.086447399 -0.080316685 0.058886532 -0.13075549
## S1_median_Asc_VH -0.023755091 -0.053238763 0.048697798 -0.09663791
## Cross_curvature Profile_curvature S1_median_Asc_VV
## Elevation -0.03207564 -0.01532337 0.08644740
## Slope -0.04053401 -0.03403360 -0.08031669
## TPI 0.54331757 0.85931939 0.05888653
## TRI -0.05408707 -0.01837885 -0.13075549
## Cross_curvature 1.00000000 0.23535510 0.10882554
## Profile_curvature 0.23535510 1.00000000 0.08203996
## S1_median_Asc_VV 0.10882554 0.08203996 1.00000000
## S1_median_Asc_VH 0.10152396 0.06985885 0.94868730
## S1_median_Asc_VH
## Elevation -0.02375509
## Slope -0.05323876
## TPI 0.04869780
## TRI -0.09663791
## Cross_curvature 0.10152396
## Profile_curvature 0.06985885
## S1_median_Asc_VV 0.94868730
## S1_median_Asc_VH 1.00000000
We will display a mixed plot (numbers and plots) to understand the correlation between the predictors.
# Display mixed plot
corrplot.mixed(bandCorrelations,lower.col="black", number.cex = .7, upper = "color")
We will specify the random forest (RF) model as shown below.
# Set up the model tuning parameters
fitControl<- trainControl(method = "repeatedcv", number = 10, repeats = 10)
# Set-up timer to check how long it takes to run the RF model
timeStart <- proc.time()
set.seed(27)
# Train the RF model
rf_model <- train(response ~.,data = training,
method = "rf",
trControl = fitControl,
prox = TRUE,
fitBest = FALSE,
localImp = TRUE,
returnData = TRUE)
proc.time() - timeStart
## user system elapsed
## 31.871 0.272 32.147
Next, check the RF model performance using the print() function.
# Check the RF model performance
print(rf_model)
## Random Forest
##
## 224 samples
## 8 predictor
## 2 classes: 'Landslide', 'Others'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 201, 201, 202, 202, 201, 202, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8841700 0.7683833
## 5 0.9060474 0.8117782
## 8 0.9015217 0.8023295
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 5.
Next, check the parameters of the best model.
# check the parameters of the best model.
rf_model$fin
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, localImp = TRUE, proximity = TRUE, fitBest = FALSE, returnData = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 10.27%
## Confusion matrix:
## Landslide Others class.error
## Landslide 96 8 0.07692308
## Others 15 105 0.12500000
Next, calculate and display the top predictor variables.
# Compute RF variable importance
rf_varImp <- varImp(rf_model)
ggplot(rf_varImp) # Display the most important predictor variables
Next, we will assess the RF model performance using new testing data.
# Prepare a confusion matrix
pred_rf <- predict(rf_model, newdata = testing)
confusionMatrix(data = pred_rf, testing$response)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Landslide Others
## Landslide 62 9
## Others 7 70
##
## Accuracy : 0.8919
## 95% CI : (0.8304, 0.9369)
## No Information Rate : 0.5338
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7832
##
## Mcnemar's Test P-Value : 0.8026
##
## Sensitivity : 0.8986
## Specificity : 0.8861
## Pos Pred Value : 0.8732
## Neg Pred Value : 0.9091
## Prevalence : 0.4662
## Detection Rate : 0.4189
## Detection Prevalence : 0.4797
## Balanced Accuracy : 0.8923
##
## 'Positive' Class : Landslide
##
We will use the RF model to landslide risk using the predict() function.
# Predict land cover
timeStart<- proc.time() # measure computation time
lc_rf <- predict(rvars, rf_model, type = "prob")
proc.time() - timeStart # user time and system time
## user system elapsed
## 754.959 8.803 763.772
We are going to display a landslide risk map using the tmap package.
# Display landslide risk map using the tmap and tmaptools packages
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(lc_rf) + tm_raster(style= "quantile", n=7, palette=get_brewer_pal("Reds", n = 7, plot=FALSE)) +
tm_layout(legend.outside = TRUE)
## stars object downsampled to 940 by 1065 cells. See tm_shape manual (argument raster.downsample)
Finally, save the landslide risk map in “img” or “geotiff” format.
# Save the landslide risk map in “img” or “geotiff” formats
writeRaster(lc_rf, filename="Chiman_Landslide_Risk.img", datatype='FLT4S', index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
In this step, we will use the LIME package to help us explain individual predictions. First, we remove the response variable since LIME needs data without a response variable.
# Remove the response variable (response)
training_x <- dplyr::select(training, -response)
testing_x <- dplyr::select(testing, -response)
training_y <- dplyr::select(training, response)
testing_y <- dplyr::select(testing, response)
Next, let’s build the explainer using the LIME() function.
# Create an explainer object
set.seed(27)
explainer_rf <- lime(training_x, rf_model, n_bins = 5, quantile_bins = TRUE)
summary(explainer_rf)
## Length Class Mode
## model 23 train list
## preprocess 1 -none- function
## bin_continuous 1 -none- logical
## n_bins 1 -none- numeric
## quantile_bins 1 -none- logical
## use_density 1 -none- logical
## feature_type 8 -none- character
## bin_cuts 8 -none- list
## feature_distribution 8 -none- list
Next, we will use the explain() function to predict and evaluate the explanations.
timeStart<- proc.time() # measure computation time
# Explain new observations
explanation_rf <- explain(
x = testing_x,
explainer = explainer_rf,
n_permutations = 5000,
dist_fun = "euclidean",
kernel_width = .75,
n_features = 8,
feature_select = "highest_weights",
n_labels = 2
)
proc.time() - timeStart # user time and system time
## user system elapsed
## 49.452 0.452 49.908
Finally, let’s visualize cases (single pixels) between rows 1 and 50.
# Create a plot to visualize the explanaitions
plot_features(explanation_rf[1:50, ], ncol = 2)