Nrdly
Get Nrdly Free Trial Built with Nrdly

Exploratory Data Analysis: The Critical Link to Successful Implementation of Geospatial Machine Learning

Introduction

Geospatial data such as remotely-sensed imagery is often taken at face value. Many remote sensing researchers and analysts assume that remotely-sensed and reference data are accurate. However, this is not true as errors are introduced during data acquisition or compilation of reference data. To date, the remote sensing community has greater access to free and open remotely-sensed imagery, powerful machine learning algorithms, and cloud computing platforms (Google Earth Engine, Microsoft Planetary Computer). As a result, remote sensing researchers and analysts have an increased capacity to produce geospatial information using machine learning algorithms. However, there is a persisting tendency to run machine learning algorithms without understanding data quality and the underlying assumptions. Unfortunately, this is a common mistake for inexperienced remote sensing researchers and analysts.

The Importance of Exploratory Data analysis

Exploratory data analysis (EDA) is essential before training machine learning models. EDA refers to the process of using summary statistics and graphical representations (histograms, scatter plots, box plots, etc.) to perform preliminary data analysis, identify patterns, detect anomalies and outliers, check predictor correlations, test hypotheses, and verify assumptions. The purpose of EDA is to understand the overall distribution and uncover potential problems with a given data set. In addition, EDA provides the capacity to find new insights in a data set and improve the quality of the overall research or project.

Exploratory Data Analysis Methods

Remote sensing researchers and analysts can use simple EDA or more complex exploratory spatial data methods depending on the research or project. In most cases, simple univariate and multivariate methods provide enough insights before running machine learning algorithms. For example, remote sensing researchers and analysts can explore their data using summary statistics (mean, median, minimum, maximum, etc.), histograms and density plots, box and whisker plots, and scatterplot matrices. In general, it is better to keep EDA simple to spend more time on understanding the data set. Once the data set is fully understood, the analysts can clean and remove redundant data or deal with missing data.

Next Steps

EDA is an essential component of data-centric explainable machine learning. My book on ‘Data-centric Explainable Machine Learning for Land Cover Classification: A Practical Guide in R’ has a short section on EDA. If you want to learn more about the book, please check the information at:

Data-centric Explainable Machine Learning