Download as pdf
Download as pdf
You are on page 1of 13
Patrick Gray (patrick.c.gray at duke) - https://github.com/patrickcgray Chapter 5: Classification of Land Cover Introduction in this chapter we will classify the Sentinel-2 image we've been working with using a supervised classification approach which incorporates the training data we worked with in chapter 4. Specifically, we will be using Naive Bayes. Naive Bayes predicts the probabilities that a data point belongs to a particular class and the class with the highest probability is considered as the most likely class. The way they get these probabilities is by using Bayes’ Theorem, which describes the orobability of a feature, based on prior knowledge of conditions that might be related to that feature. Naive Bayes is quite fast when compared to some other machine learning approaches (e.g, SVM can be quite computationally intensive) This isn’t to say that itis the best per se; rather itis @ Okay looks good! Our raster dataset is ready Now our goal is to get the pixels from the raster as outlined in each shapefile. ur training data, the shapefile we've worked with, contains one main field we care about: * a Classname field (String datatype) Combined with the innate location information of polygons in a Shapefile, we have all that we need to.use for pairing labels with the information in our raster. However, in order to pair up our vector data with our raster pixels, we will need a way of co-aligning the datasets in space. Well do this using the rasterio mask function which takes in a dataset and a polygon and ther outputs a numpy array with the pixels in the polygon. Let's run through an example full_dataset.crs CRS. from epsg(4326) Open up our shapefile and check its crs shapefile = gpd.read_file("../data/rcr/rer_landcover. shp") shapefile.crs {'init': ‘epsg:32618'} Remember the projections don't match! Let's use some geopandas magic to reproject all our shapefiles to lat, long shapefile = shapefile.to_crs({‘init': ‘epsg:4326'}) shapefile.crs {'init': ‘epsg:4326'} Len( shapefile) 23 Now we want to extract the geometry of each feature in the shapefile in GeoJSON format # this generates a List of shapely geometries geons = shapefile.geonetry.values # Let's grab a single shapely geometry to check geonetry = geons(2) print (type(geonetry)) print (geometry) # transform to GeoJSON format from shapely.geonetry inport mapping feature = [mapping(geonetry)] # can also do this using polygon. geo interface_ print(type( feature) ) print(feature) POLYGON ((~76.67593927883173 34.69487548849214, ~76.67573882771855_34.69451319913902, -7 6.67666934555091 34.69360077384821, -76.67676946161477 34.69421769352402, -76.6759392788 3173 34,69487548849214)) [{'type': ‘Polygon’, ‘coordinates’: (((-76.67593927883173, 34,69487548849214), (-76.6757 3882771855, 34.694513199139024), (-76.6766693455509, 34. 6936077384821), (-76.6767694616 1477, 34.69421769352402), (-76.67593927883173, 34.69487548849214)), )}] Now let's extract the raster values values within the polygon using the rasterio mask() function out_image, out_transform = mask(full_dataset, feature, crop=True) out_image. shape (8, 18, 13) Okay those looks like the right dimensions for our training data, 8 bands and 6x8 pixels seems reasonable given our earlier explorations Well be doing a lot of memory intensive work so let's clean up and close this dataset full_dataset.close() Building the Training Data for scikit-learn Now let's do it for al features in the shapefile and create an array X that has al the pixels and an array y that has all the training labels. x y np.array({], dtype=np.int8).reshape(0,8) # pixels for training np.array({], dtype=np.string_) # Labels for training # extract the raster values within the polygon with rasterio.open(ing_fp) as sre: band_count = sre.count for index, geon in enunerate(geons): feature = [mapping(geom) ] # the mask function returns an array of the raster pixels within this feature out_image, out_transform = mask(src, feature, crop=True) # eliminate all the pixels with @ values for all 8 bands - AKA not actually par ‘out_image_trinmed = out_inage|:,~np.all(out_image == 0, axis=0)] # eliminate all the pixels with 255 values for all 8 bands - AKA not actually p out_image_trinmed = out_inage_trimned(:,~np.all(out_inage_trinned == 255, axis= # reshape the array to [pixel count, bands} out_image_reshaped = out_inage_trinned.reshape(~1, band_count) # oppend the Labels to the y array y = np.append(y, [shapefile["Classnane")[index]] * out_inage_reshaped.shape(9]) # stack the pizels onto the pixel array X = np.vstack((x,out_image_reshaped)) Pairing Y with X Now that we have the image we want to classify (our X feature inputs), and the land cover labels (our y labeled data), let's check to make sure they match in size so we can feed them to Naive Bayes: # what are our classification Labels? Labels = np.unique(shapefile[ "Classnane” }) print(‘The training data include (n} classes: {classes}\n’ « # We will need a "X" matrix containing our features, and a "y" array containing our Lab print("Our x matrix is sized: (sz}'.format(sz=x.shape)) print(*Our y array is sized: (sz}'.format(sz=y. shape) ) The training data include 6 classes: [‘Emergent Wetland’ ‘Forested Wetland’ ‘Herbaceous’ "sand *Subtidal Haline’ ‘WetSand" ] Our X matrix is sized: (598, 8) Our y array is sized: (598,) tall looks good! Let's explore the spectral signatures of each class now to make sure they're actually separable since all we're going by in this classification is pixel values. In In fig, ax = plt.subplots(1,3, Figsize=[20,8]) # numbers 1-8 band_count = np.arange(1,9) classes = np.unique(y) for class_type in classes: band_intensity = np.mean(X[y==class_type, :], axis=0) ax[0].plot(band_count, band_intensity, label=class_type) ax/1].plot(band_count, band_intensity, label=class_type) ax[2].plot(band_count, band_intensity, label=class_type) # plot them as Lines # Add some axis Labels ax[@].set_xlabel('Band #") ax[0].set_ylabel('Reflectance Value’) ax[1].set_ylabel (‘Reflectance Value’) ax[1].set_xlabel('Band #') ax[2].set_ylabel( ‘Reflectance Value’) ax[2]-set_xlabel(‘Band #") wax[0]. set_yLim(32, 38) ax[1].set_ylin(32, 38) ax[2].set_ylin(7e,140) fax. set ax[1] . legend (lo: # Add a title ax[0].set_title('Band Intensities Full Overview") ax[1].set_title('Band Intensities Lower Ref Subset’) ax[2].set_title('Band Intensities Higher Ref Subset’) upper right” Text(@.5, 1.0, "Band Intensities Higher Ref Subset’) I- | They look okay but emergent wetland and water look quite similar! They're going to be difficult to differentiate. Let's make a quick helper function, this one will convert the class labels into indicies and then assign a dictionary relating the class indices and their names. def str_class_to_int(class_array): class_array[class_array == 'Subtidal Haline’] = @ class_array[class_array == ‘wetSand'] = 1 class_array[class_array class_array[class_array class_array[class_array == ‘Herbaceous'] = 4 class_array[class_array == ‘Forested Wetland"] = return (class_array.astype(int)) "Emergent Wetland’ ] "sand"] = 3 Training the Classifier Now that we have our X matrix of feature inputs (the spectral bands) and our y array (the labels), we can train our model Visit this web page to find the usage of GaussianNaiveBayes Classifier from scikit-learn . from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() enb.fit(X, y) GaussianNB(priors=None, var_smoothing=1e-@9) ‘tis that simple to train a classifier in scikit-Learr | The hard part is often validation anc interpretation Predicting on the image With our Naive Bayes classifier fit, we can now proceed by trying to classify the entire image: We're only going to open the subset of the image we viewed above because otherwise itis computationally too intensive for most users. from rasterio.plot import show from rasterio.plot import show_hist from rasterio.windows import Window from rasterio.plot import reshape_as_raster, reshape_as_image with rasterio.open(ing_fp) as sre: # may need to reduce this image size if your kernel crashes, takes a Lot of memory ing = src.read()[:, 15@:600, 250:1400] ¥ Take our full image and reshape into Long 2d array (nrow * ncol, nband) for classific print (img. shape) reshaped_ing = reshape_as_image(ing) print (reshaped_ing.shape) (8, 450, 1150) (450, 1458, 8) Now we can predict for each pixel in our image: class_prediction = gnb.predict(reshaped_img.reshape(-1, 8)) # Reshape our classification map back into a 2D matrix so we can visualize it class_prediction = class_prediction.reshape(reshaped_ing[ @].shape) Because our shapefile came with the labels as strings we want to convert them to a numpy array with ints using the helper function we made earlier. class_prediction = str_class_to_int(class_prediction) Let's visualize it! First we'll make a colormap so we can visualize the classes, which are just encoded as integers, ir more logical colors. Don't worry too much if this code is confusing! It can be a little clunky to specify colormaps for matplotlit . def color_stretch(image, index): colors = image[:, :, index] .astype(np.floatsa) for b in range(colors.shape[2]): colors[:, :, b] = rasterio.plot.adjust_band(colors[:, :, b]) return colors # find the highest pixel value in the prediction image n= int(np.max(class_prediction) ) # next setup a colormap for our map colors = dict(( (@, (48, 156, 214, 255)), # Blue - Water (1, (438,69,19, 255)),—# Brown ~ WetSand (2, (96, 19, 134, 255)), # Purple - Emergent Wetland (3, (244, 164, 96, 255), # Ton - Sand (4, (206, 224, 196, 255)), # Lime - Herbaceous (5, (G4, 139, 34, 255), # Forest Green - Forest » # Put @ - 255 as float 6 - 2 for k in colors: v= colors[k] _v = Ly / 255.8 for _v in v] colors{k] = _v index_colors = [colors[key] if key in colors else (255, 255, 255, @) for key in range(®, n#1)] cnap = plt.natplotlib.colors.ListedColormap(index colors, ‘Classification’, +1) Now show the classified map next to the RGB image! Fig, axs = plt.subplots(2,1,Figsize=(10,7)) ing_stretched = color_stretch(reshaped_ing, [4, 3, 2]) axs[ 9] -imshow(img_stretched) axs[1]-imshow(class_prediction, cmap=cmap, interpolation='none' ) Fig.show() 00 This looks pretty good! Let's generate a map of Normalized Difference Water Index (NDWI) and NDVI just to compare witk out output map. NDW is similar to NDVI but for identifying water. with rasterio.open(ing_fp) as sre: green_band = src.read(3) red_band = src.read(4) nir_band = src.read(8) nd navi (green_band.astype(float) - nir_band.astype(float)) / (green_band.astype( float) (nir_band.astype(float) - red_band.astype(float)) / (red_band.astype(float) + ni Subset them to our area of interest nd dvi ndwi[150:600, 25¢:1400 ndvi[150:620, 25¢:1400) Display all four maps: fig, axs = plt.subplots(2,2, Figsize=(15,7)) ing_stretched = color_stretch(reshaped_img, [3, 2, 2]) axs[9,0].imshow(ing_stretched) axs[0,1].imshow(class_prediction, cmap=cmap, interpolation='none") rwdi_plot = axs[1,0].ishow(ndwi, cmap="RdYIGn") axs[1,0].set_title(*NoWt") fig.colorbar(nwdi_plot, ax=axs[1,0]) ndvi_plot = axs[1,1].imshow(ndvi, cmap="RdY1Gn") axs[1,1].set_title(*Novr") fig.colorbar(ndvi_plot, ax=axs[1,1]) plt.show() Looks pretty good! Areas that are high on the NDWI ratio are generally classified as water and areas high on NDVI are forest and herbaceous. It does seem like the wetland areas (e.g. the bottom right sland complex) aren't being picked up so it might be worth experimenting with other algorithms! Let's take a closer look at the Duke Marine Lab and the tip of the Rachel Carson Reserve. Fig, axs = plt.subplots(1,2,Figsize=(15,15)) ing_stretched = color_stretch(reshaped_ing, [3, 2, 2]) axs[9].imshow(img_stretched[9: 188, 160:35@]) axs[1]-imshow(class_prediction[@:180, 160:350], cmap=cnap, interpolation=' none" ) Fig.show() This actually doesn't look half bad! Land cover mapping is a complex problem and one where there are many approaches and tools for improving a map. Testing an Unsupervised Classification Algorithm Let's also try a unsupervised classification algorithm, k-means clustering, in the scikit-learn library (documentation) <-means (wikipedia page) aims to partition n observations into k clusters in which each observatior belongs to the cluster with the nearest mean, serving as a prototype of the cluster from sklearn.cluster import kMeans bands, rows, cols = ing.shape k = 10 # num of clusters kmeans_predictions = kKMeans(n_clusters=k, random_state=0) .fit(reshaped_img.reshape(-1, kneans_predictions_2d = kneans_predictions.labels_.reshape(rows, cols) # Now show the cLassmap next to the image fig, axs = plt.subplots(1,2, figsize=(15,8)) img_stretched = color_stretch(reshaped_img, [3, 2, 1]) axs[0] -imshow(img_stretched) axs[ 1]. imshow(kmeans_predictions_2d) Wow this looks like it was better able to distinguish some areas like the wetland and submerged sand than our supervised classification approach! But supervised usually does better with some tuning, luckily there are lots of ways to think about improving our supervised method. Wrapup We've seen how we can use scikit-learn to implement the Naive Bayes classifier for land cover classification. A couple future directions that immediately follow this tutorial include * Extend the lessons learned in the visualization chapter to explore the class separability along various dimensions of the data. For example, plot bands against each other and label each goint in the scatter plot a different color according to the training data label * Add additional features - would using NDVI as well as the spectral bands improve our classification? * scikit-Learn includes many machine learning classifiers -- are any of these better thar Naive Bayes for our goal? SVM? Nearest Neighbors? Others? * In this example we only use 8-bit imagery, 16 or 32 bit may contain more information that helps distinguish the classes * Our training data was created using ultra-high resolution drone imagery. A good deal of error could be coming from the fact that the training samples don't line up exactly with the classes in this imagery. Editing the training shapefile to be better matched to this image could lead to major improvement © This approach only leverages the spectral information in Landsat. What would happen if we ‘coked into some spatial information metrics like incorporating moving window statistics? * And while more advanced, deep learning methods (lke in the next chapter!!) could lead te major improvements in this classification! Quantative Accuracy Assessments! We examined our maps for qualitative accuracy but we'll need to perform a proper accuracy assessment based on a probability sample to conclude anything about the accuracy of the entire area. With the information from the accuracy assessment, we will be able not only to tell how gooe the map is, but more importantly we will be able to come up with statistically defensible unbiased estimates with confidence intervals of the land cover class areas in the map. For more information, see Olofsson, et. al, 2013, inthe next chapter (link to webpage or Notebook) welll explore how we can classify land cover on ¢ arger scale and more accurately with deep neural networks. We'll also use some more quantative accuracy assessment methods.

You might also like