Masking

Masking is a common operation in raster processing. It is setting certain cells to the NoData value. This is usually done to remove low-quality observations from the raster processing. Another related use case is to “clip” a raster to a given polygon.

Masking Example

Let’s demonstrate masking with a pair of bands of Sentinel-2 data. The measurement bands we will use, blue and green, have no defined NoData. They share quality information from a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on this, see the Sentinel-2 algorithm overview. Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.

Sentinel-2 Scene Classification Values

Credit: Sentinel-2 algorithm overview

The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so all tiles are aligned across rows.

from pyspark.sql import Row

blue_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
green_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B03.tif'
scl_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/SCL.tif'
cat = spark.createDataFrame([Row(blue=blue_uri, green=green_uri, scl=scl_uri),])
unmasked = spark.read.raster(cat, catalog_col_names=['blue', 'green', 'scl'])
unmasked.printSchema()
root
 |-- blue_path: string (nullable = false)
 |-- green_path: string (nullable = false)
 |-- scl_path: string (nullable = false)
 |-- blue: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- green: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
 |-- scl: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)
unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
rf_cell_type(blue) rf_cell_type(scl)
[uint16raw] [uint8raw]

Define CellType for Masked Tile

Because there is not a NoData already defined for the blue band, we must choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value. We will construct a new CellType object to represent this.

blue_min = unmasked.agg(rf_agg_stats('blue').min.alias('blue_min'))
print('Nonzero minimum value in the blue band:', blue_min.first())

blue_ct = unmasked.select(rf_cell_type('blue')).distinct().first()[0][0]
masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
masked_blue_ct.cell_type_name
Nonzero minimum value in the blue band: Row(blue_min=3.0)
'uint16ud0'

We next convert the blue band to this cell type.

converted = unmasked.select('scl', 'green', rf_convert_cell_type('blue', masked_blue_ct).alias('blue'))

Apply Mask from Quality Band

Now we set cells of our blue column to NoData for all locations where the scl tile is in our set of undesirable values. This is the actual masking operation.

from pyspark.sql.functions import lit

masked = converted.withColumn('blue_masked', rf_mask_by_values('blue', 'scl', [0, 1, 8, 9, 10]))
masked

Showing only top 5 rows.

scl green blue blue_masked

We can verify that the number of NoData cells in the resulting blue_masked column matches the total of the boolean mask tile to ensure our logic is correct.

masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum(rf_local_is_in('scl', [0, 1, 8, 9, 10])))

Showing only top 5 rows.

rf_no_data_cells(blue_masked) rf_tile_sum(rf_local_is_in(scl, array(0, 1, 8, 9, 10)))
4475 4475.0
0 0.0
0 0.0
0 0.0
5 5.0

It’s also nice to view a sample. The white regions are areas of NoData.

sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
display(sample[0])

Blue band masked against selected SCL values

And the original SCL data. The bright yellow is a cloudy region in the original image.

display(sample[1])

SCL tile for above

Transferring Mask

We can now apply the same mask from the blue column to the green column. Note here we have supressed the step of explicitly checking what a “safe” NoData value for the green band should be.

masked.withColumn('green_masked', rf_mask(rf_convert_cell_type('green', masked_blue_ct), 'blue_masked'))  \
      .orderBy(-rf_no_data_cells('blue_masked'))

Showing only top 5 rows.

scl green blue blue_masked green_masked