Raster Join


A common operation for raster data is reprojecting or warping the data to a different CRS with a specific transform. In many use cases, the particulars of the warp operation depend on another set of raster data. Furthermore, the warp is done to put both sets of raster data to a common set of grid to enable manipulation of the datasets together.

In RasterFrames, you can perform a Raster Join on two DataFrames containing raster data. The operation will perform a spatial join based on the CRS and extent data in each DataFrame. By default it is a left join and uses an intersection operator. For each candidate row, all tile columns on the right hand side are warped to match the left hand side’s CRS, extent, and dimensions. Warping relies on GeoTrellis library code and uses nearest neighbor resampling method. The operation is also an aggregate, with multiple intersecting right-hand side tiles merged into the result. There is no guarantee about the ordering of tiles used to select cell values in the case of overlapping tiles. When using the raster DataSource you will automatically get the CRS and extent information needed to do this operation.

Example Code

Because the raster join is a distributed spatial join, indexing of both DataFrames using the spatial index is crucial for performance.

# Southern Mozambique December 29, 2016
modis = spark.read.raster('s3://astraea-opendata/MCD43A4.006/21/11/2016297/MCD43A4.A2016297.h21v11.006.2016306075821_B01.TIF',
                          spatial_index_partitions=True) \
                  .withColumnRenamed('proj_raster', 'modis')

landsat8 = spark.read.raster('https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/167/077/LC08_L1TP_167077_20161015_20170319_01_T1/LC08_L1TP_167077_20161015_20170319_01_T1_B4.TIF',
                             spatial_index_partitions=True) \
                  .withColumnRenamed('proj_raster', 'landsat')

rj = landsat8.raster_join(modis)

# Show some non-empty tiles
rj.select('landsat', 'modis', 'crs', 'extent') \
  .filter(rf_data_cells('modis') > 0) \
  .filter(rf_tile_max('landsat') > 0)

Showing only top 5 rows.

landsat modis crs extent
[+proj=utm +zone=36 +datum=WGS84 +units=m +no_defs ] [525945.0, -2813925.0, 533625.0, -2806245.0]
[+proj=utm +zone=36 +datum=WGS84 +units=m +no_defs ] [495225.0, -2652645.0, 502905.0, -2644965.0]
[+proj=utm +zone=36 +datum=WGS84 +units=m +no_defs ] [641145.0, -2675685.0, 648825.0, -2668005.0]
[+proj=utm +zone=36 +datum=WGS84 +units=m +no_defs ] [525945.0, -2798565.0, 533625.0, -2790885.0]
[+proj=utm +zone=36 +datum=WGS84 +units=m +no_defs ] [541305.0, -2798565.0, 548985.0, -2790885.0]

Additional Options

The following optional arguments are allowed:

  • left_extent - the column on the left-hand DataFrame giving the extent of the tile columns
  • left_crs - the column on the left-hand DataFrame giving the CRS of the tile columns
  • right_extent - the column on the right-hand DataFrame giving the extent of the tile columns
  • right_crs - the column on the right-hand DataFrame giving the CRS of the tile columns
  • join_exprs - a single column expression as would be used in the on parameter of join

Note that the join_exprs will override the join behavior described above. By default the expression is equivalent to:

    st_reproject(st_geometry(right[right_extent]), right[right_crs], left[left_crs])