Reading Raster Data

RasterFrames registers a DataSource named raster that enables reading of GeoTIFFs (and other formats when GDAL is installed) from arbitrary URIs. The raster DataSource operates on either a single raster file location or another DataFrame, called a catalog, containing pointers to many raster file locations.

RasterFrames can also read from GeoTrellis catalogs and layers.

Single Rasters

The simplest way to use the raster reader is with a single raster from a single URI or file. In the examples that follow we’ll be reading from a Sentinel-2 scene stored in an AWS S3 bucket.

rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
rf.printSchema()
root
 |-- proj_raster_path: string (nullable = false)
 |-- proj_raster: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)

The file at the address above is a valid Cloud Optimized GeoTIFF (COG), which RasterFrames fully supports. RasterFrames will take advantage of the optimizations in the COG format to enable more efficient reading compared to non-COG GeoTIFFs.

Let’s unpack the proj_raster column and look at the contents in more detail. It contains a CRS, a spatial extent measured in that CRS, and a two-dimensional array of numeric values called a tile.

crs = rf.select(rf_crs("proj_raster").alias("value")).first()
print("CRS", crs.value.crsProj4)
CRS +proj=utm +zone=17 +datum=WGS84 +units=m +no_defs
rf.select(
    rf_extent("proj_raster").alias("extent"),
    rf_tile("proj_raster").alias("tile")
)

Showing only top 5 rows.

extent tile
[699960.0, 4284660.0, 715320.0, 4300020.0]
[715320.0, 4284660.0, 730680.0, 4300020.0]
[730680.0, 4284660.0, 746040.0, 4300020.0]
[746040.0, 4284660.0, 761400.0, 4300020.0]
[761400.0, 4284660.0, 776760.0, 4300020.0]

You can also see that the single raster has been broken out into many arbitrary non-overlapping regions. Doing so takes advantage of parallel in-memory reads from the cloud hosted data source and allows Spark to work on manageable amounts of data per task. The following code fragment shows us how many subtiles were created from a single source image.

counts = rf.groupby(rf.proj_raster_path).count()
counts
proj_raster_path count
https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif 64

Let’s select a single tile and view it. The tile preview image as well as the string representation provide some basic information about the tile: its dimensions as numbers of columns and rows and the cell type, or data type of all the cells in the tile. For more about cell types, refer to this discussion.

tile = rf.select(rf_tile("proj_raster")).first()[0]
display(tile)

Multiple Singleband Rasters

In this example, we show the reading two bands of Landsat 8 imagery (red and near-infrared), combining them with rf_normalized_difference to compute NDVI, a common measure of vegetation health. As described in the section on catalogs, image URIs in a single row are assumed to be from the same scene/granule, and therefore compatible. This pattern is commonly used when multiple bands are stored in separate files.

bands = [f'B{b}' for b in [4, 5]]
uris = [f'https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/014/032/LC08_L1TP_014032_20190720_20190731_01_T1/LC08_L1TP_014032_20190720_20190731_01_T1_{b}.TIF' for b in bands]
catalog = ','.join(bands) + '\n' + ','.join(uris)

rf = (spark.read.raster(catalog, bands) 
    # Adding semantic names 
    .withColumnRenamed('B4', 'red').withColumnRenamed('B5', 'NIR')  
    # Adding tile center point for reference 
    .withColumn('longitude_latitude', st_reproject(st_centroid(rf_geometry('red')), rf_crs('red'), lit('EPSG:4326')))
    # Compute NDVI  
    .withColumn('NDVI', rf_normalized_difference('NIR', 'red'))
    # For the purposes of inspection, filter out rows where there's not much vegetation  
    .where(rf_tile_sum('NDVI') > 10000) 
    # Order output 
    .select('longitude_latitude', 'red', 'NIR', 'NDVI')) 
display(rf)

Showing only top 5 rows.

longitude_latitude red NIR NDVI
POINT (-75.64310549921628 41.35507991091…
POINT (-75.55129747458508 41.35555632722…
POINT (-75.64242580157753 41.28590485893…
POINT (-75.55071479581207 41.28638012434…
POINT (-75.45900176161878 41.28678233172…