Reading Raster Data

RasterFrames registers a DataSource named raster that enables reading of GeoTIFFs (and other formats when GDAL is installed) from arbitrary URIs. The raster DataSource operates on either a single raster file location or another DataFrame, called a catalog, containing pointers to many raster file locations.

RasterFrames can also read from GeoTrellis catalogs and layers.

Single Rasters

The simplest way to use the raster reader is with a single raster from a single URI or file. In the examples that follow we’ll be reading from a Sentinel-2 scene stored in an AWS S3 bucket.

rf = spark.read.raster('https://rasterframes.s3.amazonaws.com/samples/luray_snp/B02.tif')
rf.printSchema()
root
 |-- proj_raster_path: string (nullable = false)
 |-- proj_raster: struct (nullable = true)
 |    |-- tile_context: struct (nullable = true)
 |    |    |-- extent: struct (nullable = false)
 |    |    |    |-- xmin: double (nullable = false)
 |    |    |    |-- ymin: double (nullable = false)
 |    |    |    |-- xmax: double (nullable = false)
 |    |    |    |-- ymax: double (nullable = false)
 |    |    |-- crs: struct (nullable = false)
 |    |    |    |-- crsProj4: string (nullable = false)
 |    |-- tile: tile (nullable = false)

The file at the address above is a valid Cloud Optimized GeoTIFF (COG), which RasterFrames fully supports. RasterFrames will take advantage of the optimizations in the COG format to enable more efficient reading compared to non-COG GeoTIFFs.

Let’s unpack the proj_raster column and look at the contents in more detail. It contains a CRS, a spatial extent measured in that CRS, and a two-dimensional array of numeric values called a tile.

crs = rf.select(rf_crs("proj_raster").alias("value")).first()
print("CRS", crs.value.crsProj4)
CRS +proj=utm +zone=17 +datum=WGS84 +units=m +no_defs
rf.select(
    rf_extent("proj_raster").alias("extent"),
    rf_tile("proj_raster").alias("tile")
)

Showing only top 5 rows.

extent tile
[807480.0, 4207860.0, 809760.0, 4223220.0]
[792120.0, 4269300.0, 807480.0, 4284660.0]
[699960.0, 4223220.0, 715320.0, 4238580.0]
[807480.0, 4190220.0, 809760.0, 4192500.0]
[715320.0, 4223220.0, 730680.0, 4238580.0]

You can also see that the single raster has been broken out into many rows containing arbitrary non-overlapping regions. Doing so takes advantage of parallel in-memory reads from the cloud hosted data source and allows Spark to work on manageable amounts of data per row. The map below shows downsampled imagery with the bounds of the individual tiles.

Note

The image contains visible “seams” between the tile extents due to reprojection and downsampling used to create the image. The native imagery in the DataFrame does not contain any gaps in the source raster’s coverage.