Writing Raster Data

RasterFrames is oriented toward large scale analyses of spatial data. The primary output of these analyses could be a statistical summary, a machine learning model, or some other result that is generally much smaller than the input dataset.

However, there are times in any analysis where writing a representative sample of the work in progress provides valuable feedback on the current state of the process and results.

Tile Samples

We have some convenience methods to quickly visualize _tile_s (see discussion of the RasterFrame schema for orientation to the concept) when inspecting a subset of the data in a Notebook.

In an IPython or Jupyter interpreter, a Tile object will be displayed as an image with limited metadata.

def scene(band):
    b = str(band).zfill(2) # converts int 2 to '02'
    return 'https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/2019059/' \
             'MCD43A4.A2019059.h11v08.006.2019072203257_B{}.TIF'.format(b)
spark_df = spark.read.raster(scene(2), tile_dimensions=(128, 128))
tile = spark_df.select(rf_tile('proj_raster').alias('tile')).first()['tile']
tile

DataFrame Samples

Within an IPython or Jupyter interpreter, a Spark and Pandas DataFrames containing a column of tiles will be rendered as the samples discussed above. Simply import the rf_ipython submodule to enable enhanced HTML rendering of these DataFrame types.

import pyrasterframes.rf_ipython

samples = spark_df \
    .select(
        rf_extent('proj_raster').alias('extent'),
        rf_tile('proj_raster').alias('tile'),
    )\
    .select('extent.*', 'tile') \
    .limit(3)
samples
xmin ymin xmax ymax tile
-7783653.637667 1052646.4919514267 -7724349.609951427 1111950.519667
-7724349.609951427 1052646.4919514267 -7665045.582235854 1111950.519667
-7665045.582235853 1052646.4919514267 -7605741.55452028 1111950.519667

GeoTIFFs

GeoTIFF is one of the most common file formats for spatial data, providing flexibility in data encoding, representation, and storage. RasterFrames provides a specialized Spark DataFrame writer for rendering a RasterFrame to a GeoTIFF.

One downside to GeoTIFF is that it is not a big data native format. To create a GeoTIFF, all the data to be encoded has to be in the memory of one computer (in Spark parlance, this is a “collect”), limiting it’s maximum size substantially compared to that of a full cluster environment. When rendering GeoTIFFs in RasterFrames, you must either specify the dimensions of the output raster, or deliberately limit the size of the collected data.

Fortunately, we can use the cluster computing capability to downsample the data into a more manageable size. For sake of example, let’s render an overview of a scene’s red band as a small raster, reprojecting it to latitude and longitude coordinates on the WGS84 reference ellipsoid (aka EPSG:4326).

outfile = os.path.join('/tmp', 'geotiff-overview.tif')
spark_df.write.geotiff(outfile, crs='EPSG:4326', raster_dimensions=(256, 256))

We can view the written file with rasterio:

import rasterio
from rasterio.plot import show, show_hist

with rasterio.open(outfile) as src:
    # View raster
    show(src, adjust='linear')
    # View data distribution
    show_hist(src, bins=50, lw=0.0, stacked=False, alpha=0.6,
        histtype='stepfilled', title="Overview Histogram")

If there are many tile or projected raster columns in the DataFrame, the GeoTIFF writer will write each one as a separate band in the file. Each band in the output will be tagged the input column names for reference.

GeoTrellis Layers

GeoTrellis is one of the key libraries upon which RasterFrames is built. It provides a Scala language API for working with geospatial raster data. GeoTrellis defines a tile layer storage format for persisting imagery mosaics. RasterFrames can write data from a RasterFrameLayer into a GeoTrellis Layer. RasterFrames provides a geotrellis DataSource that supports both reading and writing GeoTrellis layers.

An example is forthcoming. In the mean time referencing the GeoTrellisDataSourceSpec test code may help.

Parquet

You can write a RasterFrame to the Apache Parquet format. This format is designed to efficiently persist and query columnar data in distributed file system, such as HDFS. It also provides benefits when working in single node (or “local”) mode, such as tailoring organization for defined query patterns.

spark_df.withColumn('exp', rf_expm1('proj_raster')) \
    .write.mode('append').parquet('hdfs:///rf-user/sample.pq')