Raster Catalogs

While interesting processing can be done on a single raster file, RasterFrames shines when catalogs of raster data are to be processed. In its simplest form, a catalog is a list of URLs referencing raster files. This list can be a Spark DataFrame, Pandas DataFrame, CSV file or CSV string. The catalog is input into the raster DataSource described in the next page, which creates tiles from the rasters at the referenced URLs.

A catalog can have one or two dimensions:

  • One-D: A single column contains raster URLs across the rows. All referenced rasters represent the same band. For example, a column of URLs to Landsat 8 near-infrared rasters covering Europe. Each row represents different places and times.
  • Two-D: Many columns contain raster URLs. Each column references the same band, and each row represents the same place and time. For example, red-, green-, and blue-band columns for scenes covering Europe. Each row represents a single scene with the same resolution, extent, CRS, etc across the row.

Creating a Catalog

This section will provide some examples of catalogs creation, as well as introduce some experimental catalogs built into RasterFrames. Reading raster data represented by a catalog is covered in more detail in the next page.


A single URL is the simplest form of a catalog.

file_uri = "/data/raster/myfile.tif"
# Pandas DF
my_cat = pd.DataFrame({'B01': [file_uri]})

# equivalent Spark DF
from pyspark.sql import Row
my_cat = spark.createDataFrame([Row(B01=file_uri)])

#equivalent CSV string
my_cat = "B01\n{}".format(file_uri)

A single column represents the same content type with different observations along the rows. In this example it is band 1 of MODIS surface reflectance, which is visible red. In the example the location of the images is the same, indicated by the granule identifier h04v09, but the dates differ: 2018185 (July 4, 2018) and 2018188 (July 7, 2018).

scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"

# a pandas DF
one_d_cat_pd = pd.DataFrame({'B01': [scene1_B01, scene2_B01]})

# equivalent spark DF
one_d_cat_df = spark.createDataFrame([Row(B01=scene1_B01), Row(B01=scene2_B01)])

# equivalent CSV string
one_d_cat_csv = '\n'.join(['B01', scene1_B01, scene2_B01])

This is what it looks like in DataFrame form:



In this example, multiple columns representing multiple content types (bands) across multiple scenes. In each row, the scene is the same: granule id h04v09 on July 4 or July 7, 2018. The first column is band 1, red, and the second is band 2, near infrared.

scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
scene1_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF"
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"
scene2_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B02.TIF"

# Pandas DF
two_d_cat_pd = pd.DataFrame([
    {'B01': [scene1_B01], 'B02': [scene1_B02]},
    {'B01': [scene2_B01], 'B02': [scene2_B02]}

# or
two_d_cat_df = spark.createDataFrame([
    Row(B01=scene1_B01, B02=scene1_B02),
    Row(B01=scene2_B01, B02=scene2_B02)
# As CSV string
tow_d_cat_csv = '\n'.join(['B01,B02', scene1_B01 + "," + scene1_B02, scene2_B01 + "," + scene2_B02])

This is what it looks like in DataFrame form:

B01 B02
https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF
https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B02.TIF

Using External Catalogs

The concept of a catalog is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here’s an extended example of reading a cloud-hosted CSV file containing MODIS scene metadata and transforming it into a catalog. The metadata describing the content of each URL is an important aspect of processing raster data.

from pyspark import SparkFiles
from pyspark.sql import functions as F


scene_list = spark.read \
    .format("csv") \
    .option("header", "true") \

Showing only top 5 rows.

date download_url gid
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/index.html MCD43A4.A2018185.h04v09.006.2018194032851
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/index.html MCD43A4.A2018185.h01v09.006.2018194032819
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/index.html MCD43A4.A2018185.h06v03.006.2018194032807
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/index.html MCD43A4.A2018185.h03v09.006.2018194032826
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/index.html MCD43A4.A2018185.h08v09.006.2018194032839

Observe the scenes list file has URIs to index.html files in the download_url column. The image URI’s are in the same directory. The filenames are of the form ${gid}_B${band}.TIF. The next code chunk builds these URIs, which completes our catalog.

modis_catalog = scene_list \
        F.concat(F.regexp_replace('download_url', 'index.html$', ''), 'gid',)
    ) \
    .withColumn('B01' , F.concat('base_url', F.lit("_B01.TIF"))) \
    .withColumn('B02' , F.concat('base_url', F.lit("_B02.TIF"))) \
    .withColumn('B03' , F.concat('base_url', F.lit("_B03.TIF")))

Showing only top 5 rows.

date download_url gid base_url B01 B02 B03
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/index.html MCD43A4.A2018185.h04v09.006.2018194032851 https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851 https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B03.TIF
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/index.html MCD43A4.A2018185.h01v09.006.2018194032819 https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/MCD43A4.A2018185.h01v09.006.2018194032819 https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/MCD43A4.A2018185.h01v09.006.2018194032819_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/MCD43A4.A2018185.h01v09.006.2018194032819_B02.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/MCD43A4.A2018185.h01v09.006.2018194032819_B03.TIF
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/index.html MCD43A4.A2018185.h06v03.006.2018194032807 https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/MCD43A4.A2018185.h06v03.006.2018194032807 https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/MCD43A4.A2018185.h06v03.006.2018194032807_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/MCD43A4.A2018185.h06v03.006.2018194032807_B02.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/MCD43A4.A2018185.h06v03.006.2018194032807_B03.TIF
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/index.html MCD43A4.A2018185.h03v09.006.2018194032826 https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/MCD43A4.A2018185.h03v09.006.2018194032826 https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/MCD43A4.A2018185.h03v09.006.2018194032826_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/MCD43A4.A2018185.h03v09.006.2018194032826_B02.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/MCD43A4.A2018185.h03v09.006.2018194032826_B03.TIF
2018-07-04 00:00:00 https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/index.html MCD43A4.A2018185.h08v09.006.2018194032839 https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/MCD43A4.A2018185.h08v09.006.2018194032839 https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/MCD43A4.A2018185.h08v09.006.2018194032839_B01.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/MCD43A4.A2018185.h08v09.006.2018194032839_B02.TIF https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/MCD43A4.A2018185.h08v09.006.2018194032839_B03.TIF

Using Built-in Catalogs

RasterFrames comes with two experimental catalogs over the AWS PDS Landsat 8 and MODIS repositories. They are created by downloading the latest scene lists and building up the appropriate band URI columns as in the prior example.

Note: The first time you run these may take some time, as the catalogs are large and have to be downloaded. However, they are cached and subsequent invocations should be faster.


modis_catalog = spark.read.format('aws-pds-modis-catalog').load()
 |-- product_id: string (nullable = false)
 |-- acquisition_date: timestamp (nullable = false)
 |-- granule_id: string (nullable = false)
 |-- gid: string (nullable = false)
 |-- B01: string (nullable = true)
 |-- B01qa: string (nullable = true)
 |-- B02: string (nullable = true)
 |-- B02qa: string (nullable = true)
 |-- B03: string (nullable = true)
 |-- B03aq: string (nullable = true)
 |-- B04: string (nullable = true)
 |-- B04qa: string (nullable = true)
 |-- B05: string (nullable = true)
 |-- B05qa: string (nullable = true)
 |-- B06: string (nullable = true)
 |-- B06qa: string (nullable = true)
 |-- B07: string (nullable = true)
 |-- B07qa: string (nullable = true)

Landsat 8

The Landsat 8 catalog includes a richer set of metadata describing the contents of each scene.

l8 = spark.read.format('aws-pds-l8-catalog').load()
 |-- product_id: string (nullable = false)
 |-- entity_id: string (nullable = false)
 |-- acquisition_date: timestamp (nullable = false)
 |-- cloud_cover_pct: float (nullable = false)
 |-- processing_level: string (nullable = false)
 |-- path: short (nullable = false)
 |-- row: short (nullable = false)
 |-- bounds_wgs84: struct (nullable = false)
 |    |-- minX: double (nullable = false)
 |    |-- maxX: double (nullable = false)
 |    |-- minY: double (nullable = false)
 |    |-- maxY: double (nullable = false)
 |-- B1: string (nullable = true)
 |-- B2: string (nullable = true)
 |-- B3: string (nullable = true)
 |-- B4: string (nullable = true)
 |-- B5: string (nullable = true)
 |-- B6: string (nullable = true)
 |-- B7: string (nullable = true)
 |-- B8: string (nullable = true)
 |-- B9: string (nullable = true)
 |-- B10: string (nullable = true)
 |-- B11: string (nullable = true)
 |-- BQA: string (nullable = true)