Raster Catalogs
While interesting processing can be done on a single raster file, RasterFrames shines when catalogs of raster data are to be processed. In its simplest form, a catalog is a list of URLs referencing raster files. This list can be a Spark DataFrame, Pandas DataFrame, CSV file or CSV string. The catalog is input into the raster
DataSource described in the next page, which creates tiles from the rasters at the referenced URLs.
A catalog can have one or two dimensions:
- One-D: A single column contains raster URLs across the rows. All referenced rasters represent the same band. For example, a column of URLs to Landsat 8 near-infrared rasters covering Europe. Each row represents different places and times.
- Two-D: Many columns contain raster URLs. Each column references the same band, and each row represents the same place and time. For example, red-, green-, and blue-band columns for scenes covering Europe. Each row represents a single scene with the same resolution, extent, CRS, etc across the row.
Creating a Catalog
This section will provide some examples of catalogs creation, as well as introduce some experimental catalogs built into RasterFrames. Reading raster data represented by a catalog is covered in more detail in the next page.
One-D
A single URL is the simplest form of a catalog.
file_uri = "/data/raster/myfile.tif"
# Pandas DF
my_cat = pd.DataFrame({'B01': [file_uri]})
# equivalent Spark DF
from pyspark.sql import Row
my_cat = spark.createDataFrame([Row(B01=file_uri)])
#equivalent CSV string
my_cat = "B01\n{}".format(file_uri)
A single column represents the same content type with different observations along the rows. In this example it is band 1 of MODIS surface reflectance, which is visible red. In the example the location of the images is the same, indicated by the granule identifier h04v09
, but the dates differ: 2018185 (July 4, 2018) and 2018188 (July 7, 2018).
scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"
# a pandas DF
one_d_cat_pd = pd.DataFrame({'B01': [scene1_B01, scene2_B01]})
# equivalent spark DF
one_d_cat_df = spark.createDataFrame([Row(B01=scene1_B01), Row(B01=scene2_B01)])
# equivalent CSV string
one_d_cat_csv = '\n'.join(['B01', scene1_B01, scene2_B01])
This is what it looks like in DataFrame form:
one_d_cat_df
Two-D
In this example, multiple columns representing multiple content types (bands) across multiple scenes. In each row, the scene is the same: granule id h04v09
on July 4 or July 7, 2018. The first column is band 1, red, and the second is band 2, near infrared.
scene1_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B01.TIF"
scene1_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/MCD43A4.A2018185.h04v09.006.2018194032851_B02.TIF"
scene2_B01 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B01.TIF"
scene2_B02 = "https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018188/MCD43A4.A2018188.h04v09.006.2018198232008_B02.TIF"
# Pandas DF
two_d_cat_pd = pd.DataFrame([
{'B01': [scene1_B01], 'B02': [scene1_B02]},
{'B01': [scene2_B01], 'B02': [scene2_B02]}
])
# or
two_d_cat_df = spark.createDataFrame([
Row(B01=scene1_B01, B02=scene1_B02),
Row(B01=scene2_B01, B02=scene2_B02)
])
# As CSV string
tow_d_cat_csv = '\n'.join(['B01,B02', scene1_B01 + "," + scene1_B02, scene2_B01 + "," + scene2_B02])
This is what it looks like in DataFrame form:
two_d_cat_df
Using External Catalogs
The concept of a catalog is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here’s an extended example of reading a cloud-hosted CSV file containing MODIS scene metadata and transforming it into a catalog. The metadata describing the content of each URL is an important aspect of processing raster data.
from pyspark import SparkFiles
from pyspark.sql import functions as F
spark.sparkContext.addFile("https://modis-pds.s3.amazonaws.com/MCD43A4.006/2018-07-04_scenes.txt")
scene_list = spark.read \
.format("csv") \
.option("header", "true") \
.load(SparkFiles.get("2018-07-04_scenes.txt"))
scene_list
Showing only top 5 rows.
date | download_url | gid |
---|---|---|
2018-07-04 00:00:00 | https://modis-pds.s3.amazonaws.com/MCD43A4.006/04/09/2018185/index.html | MCD43A4.A2018185.h04v09.006.2018194032851 |
2018-07-04 00:00:00 | https://modis-pds.s3.amazonaws.com/MCD43A4.006/01/09/2018185/index.html | MCD43A4.A2018185.h01v09.006.2018194032819 |
2018-07-04 00:00:00 | https://modis-pds.s3.amazonaws.com/MCD43A4.006/06/03/2018185/index.html | MCD43A4.A2018185.h06v03.006.2018194032807 |
2018-07-04 00:00:00 | https://modis-pds.s3.amazonaws.com/MCD43A4.006/03/09/2018185/index.html | MCD43A4.A2018185.h03v09.006.2018194032826 |
2018-07-04 00:00:00 | https://modis-pds.s3.amazonaws.com/MCD43A4.006/08/09/2018185/index.html | MCD43A4.A2018185.h08v09.006.2018194032839 |
Observe the scenes list file has URIs to index.html
files in the download_url column. The image URI’s are in the same directory. The filenames are of the form ${gid}_B${band}.TIF
. The next code chunk builds these URIs, which completes our catalog.
modis_catalog = scene_list \
.withColumn('base_url',
F.concat(F.regexp_replace('download_url', 'index.html$', ''), 'gid',)
) \
.withColumn('B01' , F.concat('base_url', F.lit("_B01.TIF"))) \
.withColumn('B02' , F.concat('base_url', F.lit("_B02.TIF"))) \
.withColumn('B03' , F.concat('base_url', F.lit("_B03.TIF")))
modis_catalog
Showing only top 5 rows.
Using Built-in Catalogs
RasterFrames comes with two experimental catalogs over the AWS PDS Landsat 8 and MODIS repositories. They are created by downloading the latest scene lists and building up the appropriate band URI columns as in the prior example.
Note: The first time you run these may take some time, as the catalogs are large and have to be downloaded. However, they are cached and subsequent invocations should be faster.
MODIS
modis_catalog = spark.read.format('aws-pds-modis-catalog').load()
modis_catalog.printSchema()
root
|-- product_id: string (nullable = false)
|-- acquisition_date: timestamp (nullable = false)
|-- granule_id: string (nullable = false)
|-- gid: string (nullable = false)
|-- B01: string (nullable = true)
|-- B01qa: string (nullable = true)
|-- B02: string (nullable = true)
|-- B02qa: string (nullable = true)
|-- B03: string (nullable = true)
|-- B03aq: string (nullable = true)
|-- B04: string (nullable = true)
|-- B04qa: string (nullable = true)
|-- B05: string (nullable = true)
|-- B05qa: string (nullable = true)
|-- B06: string (nullable = true)
|-- B06qa: string (nullable = true)
|-- B07: string (nullable = true)
|-- B07qa: string (nullable = true)
Landsat 8
The Landsat 8 catalog includes a richer set of metadata describing the contents of each scene.
l8 = spark.read.format('aws-pds-l8-catalog').load()
l8.printSchema()
root
|-- product_id: string (nullable = false)
|-- entity_id: string (nullable = false)
|-- acquisition_date: timestamp (nullable = false)
|-- cloud_cover_pct: float (nullable = false)
|-- processing_level: string (nullable = false)
|-- path: short (nullable = false)
|-- row: short (nullable = false)
|-- bounds_wgs84: struct (nullable = false)
| |-- minX: double (nullable = false)
| |-- maxX: double (nullable = false)
| |-- minY: double (nullable = false)
| |-- maxY: double (nullable = false)
|-- B1: string (nullable = true)
|-- B2: string (nullable = true)
|-- B3: string (nullable = true)
|-- B4: string (nullable = true)
|-- B5: string (nullable = true)
|-- B6: string (nullable = true)
|-- B7: string (nullable = true)
|-- B8: string (nullable = true)
|-- B9: string (nullable = true)
|-- B10: string (nullable = true)
|-- B11: string (nullable = true)
|-- BQA: string (nullable = true)