Scala and SQL

One of the great powers of RasterFrames is the ability to express computation in multiple programming languages. The content in this manual focuses on Python because it is the most commonly used language in data science and GIS analytics. However, Scala (the implementation language of RasterFrames) and SQL (commonly used in many domains) are also fully supported. Examples in Python can be mechanically translated into the other two languages without much difficulty once the naming conventions are understood.

In the sections below we will show the same example program in each language. To do so we will compute the average NDVI per month for a single tile in Tanzania.

Python

Step 1: Load the catalog

modis = spark.read.format('aws-pds-modis-catalog').load()

Step 2: Down-select data by month

red_nir_monthly_2017 = modis \
    .select(
        col('granule_id'),
        month('acquisition_date').alias('month'),
        col('B01').alias('red'),
        col('B02').alias('nir')
    ) \
    .where(
        (year('acquisition_date') == 2017) & 
        (dayofmonth('acquisition_date') == 15) & 
        (col('granule_id') == 'h21v09')
    )
red_nir_monthly_2017.printSchema()

root
 |-- granule_id: string (nullable = false)
 |-- month: integer (nullable = false)
 |-- red: string (nullable = true)
 |-- nir: string (nullable = true)

Step 3: Read tiles

red_nir_tiles_monthly_2017 = spark.read.raster(
    red_nir_monthly_2017,
    catalog_col_names=['red', 'nir'],
    tile_dimensions=(256, 256)
)

Step 4: Compute aggregates

result = red_nir_tiles_monthly_2017 \
    .where(st_intersects(
        st_reproject(rf_geometry(col('red')), rf_crs(col('red')).crsProj4, rf_mk_crs('EPSG:4326')),
        st_makePoint(lit(34.870605), lit(-4.729727)))
    ) \
    .groupBy('month') \
    .agg(rf_agg_stats(rf_normalized_difference(col('nir'), col('red'))).alias('ndvi_stats')) \
    .orderBy(col('month')) \
    .select('month', 'ndvi_stats.*')
result

Showing only top 5 rows.

month	data_cells	no_data_cells	min	max	mean	variance
1	65523	13	-0.2519809825673534	0.8644836272040303	0.4062596673810191	0.01407280838385605
2	65521	15	-0.21232123212321233	0.8872390789756832	0.46738863804011127	0.011822698212885174
3	64425	1111	-0.36211340206185566	0.9208261617900172	0.5811071411891395	0.011570245465885753
4	64236	1300	-0.17252657399836469	0.922397476340694	0.5254596885897274	0.01087259231886667
5	60819	4717	-0.19951338199513383	0.916626036079961	0.46471430090039234	0.01215478443067744

SQL

For convenience, we’re going to evaluate SQL from the Python environment. The SQL fragments should work in the spark-sql shell just the same.

def sql(stmt):
    return spark.sql(stmt)

Step 1: Load the catalog

sql("CREATE OR REPLACE TEMPORARY VIEW modis USING `aws-pds-modis-catalog`")

Step 2: Down-select data by month

sql("""
CREATE OR REPLACE TEMPORARY VIEW red_nir_monthly_2017 AS
SELECT granule_id, month(acquisition_date) as month, B01 as red, B02 as nir
FROM modis
WHERE year(acquisition_date) = 2017 AND day(acquisition_date) = 15 AND granule_id = 'h21v09'
""")
sql('DESCRIBE red_nir_monthly_2017')

col_name	data_type	comment
granule_id	string
month	int
red	string
nir	string

Step 3: Read tiles

sql("""
CREATE OR REPLACE TEMPORARY VIEW red_nir_tiles_monthly_2017
USING raster
OPTIONS (
    catalog_table='red_nir_monthly_2017',
    catalog_col_names='red,nir',
    tile_dimensions='256,256'
    )
""")

Step 4: Compute aggregates

grouped = sql("""
SELECT month, ndvi_stats.* FROM (
    SELECT month, rf_agg_stats(rf_normalized_difference(nir, red)) as ndvi_stats
    FROM red_nir_tiles_monthly_2017
    WHERE st_intersects(st_reproject(rf_geometry(red), rf_crs(red), 'EPSG:4326'), st_makePoint(34.870605, -4.729727))
    GROUP BY month
    ORDER BY month
)
""")
grouped

Showing only top 5 rows.

month	data_cells	no_data_cells	min	max	mean	variance
1	65523	13	-0.2519809825673534	0.8644836272040303	0.4062596673810191	0.01407280838385605
2	65521	15	-0.21232123212321233	0.8872390789756832	0.46738863804011127	0.011822698212885174
3	64425	1111	-0.36211340206185566	0.9208261617900172	0.5811071411891395	0.011570245465885753
4	64236	1300	-0.17252657399836469	0.922397476340694	0.5254596885897274	0.01087259231886667
5	60819	4717	-0.19951338199513383	0.916626036079961	0.46471430090039234	0.01215478443067744

Scala

The latest Scala API documentation is available here:

Scala API Documentation

Step 1: Load the catalog

import geotrellis.proj4.LatLng
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._


implicit val spark = SparkSession.builder()
  .master("local[*]")
  .appName("RasterFrames")
  .withKryoSerialization
  .getOrCreate()
  .withRasterFrames

import spark.implicits._

val modis = spark.read.format("aws-pds-modis-catalog").load()

Step 2: Down-select data by month

val red_nir_monthly_2017 = modis
  .select($"granule_id", month($"acquisition_date") as "month", $"B01" as "red", $"B02" as "nir")
  .where(year($"acquisition_date") === 2017 && (dayofmonth($"acquisition_date") === 15) && $"granule_id" === "h21v09")

Step 3: Read tiles

val red_nir_tiles_monthly_2017 = spark.read.raster
  .fromCatalog(red_nir_monthly_2017, "red", "nir")
  .load()

Step 4: Compute aggregates

val result = red_nir_tiles_monthly_2017
  .where(st_intersects(
    st_reproject(rf_geometry($"red"), rf_crs($"red"), LatLng),
    st_makePoint(34.870605, -4.729727)
  ))
  .groupBy("month")
  .agg(rf_agg_stats(rf_normalized_difference($"nir", $"red")) as "ndvi_stats")
  .orderBy("month")
  .select("month", "ndvi_stats.*")
  
result.show()

Edit this page

0.9.1