Function Reference

For the most up to date list of User Defined Functions using Tiles, look at API documentation for RasterFunctions.

The full Scala API documentation can be found here.

RasterFrames also provides SQL and Python bindings to many UDFs using the Tile column type. In Spark SQL, the functions are already registered in the SQL engine; they are usually prefixed with rf_. In Python, they are available in the pyrasterframes.rasterfunctions module.

The convention in this document will be to define the function signature as below, with its return type, the function name, and named arguments with their types.

ReturnDataType function_name(InputDataType argument1, InputDataType argument2)

List of Available SQL and Python Functions

The convention in this document will be to define the function signature as below, with its return type, the function name, and named arguments with their types.

ReturnDataType function_name(InputDataType argument1, InputDataType argument2)

Vector Operations

Various LocationTech GeoMesa UDFs to deal with geomtery type columns are also provided in the SQL engine and within the pyrasterframes.rasterfunctions Python module. These are documented in the LocationTech GeoMesa Spark SQL documentation. These functions are all prefixed with st_.

RasterFrames provides two additional functions for vector geometry.

reproject_geometry

Python: Geometry reproject_geometry(Geometry geom, String origin_crs, String destination_crs)

SQL: rf_reproject_geometry

Reproject the vector geom from origin_crs to destination_crs. Both _crs arguments are either proj4 strings, EPSG codes codes or OGC WKT for coordinate reference systems.

envelope

Python:

Struct[Double minX, Double maxX, Double minY, Double maxY] envelope(Geometry geom)

Python only. Extracts the bounding box (envelope) of the geometry.

See also GeoMesa st_envelope which returns a Geometry type.

Tile Metadata and Mutation

Functions to access and change the particulars of a tile: its shape and the data type of its cells. See below section on masking and nodata for additional discussion of cell types.

cell_types

Python:

Array[String] cell_types()

SQL: rf_cell_types

Print an array of possible cell type names, as below. These names are used in other functions. See discussion on nodata for additional details.

cell_types
bool
int8raw
int8
uint8raw
uint8
int16raw
int16
uint16raw
uint16
int32raw
int32
float32raw
float32
float64raw
float64

tile_dimensions

Python:

Struct[Int, Int] tile_dimensions(Tile tile)

SQL: rf_tile_dimensions

Get number of columns and rows in the tile, as a Struct of cols and rows.

cell_type

Python:

Struct[String] cell_type(Tile tile)

SQL: rf_cell_type

Get the cell type of the tile. Available cell types can be retrieved with the cell_types function.

convert_cell_type

Python:

Tile convert_cell_type(Tile tileCol, String cellType)

SQL: rf_convert_cell_type

Convert tileCol to a different cell type.

resample

Python:

Tile resample(Tile tile, Double factor)
Tile resample(Tile tile, Int factor)
Tile resample(Tile tile, Tile shape_tile)

SQL: rf_resample

Change the tile dimension. Passing a numeric factor will scale the number of columns and rows in the tile: 1.0 is the same number of columns and row; less than one downsamples the tile; and greater than one upsamples the tile. Passing a shape_tile as the second argument outputs tile having the same number of columns and rows as shape_tile. All resampling is by nearest neighbor method.

Tile Creation

Functions to create a new Tile column, either from scratch or from existing data not yet in a tile.

tile_zeros

Python:

Tile tile_zeros(Int tile_columns, Int tile_rows, String cell_type_name)

SQL: rf_tile_zeros

Create a tile of shape tile_columns by tile_rows full of zeros, with the specified cell type. See function cell_types for valid values. All arguments are literal values and not column expressions.

tile_ones

Python:

Tile tile_ones(Int tile_columns, Int tile_rows, String cell_type_name)

SQL: rf_tile_ones

Create a tile of shape tile_columns by tile_rows full of ones, with the specified cell type. See function cell_types for valid values. All arguments are literal values and not column expressions.

make_constant_tile

Python:

Tile make_constant_tile(Numeric constant, Int tile_columns, Int tile_rows,  String cell_type_name)

SQL: rf_make_constant_tile

Create a tile of shape tile_columns by tile_rows full of constant, with the specified cell type. See function cell_types for valid values. All arguments are literal values and not column expressions.

rasterize

Python:

Tile rasterize(Geometry geom, Geometry tile_bounds, Int value, Int tile_columns, Int tile_rows)

SQL: rf_rasterize

Convert a vector Geometry geom into a Tile representation. The value will be “burned-in” to the returned tile where the geom intersects the tile_bounds. Returned tile will have shape tile_columns by tile_rows. Values outside the geom will be assigned a nodata value. Returned tile has cell type int32, note that value is of type Int.

Parameters tile_columns and tile_rows are literals, not column expressions. The others are column expressions.

Example use. In the code snip below, you can visualize the tri and b geometries with tools like Wicket. The result is a right triangle burned into the tile, with nodata values shown as ∘.

spark.sql("""
SELECT rf_render_ascii(
        rf_rasterize(tri, b, 8, 10, 10))

FROM 
  ( SELECT st_geomFromWKT('POLYGON((1.5 0.5, 1.5 1.5, 0.5 0.5, 1.5 0.5))') AS tri,
           st_geomFromWKT('POLYGON((0.0 0.0, 2.0 0.0, 2.0 2.0, 0.0 2.0, 0.0 0.0))') AS b
   ) r
""").show(1, False)

-----------
|∘∘∘∘∘∘∘∘∘∘
∘∘∘∘∘∘∘∘∘∘
∘∘∘∘∘∘∘∘∘∘
∘∘∘∘∘∘∘ ∘∘
∘∘∘∘∘∘  ∘∘
∘∘∘∘∘   ∘∘
∘∘∘∘    ∘∘
∘∘∘     ∘∘
∘∘∘∘∘∘∘∘∘∘
∘∘∘∘∘∘∘∘∘∘|
-----------

array_to_tile

Python:

Tile array_to_tile(Array arrayCol, Int numCols, Int numRows)

Python only. Create a tile from a Spark SQL Array, filling values in row-major order.

assemble_tile

Python:

Tile assemble_tile(Int colIndex, Int rowIndex, Numeric cellData, Int numCols, Int numRows, String cellType)

Python only. Create a Tile from a column of cell data with location indices. This function is the inverse of explode_tiles. Intended use is with a groupby, producing one row with a new tile per group. The numCols, numRows and cellType arguments are literal values, others are column expressions. Valid values for cellType can be found with function cell_types.

Masking and Nodata

In raster operations, the preservation and correct processing of missing operations is very important. The idea of missing data is often expressed as a null or NaN. In raster data, missing observations are often termed NODATA; we will style them as nodata in this document. RasterFrames provides a variety of functions to manage and inspect nodata within tiles.

See also statistical summaries to get the count of data and nodata values per tile and aggregate in a tile column: data_cells, no_data_cells, agg_data_cells, agg_no_data_cells.

It is important to note that not all cell types support the nodata representation: these are bool and when the cell type string ends in raw.

For integral valued cell types, the nodata is marked by a special sentinel value. This can be a default, typically zero or the minimum value for the underlying data type. The nodata value can also be a user-defined value. For example if the value 4 is to be interpreted as nodata, the cell type will read ‘int32ud4’.

For float cell types, the nodata can either be NaN or a user-defined value; for example 'float32ud-999.9' would mean the value -999.9 is interpreted as a nodata.

For more reading about cell types and ndodata, see the GeoTrellis documentation.

mask

Python:

Tile mask(Tile tile, Tile mask)

SQL: rf_mask

Where the mask contains nodata, replace values in the tile with nodata.

Returned tile cell type will be coerced to one supporting nodata if it does not already.

inverse_mask

Python:

Tile inverse_mask(Tile tile, Tile mask)

SQL: rf_inverse_mask

Where the mask does not contain nodata, replace values in tile with nodata.

mask_by_value

Python:

Tile mask_by_value(Tile data_tile, Tile mask_tile, Int mask_value)

SQL: rf_mask_by_value

Generate a tile with the values from data_tile, with nodata in cells where the mask_tile is equal to mask_value.

is_no_data_tile

Python:

Boolean is_no_data_tile(tile)

SQL: rf_is_no_data_tile

Returns true if tile contains only nodata. By definition returns false if cell type does not support nodata.

with_no_data

Python:

Tile with_no_data(Tile tile, Double no_data_value)

Python only. Return a tile column marking as nodata all cells equal to no_data_value.

The no_data_value argument is a literal Double, not a Column expression.

If input tile had a nodata value already, the behaviour depends on if its cell type is floating point or not. For floating point cell type tile, nodata values on the input tile remain nodata values on the output. For integral cell type tiles, the previous nodata values become literal values.

Map Algebra

Map algebra raster operations are element-wise operations between a tile and a scalar, between two tiles, or among many tiles.

Some of these functions have similar variations in the Python API:

  • local_op: applies op to two columns; the right hand side can be a tile or a numeric column.
  • local_op_scalar: applies op to a tile and a literal scalar, coercing the tile to a floating point type
  • local_op_scalar_int: applies op to a tile and a literal scalar, without coercing the tile to a floating point type

We will provide all these variations for local_add and then suppress the rest in this document.

The SQL API does not require the local_op_scalar or local_op_scalar_int forms.

local_add

Python:

Tile local_add(Tile tile1, Tile rhs)
Tile local_add(Tile tile1, Int rhs)
Tile local_add(Tile tile1, Double rhs)

SQL: rf_local_add

Returns a tile column containing the element-wise sum of tile1 and rhs.

local_add_scalar

Python:

Tile local_add_scalar(Tile tile, Double scalar)

SQL: rf_local_add_scalar

Returns a tile column containing the element-wise sum of tile and scalar. If tile is integral type, it will be coerced to floating before addition; returns float valued tile.

local_add_scalar_int

Python:

Tile local_add_scalar_int(Tile tile, Int scalar)

SQL: rf_local_add_scalar_int

Returns a tile column containing the element-wise sum of tile and scalar. If tile is integral type, returns integral type tile.

local_subtract

Python:

Tile local_subtract(Tile tile1, Tile rhs)
Tile local_subtract(Tile tile1, Int rhs)
Tile local_subtract(Tile tile1, Double rhs)

SQL: rf_local_subtract

Returns a tile column containing the element-wise difference of tile1 and rhs.

local_multiply

Python:

Tile local_multiply(Tile tile1, Tile rhs)
Tile local_multiply(Tile tile1, Int rhs)
Tile local_multiply(Tile tile1, Double rhs)

SQL: rf_local_multiply

Returns a tile column containing the element-wise product of tile1 and rhs. This is not the matrix multiplication of tile1 and rhs.

local_divide

Python:

Tile local_divide(Tile tile1, Tile rhs)
Tile local_divide(Tile tile1, Int rhs)
Tile local_divide(Tile tile1, Double rhs)

SQL: rf_local_divide

Returns a tile column containing the element-wise quotient of tile1 and rhs.

normalized_difference

Python:

Tile normalized_difference(Tile tile1, Tile tile2)

SQL: rf_normalized_difference

Compute the normalized difference of the the two tiles: (tile1 - tile2) / (tile1 + tile2). Result is always floating point cell type. This function has no scalar variant.

local_less

Python:

Tile local_less(Tile tile1, Tile rhs)
Tile local_less(Tile tile1, Int rhs)
Tile local_less(Tile tile1, Double rhs)

SQL: rf_less

Returns a tile column containing the element-wise evaluation of tile1 is less than rhs.

local_less_equal

Python:

Tile local_less_equal(Tile tile1, Tile rhs)
Tile local_less_equal(Tile tile1, Int rhs)
Tile local_less_equal(Tile tile1, Double rhs)

SQL: rf_less_equal

Returns a tile column containing the element-wise evaluation of tile1 is less than or equal to rhs.

local_greater

Python:

Tile local_greater(Tile tile1, Tile rhs)
Tile local_greater(Tile tile1, Int rhs)
Tile local_greater(Tile tile1, Double rhs)

SQL: rf_greater

Returns a tile column containing the element-wise evaluation of tile1 is greater than rhs.

local_greater_equal

Python:

Tile local_greater_equal(Tile tile1, Tile rhs)
Tile local_greater_equal(Tile tile1, Int rhs)
Tile local_greater_equal(Tile tile1, Double rhs)

SQL: rf_greater_equal

Returns a tile column containing the element-wise evaluation of tile1 is greater than or equal to rhs.

local_equal

Python:

Tile local_equal(Tile tile1, Tile rhs)
Tile local_equal(Tile tile1, Int rhs)
Tile local_equal(Tile tile1, Double rhs)

SQL: rf_equal

Returns a tile column containing the element-wise equality of tile1 and rhs.

local_unequal

Python:

Tile local_unequal(Tile tile1, Tile rhs)
Tile local_unequal(Tile tile1, Int rhs)
Tile local_unequal(Tile tile1, Double rhs)

SQL: rf_unequal

Returns a tile column containing the element-wise inequality of tile1 and rhs.

round

Python:

Tile round(Tile tile)

SQL: rf_round

Round cell values to the nearest integer without changing the cell type.

exp

Python:

Tile exp(Tile tile)

SQL: rf_exp

Performs cell-wise exponential.

exp10

Python:

Tile exp10(Tile tile)

SQL: rf_exp10

Compute 10 to the power of cell values.

exp2

Python:

Tile exp2(Tile tile)

SQL: rf_exp2

Compute 2 to the power of cell values.

expm1

Python:

Tile expm1(Tile tile)

SQL: rf_expm1

Performs cell-wise exponential, then subtract one. Inverse of log1p.

log

Python:

Tile log(Tile tile)

SQL: rf_log

Performs cell-wise natural logarithm.

log10

Python:

Tile log10(Tile tile)

SQL: rf_log10

Performs cell-wise logarithm with base 10.

log2

Python:

Tile log2(Tile tile)

SQL: rf_log2

Performs cell-wise logarithm with base 2.

log1p

Python:

Tile log1p(Tile tile)

SQL: rf_log1p

Performs natural logarithm of cell values plus one. Inverse of expm1.

Tile Statistics

The following functions compute a statistical summary per row of a tile column. The statistics are computed across the cells of a single tile, within each DataFrame Row. Consider the following example.

import pyspark.functions as F
spark.sql("""
 SELECT 1 as id, rf_tile_ones(5, 5, 'float32') as t 
 UNION
 SELECT 2 as id, rf_local_multiply(rf_tile_ones(5, 5, 'float32'), 3) as t 
 """).select(F.col('id'), tile_sum(F.col('t'))).show()


+---+-----------+
| id|tile_sum(t)|
+---+-----------+
|  2|       75.0|
|  1|       25.0|
+---+-----------+

tile_sum

Python:

Double tile_sum(Tile tile)

SQL: rf_tile_sum

Computes the sum of cells in each row of column tile, ignoring nodata values.

tile_mean

Python:

Double tile_mean(Tile tile)

SQL: rf_tile_mean

Computes the mean of cells in each row of column tile, ignoring nodata values.

tile_min

Python:

Double tile_min(Tile tile)

SQL: rf_tile_min

Computes the min of cells in each row of column tile, ignoring nodata values.

tile_max

Python:

Double tile_max(Tile tile)

SQL: rf_tile_max

Computes the max of cells in each row of column tile, ignoring nodata values.

no_data_cells

Python:

Long no_data_cells(Tile tile)

SQL: rf_no_data_cells

Return the count of nodata cells in the tile.

data_cells

Python:

Long data_cells(Tile tile)

SQL: rf_data_cells

Return the count of data cells in the tile.

tile_stats

Python:

Struct[Long, Long, Double, Double, Double, Double] tile_stats(Tile tile)

SQL: tile_stats

Computes the following statistics of cells in each row of column tile: data cell count, nodata cell count, minimum, maximum, mean, and variance. The minimum, maximum, mean, and variance are computed ignoring nodata values.

tile_histogram

Python:

Struct[Struct[Long, Long, Double, Double, Double, Double], Array[Struct[Double, Long]]] tile_histogram(Tile tile)

SQL: rf_tile_histogram

Computes a statistical summary of cell values within each row of tile. Resulting column has the below schema. Note that several of the other tile statistics functions are convenience methods to extract parts of this result. Related is the agg_approx_histogram which computes the statistics across all rows in a group.

 |-- tile_histogram: struct (nullable = true)
 |    |-- stats: struct (nullable = true)
 |    |    |-- dataCells: long (nullable = false)
 |    |    |-- noDataCells: long (nullable = false)
 |    |    |-- min: double (nullable = false)
 |    |    |-- max: double (nullable = false)
 |    |    |-- mean: double (nullable = false)
 |    |    |-- variance: double (nullable = false)
 |    |-- bins: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- value: double (nullable = false)
 |    |    |    |-- count: long (nullable = false)

Aggregate Tile Statistics

These functions compute statistical summaries over all of the cell values and across all the rows in the DataFrame or group. Example use below computes a single double-valued mean per month, across all data cells in the red_band tile type column. This would return at most twelve rows.

from pyspark.functions import month
from pyrasterframes.functions import agg_mean
rf.groupby(month(rf.datetime)).agg(agg_mean(rf.red_band).alias('red_mean_monthly'))

Continuing our example from the Tile Statistics section, consider the following. Note that only a single row is returned. It is averaging 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows.

spark.sql("""
SELECT 1 as id, rf_tile_ones(5, 5, 'float32') as t 
UNION
SELECT 2 as id, rf_local_multiply_scalar(rf_tile_ones(5, 5, 'float32'), 3) as t 
""").agg(agg_mean(F.col('t'))).show(10, False)

+-----------+
|agg_mean(t)|
+-----------+
|2.0        |
+-----------+

agg_mean

Python:

Double agg_mean(Tile tile)

SQL: rf_agg_stats(tile).mean

Aggregates over the tile and return the mean of cell values, ignoring nodata. Equivalent to agg_stats.mean.

agg_data_cells

Python:

Long agg_data_cells(Tile tile)

SQL: rf_agg_stats(tile).dataCells

Aggregates over the tile and return the count of data cells. Equivalent to agg_stats.dataCells. C.F. data_cells; equivalent code:

rf.select(agg_data_cells(rf.tile).alias('agg_data_cell')).show()
# Equivalent to
rf.agg(F.sum(data_cells(rf.tile)).alias('agg_data_cell')).show()

agg_no_data_cells

Python:

Long agg_no_data_cells(Tile tile)

SQL: rf_agg_stats(tile).noDataCells

Aggregates over the tile and return the count of nodata cells. Equivalent to agg_stats.noDataCells. C.F. no_data_cells a row-wise count of no data cells.

agg_stats

Python:

Struct[Long, Long, Double, Double, Double, Double] agg_stats(Tile tile)

SQL: rf_agg_stats

Aggregates over the tile and returns statistical summaries of cell values: number of data cells, number of nodata cells, minimum, maximum, mean, and variance. The minimum, maximum, mean, and variance ignore the presence of nodata.

agg_approx_histogram

Python:

Struct[Struct[Long, Long, Double, Double, Double, Double], Array[Struct[Double, Long]]] agg_approx_histogram(Tile tile)

SQL: rf_agg_approx_histogram

Aggregates over the tile return statistical summaries of the cell values, including a histogram, in the below schema. The bins array is of tuples of histogram values and counts. Typically values are plotted on the x-axis and counts on the y-axis.

Note that several of the other cell value statistics functions are convenience methods to extract parts of this result. Related is the tile_histogram function which operates on a single row at a time.

 |-- agg_approx_histogram: struct (nullable = true)
 |    |-- stats: struct (nullable = true)
 |    |    |-- dataCells: long (nullable = false)
 |    |    |-- noDataCells: long (nullable = false)
 |    |    |-- min: double (nullable = false)
 |    |    |-- max: double (nullable = false)
 |    |    |-- mean: double (nullable = false)
 |    |    |-- variance: double (nullable = false)
 |    |-- bins: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- value: double (nullable = false)
 |    |    |    |-- count: long (nullable = false)

Tile Local Aggregate Statistics

Local statistics compute the element-wise statistics across a DataFrame or group of tiles, resulting in a tile that has the same dimension.

Consider again our example for Tile Statistics and Aggregate Tile Statistics, this time apply agg_local_mean. We see that it is computing the element-wise mean across the two rows. In this case it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the tile.

import pyspark.functions as F
lam = spark.sql("""
SELECT 1 as id, rf_tile_ones(5, 5, 'float32') as t 
UNION
SELECT 2 as id, rf_local_multiply(rf_tile_ones(5, 5, 'float32'), 3) as t 
""").agg(local_agg_mean(F.col('t')).alias('l')) \

## local_agg_mean returns a tile
lam.select(tile_dimensions(lam.l)).show()
## 
+------------------+
|tile_dimensions(l)|
+------------------+
|            [5, 5]|
+------------------+ 
##

lam.select(explode_tiles(lam.l)).show(10, False)
##
+------------+---------+---+
|column_index|row_index|l  |
+------------+---------+---+
|0           |0        |2.0|
|1           |0        |2.0|
|2           |0        |2.0|
|3           |0        |2.0|
|4           |0        |2.0|
|0           |1        |2.0|
|1           |1        |2.0|
|2           |1        |2.0|
|3           |1        |2.0|
|4           |1        |2.0|
+------------+---------+---+
only showing top 10 rows

agg_local_max

Python:

Tile agg_local_max(Tile tile)

SQL: rf_agg_local_max

Compute the cell-local maximum operation over Tiles in a column.

agg_local_min

Python:

Tile agg_local_min(Tile tile)

SQL: rf_agg_local_min

Compute the cell-local minimum operation over Tiles in a column.

agg_local_mean

Python:

Tile agg_local_mean(Tile tile)

SQL: rf_agg_local_mean

Compute the cell-local mean operation over Tiles in a column.

agg_local_data_cells

Python:

Tile agg_local_data_cells(Tile tile)

SQL: rf_agg_local_data_cells

Compute the cell-local count of data cells over Tiles in a column. Returned tile has a cell type of int32.

agg_local_no_data_cells

Python:

Tile agg_local_no_data_cells(Tile tile)

SQL: rf_agg_local_no_data_cells

Compute the cell-local count of nodata cells over Tiles in a column. Returned tile has a cell type of int32.

agg_local_stats

Python:

Struct[Tile, Tile, Tile, Tile, Tile] agg_local_stats(Tile tile)

SQL: rf_agg_local_stats

Compute cell-local aggregate count, minimum, maximum, mean, and variance for a column of Tiles. Returns a struct of five tiles.

Converting Tiles

RasterFrames provides several ways to convert a tile into other data structures. See also functions for creating tiles.

explode_tiles

Python:

Int, Int, Numeric* explode_tiles(Tile* tile)

SQL: rf_explode_tiles

Create a row for each cell in tile columns. Many tile columns can be passed in, and the returned DataFrame will have one numeric column per input. There will also be columns for column_index and row_index. Inverse of assemble_tile. When using this function, be sure to have a unique identifier for rows in order to successfully invert the operation.

explode_tiles_sample

Python:

Int, Int, Numeric* explode_tiles_sample(Double sample_frac, Long seed, Tile* tile)

Python only. As with explode_tiles, but taking a randomly sampled subset of cells. Equivalent to the below, but this implementation is optimized for speed. Parameter sample_frac should be between 0.0 and 1.0.

df.select(df.id, explode_tiles(df.tile1, df.tile2, df.tile3)) \
    .sample(False, 0.05, 8675309)
# Equivalent result, faster
df.select(df.id, explode_tiles_sample(0.05, 8675309, df.tile1, df.tile2, df.tile3)) \

tile_to_int_array

Python:

Array tile_to_int_array(Tile tile)

SQL: rf_tile_to_int_array

Convert Tile column to Spark SQL Array, in row-major order. Float cell types will be coerced to integral type by flooring.

tile_to_double_array

Python:

Array tile_to_double_arry(Tile tile)

SQL: rf_tile_to_double_array

Convert tile column to Spark Array, in row-major order. Integral cell types will be coerced to floats.

render_ascii

Python:

String render_ascii(Tile tile)

SQL: rf_render_ascii

Pretty print the tile values as plain text.