Raster Statistics

RasterFrames has a number of extension methods and columnar functions for performing analysis on tiles.

Tile Statistics

Tile Dimensions

Get the nominal tile dimensions. Depending on the tiling there may be some tiles with different sizes on the edges.

scala> rf.select(rf.spatialKeyColumn, tile_dimensions($"tile")).show(3)
+-----------+---------------------+
|spatial_key|tile_dimensions(tile)|
+-----------+---------------------+
|     [6, 3]|           [128, 128]|
|     [4, 0]|           [128, 128]|
|     [0, 0]|           [128, 128]|
+-----------+---------------------+
only showing top 3 rows

Descriptive Statistics

NoData Counts

Count the numer of NoData and non-NoData cells in each tile.

scala> rf.select(rf.spatialKeyColumn, no_data_cells($"tile"), data_cells($"tile")).show(3)
+-----------+-------------------+----------------+
|spatial_key|no_data_cells(tile)|data_cells(tile)|
+-----------+-------------------+----------------+
|     [6, 3]|              15688|             696|
|     [4, 0]|                  0|           16384|
|     [0, 0]|                  0|           16384|
+-----------+-------------------+----------------+
only showing top 3 rows

Tile Mean

Compute the mean value in each tile. Use tileMean for integral cell types, and tileMeanDouble for floating point cell types.

scala> rf.select(rf.spatialKeyColumn, tile_mean($"tile")).show(3)
+-----------+------------------+
|spatial_key|   tile_mean(tile)|
+-----------+------------------+
|     [6, 3]|10757.254310344828|
|     [4, 0]| 9883.589050292969|
|     [0, 0]|10338.119995117188|
+-----------+------------------+
only showing top 3 rows

Tile Summary Statistics

Compute a suite of summary statistics for each tile. Use tile_stats for integral cells types, and tile_stats_double for floating point cell types.

scala> rf.withColumn("stats", tile_stats($"tile")).select(rf.spatialKeyColumn, $"stats.*").show(3)
+-----------+----------+-------------+------+-------+------------------+------------------+
|spatial_key|data_cells|no_data_cells|   min|    max|              mean|          variance|
+-----------+----------+-------------+------+-------+------------------+------------------+
|     [6, 3]|       696|        15688|7604.0|16143.0|10757.254310344822| 3271125.902280271|
|     [4, 0]|     16384|            0|7678.0|16464.0| 9883.589050292961|2163148.3790329304|
|     [0, 0]|     16384|            0|7291.0|23077.0| 10338.11999511721|3386469.0957086035|
+-----------+----------+-------------+------+-------+------------------+------------------+
only showing top 3 rows

Histogram

The tile_histogram function computes a histogram over the data in each tile.

In this example we compute quantile breaks.

scala> rf.select(tile_histogram($"tile")).map(_.quantileBreaks(5)).show(5, false)
+--------------------------------------------------------------------------------------------------+
|value                                                                                             |
+--------------------------------------------------------------------------------------------------+
|[8843.0, 9917.999999999985, 10658.999999999978, 11576.000000000005, 12501.000000000024]           |
|[8137.833333333333, 8854.666666666668, 9922.5, 10717.555555555555, 11448.666666666668]            |
|[7942.916666666666, 9022.333333333334, 10684.0, 11425.944444444445, 12198.166666666668]           |
|[9256.833333333332, 10300.466666666667, 10989.5, 11502.333333333334, 12133.222222222223]          |
|[9189.666666666666, 9733.037037037036, 10065.888888888889, 10452.666666666666, 10991.533333333333]|
+--------------------------------------------------------------------------------------------------+
only showing top 5 rows

Aggregate Statistics

The agg_stats function computes the same summary statistics as tile_stats, but aggregates them over the whole RasterFrame.

scala> rf.select(agg_stats($"tile")).show()
+----------+-------------+------+-------+-----------------+------------------+
|data_cells|no_data_cells|   min|    max|             mean|          variance|
+----------+-------------+------+-------+-----------------+------------------+
|    387000|        71752|7209.0|39217.0|10160.48549870801|3315238.5311127007|
+----------+-------------+------+-------+-----------------+------------------+

A more involved example: extract bin counts from a computed Histogram.

scala> rf.select(agg_approx_histogram($"tile")).
     |   map(h => for(v <- h.labels) yield(v, h.itemCount(v))).
     |   select(explode($"value") as "counts").
     |   select("counts._1", "counts._2").
     |   toDF("value", "count").
     |   orderBy(desc("count")).
     |   show(10)
+------------------+-----+
|             value|count|
+------------------+-----+
| 7905.780878889613|59871|
|  9693.36822122893|37138|
|10731.770891323657|33770|
| 10076.43293835417|27512|
| 8365.393423741412|26915|
|11646.288154754428|23883|
| 11084.84999789323|23733|
| 9021.338606741572|22250|
|10385.199022093442|22088|
|11359.327558440293|20491|
+------------------+-----+
only showing top 10 rows