# Raster Statistics

RasterFrames has a number of extension methods and columnar functions for performing analysis on tiles.

## Tile Statistics

### Tile Dimensions

Get the nominal tile dimensions. Depending on the tiling there may be some tiles with different sizes on the edges.

``````scala> rf.select(rf.spatialKeyColumn, tile_dimensions(\$"tile")).show(3)
+-----------+---------------------+
|spatial_key|tile_dimensions(tile)|
+-----------+---------------------+
|     [6, 3]|           [128, 128]|
|     [4, 0]|           [128, 128]|
|     [0, 0]|           [128, 128]|
+-----------+---------------------+
only showing top 3 rows

``````

### Descriptive Statistics

#### NoData Counts

Count the numer of `NoData` and non-`NoData` cells in each tile.

``````scala> rf.select(rf.spatialKeyColumn, no_data_cells(\$"tile"), data_cells(\$"tile")).show(3)
+-----------+-------------------+----------------+
|spatial_key|no_data_cells(tile)|data_cells(tile)|
+-----------+-------------------+----------------+
|     [6, 3]|              15688|             696|
|     [4, 0]|                  0|           16384|
|     [0, 0]|                  0|           16384|
+-----------+-------------------+----------------+
only showing top 3 rows

``````

#### Tile Mean

Compute the mean value in each tile. Use `tileMean` for integral cell types, and `tileMeanDouble` for floating point cell types.

``````scala> rf.select(rf.spatialKeyColumn, tile_mean(\$"tile")).show(3)
+-----------+------------------+
|spatial_key|   tile_mean(tile)|
+-----------+------------------+
|     [6, 3]|10757.254310344828|
|     [4, 0]| 9883.589050292969|
|     [0, 0]|10338.119995117188|
+-----------+------------------+
only showing top 3 rows

``````

#### Tile Summary Statistics

Compute a suite of summary statistics for each tile. Use `tile_stats` for integral cells types, and `tile_stats_double` for floating point cell types.

``````scala> rf.withColumn("stats", tile_stats(\$"tile")).select(rf.spatialKeyColumn, \$"stats.*").show(3)
+-----------+----------+-------------+------+-------+------------------+------------------+
|spatial_key|data_cells|no_data_cells|   min|    max|              mean|          variance|
+-----------+----------+-------------+------+-------+------------------+------------------+
|     [6, 3]|       696|        15688|7604.0|16143.0|10757.254310344822| 3271125.902280271|
|     [4, 0]|     16384|            0|7678.0|16464.0| 9883.589050292961|2163148.3790329304|
|     [0, 0]|     16384|            0|7291.0|23077.0| 10338.11999511721|3386469.0957086035|
+-----------+----------+-------------+------+-------+------------------+------------------+
only showing top 3 rows

``````

### Histogram

The `tile_histogram` function computes a histogram over the data in each tile.

In this example we compute quantile breaks.

``````scala> rf.select(tile_histogram(\$"tile")).map(_.quantileBreaks(5)).show(5, false)
+--------------------------------------------------------------------------------------------------+
|value                                                                                             |
+--------------------------------------------------------------------------------------------------+
|[8843.0, 9917.999999999985, 10658.999999999978, 11576.000000000005, 12501.000000000024]           |
|[8137.833333333333, 8854.666666666668, 9922.5, 10717.555555555555, 11448.666666666668]            |
|[7942.916666666666, 9022.333333333334, 10684.0, 11425.944444444445, 12198.166666666668]           |
|[9256.833333333332, 10300.466666666667, 10989.5, 11502.333333333334, 12133.222222222223]          |
|[9189.666666666666, 9733.037037037036, 10065.888888888889, 10452.666666666666, 10991.533333333333]|
+--------------------------------------------------------------------------------------------------+
only showing top 5 rows

``````

## Aggregate Statistics

The `agg_stats` function computes the same summary statistics as `tile_stats`, but aggregates them over the whole RasterFrame.

``````scala> rf.select(agg_stats(\$"tile")).show()
+----------+-------------+------+-------+-----------------+------------------+
|data_cells|no_data_cells|   min|    max|             mean|          variance|
+----------+-------------+------+-------+-----------------+------------------+
|    387000|        71752|7209.0|39217.0|10160.48549870801|3315238.5311127007|
+----------+-------------+------+-------+-----------------+------------------+

``````

A more involved example: extract bin counts from a computed `Histogram`.

``````scala> rf.select(agg_approx_histogram(\$"tile")).
|   map(h => for(v <- h.labels) yield(v, h.itemCount(v))).
|   select(explode(\$"value") as "counts").
|   select("counts._1", "counts._2").
|   toDF("value", "count").
|   orderBy(desc("count")).
|   show(10)
+------------------+-----+
|             value|count|
+------------------+-----+
| 7905.780878889613|59871|
|  9693.36822122893|37138|
|10731.770891323657|33770|
| 10076.43293835417|27512|
| 8365.393423741412|26915|
|11646.288154754428|23883|
| 11084.84999789323|23733|
| 9021.338606741572|22250|
|10385.199022093442|22088|
|11359.327558440293|20491|
+------------------+-----+
only showing top 10 rows

``````