# “NoData” Handling

## What is NoData?

In raster operations, the preservation and correct processing of missing observations is very important. In most DataFrames and in scientific computing, the idea of missing data is expressed as a `null`

or `NaN`

value. However, a great deal of raster data is stored for space efficiency, which typically leads to use of integral values with a “sentinel” value designated to represent missing observations. This sentinel value varies across data products and bands. In a generic sense, it is usually called the “NoData” value.

RasterFrames provides a variety of functions to inspect and manage NoData within *tiles*.

## Cell Types

To understand how NoData is handled in RasterFrames, we first need to understand the different underlying types of data called cell types. RasterFrames cell types are GeoTrellis `CellType`

s, so the GeoTrellis documentation is a valuable resource on how these are defined.

The `CellType`

class from the `rf_types`

submodule allows us to create a representation of any valid cell type. There are convenience methods to create instances for a variety of basic types.

```
from pyrasterframes.rf_types import CellType
CellType.bool()
CellType.int8()
CellType.uint8()
CellType.int16()
CellType.uint16()
CellType.int32()
CellType.float32()
CellType.float64()
```

We can also inspect the cell type of a given *tile* or `proj_raster`

column.

```
cell_types = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
.select(rf_cell_type('proj_raster')).distinct()
cell_types
```

rf_cell_type(proj_raster) |
---|

[uint16raw] |

### Understanding Cell Types and NoData

We can use the methods on the `CellType`

class to learn more about a specific cell type. Let’s consider the cell type of our sample data above.

```
ct = CellType('uint16raw')
ct, ct.is_floating_point(), ct.has_no_data()
```

```
(uint16raw, False, False)
```

We can see that for the above data source, there is no defined NoData value. This means that each value is interpreted as a valid observation. Often such data is meant to be combined with another band indicating the quality of observations at each location. The lack of NoData is indicated by the `raw`

at the end of the type name. Consider for contrast the `uint16`

cell type.

```
from pyrasterframes.rf_types import CellType
ct = CellType('uint16')
ct, ct.is_floating_point(), ct.has_no_data(), ct.no_data_value()
```

```
(uint16, False, True, 0)
```

In this case, the minimum value of 0 is designated as the NoData value. For integral-valued cell types, the NoData is typically zero, the maximum, or the minimum value for the underlying data type. The NoData value can also be a user-defined value. In that case the value is designated with `ud`

.

```
CellType.uint16().with_no_data_value(99).cell_type_name
```

```
'uint16ud99'
```

Floating point types have `NaN`

as the NoData value by default. However, a user-defined NoData can be set.

```
print(CellType.float32().no_data_value())
print(CellType.float32().with_no_data_value(-99.9).no_data_value())
```

```
nan
-99.9
```

## NoData and Local Arithmetic

Let’s now explore how the presence of NoData affects local map algebra operations. To demonstrate the behavior, lets create two *tiles*. One *tile* will have values of 0 and 1, and the other will have values of just 0.

```
tile_size = 100
x = np.zeros((tile_size, tile_size), dtype='int16')
x[:,tile_size//2:] = 1
x = Tile(x)
y = Tile(np.zeros((tile_size, tile_size), dtype='int16'))
rf = spark.createDataFrame([Row(x=x, y=y)])
print('x')
display(x)
```

```
x
```

```
print('y')
display(y)
```

```
y
/anaconda3/lib/python3.7/site-packages/matplotlib/image.py:405: UserWarning: Warning: converting a masked element to nan.
dv = (np.float64(self.norm.vmax) -
/anaconda3/lib/python3.7/site-packages/matplotlib/image.py:406: UserWarning: Warning: converting a masked element to nan.
np.float64(self.norm.vmin))
/anaconda3/lib/python3.7/site-packages/matplotlib/image.py:413: UserWarning: Warning: converting a masked element to nan.
a_min = np.float64(newmin)
/anaconda3/lib/python3.7/site-packages/matplotlib/image.py:418: UserWarning: Warning: converting a masked element to nan.
a_max = np.float64(newmax)
<string>:6: UserWarning: Warning: converting a masked element to nan.
/anaconda3/lib/python3.7/site-packages/numpy/ma/core.py:711: UserWarning: Warning: converting a masked element to nan.
data = np.array(a, copy=False, subok=subok)
```

Now, let’s create a new column from `x`

with the value of 1 changed to NoData. Then, we will add this new column with NoData to the `y`

column. As shown below, the result of the sum also has NoData (represented in white). In general for local algebra operations, Data + NoData = NoData.

```
masked_rf = rf.withColumn('x_nd', rf_mask_by_value('x', 'x', lit(1)) )
masked_rf = masked_rf.withColumn('x_nd_y_sum', rf_local_add('x_nd', 'y'))
row = masked_rf.collect()[0]
print('x with NoData')
display(row.x_nd)
```

```
x with NoData
```

```
print('x with NoData plus y')
display(row.x_nd_y_sum)
```

```
x with NoData plus y
```

To see more information about possible operations on *tile* columns, see the local map algebra page and function reference.

## Changing a Tile’s NoData Values

One way to mask a *tile* is to make a new *tile* with a user defined NoData value. We will explore this method below. First, lets create a DataFrame from a *tile* with values of 0, 1, 2, and 3. We will use numpy to create a 100x100 *tile* with vertical bands containing values 0, 1, 2, and 3.

```
tile_size = 100
x = np.zeros((tile_size, tile_size), dtype='int16')
# setting the values of the columns
for i in range(4):
x[:, i*tile_size//4:(i+1)*tile_size//4] = i
x = Tile(x)
rf = spark.createDataFrame([Row(tile=x)])
display(x)
```

First, we mask the value of 1 by making a new *tile* column with the user defined cell type ‘uint16ud1’. Then, we mask out the value of two by making a *tile* column with the cell type ‘uint16ud2’.

```
def get_nodata_ct(nd_val):
return CellType('uint16').with_no_data_value(nd_val)
masked_rf = rf.withColumn('tile_nd_1',
rf_convert_cell_type('tile', get_nodata_ct(1))) \
.withColumn('tile_nd_2',
rf_convert_cell_type('tile_nd_1', get_nodata_ct(2))) \
```

```
collected = masked_rf.collect()
```

Let’s look at the new *tiles* we created. The *tile* named `tile_nd_1`

has the 1 values masked out as expected.

```
display(collected[0].tile_nd_1)
```

And the *tile* named `tile_nd_2`

has the values of 1 and 2 masked out. This is because we created the *tile* by setting a new user defined NoData value to `tile_nd_1`

, and the values previously masked out in `tile_nd_1`

stayed masked when creating `tile_nd_2`

.

```
display(collected[0].tile_nd_2)
```

## Combining Tiles with Different Data Types

RasterFrames supports having *tile* columns with different cell types in a single DataFrame. It is important to understand how these different cell types interact.

Let’s first create a RasterFrame that has columns of `float`

and `int`

cell type.

```
x = Tile((np.ones((100, 100))*2), CellType.float64())
y = Tile((np.ones((100, 100))*3), CellType.int32())
rf = spark.createDataFrame([Row(x=x, y=y)])
cell_types = rf.select(rf_cell_type('x'), rf_cell_type('y')).distinct()
cell_types
```

rf_cell_type(x) | rf_cell_type(y) |
---|---|

[float64] | [int32] |

When performing a local operation between *tile* columns with cell types `int`

and `float`

, the resulting *tile* cell type will be `float`

. In local algebra over two *tiles* of different “sized” cell types, the resulting cell type will be the larger of the two input *tiles’* cell types.

```
sums = rf.select(
rf_cell_type('x'),
rf_cell_type('y'),
rf_cell_type(rf_local_add('x', 'y')).alias('xy_sum'),
)
sums
```

rf_cell_type(x) | rf_cell_type(y) | xy_sum |
---|---|---|

[float64] | [int32] | [float64] |

Combining *tile* columns of different cell types gets a little trickier when user defined NoData cell types are involved. Let’s create two *tile* columns: one with a NoData value of 1, and one with a NoData value of 2 (using our previously defined `get_nodata_ct`

function).

```
x_nd_1 = Tile((np.ones((100, 100))*3), get_nodata_ct(1))
x_nd_2 = Tile((np.ones((100, 100))*3), get_nodata_ct(2))
rf_nd = spark.createDataFrame([Row(x_nd_1=x_nd_1, x_nd_2=x_nd_2)])
```

Let’s try adding the *tile* columns with different NoData values. When there is an inconsistent NoData value in the two columns, the NoData value of the right-hand side of the sum is kept. In this case, this means the result has a NoData value of 1.

```
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_2', 'x_nd_1'))
cell_types = rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct()
cell_types
```

rf_cell_type(x_nd_sum) |
---|

[uint16ud1] |

Reversing the order of the sum changes the NoData value of the resulting column to 2.

```
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_1', 'x_nd_2'))
cell_types = rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct()
cell_types
```

rf_cell_type(x_nd_sum) |
---|

[uint16ud2] |

## NoData Values in Aggregation

Let’s use the same *tile* as before to demonstrate how NoData values affect *tile* aggregations.

```
tile_size = 100
x = np.zeros((tile_size, tile_size))
for i in range(4):
x[:, i*tile_size//4:(i+1)*tile_size//4] = i
x = Tile(x, CellType.int16())
rf = spark.createDataFrame([Row(tile=x)])
display(x)
```

First we create the two new masked *tile* columns as before. One with only the value of 1 masked, and the other with and values of 1 and 2 masked.

```
masked_rf = rf.withColumn('tile_nd_1',
rf_convert_cell_type('tile', get_nodata_ct(1))) \
.withColumn('tile_nd_2',
rf_convert_cell_type('tile_nd_1', get_nodata_ct(2)))
```

The results of `rf_tile_sum`

vary on the *tiles* that were masked. This is because any cells with NoData values are ignored in the aggregation. Note that `tile_nd_2`

has the lowest sum, since it has the fewest amount of data cells.

```
sums = masked_rf.select(rf_tile_sum('tile'), rf_tile_sum('tile_nd_1'), rf_tile_sum('tile_nd_2'))
sums
```

rf_tile_sum(tile) | rf_tile_sum(tile_nd_1) | rf_tile_sum(tile_nd_2) |
---|---|---|

15000.0 | 12500.0 | 7500.0 |