RasterFrames® provides a DataFrame-centric view over arbitrary Earth-observation (EO) data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of Apache Spark ML algorithms. It provides APIs in Python, SQL, and Scala, and can scale from a laptop computer to a large distributed cluster, enabling global analysis with satellite imagery in a wholly new, flexible, and convenient way.
We have a millennia-long history of organizing information in tabular form. Typically, rows represent independent events or observations, and columns represent attributes and measurements from the observations. The forms have evolved, from hand-written agricultural records and transaction ledgers, to the advent of spreadsheets on the personal computer, and on to the creation of the DataFrame data structure as found in R Data Frames and Python Pandas. The table-oriented data structure remains a common and critical component of organizing data across industries, and—most importantly—it is the mental model employed by data scientists across diverse forms of modeling and analysis.
The evolution of the DataFrame form has continued with Spark SQL, which brings DataFrames to the big data distributed compute space. Through several novel innovations, Spark SQL enables data scientists to work with DataFrames too large for the memory of a single computer. As suggested by the name, these DataFrames are manipulatable via standard SQL, as well as the more general-purpose programming languages Python, R, Java, and Scala.
RasterFrames, an incubating Eclipse Foundation LocationTech project, brings together EO data access, cloud computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity and a huge challenge to the data analysis community. It is Big Data in the truest sense, and its footprint is rapidly getting bigger. According to a World Bank document on assets for post-disaster situation awareness[^1]:
Of the 1,738 operational satellites currently orbiting the earth (as of 9/17), 596 are earth observation satellites and 477 of these are non-military assets (i.e. available to civil society including commercial entities and governments for earth observation, according to the Union of Concerned Scientists). This number is expected to increase significantly over the next ten years. The 200 or so planned remote sensing satellites have a value of over 27 billion USD (Forecast International). This estimate does not include the burgeoning fleets of smallsats as well as micro, nano and even smaller satellites… All this enthusiasm has, not unexpectedly, led to a veritable fire-hose of remotely sensed data which is becoming difficult to navigate even for seasoned experts.
By using DataFrames as the core cognitive and compute data model for processing EO data, RasterFrames is able to deliver sophisticated computational and algorithmic capabilities in a tabular form that is familiar and accessible to the general computing public. Because it is built on Apache Spark, solutions prototyped on a laptop can be easily scaled to run on cluster and cloud compute resources. Apache Spark also provides integration between its DataFrame libraries and machine learning, with which RasterFrames is fully compatible.
RasterFrames introduces georectified raster imagery to Spark SQL. It quantizes scenes into chunks called tiles. Each tile contains a 2-D matrix of cell or pixel values along with information on how to numerically interpret those cells.
As shown in the figure below, a “RasterFrame” is a Spark DataFrame with one or more columns of type tile. A tile column typically represents a single frequency band of sensor data, such as “blue” or “near infrared”, but can also be quality assurance information, land classification assignments, or any other raster spatial data. Along with tile columns there is typically an
extent specifying the geographic location of the data, the map projection of that geometry (
crs), and a
timestamp column representing the acquisition time. These columns can all be used in the
WHERE clause when filtering.
|2019-02-28||[+proj=sinu +lon_0=0.0 +x_0=0.0 +y_0=0.0 +a=6371007.181 +b=6371007.181 +units=m ]||[-6834789.194217826, 163086.07621782666, -6716181.138786679, 281694.1316489733]|
|2019-02-28||[+proj=sinu +lon_0=0.0 +x_0=0.0 +y_0=0.0 +a=6371007.181 +b=6371007.181 +units=m ]||[-7427829.47137356, 281694.1316489733, -7309221.415942413, 400302.18708011997]|
|2019-02-28||[+proj=sinu +lon_0=0.0 +x_0=0.0 +y_0=0.0 +a=6371007.181 +b=6371007.181 +units=m ]||[-7309221.415942414, 163086.07621782666, -7190613.360511267, 281694.1316489733]|
|2019-02-28||[+proj=sinu +lon_0=0.0 +x_0=0.0 +y_0=0.0 +a=6371007.181 +b=6371007.181 +units=m ]||[-7665045.582235853, 637518.2979424134, -7546437.526804706, 756126.35337356]|
RasterFrames also includes support for working with vector data, such as GeoJSON. RasterFrames vector data operations let you filter with geospatial relationships like contains or intersects, mask cells, convert vectors to rasters, and more.
Raster data can be read from a number of sources. Through the flexible Spark SQL DataSource API, RasterFrames can be constructed from collections of imagery (including Cloud Optimized GeoTIFFs or COGS), GeoTrellis Layers, and from catalogs of large datasets like Landsat 8 and MODIS data sets on the AWS Public Data Set (PDS).
[^1]: Demystifying Satellite Assets for Post-Disaster Situation Awareness. World Bank via OpenDRI.org. Accessed November 28, 2018.