Related repository is https://github.com/biggis-project/biggis-infrastructure
The BigGIS architecture, as depicted in the picture above, consists of several layers that are briefly discussed in the following.
Pipelines in BigGIS are modelled leveraging StreamPipes. StreamPipes allows to transform and analyse sensor landscape and other data streams with less programming effort. We extend StreamPipes to support geo-spatial data analytics, e.g. raster data.
To process and analyze the geo-spatial data, BigGIS relies on multiple big data analytics frameworks. While Apache Flink is mainly used for sensor data, some application use cases need to process geo-spatial raster or vector data in batches as well. Thus, we integrate Apache Spark featuring GeoTrellis into our architectural design in order to process geographic data. GeoTrellis provides a number of operations to manipulate raster data including map algebra operations. In addition, we provide other data science notebooks (RStudio, Jupyter) to exploratively analyze the data.
Middleware & Connectors¶
Apache Kafka is used as the primary message broker. It handles the communication between the data processing elements, i.e. nodes, within the analytics pipelines. Besides, ActiveMQ is another message broker which can be used in addition to Kafka. Currently, the main purpose is to provide an endpoint for the websocket connections required by the real-time dashboard of the StreamPipes UI.
Internally, BigGIS uses a variety of different storage backends for designated purposes.
- HDFS for GeoTrellis catalog and Spark jobs.
- Exasol for fast access of stored data.
- CouchDB for pipelines, users and visualizations created in the dashboard.
- RDF4J (formerly Sesame) as a semantic backend of StreamPipes.
Running these containers in a distributed manner requires a wide variety of technologies, that must be integrated and managed throughout their lifecycle. To easily deploy our containers, our infrastructure is designed to run on Rancher as our container management platform. Rancher enables organizations to run and manage Docker and Kubernetes in production, providing four major components, that are:
- Infrastructure Orchestration
- Container Orchestration and Scheduling
- Application Catalog
That way, it is fairly easy to set up a distributed Rancher cluster in a couple of minutes (see the official documentation for more information).