Welcome to Stratosphere

Stratosphere is a DFG-funded research project investigating new paradigms for scalable, complex analytics on massively-parallel data sets. The Stratosphere System is available under the Apache License, Version 2.0. Feel free to download it, try it out and give feedback or ask for help on our GitHub page.

Layered Architecture

Stratosphere Layers
  1. An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases. Internally, Meteor scripts are translated into Sopremo algebra and optimized.
  2. A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations. PACT is based on second-order functions and features an optimizer that chooses parallelization strategies.
  3. An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.

The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises a rich stack of components with different programming abstractions for complex analytics tasks:

Meteor Language

Meteor is a textual higher-level language for rapid composition of queries. It uses a JSON-like data model and features in its core typical operation for analysis and transformation of (semi-) structured nested data.

The meteor language is highly extensible and supports the addition of custom operations that integrate fluently with the syntax, in order to create problem specific Domain Languages. Meteor queries are translated into Sopremo algebra, optimized, and transformed into PACT programs by the compiler.

PACT Programming Model

The PACT programming model is an extension of the well known MapReduce programming model. PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.

PACT programs are parallelized by a cost-based compiler that picks data shipping and local processing strategies such that network- and disk I/O is minimized. The compiler incorporates user code properties (when possible) to find better plans; it thus alleviates the need for many manual optimizations (such as job merging) that one typically does to create efficient MapReduce programs. Compiled PACT programs are executed by the Nephele Data Flow Engine.

Nephele Data Flow Engine

Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance. Nephele can run on top of a cluster and govern the resources itself, or directly connect to an IaaS cloud service to allocate computing resources on demand.