Stratosphere is a DFG-funded research project investigating new paradigms for scalable, complex analytics on massively-parallel data sets. The Stratosphere System is available under the Apache License, Version 2.0. Feel free to download it, try it out and give feedback or ask for help on our GitHub page.
The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises a rich stack of components with different programming abstractions for complex analytics tasks:
Meteor is a textual higher-level language for rapid composition of queries. It uses a JSON-like data model and features in its core typical operation for analysis and transformation of (semi-) structured nested data.
The meteor language is highly extensible and supports the addition of custom operations that integrate fluently with the syntax, in order to create problem specific Domain Languages. Meteor queries are translated into Sopremo algebra, optimized, and transformed into PACT programs by the compiler.
The PACT programming model is an extension of the well known MapReduce programming model. PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.
PACT programs are parallelized by a cost-based compiler that picks data shipping and local processing strategies such that network- and disk I/O is minimized. The compiler incorporates user code properties (when possible) to find better plans; it thus alleviates the need for many manual optimizations (such as job merging) that one typically does to create efficient MapReduce programs. Compiled PACT programs are executed by the Nephele Data Flow Engine.
Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance. Nephele can run on top of a cluster and govern the resources itself, or directly connect to an IaaS cloud service to allocate computing resources on demand.