Basics¶

Ophidia is a research project on big data analytics for eScience. It provides a framework for parallel I/O and data analysis, an array-based storage model and a hierarchical storage organization to partition and distribute multidimensional scientific datasets. Since the storage model does not rely on any scientific dataset file format, it can be exploited in different scientific domains and with very heterogeneous sets of data.

The physical design of the Ophidia storage model combines a key-value store approach built on top of a relational data store to manage multidimensional scientific data as data cubes. A data cube allows data to be modeled and viewed in multiple dimensions. It relies on the dimensions and facts concepts. The dimensions represent the entities with respect to which it is desired to keep records and perform analysis. The facts are numerical measures related to a central theme, by which it is desired to analyze relationships between dimensions.

In a typical scientific use case concerning the output of a global model simulation, latitude, longitude, depth, and time are dimensions; air pressure or sea surface temperature are facts.

Measure values are organized into arrays; each array is identified by a key obtained from the combination of some dimension values (called explicit dimensions), while the other dimensions (called implicit dimensions) identify a value through their position. The name explicit refers to the fact that the dimension identifier is explicitly stored/reported in the key part of the key-value row. For an implicit dimension, its identifiers are not stored in the key-value row at all. All the identifiers of an implicit dimension are implicitly mapped (by position) onto the measures stored in the value part of the key-value row. It is worth mentioning here that the values along the implicit dimensions are physically stored in contiguous locations. This aspect has, of course, some performance implications.

Coming back to the scientific use case mentioned before, a possible mapping could be the following:

time as implicit dimension;
latitude, longitude, depth could as explicit dimensions.

This mapping would allow a very efficient time-series based analysis through Ophidia.

From a physical point of view, a data cube is horizontally partitioned into several blocks (called fragments) that are distributed across multiple I/O nodes. Each I/O node hosts a set of I/O servers optimized to manage n-dimensional arrays. Theses I/O server manage a set of databases consisting of one or more data cube fragments.

Analytics tasks on data cubes (i.e. on all fragments associated to a data cube) are performed by the Ophidia analytics framework. It is responsible for atomically processing and manipulating data cubes, by providing a common way to run distributive tasks (operators) on large set of fragments.

Each operator in Ophidia:

represents a plugin of the framework,
implements (some/all of) the seven interfaces provided by the framework API,
is dynamically loaded at runtime by the framework to perform a specific data analytics task,
can be sequential, parallel or distributed,
can be data- metadata- or system- oriented depending on its main focus,
does not manage any general information related to monitoring, logging and bookkeeping.

Some examples of Ophidia operators are:

datacube sub-setting (slicing and dicing),
datacube aggregation,
datacube duplication,
datacube pivoting,
NetCDF file import and export.

Other operators can also handle more than one data cube at a time allowing intercomparison. More examples are available in the operators manual section.

In addition, Ophidia comes with an extensive set of primitives to operate on n-dimensional arrays (i.e. on the arrays contained in fragments). Currently available array-based functions allow data sub-setting, data aggregation (i.e. max, min, avg), array concatenation, algebraic expressions, predicate evaluation and compression routines (i.e. zlib). Core functions of well-known numerical libraries (e.g. GSL) have been included into the primitives. A complete reference is available in the primitives manual section.

In some scientific use cases, the management of processing chains or workflows with tens or hundreds of analytics operators can be a real challenge. Ophidia is able to handle workflow (DAG) of analytics operators through its server interface. Ophidia implements both inter- and intra- parallelism strategies to address performance. More detailed workflow documentation is also available in the workflow manual section.

Both single operators and complex workflow can run through the Ophidia terminal, the client application provided with the Ophidia analytics framework. APIs are also available for developers, both in C and Python.

The Ophidia code is available on GitHub under GPLv3 license at https://github.com/OphidiaBigData.

Tutorials are also available on YouTube for training purposes here. You can follow us on Twitter @OphidiaBigData.

If you want to join the project, develop some contributions, suggest new features, or start a collaboration with the Ophidia Team, please contact us.

Your feedback is very welcome!