Data model¶

In Ophidia data are multidimensional and organized as cubes (or datacubes).

A cube consists of a measure and several dimensions: the measure corresponds to the numerical values that can be analyzed over the available dimensions. For instance, a measure could be temperature, a dimension could be time. In this case the cube includes the values of temperature for each time instant in a given period.

Storage model¶

Internal storage model is an evolution of the classic star schema adopted for data warehouses: Dimensional Fact Model. Unlike this model, in Ophidia data can be represented as binary arrays and the platform includes the support for array-based data types. Most of primitives process binary arrays and return a binary array as output. In addition, a key mapping related to a set of foreign keys is adopted to save memory space: the fact table includes only a column key on behalf of many columns foreign keys, one for each foreign key to the corresponding dimension table.

Ophidia assumes that dimension set is divided into two parts: explicit dimensions and implicit dimensions. The former are handled similarly to star schema, the latter refer to binary arrays.

In simple terms,

implicit dimensions are used as indexes within a binary array;
explicit dimensions identify each binary array.

Ophidia leaves the user the option to choose implicit and explicit dimensions of each cube and provides some operators to change the data structure (see the commands below).

It is worth pointing out that, although Ophidia is able to handle cubes with many implicit dimensions, the platform is optimized to process data with only one implicit dimension (see the primitives). In this case the dimension refers to the variability of data along the binary array.

As a rule of thumb, dimension types (explicit or implicit) should be chosen based on the workflow to be applied on the input datasets: operators working on arrays give the best performance and advanced features are provided for array manipulation.

For instance, if you wish to analyse a number of time series, one for each cell of the spatial domain, a reasonable implicit dimension would be time, whereas explicit dimensions (e.g. latitude and longitude) would represent the spatial domain.

Hierarchical data management¶

In order to manage large volumes of data and improve scalability, in Ophidia storage system datasets are partitioned and (possibly) distribuited over more analytics nodes.

Data partitioning consists of splitting the central fact table of the model described above into multiple smaller tables (chunks called fragments). In many cases, this clearly enables parallel data processing, in fact the fragments could be assigned to different Analytics Framework nodes and processed concurrently.

Actually, the fragments produced by the partitioning scheme are mapped onto a hierarchical structure allowing the user to optionally define:

the number of analytics nodes to be used (each node tipically hosts a single I/O server instance);
the number of fragments on the same physical database (a single database is used per I/O server instance).

Commands at glance¶

How to import a NetCDF file using the partitioning schema: 3 hosts and 5 fragments per databases (15 fragments in total)?

oph_importnc src_path=/path/to/file.nc;measure=foo;nhost=3;nfrag=5;

How to print the partitioning schema adopted for a cube?

oph_cubeschema

How to transform the most inner explicit dimension into the most outer implicit dimension?

oph_rollup

How to transform the most outer implicit dimension into the most inner explicit dimension?

oph_drilldown

How to exchange the order of two implicit dimensions (assuming that there are only two implicit dimensions)?

oph_permute dim_pos=2,1;

Note that order of explicit dimensions cannot be exchanged.