Develop a new primitive

Ophidia comes with an extensive set of primitives to operate on n-dimensional arrays (i.e. on the arrays contained in fragments). To achieve flexibility requirements, primitives are designed as dynamic libraries in order to be plugged in different I/O servers with no effort. All plugins (currently available as sequential implementations) represent extensions of the SQL, so that they can be embedded in SQL statements to perform scientific data analysis tasks. Furthermore, since most plugins take an array as input and provide an array as output, it is possible to implement a very simple composition (nesting) of plugins to perform more semantically complex tasks (query).

The primitives have been designed and implemented as MySQL User-Defined Functions (UDF). See Adding New Functions to MySQL, from MySQL official documentation, to learn more about the UDF.

MySQL User-Defined Function

MySQL UDF (also called plugins) are shared objects written in C/C++ which are dynamically loaded into MySQL through a CREATE FUNCTION query. Once imported into the DBMS, the plugins can be used in SQL queries as all the other built-in functions.

Plugins can be developed in two fashions: simple or aggregate. Simple plugins work on a single row at a time (an example is MySQL ABS() function), while aggregate plugins work on groups of rows (an example is MySQL SUM() function).

For each new plugin, some UDF interfaces, based on the previous classification, must be implemented. Let’s assume we want to define a plugin called NEW_PLUGIN, the functions to implement are:

  • NEW_PLUGIN_init: to initialize the plugin and check the input arguments;
  • NEW_PLUGIN_clear (only for aggregate plugins): to reset the current aggregate value;
  • NEW_PLUGIN_add (only for aggregate plugins): to add the element to the current aggregate value;
  • NEW_PLUGIN: to execute the actual computation;
  • NEW_PLUGIN_deinit: to release resources allocated through the initialization function.

More details on MySQL UDF can be found in Adding a New User-Defined Function on the official MySQL documentation.

Sample implementation

Let’s look now at a sample primitive implementation: oph_square. This primitive computes the square value of each element inside the input binary array. Since this is a simple MySQL UDF, the code should contain at least the implementation of three interfaces:

  • oph_square_init: to initialize the primitives structures;
  • oph_square_deinit: to de-allocate resources;
  • oph_square: to perform the primitive computation.

To build the library, mysql and the standard C headers should be included. Below you can find a sample header for the primitive.

/* Standard C headers */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>

/* Ophidia primitives core functions header */
#include "oph_core.h"

/* MySQL headers*/
#include <mysql.h> // It contains UDF-related symbols and data structures

// Function called at the start to initialize UDF-related structures
my_bool oph_square_init(UDF_INIT *initid, UDF_ARGS *args, char *message);

// Function called at the end to perform cleanup
void oph_square_deinit(UDF_INIT *initid);

// Main function called iteratively on each MySQL row (after the *_init function and before the *_deinit function)
char* oph_square(UDF_INIT *initid, UDF_ARGS *args, char *result, unsigned long *length, char *is_null, char *error);

The following sections will describe the code for each interface defined in the header. To build this library you can run a command like:

gcc -shared -o oph_square.so oph_square.c

Initialization Phase

This function is used to initialize structures and allocate memory required by the execution function and to check the input arguments type and number.

my_bool oph_square_init(UDF_INIT *initid, UDF_ARGS *args, char *message)
{
        // Input arguments must be 3
        if(args->arg_count != 3)
        {
                strcpy(message, "ERROR: Wrong arguments! oph_square(input_OPH_TYPE, output_OPH_TYPE, measure)");
                return 1;
        }

        // Input arguments must be strings
        int i;
        for(i = 0; i < 3; i++)
        {
                if(args->arg_type[i] != STRING_RESULT)
                {
                        strcpy(message, "ERROR: Wrong arguments to oph_square function");
                        return 1;
                }
        }

        // Initialize pointer that will point to a structure
        // that will be used as container of useful info about inputs
        initid->ptr = NULL;

        return 0;
}

Execution Phase

This is the main function of the primitive, where the actual computation is performed. Note that this function is executed on each row of the table selected by the query. The sequence of operation executed can be briefly summarized as:

  1. Error and input checks;
  2. Setup/reuse the initid->ptr structure;
  3. Setup/reuse the param->measure structure;
  4. Setup/reuse the param->result structure;
  5. Select compound type;
  6. Compute quare value;
  7. Set global struct values.
char* oph_square(UDF_INIT *initid, UDF_ARGS *args, char *result, unsigned long *length, char *is_null, char *error)
{
        // Check for errors occurred in the previous iteration or null inputs
        // Continue with the computation only if there were no errors and no null inputs
        if (*error)
        {
                *length=0;
                *is_null=0;
                *error=1;
                return NULL;
        }
        if (*is_null || !args->lengths[2])
        {
                *length=0;
                *is_null=1;
                *error=0;
                return NULL;
        }

        oph_generic_param_multi* param; // Structure used as container of useful info about inputs

        // During the 1st iteration of this function the pointer initid->ptr will be null
        // since we initialized it to null in the *_init function
        // We can then check if we are in the 1st iteration and allocate memory only 1 time for efficiency reasons
        if (!initid->ptr)
        {
                // Allocate memory for param
                param = (oph_generic_param_multi*)malloc(sizeof(oph_generic_param_multi));
                if (!param)
                {
                        pmesg(1, __FILE__, __LINE__, "Error in allocating parameters\n");
                        *length=0;
                        *is_null=0;
                        *error=1; // Raise an error and exit
                        return NULL;
                }

                // Initialize param
                param->measure = NULL;
                param->result = NULL;
                param->error = 0;
                param->extend = NULL;

                // Link local variable param to the UDF "global" variable initid->ptr to reuse it across iterations
                initid->ptr = param;
        }
        // From the 2nd iteration onwards we can skip the memory allocation phase
        // and reuse the UDF "global" variable initid->ptr linking it back to the local variable param
        else
        {
                param = initid->ptr;
        }

        // Check for uncaught errors
        if (param->error)
        {
                *length=0;
                *is_null=0;
                *error=1; // Raise an error and exit
                return NULL;
        }



        /* Setup structures to contain input data and metadata (BEGIN) */

        oph_multistring* input;

        // During the 1st iteration of this function the pointer param->measure will be null
        // according to the above code
        // We can then check if we are in the 1st iteration and allocate memory only 1 time for efficiency reasons
        if (!param->error && !param->measure)
        {
                // Interpret the 1st input parameter (the input_type)
                if(core_set_oph_multistring(&input, args->args[0], &(args->lengths[0])))
                {
                        param->error = 1;
                        pmesg(1, __FILE__, __LINE__, "Error setting structure for input data\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
                }

                // Check for the correct number of input measures
                // In this case the primitive does not accept compound types,
                // so the correct number of measures is 1
                if (!input->islast)
                {
                        param->error = 1;
                        pmesg(1, __FILE__, __LINE__, "Wrong number of input measures\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
                }

                // Save the input measure size in bytes (we assume this is a constant for all iterations)
                input->length = args->lengths[2];

                // Check for corrupted data (the number of bytes must be divisible by the size of a single element in the input array)
                if(input->length % input->blocksize)
                {
                        param->error = 1;
                        pmesg(1, __FILE__, __LINE__, "Wrong input type or data corrupted\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
                }

                // Compute the number of elements in the array (we assume this is a constant for all iterations)
                input->numelem = input->length / input->blocksize;

                // Link variable param->measure to the variable input to reuse it across iterations
                param->measure = input;
        }
        // From the 2nd iteration onwards we can skip the memory allocation phase
        // and reuse the variable param->measure linking it back to the variable input
        else
        {
                input = param->measure;
        }

        // For each iteration we have to set input->content to point to the actual input array
        input->content = args->args[2];

        /* Setup structures to contain input data and metadata (END) */



        /* Setup structures to contain output data and metadata (BEGIN*/

        oph_multistring* output;

        // During the 1st iteration of this function the pointer param->result will be null
        // according to the above code
        // We can then check if we are in the 1st iteration and allocate memory only 1 time for efficiency reasons
        if (!param->error && !param->result)
        {
                // Interpret the 2nd input parameter (the output_type)
                if(core_set_oph_multistring(&output, args->args[1], &(args->lengths[1])))
                {
                        param->error = 1;
                        pmesg(1, __FILE__, __LINE__, "Error setting structure for output data\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
                }

                // Check for the correct number of output measures
                // In this case the primitive does not produce compound types,
                // so the correct number of measures is 1 (as for input measures)
                if (output->num_measure != input->num_measure)
                {
                        param->error = 1;
                        pmesg(1, __FILE__, __LINE__, "Wrong number of output measures\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
                }

                // The primitive produces the same number of output elements as those present in the input array
                output->numelem = input->numelem;

                // The effective size in bytes of the output array will depend on the specified output data type
                output->length = output->numelem * output->blocksize;

                // Allocate memory for the output array here for efficiency reasons
                output->content = (char *)malloc(output->length);
                if(!output->content)
                {
                        param->error = 1;
                        pmesg(1,  __FILE__, __LINE__, "Error allocating output array\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
                }

                // Link variable param->result to the variable output to reuse it across iterations
                param->result = output;
        }
        // From the 2nd iteration onwards we can skip the memory allocation phase
        // and reuse the variable param->result linking it back to the variable output
        else
        {
                output = param->result;
        }

        /* Setup structures to contain output data and metadata (END) */


        /* The following code is general and covers the case of compound types.
         * Generally each operation must be performed on each corresponding singleton of the compound elements.
         * So we need to iterate over the input array and select corresponding singletons at each iteration.
         * In order to correctly read the proper singleton we have to use an offset given by a combination of the current element number
         * within the array and the current singleton number within that element.
         * Similarly for indexing the output array where we will write the results.
         */
    int i,k;
    double tmp = 0;
    size_t run_sum1 = 0;
    size_t run_sum2 = 0;
    for (i = 0; i < input->num_measure; i++) { // This loop will be executed only 1 time for this primitive

      // We have to take into account all the possible input data types
      switch (input->type[i]) {

      // In case of "oph_int"
      case OPH_INT:

              for (k = 0; k < i ; k++) { // This loop will not be executed for this primitive (we do not have compound types)
                      run_sum1 += input->elemsize[k];
                      run_sum2 += output->elemsize[k];
              }

              // Iterate over all (compound) elements
              for (k = 0; k < input->numelem; k++) {

                      // Read the proper element from the input array (a binary array casted to an integer array) into tmp
                      tmp = (double) *((int *) (input->content + (k * input->blocksize) + run_sum1));

                      // Compute the square of tmp (or write your own procedure)
                      tmp *= tmp;

                      // Write the result (with the proper output data type) into the output array
                      if(core_oph_type_cast(&tmp, output->content + (k * output->blocksize) + run_sum2, OPH_DOUBLE, output->type[i])) {
                              param->error = 1;
                              pmesg(1,  __FILE__, __LINE__, "Error writing output\n");
                              *length=0;
                              *is_null=0;
                              *error=1;
                              return NULL;
                      }
              }

              // Reset singleton counters
                        run_sum1 = 0;
                        run_sum2 = 0;

              break;

      // Do the same for all other admitted input data types
      case OPH_LONG:
              {}
              break;
      case OPH_FLOAT:
              {}
              break;
      case OPH_DOUBLE:
              {}
              break;

      // Otherwise raise an error
      default:
                        param->error = 1;
                        pmesg(1, __FILE__, __LINE__, "Input Type not allowed\n");
                        *length=0;
                        *is_null=0;
                        *error=1;
                        return NULL;
      }
        }

        // Properly set the length of the returned array
    *length=output->length;

    // Reset error flags
    *error=0;
    *is_null=0;

    // Return the output array
    return (result = output->content);
}

De-initialization Phase

This functions is used to release all resources allocate by the initialization function.

void oph_square_deinit(UDF_INIT *initid)
{
        // Free allocated space
        if(initid->ptr)
        {
                free_oph_generic_param_multi((oph_generic_param_multi*)initid->ptr);
                initid->ptr = NULL;
        }
}

Use the primitive

Once the primitive has been implemented, you can add it to your MySQL instance by simply running the following query:

CREATE FUNCTION oph_square RETURNS STRING SONAME 'oph_square.so';

Note that the plugin shared library should be placed under your MySQL plugin path, for example /usr/lib64/mysql/plugin/.

Now you can finally test your brand new primitive by running this query (substitute field and table with valid ):

SELECT oph_square('OPH_DOUBLE', 'OPH_DOUBLE', measure) FROM table;