Saturday, June 28, 2014

Micro ETL Foundation - A messaging system for Oracle Data Warehouse

Introduction

A goal of what I have called the Micro ETL Foundation (MEF) is to provide some simple and immediate solutions to the need of a Data Warehouse.
What could be more simple (and necessary) of a log message? It may seem strange to devote an entire article to describe how to report the message "Hello world". A message, famous for all the programmers of the computer world. But we find that what appears simple is actually a bit more complicated.
If we take the first paragraph of any programming book, (eg. java) the solution is expressed with a single line:


         System.out.println (\ Hello Word! ");
 

If we use another programming language (eg. PL/SQL Oracle), the solution is again expressed in a single line:

         Dbms_output.put_line ('Hello World');
 

What I want to emphasize is the incompleteness of the received message; In fact, the complexity lies not in the message itself: at the end it is a simple function call.The complexity is the lack of context. Complexity is the metadata.
Suppose that at this time are running more loading processes. The fact of receiving a warning or alert message is not enough.
What I want to know is when the message was sent, which job was running, how much time has elapsed since the previous message, which was the current procedure at the time of the message.
We find out that answer to all these questions is no longer trivial. When we go out from theoretical examples printed on the books and enter into the reality of the daily work,everything becomes more complicated.
We will see the theory and practice, then the code, which will give us an answer to contextual needs described above.
And the solution will be (surprise!), again expressed in a single line. The programming language used is the Oracle PL/SQL (simply because it is widely used in the Data Warehouse world and then I turn to a wider audience), but the techniques exposed are easily playable on any other RDBMS.
I forgot: we also see how to ensure that the message is sent via e-mail.
This, however, is not just for programmers. It 'an article that shows all the hard work behind a simple log message. Just imagine the work that is behind the complete loading process of a big Data Warehouse.

Definitions

Naming Convention
I spent a lot of words on the importance and need to always use a naming convention for the Data Warehouse projects. I summarize the concept. 

In general, the naming convention is the method by which you choose to assign a name to the various entities of the system being designed. This method should produce a name that must be able to represent immediately the semantic nature of the entities that we will use. It does not matter if they belong to a Data Warehouse system or to an operational system.

ETL Process
The ETL process is the set of programs that load data from external systems into the Data Warehouse tables.
Since this set can be very complex, it is extremely important to have a messaging system that gives me the
most information possible about the process.

Job
A job is a logical unit that describe a very specific task, such as loading a dimension of analysis or of a fact table or both.
A set of job-related between them is in turn a job that is part of a schedule.
It is activated at a preset time. In turn, a job consists of simple and sequential processing component that we call units.

Unit
The unit is the elementary processing component (so it is code) which, in turn, can call other procedures and functions that can be defined in a generic way, modules.

Execution
A job usually runs at night, but it could also run several times a day.
Each job execution is a run that has to be identified with a sequential number, which can be defined as exec counter.
For the moment are sufficient these definitions very short. Later we will analyze in more detail.

Requirements

The requirements are very simple. To have a basic messaging system that signals everything that it is useful to
monitor the execution of the ETL process.

It is a task of the designer to decide the content of these messages, which may simply be informative, as the number of rows processed, or of particular attention, such as the identification of errors or anomalies. In these cases it should be possible to send the message via e-mail.
The system should not merely store all messages generated, but must also have the context information.
The context is described in the paragraph relative to the definition of the ETL process.
More context information we insert, and more we can control the system and we will be more efficient and faster to resolve the problems.
I suppose a minimum knowledge of Oracle and SQL and PL/SQL languages.

Design

The design of the messaging component of the Micro ETL Foundation (MEF), consists of 3 tables, 2 sequences and 1 package.
We can see them as the basic components. The minimum necessary for the other MEF components that we will use in the future. The naming convention used is a smaller version of the one to be applied to all objects of a Data Warehouse. [See http://www.slideshare.net/jackbim/recipes-6-of-data-warehouse-naming-convention-techniques].
But it would be unreasonable to apply to MEF the same logic that is used to organize the hundreds of tables typical of a large Data Warehouse. Also we can use MEF in any type of project. We will use this simplified structure:

<entity name>=<area code>_ <type name>_<logical code>

The basic package, being common to all, will be further simplified by eliminating the <logical name>.
The area code is "MEF".
The table MEF_CFT is the table of system configuration.
The table MEF_EMAIL_CFT is the specific configuration table for e-mail addresses, the table MEF_MSG_LOT is one that will keep the text messages.
The sequence MEF_MSG_SEQ gives a sequence number to each message.
The sequence MEF_RUN_SEQ is used to numerically identify each job execution.
The Oracle package MEF is a library of programs that manages those objects.
All scripts for creating objects, can be downloaded from slideshare that show in a pragmatic and fun way the system [http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-2].
The scripts are very minimal, with only the main structures, leaving the reader with the completion of all other
accessory structures such as indexes, constraints, etc.. 

The tables will be created in the default tablespace, but you should always create ad-hoc tablespace.

The MEF_CFT table

This table contains general information about the MEF and the Data Warehouse.Other columns will be added in the future (or on your choice)

prj_cod: Code of the Data Warehouse project


user_cod: name of the Oracle user for ETL.


email_srv_cod: E-mail Server. Every corporate has always a server for managing e-mail, indicate here that server.


mef_root_txt: Path of the folder of MEF scripts.


mef_dir: Oracle directory pointing to mef_root_txt

The MEF_MSG_LOT table

This table stores all log messages that are sent by loading process.

seq_num: Sequential number of the message. It is obtained from an Oracle sequence.


day_cod: Time stamp of the message insert in the format YYYYMMDD.


sched_cod: The identifier for the schedule to which the job belongs.


job_cod: job identifier. It is a logical entity in the sense that we think of it as the launch of a list of processing units.


unit_cod: Identifier of the processing unit within the job. We can think of it as a procedure or a function of the Oracle package.


module_cod: module Identifier. A unit, though complex, can in turn call the sub-routine or subfunctions, ie modules. In this case, it is interesting to know this detail.


rows_num: Number of rows processed. Typically, this field is not set, but if we want to report the number of rows, for example, inserted into a table, we also have this information.


line_txt: Message text


cline_txt: Message text in CLOB type.


ss_num: Number of seconds that have elapsed since the previous message. This information, together with the next two, provides a summable data. If we wanted to know how long it took all the statements of a certain kind, we would be able to calculate.


mi_num: Number of minutes that have elapsed since the previous message


hh_num: Number of hours that occurred since the previous message


elapsed_txt: Time elapsed since the previous message in the format HH24:MI:SS


stamp_dts: Time stamp insertion of the message.


exec_cnt: The identifier for the execution of the job. Every run of a job should be characterized by a number,
in turn, extracted from an Oracle sequence.


user_cod: Oracle user who posted the message. It is setted automatically by a session variable. It can be useful in cases where multiple users contribute to the loading process(not recommended)

The MEF_EMAIL_CFT table

This table configures the email addresses.

email_cod: Code to identify a group of recipients.


from_txt: Sender of the message. This name will appear as the sender of the e-mail message. Do not use special characters, nor the blank between words. Eg. not set "Administrator ETL" but "Amministratore_ETL"  otherwise you get a run-time error message like:


ORA-29279: SMTP permanent error: 501
5.5.4 Invalid arguments 


to_txt: Identifier of the message recipient
cc_txt: This e-mail address of the recipient in knowledge
subj_txt: Subject default message
status_cod: Status (1 = active, 0 = inactive) of the recipient
 

The MEF_MSG_SEQ sequence

The sequence is an Oracle object. In practice it is a universal counter that increments each time you request it.
Each message line must have its sequential number. It is more functional than time stamp to sort the table.
Because sometimes the messages are separated at a fraction of a second of each other, the time stamp might not be sufficiently discriminating.

The MEF_RUN_SEQ sequence

Sequence that indicates unambiguously, a run of a job.

The MEF package

This is the basic package. I suggest to develop all the code inside PL/SQL packages (they are basically libraries), which allow a better management and use of the code.
Now a short description of the units contained in the package.


f_str: Utility function to generate a string after replacing the input variables.


p_ins_msg_lot: This procedure perform the insertion of the message in the MEF_MSG_LOT table  receiving a variable of row type as input parameter. This procedure has the "pragma autonomous_transaction". It is very important and requires a thorough description.
It seems incredible, but if there were not, it would not be possible this messaging system. The pragma autonomous_transaction allows us to commit (i.e. to validate into the database) only and exclusively the DML statements of the unit which contains this compiler directive. This concept is crucial because it allows us to commit (so to insert data into the database) without affecting the logic of the loading process.
Let's clarify with an example.
Typically, before to loading the daily data into a table, you delete, first of all, the data of that day, and then load/reload the new data. (forget for a moment the partition manipulation)
You do the commit at the end, where the loading, (delete/insert), was successfully completed. If i ran a commit after the delete and the insert has a problem, I could have the loss of data of the day.
As the messaging needs to do a commit of the message in the table, the execution of the innocent message "I have done the delete", would validate even the delete itself. And this is a side effect not acceptable.
The autonomous transaction solve the problem: the PL/SQL Oracle engine, produce a "daughter" transaction
who live an autonomous life, which validates the data in the MEF_MSG_LOT table without affecting the logic of the parent transaction.


p_rae: The management of the exception is standardized in the following manner. When any Oracle error happen, in the "when others" instruction there is the call of the p_rae procedure that enriches the content of the error with other useful information,such as, for example, where the exception occurred. The output will always be the standard error pv_error. The "when pv_error" is used to make sure that you keep the original error. 


p_init: Other private procedure. It initializes the line_row variable to perform the insert in the table.


delta_time: Procedure that, based on the input parameters, such as the date of the last message and the current date , calculates all the delta-time information, how many seconds, minutes, hours have passed and a delta time in the 'HH24:MI:SS' format. The ways in which we can calculate the delta time are numerous: one used is just one of many.


f_get_seq_val: Function that extracts generalized, in a dynamic way, the next number of the Oracle sequence whose name is given in the parameter. 


f_get_exec_cnt: Function that extracts the execution number of the job. Before calling the generic function
f_get_seq_val it verify that has not already been set as a global variable: in this case, use the number of the current execution.


f_get_cft: Function that extracts the current configuration from the MEF_CFT table.


p_esend: A function that performs the sending of email via the UTL_MAIL package.

p_send: This is the procedure that sends the message.


p_mail: Procedure for sending the e-mail. Using the email code in inputit looks for all the recipients in the MEF_EMAIL_CFT table and calls the p_esend passing all required parameters.

System configuration

Before you can create all the structures described above, you must perform some environment check. First of all it is necessary to verify that the Oracle RDBMS is designed for sending emails.
So you have to make sure that the Oracle user that sends the messages has all grant necessary for its operations.
Let's see in detail.

SMTP check
We need to verify that the Oracle RDBMS is designed for sending emails. In fact the sending is activated by calling a procedure that is part of a package of the RDBMS. To verify this, connect to SQL * Plus with SYS user and check the package UTL_MAIL (from Oracle 11):


Sqlplus / as sysdba
SQL> descr utl_mail
ERROR:
ORA-04043: object utl_mail does not exist


If you get this error message, you need to install the package system UTL_MAIL and give the execute permission to the user.
Always as user SYS, run the script to install the package from the rdbms/admin folder of the Oracle home, and give the execution permissions to the ETL user.
So you have to check the smtp server in the Oracle initialization parameter file; in the following example it is not setted and I indicate you how to do it. Here is the sequence of instructions for making these checks (remember to replace the file path of the utlmail.sql file with the relative path to your Oracle installation):


D:\>sqlplus / as sysdba

SQL*Plus: Release 11.2.0.1.0 Production on Wed Jun 25 15:35:49 2014

Copyright (c) 1982, 2010, Oracle.  All rights reserved.

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning option

SQL> @...\RDBMS\ADMIN\utlmail.sql
SQL> @...\RDBMS\ADMIN\prvtmail.plb
Package created.
Synonym created.

SQL> grant execute on UTL_MAIL to <ETL user>;
Grant succeeded.
SQL> sho parameters smtp
NAME TYPE VALUE
------------------------------------ ----------- -------
smtp_out_server string

SQL> alter system set smtp_out_server = <email server> scope=both;
System altered.

 
Almost always, in the company there is a mail server, the value should be include the domain, for example. exch.dev.com. The option "scope = both" make the change effective immediately and permanently.
We must also create and configure an ACL (from Oracle11) using the system package dbms_network_acl_admin.
The ACL (Access Control List) is only one way to define external resources to the RDBMS (such as email servers) and allow access to users.
Now open slideshare [http://www.slideshare.net/jackbim/recipe-7-of-data-warehouse-a-messaging-system-for-oracle-dwh-2] where you will find all the instructions on how to download and run the installation script of this messaging system.
The script will execute all the necessary settings for you. All this will take no longer than 5 minutes of work.

Test

In addition to the tests showned in the slides of slideshare, we perform another test now, certainly more exhaustive,  which clearly shows the functionality simulating a piece of ETL load.
We start by creating a test table, initializing it from a system table. At this point we run an anonymous block of code (anonymous block is defined as the set of SQL statements included between a begin and an end) that will simulate a real load of a table with insert and delete of data. The sequence of steps is quite simple: you initialize global variables package to better identify the various steps in the messaging table.

create table SALES as select * from tabs;
begin
   mef.pv_sched_cod := 'Daily';
   mef.pv_job_cod := 'Staging tables';
   mef.pv_unit_cod := 'Load sales table';
   mef.pv_exec_cnt := 10;

   mef.p_send('proc_prova','Load of SALES table');
   mef.p_send('proc_prova','Deleting...');
   delete from sales;
   mef.p_send('proc_prova','Deleted');
   mef.p_send('proc_prova','Loading...');
   insert into sales select * from tabs;
   mef.p_send('proc_prova','Loaded');
   mef.p_mail('MEF','ETL_administrator','Sales table loaded');
   mef.p_send('proc_prova','Load ended');
end;
/


The final result obtained, which can be seen in the table MEF_MSG_LOT, it is very interesting, and gives you the wealth of contextual information that is talked about in the beginning.

Conclusion

We saw in detail the steps required to build a messaging system, simple, but very useful for all the people working in the Data Warehouse projects.
This implementation, which is the basis of my Micro-ETL-Foundation, obviously works on any Oracle-PL/SQL project, is non-invasive, and can be applied at any time inserting simple procedure calls in an existing ETL process.