Saturday, March 5, 2016

Recipes of Data Warehouse and Business Intelligence (Use case 02) - How to load a data file from network with control file

Analysis of the use case 2

This use case shows how to configure a data file that is present in a network folder (obviously it must be visible to the server that contains the Data Warehouse). Furthermore this file is coupled to the corresponding control file that, in order to be managed by MEF, must have the same name as the data file unlike its extension.
This situation typically occurs for two reasons.

1 - the feeding system can not (or do not want to do it) send the data file to the DWH
2 - the feeding system generates a data file that is used by multiple applications.


In this case we have take the data file (moving it or just copying it) and carry it to the
reception folder called rcv .
The control file contains no further information. His mere presence certifies that the file date is OK. We shall see, in a next use case, also in the case where the control file must contain the number of rows in the data file. This setting is preferable, as it allows us greater control on the consistency of the data file .
As in use case 01 follow the indications contained in the read_me file. Here I will describe in detail the sequence of operations.

Changes to MEF repository

In order to properly handle the move (or copy) of a data file from a network folder
to rcv folder, you can access to the Operating System using java functionality.
The original package mef_java was then modified to be able to execute
Operating System commands on both Linux and Windows. Consequently they were made of small interventions even on packages and mef mef_sta to use the new Java function.

Another intervention was necessary to also configure the information
"FIELDS TERMINATED BY" associated the Oracle external table. Ovviously, the data file can be constructed on very different systems between them, so some
settings can be valid on Windows but not on Linux and vice versa. The only thing
what we can do is to make configurable the feature of the external table and try to change the value if there are any problems.
In this use case, for example, the same data file, on Windows 64 had to be set as "0x 0d0a" but on Linux and Windows 32 as "NEWFILE". If the load gives error, try changing the value.

Before to start with configuration and run, do the work indicated in the read_me.txt to prepare the MEF environment.

Configuration of the data file

The data file used for this demonstration is that of the financial regulated markets
,downloadable from ESMA site (http://mifiddatabase.esma.europa.eu). For the purposes of demonstration, the network folder is instead a local folder that is initialized automatically after the installation of the MEF. 

You can safely try the use case with a your network folder, for example '\\ w2886038.bim.local \ shared'.
In this case, you must move the data file and the control file under the
'\\ W2886038.bim.local \ shared' folder and you have to change the pointing of the Oracle directory as follows:


CREATE OR REPLACE DIRECTORY
DWH_ESM_M_REGMAR_SRC AS
'\\w2886038.bim.local\shared';


The io_regmar.txt contains the setting for the data file. Set your correct value according to your Operating System. 

Configuration of the structure of the data file

The configuration is contained in the regmar.csv. There is nothing else to do. You have done all the work neeeded to process the data file. Connect to Oracle and run the configuration.

sqlplus etl/etl
SQL> @sta_conf_io REGMAR



Pre o post processing

Nothing to do.

Load of the data file

To load the data file, you must put it and its control file under the folder pointed from the DWH_ESM_M_REGMAR_SRC directory.
 

Thursday, February 25, 2016

Recipes of Data Warehouse and Business Intelligence (Use case 01) - How to load a data file with fixed length columns, dummy fields and transformation rules

Introduction

In the initial  example, we loaded the global financial market data files (MIC). One of the fields in the data file was the ISO country code in the 2-characters format. The international encoding of nations also provides a  3-characters format and a numeric format.
It 'important for a Data Warehouse, have all the standard codes, above all for information of a general nature as well as the codes of nations. This allows the Data Warehouse to receive and send data files to the outside world using a common coding language.
We can find on the internet the data file that contains these encodings (I used the http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html site but you can find others. I used this example because with a simple copy and paste, I created a txt format data file with fields at  fixed  lenght.
This data file, after being loaded into the Staging Area, will be a good candidate to become the dimension table of nations. This dimension must have a unique key field on which to build the dimensional key. In the case where, like this, there are more ways to identify the nation uniquely, we need to decide on a single one and use it as a standard for the whole Data Warehouse. We decide to define as a unique key the ISO code the 3-characters format.
In the sample of financial markets (MIC), there is the 2-characters format code. Suppose that, since the MIC will become a dimension of analysis, we want to include  the ISO 3-characters format code and its related description.
The objective is to show how to use MEF for
•    Load the data file of the nations in format with fixed-length fields.
•    Set up a trasfomazion rule on a field
•    Configure the fields that are not in the MIC data file.
•    Use the post processing
Do all this, it is very simple. It takes a little work.

1 - It starts as usual, by configuring the COUNTRY data file and placing the sample file in the \dat.
2 - Run the load.
3 - Change the configuration file of the MIC structure, adding dummy fields. These fields will not be present in the file date, but will be created in the Staging Area table.
4 - We use the post processing package, which always runs even if it has nothing to do, to write the plsql code that will fill the added fields. 
The data file and the related configuration files, you can download from https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k. Under the folder of Use Case. Follow the read_me instructions.
We see in detail these steps.

The COUNTRY configuration

The configuration  of the data file is performed by opening the io_country.txt file  located in the \cft, and setting the following values.

IO_COD:COUNTRY
IO_DEB:ISO 3166 Codes (Countries)
TYPE_COD:FIN
SEC_COD:ISO
FRQ_COD:M
FILE_LIKE_TXT:countries%.dat
FILE_EXT_TXT:countries_22-02-2016.dat
HOST_NC:.,
HEAD_CNT:3
FOO_CNT:0
SEP_TXT:
START_NUM:11
SIZE_NUM:10
MASK_TXT:DD-MM-YYYY
TRUNC_COD:1

Since the data reference date is inside the file name (countries_22-02-2016.dat), we need to set the size properly MASK_TXT: DD-MM-YYYY. The date starts at the eleventh character  (START_NUM:11) and is long 10 (SIZE_NUM: 10).
I plugged in the data file 3 header lines, so there is HEAD_CNT:3
Since it is a data file with fixed length fields, leave blank the information of the SEP_TXT separator.
The configuration of the structure of the data file, was performed by opening the file country.csv under the folder \cft
We see in the figure only the columns that need to be setted.



The first 4 lines setup the fields according as they are present in the data file. They have a fixed length which is affected by the amplitude of the column headers (HOST_LENGTH_NUM), but in the Staging Area table,  I want to have the real length, ie the one that is obtained with the sum of
HOST_LENGTH_NUM with STA_OFF_NUM.
We set the control of the congruence for all fields that do not undergo the rules (CHK_FLG) and we set the fields that make up the unique index (UI_FLG). After loading the Staging Area table, you will create the unique index. This is to prevent, immediately, that may be present duplicate keys.
At line 5, we insert the field in more that it is not found in the data file. (The COLUMN_COD FILE is null). Also, because we want the nation numerical code is just a number (not a number with leading zeros) we set a rule change. This rule will be used in the creation of the view that the processing module will use for loading the staging table.
Now, open a shell and :

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

E:\>cd projects\dwh\mef

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 10:41:25 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Tue Feb 23 2016 09:28:53 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> @sta_conf_io COUNTRY

PL/SQL procedure successfully completed.

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

E:\projects\dwh\mef>


Load the COUNTRY data file
Reconnect to SQL*Plus and run the load:

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 11:08:01 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Wed Feb 24 2016 10:41:26 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> exec mef_job.p_run('sta_iso_country');

PL/SQL procedure successfully completed.

SQL>


The MIC reconfiguration

We must change the structure of the MIC data file by inserting the extra fields, as we have done in the fifth row of the COUNTRY configuration. Doing so is very simple, because you just do a copy and paste of the lines 3 and 1of COUNTRY. (With FILE_COLUMN_COD and UI_FLG changed to null).


After changing the configuration file of MIC, run the configuration script.

SQL> @sta_conf_io MIC

PL/SQL procedure successfully completed.

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

E:\projects\dwh\mef>


Setting of the new fields PAE_DEB and PAEISO3_COD with the post-processing


Needless to say that in order to update these fields, you must first be loaded the COUNTRY data file. This setting is a typical post-processing operation. I have opened the temp_dwh_sta_iso_mic_ppk.sql file, present under the \sql folder, and modify it as follows.

create or replace package dwh_STA_ISO_MIC_PPK as
procedure p_main;
end;
/
create or replace package body dwh_STA_ISO_MIC_PPK as
pv_pkg varchar2(30) := 'dwh_STA_ISO_MIC_PPK.';
pv_error   exception;
pragma exception_init (pv_error, -20058);

procedure p_upd_iso is
   v_mod varchar2(61) := pv_pkg||'p_upd_iso';
   v_count number;
   v_upd number;
   v_msg varchar2(500); 
begin
   select count(*) into v_count
   from dwh_sta_iso_mic_stt;

   update dwh_sta_iso_mic_stt d
   set (paeiso3_cod,pae_deb) = (
       select paeiso3_cod,pae_deb
       from dwh_sta_iso_country_stt s
       where d.paeiso2_cod = s.paeiso2_cod
       )
   where exists (
      select paeiso3_cod,pae_deb
      from dwh_sta_iso_country_stt s
      where d.paeiso2_cod = s.paeiso2_cod
      );
   v_upd := sql%rowcount;
   v_msg := 'Updated '||v_upd||' countries on '||v_count;  
   if (v_upd <> v_count) then
      v_msg := v_msg||chr(10)||'Attention ! Some contries ('||
      to_char(v_count-v_upd)||') without description and ISO3 code';
   end if;
   mef.p_send(v_mod,v_msg);
   mef.p_mail('INFO','Staging Area Informational ('||v_mod||')',v_msg);
exception
   when pv_error then raise;
   when others then mef.p_rae(sqlerrm,v_mod); 
end;

procedure p_main is
   v_mod varchar2(61) := pv_pkg||'p_main';
begin
        if (mef_job.pv_wr.repeat_cnt > 0) then
      mef.p_send(v_mod,'Start');
      p_upd_iso;
      mef.p_send(v_mod,'End');
   end if;
exception
   when others then mef_unit.p_exc(v_mod,sqlerrm);
end;
end;
/


I have closed the file and I have renamed it, without the initial "temp_".
I have already done for you: you must only download the file, move it under the \sql folder, and compile it.

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 11:16:35 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Wed Feb 24 2016 11:08:01 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> @..\sql\dwh_sta_iso_mic_ppk.sql

Package created.


Package body created.

SQL>


As you can see, the code performs a simple update of the Staging Area Table, taking the values from the dwh_sta_iso_country_stt table. Because we want to be notified if there are nations who have not found a description, add a new email code in the configuration table  MEF_EMAIL_CFT, which will be used by  mef.p_mail call.From  SQL * Plus give:

Insert into MEF_EMAIL_CFT (EMAIL_COD, FROM_TXT, TO_TXT, CC_TXT, SUBJ_TXT, STATUS_COD)
Values ('INFO', 'mef@microetlfoundation.com', 'massimocenci@outlook.com', 'massimo_cenci@yahoo.it', null,1);
COMMIT;

Of course, change the email address with yours. We can now reload the MIC file. We enter SQL * Plus and give:

SQL> exec mef_job.p_run('sta_iso_mic');

PL/SQL procedure successfully completed.

SQL>

 
We see (in this image fragment) that the MIC Staging table has new values in the added columns.






 Since there are some cases of country codes not found, we received the informational email:


Wednesday, February 24, 2016

Designing, building, loading and managing a Staging Area for an Oracle Data Warehouse. In 20 minutes.



I briefly summarize what I have described in the presentation, with the same name, on Slideshare.

In the recent years, I have tried to explain and share a vision of the Data Warehouse which I defined as Micro ETL Foundation (MEF). MEF was born as a set of ideas, tips and methods to be able to successfully implement a project of this type.

The main guideline was the simplicity. This was the most difficult point to be maintained, because a project of Data Warehouse and Business Intelligence is extraordinarily complex.I have shown in various articles, that even small details such as the descriptions of the codes or the null values management can weigh heavily on the final outcome of the project.In fact, in my opinion, the real complexity lies in being able to simplify  the complexity.

It is now time to collect and put together organically and usable in real life, everything that I wrote, giving a consistent life to the Micro ETL Foundation transforming it into a an ecosystem to be used by all in a simple and free way.
The implementation was carried out using the internal language of the database (pl / sql) and  it is addressed to an Oracle Data Warehouse builted in Windows (and Linux) environment, but the underlying philosophy can be easily adapted to any RDBMS and any Operating System. 

The use of a well-defined Naming Convention has been decisive for the creation of models useful to the automatic generation of all the objects that constituting the system. We can then define MEF as a "Naming Convention-driven" ecosystem.

The focus will be on the Staging Area, because, for me, is one of the main components of a Data Warehouse, the basis (according to the Kimball approach) on which we can build the next components, ie the Dimensions and the Fact tables. The example described is the load of a simple file in csv format.

The MEF ecosystem is downloadable at https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k
The use case is described on Slideshare:
http://www.slideshare.net/jackbim/recipe-14-of-data-warehouse-and-business-intelligence-build-a-staging-area-for-an-oracle-data-warehouse-1?related=1
http://www.slideshare.net/jackbim/recipe-14-build-a-staging-area-for-an-oracle-data-warehouse-1

Now I will describe how to use it with others practical examples.