Saturday, March 5, 2016

Recipes of Data Warehouse and Business Intelligence (Use case 02) - How to load a data file from network with control file

Analysis of the use case 2

This use case shows how to configure a data file that is present in a network folder (obviously it must be visible to the server that contains the Data Warehouse). Furthermore this file is coupled to the corresponding control file that, in order to be managed by MEF, must have the same name as the data file unlike its extension.
This situation typically occurs for two reasons.

1 - the feeding system can not (or do not want to do it) send the data file to the DWH
2 - the feeding system generates a data file that is used by multiple applications.


In this case we have take the data file (moving it or just copying it) and carry it to the
reception folder called rcv .
The control file contains no further information. His mere presence certifies that the file date is OK. We shall see, in a next use case, also in the case where the control file must contain the number of rows in the data file. This setting is preferable, as it allows us greater control on the consistency of the data file .
As in use case 01 follow the indications contained in the read_me file. Here I will describe in detail the sequence of operations.

Changes to MEF repository

In order to properly handle the move (or copy) of a data file from a network folder
to rcv folder, you can access to the Operating System using java functionality.
The original package mef_java was then modified to be able to execute
Operating System commands on both Linux and Windows. Consequently they were made of small interventions even on packages and mef mef_sta to use the new Java function.

Another intervention was necessary to also configure the information
"FIELDS TERMINATED BY" associated the Oracle external table. Ovviously, the data file can be constructed on very different systems between them, so some
settings can be valid on Windows but not on Linux and vice versa. The only thing
what we can do is to make configurable the feature of the external table and try to change the value if there are any problems.
In this use case, for example, the same data file, on Windows 64 had to be set as "0x 0d0a" but on Linux and Windows 32 as "NEWFILE". If the load gives error, try changing the value.

Before to start with configuration and run, do the work indicated in the read_me.txt to prepare the MEF environment.

Configuration of the data file

The data file used for this demonstration is that of the financial regulated markets
,downloadable from ESMA site (http://mifiddatabase.esma.europa.eu). For the purposes of demonstration, the network folder is instead a local folder that is initialized automatically after the installation of the MEF. 

You can safely try the use case with a your network folder, for example '\\ w2886038.bim.local \ shared'.
In this case, you must move the data file and the control file under the
'\\ W2886038.bim.local \ shared' folder and you have to change the pointing of the Oracle directory as follows:


CREATE OR REPLACE DIRECTORY
DWH_ESM_M_REGMAR_SRC AS
'\\w2886038.bim.local\shared';


The io_regmar.txt contains the setting for the data file. Set your correct value according to your Operating System. 

Configuration of the structure of the data file

The configuration is contained in the regmar.csv. There is nothing else to do. You have done all the work neeeded to process the data file. Connect to Oracle and run the configuration.

sqlplus etl/etl
SQL> @sta_conf_io REGMAR



Pre o post processing

Nothing to do.

Load of the data file

To load the data file, you must put it and its control file under the folder pointed from the DWH_ESM_M_REGMAR_SRC directory.
 

Thursday, February 25, 2016

Recipes of Data Warehouse and Business Intelligence (Use case 01) - How to load a data file with fixed length columns, dummy fields and transformation rules

Introduction

In the initial  example, we loaded the global financial market data files (MIC). One of the fields in the data file was the ISO country code in the 2-characters format. The international encoding of nations also provides a  3-characters format and a numeric format.
It 'important for a Data Warehouse, have all the standard codes, above all for information of a general nature as well as the codes of nations. This allows the Data Warehouse to receive and send data files to the outside world using a common coding language.
We can find on the internet the data file that contains these encodings (I used the http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html site but you can find others. I used this example because with a simple copy and paste, I created a txt format data file with fields at  fixed  lenght.
This data file, after being loaded into the Staging Area, will be a good candidate to become the dimension table of nations. This dimension must have a unique key field on which to build the dimensional key. In the case where, like this, there are more ways to identify the nation uniquely, we need to decide on a single one and use it as a standard for the whole Data Warehouse. We decide to define as a unique key the ISO code the 3-characters format.
In the sample of financial markets (MIC), there is the 2-characters format code. Suppose that, since the MIC will become a dimension of analysis, we want to include  the ISO 3-characters format code and its related description.
The objective is to show how to use MEF for
•    Load the data file of the nations in format with fixed-length fields.
•    Set up a trasfomazion rule on a field
•    Configure the fields that are not in the MIC data file.
•    Use the post processing
Do all this, it is very simple. It takes a little work.

1 - It starts as usual, by configuring the COUNTRY data file and placing the sample file in the \dat.
2 - Run the load.
3 - Change the configuration file of the MIC structure, adding dummy fields. These fields will not be present in the file date, but will be created in the Staging Area table.
4 - We use the post processing package, which always runs even if it has nothing to do, to write the plsql code that will fill the added fields. 
The data file and the related configuration files, you can download from https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k. Under the folder of Use Case. Follow the read_me instructions.
We see in detail these steps.

The COUNTRY configuration

The configuration  of the data file is performed by opening the io_country.txt file  located in the \cft, and setting the following values.

IO_COD:COUNTRY
IO_DEB:ISO 3166 Codes (Countries)
TYPE_COD:FIN
SEC_COD:ISO
FRQ_COD:M
FILE_LIKE_TXT:countries%.dat
FILE_EXT_TXT:countries_22-02-2016.dat
HOST_NC:.,
HEAD_CNT:3
FOO_CNT:0
SEP_TXT:
START_NUM:11
SIZE_NUM:10
MASK_TXT:DD-MM-YYYY
TRUNC_COD:1

Since the data reference date is inside the file name (countries_22-02-2016.dat), we need to set the size properly MASK_TXT: DD-MM-YYYY. The date starts at the eleventh character  (START_NUM:11) and is long 10 (SIZE_NUM: 10).
I plugged in the data file 3 header lines, so there is HEAD_CNT:3
Since it is a data file with fixed length fields, leave blank the information of the SEP_TXT separator.
The configuration of the structure of the data file, was performed by opening the file country.csv under the folder \cft
We see in the figure only the columns that need to be setted.



The first 4 lines setup the fields according as they are present in the data file. They have a fixed length which is affected by the amplitude of the column headers (HOST_LENGTH_NUM), but in the Staging Area table,  I want to have the real length, ie the one that is obtained with the sum of
HOST_LENGTH_NUM with STA_OFF_NUM.
We set the control of the congruence for all fields that do not undergo the rules (CHK_FLG) and we set the fields that make up the unique index (UI_FLG). After loading the Staging Area table, you will create the unique index. This is to prevent, immediately, that may be present duplicate keys.
At line 5, we insert the field in more that it is not found in the data file. (The COLUMN_COD FILE is null). Also, because we want the nation numerical code is just a number (not a number with leading zeros) we set a rule change. This rule will be used in the creation of the view that the processing module will use for loading the staging table.
Now, open a shell and :

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation.  All rights reserved.

E:\>cd projects\dwh\mef

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 10:41:25 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Tue Feb 23 2016 09:28:53 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> @sta_conf_io COUNTRY

PL/SQL procedure successfully completed.

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

E:\projects\dwh\mef>


Load the COUNTRY data file
Reconnect to SQL*Plus and run the load:

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 11:08:01 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Wed Feb 24 2016 10:41:26 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> exec mef_job.p_run('sta_iso_country');

PL/SQL procedure successfully completed.

SQL>


The MIC reconfiguration

We must change the structure of the MIC data file by inserting the extra fields, as we have done in the fifth row of the COUNTRY configuration. Doing so is very simple, because you just do a copy and paste of the lines 3 and 1of COUNTRY. (With FILE_COLUMN_COD and UI_FLG changed to null).


After changing the configuration file of MIC, run the configuration script.

SQL> @sta_conf_io MIC

PL/SQL procedure successfully completed.

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

E:\projects\dwh\mef>


Setting of the new fields PAE_DEB and PAEISO3_COD with the post-processing


Needless to say that in order to update these fields, you must first be loaded the COUNTRY data file. This setting is a typical post-processing operation. I have opened the temp_dwh_sta_iso_mic_ppk.sql file, present under the \sql folder, and modify it as follows.

create or replace package dwh_STA_ISO_MIC_PPK as
procedure p_main;
end;
/
create or replace package body dwh_STA_ISO_MIC_PPK as
pv_pkg varchar2(30) := 'dwh_STA_ISO_MIC_PPK.';
pv_error   exception;
pragma exception_init (pv_error, -20058);

procedure p_upd_iso is
   v_mod varchar2(61) := pv_pkg||'p_upd_iso';
   v_count number;
   v_upd number;
   v_msg varchar2(500); 
begin
   select count(*) into v_count
   from dwh_sta_iso_mic_stt;

   update dwh_sta_iso_mic_stt d
   set (paeiso3_cod,pae_deb) = (
       select paeiso3_cod,pae_deb
       from dwh_sta_iso_country_stt s
       where d.paeiso2_cod = s.paeiso2_cod
       )
   where exists (
      select paeiso3_cod,pae_deb
      from dwh_sta_iso_country_stt s
      where d.paeiso2_cod = s.paeiso2_cod
      );
   v_upd := sql%rowcount;
   v_msg := 'Updated '||v_upd||' countries on '||v_count;  
   if (v_upd <> v_count) then
      v_msg := v_msg||chr(10)||'Attention ! Some contries ('||
      to_char(v_count-v_upd)||') without description and ISO3 code';
   end if;
   mef.p_send(v_mod,v_msg);
   mef.p_mail('INFO','Staging Area Informational ('||v_mod||')',v_msg);
exception
   when pv_error then raise;
   when others then mef.p_rae(sqlerrm,v_mod); 
end;

procedure p_main is
   v_mod varchar2(61) := pv_pkg||'p_main';
begin
        if (mef_job.pv_wr.repeat_cnt > 0) then
      mef.p_send(v_mod,'Start');
      p_upd_iso;
      mef.p_send(v_mod,'End');
   end if;
exception
   when others then mef_unit.p_exc(v_mod,sqlerrm);
end;
end;
/


I have closed the file and I have renamed it, without the initial "temp_".
I have already done for you: you must only download the file, move it under the \sql folder, and compile it.

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 11:16:35 2016

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Last Successful login time: Wed Feb 24 2016 11:08:01 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> @..\sql\dwh_sta_iso_mic_ppk.sql

Package created.


Package body created.

SQL>


As you can see, the code performs a simple update of the Staging Area Table, taking the values from the dwh_sta_iso_country_stt table. Because we want to be notified if there are nations who have not found a description, add a new email code in the configuration table  MEF_EMAIL_CFT, which will be used by  mef.p_mail call.From  SQL * Plus give:

Insert into MEF_EMAIL_CFT (EMAIL_COD, FROM_TXT, TO_TXT, CC_TXT, SUBJ_TXT, STATUS_COD)
Values ('INFO', 'mef@microetlfoundation.com', 'massimocenci@outlook.com', 'massimo_cenci@yahoo.it', null,1);
COMMIT;

Of course, change the email address with yours. We can now reload the MIC file. We enter SQL * Plus and give:

SQL> exec mef_job.p_run('sta_iso_mic');

PL/SQL procedure successfully completed.

SQL>

 
We see (in this image fragment) that the MIC Staging table has new values in the added columns.






 Since there are some cases of country codes not found, we received the informational email:


Wednesday, February 24, 2016

Designing, building, loading and managing a Staging Area for an Oracle Data Warehouse. In 20 minutes.



I briefly summarize what I have described in the presentation, with the same name, on Slideshare.

In the recent years, I have tried to explain and share a vision of the Data Warehouse which I defined as Micro ETL Foundation (MEF). MEF was born as a set of ideas, tips and methods to be able to successfully implement a project of this type.

The main guideline was the simplicity. This was the most difficult point to be maintained, because a project of Data Warehouse and Business Intelligence is extraordinarily complex.I have shown in various articles, that even small details such as the descriptions of the codes or the null values management can weigh heavily on the final outcome of the project.In fact, in my opinion, the real complexity lies in being able to simplify  the complexity.

It is now time to collect and put together organically and usable in real life, everything that I wrote, giving a consistent life to the Micro ETL Foundation transforming it into a an ecosystem to be used by all in a simple and free way.
The implementation was carried out using the internal language of the database (pl / sql) and  it is addressed to an Oracle Data Warehouse builted in Windows (and Linux) environment, but the underlying philosophy can be easily adapted to any RDBMS and any Operating System. 

The use of a well-defined Naming Convention has been decisive for the creation of models useful to the automatic generation of all the objects that constituting the system. We can then define MEF as a "Naming Convention-driven" ecosystem.

The focus will be on the Staging Area, because, for me, is one of the main components of a Data Warehouse, the basis (according to the Kimball approach) on which we can build the next components, ie the Dimensions and the Fact tables. The example described is the load of a simple file in csv format.

The MEF ecosystem is downloadable at https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k
The use case is described on Slideshare:
http://www.slideshare.net/jackbim/recipe-14-of-data-warehouse-and-business-intelligence-build-a-staging-area-for-an-oracle-data-warehouse-1?related=1
http://www.slideshare.net/jackbim/recipe-14-build-a-staging-area-for-an-oracle-data-warehouse-1

Now I will describe how to use it with others practical examples.

Saturday, March 7, 2015

How to identify and control the reference day of a data file

Introduction
In a presentation of some time ago [1], I had highlighted the importance of the control between the reference day and the expected one of a data file received from the host system. Given the importance of the topic, I would further deepen both the meaning of the control and the identification of the reference day of the data.
This argument may seem strange, or too technical, mainly for those who are not very experienced about Data Warehouse, or for those they are to his first experience. I'll try to simplify the concept it as much as possible, because this it is a very useful example to understand the pitfalls inside the ETL process.

Example
To understand the problem, we assume a similar example of practical life. Suppose that you receive every morning, from your bank, via email, the list of the operations of your checking account.
So, you receive daily an excel spreadsheet with the movements of the previous day. The reference day of the operations, is located in the subject of the email, for example: "Operations of the 03/12/2015".
Obviously you do not check your email from the bank every morning: you have created an automatic mechanism that opens the email, and load the daily operations, with the day  indicated in the subject of the email, in the database of your household expenses. Only at the start of the month, you start to do your analysis about the previous month.
The questions  are:
What guarantees do you have, that the email arrive to you every day?
What guarantee do you have that the day, in the subject of the email, definitely makes always reference to the day before?

The importance of the control
If we are not able to answer the questions above, we have no guarantee about the validity of our analysis.
This is the reason why the control is important. That's why we have to be absolutely sure that the day that we are loading is exactly what we expected. And, before that, we have to verify that what we expected, has arrived. We should not assume that everything spins smoothly, because it is always possible that an anomaly occurs. Both in the source systems that in our systems.
Do not forget, that in the face of a problem, we are always us, understood as the Data Warehouse team, the first interlocutors, the first people to which requires explanation. In practice, in the eyes of end-users, we are the cause of the problem. Those who had to control and they did not it. Here are two examples that may occur. (and in fact they are occurred to me in the past).

Problem 1
In the source system there has been some revision or changes in the extraction plans (no one is forced to alert you), so an error rises in the data filter. Consequence: the data file is daily generated correctly, but the reference day inside the flow is wrong. We don't know the cause. The reference day may remain fixed, backward a day, equal to the system date, or refers to a field different from the previous data file. Only if we check the reference day with the expected day we realize the error.

Problem 2
The ETL process, via ftp, copies the file from the host system folder to a local folder to process it. Suppose that the host folder, which is a folder also used by other systems, for reasons of maintenance or space, starting from a certain day, is no longer used, replaced by another folder. Nobody warns you, the old folder is still visible, and the Data Warehouse, the day after the exchange, continues to feed itself from the old folder. Also in this case, to have an immediate control of the reference day is crucial.
Moreover, if the data you are loading erroneously always for the same day, are mixed with other data that instead change daily, discover this kind of error cannot be immediate and may take several days.

The solution
The proposed solution to control the congruence between the reference day and the expected one, was very minimal, according to the rules of an "agile" vision for the construction of a Data Warehouse [2]. Were sufficient a metadata table and a log table. We can graphically represent its logic in the following figure. In practice, we can use a configuration table, of "calendar" type. In it, each calendar day, is associated with the expected day of  the daily data source and with the reference day loaded.
A log table shows, at the end of the daily load, the congruence of the two days. The physical names are constructed according my standard of naming convention.




The basic elements
The elements involved in this control are thus three. The first is the calendar day. The second is the awaited day inside the daily flow. We must configure it on the basis of  the flow analysis. We know their logic, so we can easily configure the awaited day. For example, if the data file is not expected at the weekend, the saturdays and the sundays of the calendar will not be setted as expected days.
The third element in the game is the reference day inside the data file. That is the day to be compared with that expected for the verification of congruence.

The reference day of the data
Now focus your attention to the third element, the reference day of the data . Where is this information?
It would be nice to think to a rule that the day is always present directly as a column of the file. This should be a rule, an underlying condition.
Unfortunately in a Data Warehouse, as usual, nothing is simple, and the reference day, most of the time, is not a column of the data file. The rule that we have described, it may be applied only if we are lucky enough to be able to decide ourselves the structure of the source files. Most often, though, we'll have to re-use (to save money, of course) data files already existing, created with different rules. Let's see what are the situations that surely we will face.

  • The day is a column of the data file
  • The day is in the header of the data file
  • The day is in the tail of the data file
  • The day is in the data file name
  • The day, there is not.




The day is a column of the data file
This is the most simple and easily manageable case. The day is a column of the data file. For example, a recording date, an execution date of an order, a fiscal date. Typically you can find this situation in the data files that will become, at the end of the ETL process, the fact tables.

The day is in the header/tail of the data file
Typically you can find this situation in standardized flows, already present, generated on the mainframe, by programs written with Cobol language. These files can have various lines in the header/tail with a lot of information in it, including the reference day. In fact, the reference day, usually, is the date of the generation of the data file. However, if the data are of anagraphical type (like the features of the customers), we can think that it is the same of the reference day.
Think of a customer data file. If the data extraction start at 22.00 of the day 2015/03/31, we may think that it freezes the daily situation of the customers at the 2015/03/31; therefore the extraction day is the same of the reference day. Caution, however, the cases in which the extraction starts after the midnight. In this case, the reference day is equal to the extraction day minus 1 day.

The reference day is inside the file name
This is a fairly common event, typical above all for data files in csv format.

The reference day is missing
We must also consider this case. Sometimes the data file is generated with the same name, (therefore exclude the case above), may have a header/tail, but without the information that interests us. We can only rely on the system date of the Data Warehouse server.

The importance of the metadata
Let us now see how to provide a method to handle these situations. It is easily implemented using only a bit of the language of the database (for example, the pl/sql of Oracle, but these techniques can easily be adapted to any database) and a bit of metadata.
Metadata, logical term very sophisticated and intriguing, that hides what we call trivially configuration tables, are essential to manage the case studies we have described. The metadata needed to identify and extract the reference day are:



  • ref_day_cod = fits here, if there is, the name of the column that contains the reference day.
  • row_num = with this number we denote the offset of the line, from the beginning of the data file, if it is in the header, in which we find the reference day. The number will be negative in the case where the reference day is in the tail.
  • start_num = with this number we indicate, within the row_num, the position of the starting character of the reference day.
  • size_num = with this number we indicate the size of the day. If in the YYYYMMDD format, will be 8, in the DD-MM-YYYY format is 10. Avoid using the months in text format.
  • for_txt = with this string we indicate the date format of the reference day. For example 'yyyy-mm-dd'.
  • off_num = with this number we indicate the offset (in days) of the reference date. This information is useful in the case where the generation of the flow and its processing is performed on different days. Let's take an example. Today is the day 14 and start  the processing of the data file. The flow was generated yesterday at 22:00, The day 13 is in the header, and we assume that this is the day of the reference data. In this case off_num must be equal to zero. Suppose now that the generation of the data file takes place after midnight, for which we find in the header the day 14. In this case off_num must be set = -1, because even if the generation and processing of the flow occurred after midnight definitely the master data refer to the situation of the day 13. This, in brief, is the need for this metadata.

Now complete the previous figure with an example of the corresponding data files.



How to use the metadata
Define the metadata, place them in a configuration table of all data files and fill the content, it is not a "passive" activity . That is, not only serves to document the structure of the data files, but it is also "active" information, because it can be used in the ETL programs.
Let's take a concrete example very simple. A data file, it is only a sequence of lines structured in data columns. We can represent it as:



The first thing to do in order to load the data file into the Staging Area is to build an external table that "sees" the data file, with the same columns plus the technical field <IO> _row_cnt, that simply gives a number to each row. <IO> is the input file code.
The Staging Area table must obviously have the same columns of the external table plus the reference day DAY_COD. If the day is already in the data file, it does not matter. Always insert an additional DAY_COD column. At this point, you can create a function that, having as input the IO_COD, uses the metadata table and extracts the value of the reference day to be loaded in the DAY_COD of the Staging Area table.



Conclusion
I have already had occasion to emphasize the importance of the reference day of the data. A mistake on his identification may jeopardize the validity of the whole Data Warehouse. Do not forget that:
the versioning policies of the dimensional tables (slowly changing dimension of type 2) are based on the reference day.
the fact tables are partitioned using the reference day. An error in the initial phase can be very costly in the following stages of the project. So, beyond the technical solution, the identification and control of the reference day is a very important issue in a Data Warehouse project.
The management of the reference day, just as the "NULL" management[3] and as the description of the codes [4], which I have already spoken at length, seem unimportant arguments and of little interest.
But, to ignore them, is very dangerous, and may be the cause of many failures in the worst case. Or heavy delays in the deploy in the best cases.

References
[2] http://www.slideshare.net/jackbim/recipe-11-agile-data-warehouse-and-business-intelligence






Wednesday, December 10, 2014

The Descriptions Management

Introduction

The descriptions management, is a subject little discussed within the Data Warehouse and Business Intelligence community. It speaks little in books and articles, although it is a crucial issue because it affects the way we see the information.
So, we'll talk about descriptions. In particular, the descriptions of the codes that we can found in the dimension tables of a Data Warehouse.
I think it's important to show the complexity that is hidden behind what seem trivial. And it is necessary for all, (end users, project managers, data architects and the other stakeolder) know that the descriptions management is a particularly tricky topic.

A real life example

The best way to introduce this topic, is to start with a real example. Suppose we observe an end-user who is using a Business Intelligence interface.
The user needs to extract numerical data, for example, the movements of an account, and he wants to filter only those customers who have a particular profession, for example, only employees.
In order to meet this requirement, the Data Warehouse Team has just prepared a menu with a list of choices of values, (that is, descriptions) on which the end user can choose the line "Employee". The following figure shows this case. 




This figure shows the visual world. Now we enter in the world of descriptions management, with a number of considerations that describe the logical and the physical world that  hides behind the little word "just".

The vision of the end-user

The description "Employee" associated with the code "10" is a description wanted by the end user. It is very important to show a description that is in tune with the daily nomenclature of the end users. This need has a very important consequence.
It is very unlikely that the code description, which comes from the source systems (we call it "external"), it is the same description of the end user, in fact, usually, the description will be different.
This is because the description is present in the OLTP systems, not in the synthesis systems. So you need to always have in the Data Warehouse two descriptions. That of OLTP systems, let's call it "External description" and the final one, let's call it "Business description". To have both is important because it is the only link between two systems that have different purposes.

The extended description

As seen in the example, in the list of choices, must not appear only the description, but also the code. Often this does not happen, and it is shown only the description. My suggestion is to show them both.
We define this new description as "extended description", and it must be obtained by concatenating the value of the code with a "-" (minus) and the description of the value. It is useful to do this, because end users are not all equal.
Some user think using descriptions, but others think using codes. Often the end users have so many years working in a certain environment, that it is easier for them to find customers with profession "10" rather than those with the profession "Employee". Moreover, the descriptions, in time, may vary, so the presence of the code is undoubtedly safer. The best thing is to show them both, code and description, to meet all types of users. So we have "10 - Employee"

Because we have an external description and a business description, we have to decide which of the two use to build the extended description.
In general, it is appropriate to use a single description, namely the business description, but it can be an acceptable choice also have two extended descriptions.
Let's now look at the physical world behind the list of values displayed to the end user. Because as we know, the logical considerations, always have physical consequences.

How to see the extended descriptions

When a user requests a choice of values on the Business Intelligence interface, the display of the list is almost always the result of the generation of a query on the database of the Data Warehouse. Generally, it is very likely that starts a "select distinct" of the values of the extended description, on the dimension table.
The extended description can be implemented without taking up space, with the characteristic of the virtual columns. The virtual columns are defined on Oracle, but you can always create views that create those columns. We will show later how to define it.
Here's an example, very simplified, of the columns of a possible customers dimension table, which will be the base of the list of values:

  • CUST_KEY - Artificial key associated to the customer
  • CUST_COD - Natural key (customer code)
  • PROF_COD - Profession code
  • PROF_HDS - External Description (OLTP description)
  • PROF_BDS - Business description
  • PROF_XDS - Extended description. Obtained by concatenating the code with the business description

Where to find the extended descriptions

We have seen that a simple "click" on a list of choices, hides various considerations. But the most surprising thing is that the above is just the tip of the iceberg. In this case, the similarity is very apt. The tip is what the end user sees. But what happens below the surface? It 'so trivial to get descriptions that come (maybe) from source systems, and insert them in the dimension tables? As always happens in a Data Warehouse, nothing is trivial and everything is more complicated than it seems. Indeed, code descriptions can be located, in my experience, in seven different places.
  1. The code description is already in the staging table (which will become the dimension table)
  2. The code description is present in a table of descriptions, exclusively for that code.
  3. The code description is present in a table of descriptions for all of the codes. (or nearly all)
  4. The code description is present in a sheet excel / csv
  5. The code description is present in a Word document or pdf or xml
  6. The code description is in an email.
  7. The code description is hard-coded in the code of the OLTP forms , so, after a phone call to some referent of OLTP systems, we return to one of the 4 or 5 or 6 point.

Descriptions as metadata

Without thinking where are the descriptions and the way in which we will acquire, how can we manage the descriptions in our Data Warehouse?
My advice is to always have a metadata table as that of the point 3 and reduce all other cases to it. But to do so, we need to introduce the concept of domain of values. The concept is very simple. Just give a name to the set of values associated with a certain code. For example, if we know that the set of possible values of the code profession is:

10 = Employee
20 = Free Lance
30 = Tradesman

Associate the "PROF" name to this domain of values  and insert this data into a metadata table that we call MEF_DOM_CFT (MEF Domain Configuration). The table can have a very simple structure with the following information:

  • DOM_COD  - Domain code
  • COD_TXT  - Value of the code
  • HDS_TXT  - External description
  • BDS_TXT  - Business description

The Naming Convention of the descriptions

We adopt, as always, a naming convention. On this issue I have already written a lot on my blog, in this case we assume that all the fields of descriptive codes, ie codes that makes sense to have a description, must be called like  * _COD, the external descriptions  like * _HDS, the business  descriptions  like *_BDS *, the extended descriptions like *_XDS.

Who has experience of Data Warehouse knows that, unfortunately, if we wait for the end of the analysis of the domains, we lose a lot of time, so we must act preventively. My advice is to fill in the automatic way the domains table with the values of the codes that we found in the staging tables, giving them a unique name, and a default description like "nd" (not defined). In a second step we can take action on generic domains by changing their default name . The default domain name that we will adopt will be <source >. <Column name>.

Other considerations

  • Descriptions are names that rarely vary. It would be a waste of time to set a daily loading/update of the domains table. Do an initial massive load and manage, when it is necessary, any changes and additions.

  • Often you get, in the source files, short and long descriptions. Obviously, if they are present in the source you should load them, but you should always use the short description as "external description". Now the trend is to shorten the information (see the language of SMS messages) so, in my opinion, you can ignore long descriptions that can create problems of sizing on graphical interfaces (PC,Tablet and smartphones)
  • The business  description is necessarily manual. It will be the end user who will decide which description prefers to see. The initial loading, however, must initialize the business description equal to the external description, if,and only if, the business description is empty. Then, from time to time, we will proceed to the changes.
  • Another point that should not be overlooked is the following: how much are reliable, codes and descriptions that we find in the points 2-6? As often happens, the documentation is not always up to date, so we risk to have not the full domain of the code. On this point, we have to be proactive in the sense that we can not ask generically an update of the documentation of the source systems. We must be the ones to ask precisely the descriptions of the codes that we do not find in the documentation.

The importance of the domains

Finally, the main reason for the existence of the domains table. The reuse of the description. In the tables of a Data Warehouse the same domain can be applied to so many columns. 
In fact they have different names, but refer to the same domain of values. For example, a currency code, can be present in a table with name CURTRAT (currency of treatment) and CURTRA (currency of trading), but both must have the same domain, let's call it "CUR".

The descriptions management - Stage 1

At this point, we explore the descriptions management with a real simulation, obviously very simplified, after creating the simulation environment. The work plan is as follows.
  1. Create the environment
  2. Load the domains table
  3. Load the staging table
  4. Update the domains table on the basis of the codes that we find in the table of Staging Area
  5. Load the dimension table
  6. Update the dimension table with descriptions

Create the environment

We prepare a simulation environment, creating the necessary tables with only the information they we need, which is the table of the staging area, the dimension table and the domains table. As we can see, the dimension table contains both the external description and the business description. The extended description is built using the syntax of the virtual columns.

DROP TABLE CUST_STA CASCADE CONSTRAINTS;
CREATE TABLE CUST_STA (
  CUST_COD VARCHAR2(9),
  PROF_COD VARCHAR2(30)
);

DROP TABLE MEF_DOM_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60)
  ,HDS_TXT  VARCHAR2(255)
  ,BDS_TXT  VARCHAR2(255)
);

DROP TABLE CUST_DIT CASCADE CONSTRAINTS;
CREATE TABLE CUST_DIT (
  CUST_KEY NUMBER
  ,CUST_COD VARCHAR2(9)
  ,PROF_COD VARCHAR2(30)
  ,PROF_HDS VARCHAR2(60)
  ,PROF_BDS VARCHAR2(60)
  ,PROF_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (PROF_COD||' - '||PROF_BDS) VIRTUAL
);

Load the domains table

The loading of this table will necessarily be a mix of automatic loading and manual loading. We refer to the 7 situations seen before. If the domains are already present in a specific table or in a common table, we will use just a SQL statement to load it. If you are in an Excel spreadsheet, convert it to a CSV file, point it with an Oracle external table and then load them with a SQL statement. Otherwise, you have to load them manually one by one. Suppose to be in the latter case and load the domain of the profession code. We note the difference between external description and business description. If, as often happens, the feeding system is a mainframe, the descriptions will come all in uppercase and they will not be very user-friendly for a directional reporting

INSERT INTO MEF_DOM_CFT VALUES ('PROF','10','EMPL.'    ,'Employee');
INSERT INTO MEF_DOM_CFT VALUES ('PROF','20','FREE L.'  ,'Free Lance');
INSERT INTO MEF_DOM_CFT VALUES ('PROF','30','TRADESMAN','Tradesman');
COMMIT;
 

After the insertion, the contents of the domains table will therefore be:



Obviously, if we have not yet considered the analysis of the domains or we are still waiting for documentation, this step can be ignored. The table remains empty and will be loaded automatically in a second step.

Load the staging table

The staging table loading is a job that will be part of the ETL process. For simplicity we simulate its load with the following SQL statements. The '?' as value code, substitutes NULL. From the methodological point of view, I consider important to avoid NULL values in the Data Warehouse, and replace them immediately in the Staging Area with a default value. Usually I use the question mark. [for a detailed discussion refer to [http://www.slideshare.net/jackbim/recipes-5-of-data-warehouse-the-null-values-management-in-the-etl-process]

INSERT INTO CUST_STA VALUES ('Larry','10');
INSERT INTO CUST_STA VALUES ('Tom'  ,'15');
INSERT INTO CUST_STA VALUES ('Brown','20');
INSERT INTO CUST_STA VALUES ('Steve','40');
INSERT INTO CUST_STA VALUES ('Jack' ,'?');
COMMIT;


After the insertion, the content of the staging table will be:




Update the domains table

As you can see from the staging area contents, I purposely simulated a situation of domain not updated, in fact are not present all the values that could arrive (missing the "15", "40" and a "?" ). In order to keep the domains table updated, we perform a procedure that we can run as post-processing in the Staging Area loading. The procedure is the following.

create or replace procedure p_dom_prof(p_dom_cod varchar2) is
begin
   for r in (select distinct p_dom_cod dom_cod
             ,a.prof_cod
             ,b.cod_txt
             ,nvl(b.hds_txt,'nd') hds_txt
             ,nvl(b.bds_txt,'nd') bds_txt
             from cust_sta a
             left outer join mef_dom_cft b
             on (a.prof_cod = b.cod_txt
             and b.dom_cod = p_dom_cod)) loop
      if (r.cod_txt is null) then
         insert into mef_dom_cft (dom_cod,cod_txt,hds_txt,bds_txt)
         values(p_dom_cod,r.prof_cod,r.hds_txt,r.bds_txt);
      end if;
   end loop;
   commit;
end;
/
sho errors
exec p_dom_prof('PROF');


The procedure performs a simple loop on the values of the profession code of the staging table. There is the link with the domains table,using domain name that receives as input. As we can see, the procedure will fill the domain table even if the table is initially empty (by means of the outer join). Also always acts as an insert, then does not alter or delete descriptions that have been entered manually.
The procedure also inserts the missing codes with the default description "nd". If now we query the domains table, we will get:




Load the dimension table

The loading of the dimension table, is one of the most challenging and delicate component of the entire ETL process. From my experience, almost all dimension tables should take account of the "Slowly Changing Dimension" of type 2, on which I will not go into detail now. Precisely because of this complexity, I preferred not to add complexity to complexity. I prefer manage the descriptions as post-processing of the loading of the dimension tables. So, first I load the dimension table by taking the data from the staging table, then I update the descriptive components. We load the dimension table in a very simply way with the following SQL statement:

INSERT INTO CUST_DIT(CUST_KEY,CUST_COD,PROF_COD)
SELECT ROWNUM,CUST_COD,PROF_COD
FROM CUST_STA;
COMMIT;


Situation after loading:




Update the dimension table

To update the dimension table, we create and execute the following procedure.

create or replace procedure p_dit_prof(p_dom_cod varchar2) as
begin
     for r in (select distinct prof_cod from cust_dit) loop
      update cust_dit d
      set (prof_hds,prof_bds) = (select hds_txt,bds_txt
         from mef_dom_cft s
         where d.prof_cod = s.cod_txt
         and s.dom_cod = p_dom_cod)
      where prof_cod = r.prof_cod
      and exists (select 1
         from mef_dom_cft s
         where d.prof_cod = s.cod_txt
         and s.dom_cod = p_dom_cod);
     end loop;
   commit;
end;
/
sho errors
exec p_dit_prof('PROF');


The logic of the procedure is very simple. For each value of the profession code, it performs the update of the description taken from the domains table. If now we query the dimension table we will get: 





After the nightly loading of the Data Warehouse, we can query the domain table (or the dimension table), and all situations, in which there is the description "nd", must be the object of attention for all the people who are concerned with the quality control of the data . In practice, for each code that is "nd", we must send a request for a description to the feeding systems.

The descriptions management - Stage 2

The Phase 1 is helpful to us, to be aware of the problems and difficulties associated with descriptions management . The solution shown, although functional, is practically unusable in a real Data Warehouse.
The procedures that update the domains table and the dimension table, are specific for the profession code, and such procedure should be repeated, with appropriate modifications, for all other codes that require a description. In a Data Warehouse, we can have dozens of analytical dimensions, each with dozens of descriptive codes. It is not thinkable write hundreds of procedures for the descriptions management.
So let's see how to implement a single procedure that handles all automatically. To achieve the result indicated, we must necessarily get help from some configuration tables. In particular, a configuration table of the input stream, which indicate me the name of the staging table and of the dimension table. And we need a configuration table of all columns that need a description.

Environment creation

We enrich the staging table assuming the presence of three codes. The first, already seen in the previous phase is the profession code.The second is the status code of the customer. The third is the type code of the customer. We adapt also the dimension table with the new codes.

DROP TABLE CUST2_STA CASCADE CONSTRAINTS;
CREATE TABLE CUST2_STA (
  CUST2_COD VARCHAR2(9)
  ,PROF_COD VARCHAR2(30)
  ,STATUS_COD VARCHAR2(1)
  ,TYPE_COD VARCHAR2(1)
);

DROP TABLE CUST2_DIT CASCADE CONSTRAINTS;
CREATE TABLE CUST2_DIT (
  CUST2_KEY NUMBER
  ,CUST2_COD VARCHAR2(9)
  ,PROF_COD VARCHAR2(30)
  ,PROF_BDS VARCHAR2(60)
  ,PROF_XDS VARCHAR2(93) GENERATED ALWAYS
      AS (PROF_COD||' - '||PROF_BDS) VIRTUAL
  ,STATUS_COD VARCHAR2(1)
  ,STATUS_BDS VARCHAR2(15)
  ,STATUS_XDS VARCHAR2(19) GENERATED ALWAYS
     AS (STATUS_COD||' - '||STATUS_BDS) VIRTUAL
  ,TYPE_COD VARCHAR2(1)
  ,TYPE_BDS VARCHAR2(15)
  ,TYPE_XDS VARCHAR2(19) GENERATED ALWAYS
     AS (TYPE_COD||' - '||TYPE_BDS) VIRTUAL
);


We now introduce two new configuration tables. A configuration table of the input streams (MEF_IO2_CFT) and a configuration table of the columns of the staging tables (MEF_STA2_CFT),which will become columns of the dimension table. The MEF_IO2_CFT table contains, for each identifier of a stream (IO_COD), the name of the staging table (STA_TABLE_COD) and the name of the dimension table (DIM_TABLE_COD) . We insert a configuration line .
The MEF_STA2_CFT table  contains, for each identifier of a stream (IO_COD), the name of the column code (COLUMN_COD) and the domain name associated (DOM_COD). This configuration table is very important, because it is the only place where we define for each code, the name of its domain. This name will allow you to access to the domains table to find the descriptions. For our example, we insert three lines, one for each code, assuming that the customer type has not yet been taken into account in the analysis of its domain. So, the domain code remains NULL.

DROP TABLE MEF_IO2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_IO2_CFT (
  IO_COD        VARCHAR2(9)
  ,STT_COD VARCHAR2(30)
  ,DIT_COD VARCHAR2(30)
);
INSERT INTO MEF_IO2_CFT VALUES ('CUST2','CUST2_STA','CUST2_DIT');

DROP TABLE MEF_STA2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_STA2_CFT (
  IO_COD      VARCHAR2(9)
  ,COLUMN_COD VARCHAR2(30)
  ,DOM_COD    VARCHAR2(30)
);
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','PROF_COD','PROF');
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','STATUS_COD','STATUS');
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','TYPE_COD',null);

Load the domains table

Recreate and load this table by adding to the domain already seen, that of the state of the customer:

DROP TABLE MEF_DOM2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM2_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60)
  ,HDS_TXT  VARCHAR2(255) NOT NULL
  ,BDS_TXT  VARCHAR2(255) NOT NULL
);
INSERT INTO MEF_DOM2_CFT VALUES ('PROF','10','EMPL.'    ,'Employee');
INSERT INTO MEF_DOM2_CFT VALUES ('PROF','20','FREE L.'  ,'Free Lance');
INSERT INTO MEF_DOM2_CFT VALUES ('PROF','30','TRADESMAN','Tradesman');
INSERT INTO MEF_DOM2_CFT VALUES ('STATUS','0','NO','Not active');
INSERT INTO MEF_DOM2_CFT VALUES ('STATUS','1','YES'  ,'Active');
COMMIT;



The table content will be the following:




Load the staging table

The loading of this table will be similar to what has been seen.

INSERT INTO CUST2_STA VALUES ('Larry','10','0','P');
INSERT INTO CUST2_STA VALUES ('Tom'  ,'15','1','P');
INSERT INTO CUST2_STA VALUES ('ACME' ,'20','1','L');
INSERT INTO CUST2_STA VALUES ('Steve','40','?','P');
INSERT INTO CUST2_STA VALUES ('Jack' ,'?','1','P');
COMMIT;

Update the domains table

We will use a single procedure. It, using the information of the configuration tables, dynamically creates the SQL statement to do the update. Furthermore, in case that some domains are not defined , associates automatically the default domain code as already described. Because sometimes it can be useful to call a field "* _COD" although it is not sensible to associate to it a description, we assume that, if we set the code domain equal to the word "NULL", this field will not be taken into consideration in the treatment of the domains.

create or replace procedure p_dom2(p_io_cod varchar2 default '%'
   ,p_column_cod varchar2 default '%') is
v_sql varchar2(4000);
v_dom_cod varchar2(61);
v_hds_txt varchar2(200);
v_bds_txt varchar2(200);
v_cod_txt varchar2(200);
v_column_cod varchar2(200);
type t_rc is ref cursor;
v_cur t_rc;
cr char(1) := chr(10);
begin
     for r in (select a.STT_COD
               ,b.column_cod
             ,nvl(b.dom_cod,a.io_cod||'.'||b.column_cod) dom_cod
               from mef_io2_cft a
               inner join mef_sta2_cft b
               on (a.io_cod = b.io_cod
               and b.column_cod like p_column_cod
               and b.column_cod like '%_COD'
               and nvl(b.dom_cod,'*') <> 'NULL')
               where a.io_cod like p_io_cod) loop
         v_sql := 'select distinct '||''''||r.dom_cod||''''||cr||
                  '   ,a.'||r.column_cod||cr||
                  '   ,b.cod_txt'||cr||
                  '   ,nvl(b.hds_txt,''nd'') hds_txt '||cr||
                  '   ,nvl(b.bds_txt,''nd'') bds_txt '||cr||
                  'from '||r.STT_COD||' a '||cr||
                  'left outer join MEF_DOM2_CFT b '||cr||
                  'on (a.'||r.column_cod||' = b.cod_txt '||cr||
                  '    and dom_cod='||''''||r.dom_cod||''''||')';
         dbms_output.put_line(v_sql);
         open v_cur for v_sql;
         loop
            fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt,v_hds_txt,v_bds_txt;
            exit when v_cur%notfound;
            if (v_cod_txt is null) then
               insert into MEF_DOM2_CFT (dom_cod,cod_txt,hds_txt,bds_txt)
               values(v_dom_cod,v_column_cod,v_hds_txt,v_bds_txt);
            end if;
         end loop;
         close v_cur;
     end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dom2('CUST2');


Through the command "set serveroutput on" we can see an example of the SQL statement that is generated automatically for each source / column.





After the procedure execution, the content of the table will be the following. Note the power of this procedure. Although we have not yet thought about the domains management, if you call it without parameters, it will be able to initialize the domains table of for all code fields (that have been configured) throughout the Data Warehouse.



Load the dimension table

The loading is similar to that of stage 1

INSERT INTO CUST2_DIT(CUST2_KEY,CUST2_COD,PROF_COD,STATUS_COD,TYPE_COD)
SELECT ROWNUM,CUST2_COD,PROF_COD,STATUS_COD,TYPE_COD
FROM CUST2_STA;
COMMIT;


We see now the content of the dimension table before the next update.


Update the dimension table

Also in this case we have to use a dynamic procedure.

create or replace procedure p_dit2(
   p_DIT_COD varchar2 default '%'
   ,p_col varchar2 default '%'
   ,p_force_cod number default 0) as
type t_rc is ref cursor;
v_cur t_rc;
v_sql varchar2(2000);
v_value_cod varchar2(60);
v_where varchar2(80);
v_count number;
cr char(1) := chr(10);
begin
   for s in (select io_cod,DIT_COD
             from MEF_IO2_CFT
             where DIT_COD like p_DIT_COD) loop
      for r in (select b.column_cod
                ,replace(b.column_cod,'_COD','_BDS') column_bds
                ,nvl(dom_cod,s.io_cod||'.'||column_cod) dom_cod
                from cols a
                inner join MEF_STA2_CFT b
                on (a.column_name = b.column_cod
                    and b.io_cod = s.io_cod)
                where a.table_name = s.DIT_COD
                and a.column_name like p_col
                and a.column_name like '%_COD'
                and nvl(b.dom_cod,'*') <> 'NULL') loop
         if (p_force_cod = 0) then
            v_where := ' where '||r.column_bds||' is null';
         else
            v_where := null;
         end if;
         v_sql := 'select distinct '||r.column_cod||' value_cod'||cr||
                  ' from '||s.DIT_COD||v_where;
         dbms_output.put_line(v_sql);
         open v_cur for v_sql;
         loop
            fetch v_cur into v_value_cod;
            exit when v_cur%notfound;
            -- usare using perchĆØ se il valore contiene degli apici, l'sql fallisce
            v_sql :=
            'update '||s.DIT_COD||' d'||cr||
            'set ('||r.column_bds||') = ('||cr||'select bds_txt '||cr||
            'from MEF_DOM2_CFT s where d.'||r.column_cod||' = s.cod_txt '||
            'and s.dom_cod = '||''''||r.dom_cod||''''||')'||cr||
            'where '||r.column_cod||' = :1 '||
            ' and exists ('||cr||
            'select 1 from MEF_DOM2_CFT s where d.'||r.column_cod||' = s.cod_txt '||
            'and s.dom_cod = '||''''||r.dom_cod||''''||
            ')';
            dbms_output.put_line(v_sql);
            execute immediate v_sql using v_value_cod;
         end loop;
         close v_cur;
      end loop;
   end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dit2('CUST2_DIT');


Some notes on this update procedure.

  • The first loop is used to identify the name of all the dimension tables, if you do not provide a specific input.
  • The second loop is used to identify values and descriptions associated with the domains. They will be the basis for the subsequent updates.
  • As we know, the UPDATE operations are, in general, slow operations at the Database level. But we perform the update only for the rows that have not the description valued (see, for example, the clause "where '|| || r.column_bds' is null"). This greatly reduces the weight of the update operation. In fact, in the case of the Slowly Changing Dimensions of type 2, we expect that new keys or new keys versioning, will involve only between the 5 and 10 percent of the total lines (this is why we call slowly dimensions). Then, with appropriate indexing, the weight of the update should not be excessive.
  • We note again that the condition (on NULL) seen before, is conditioned by an input parameter to the procedure. So, in the face of a massive update of the descriptions in the domains table, it is always possible to force a massive update of all dimension tables (off-line, of course).
  • We used the clause "using" in the dynamic execution, because if the value contains quotes, it could cause problems.
The dynamic SQL used in the procedure, will produce statements of type:

select distinct STATUS_COD value_cod
from CUST2_DIT where STATUS_BDS is null;

update CUST2_DIT d
set (STATUS_BDS) = (
select bds_txt
from MEF_DOM2_CFT s where d.STATUS_COD = s.cod_txt and s.dom_cod = 'STATUS')
where STATUS_COD = :1  and exists (
select 1 from MEF_DOM2_CFT s where d.STATUS_COD = s.cod_txt and s.dom_cod = 'STATUS')


At the end of the update, the dimension table will be:   





The descriptions management - Stage 3

The previous phases have produced a fairly accurate descriptions management. Phase two has made it dynamic and reusable for all the dimension tables of a Data Warehouse. Unfortunately, we are still quite far from reality. We have not taken into account a very important factor: the dependence between codes.
In a medium / large Data Warehouse and especially in an Enterprise Data Warehouse, where the dependence between codes exists, the phase 2 does not work anymore because we have to deal with the following situations.

Example 1

Suppose to be in a bank that has acquired other banks. The data warehouse will contain the data of the whole group, so that even the customer table must be unique, with a company code that differentiates the customers. We apply now conceptually this fact to our example of CUST_DIT. We are quite sure that the meaning "emplyee" (like description) of the profession code "10" for the B1 bank , has the same meaning for the B2 bank ?
The question is obviously rhetorical. The answer is NO. We cannot have this guarantee, nor can we hope that the phase of data migration from the bank acquired at the acquiring bank has normalized all tables and meanings. This is the problem of the dependence between codes. A code no longer has a unique description, but has a different description depending on the value of another code.

Example 2

It must be clear that this problem does not occur only in the case of company mergers, but also on tables  of a single company. Another example can be the causal of a banking operation. A causal code "01" may have a meaning "Descr XXX" if the movement takes place by the bank desk, but it can have a different description "Descr YYY" with the same code "01" if the transaction takes place via the Web. In this case the causal code is dependent from the transmission channel code.

Esempio 3

When I speak of dependence between codes, does not mean between two codes, but also among many  codes. An example is the table of the financial instruments. Because the instruments are very numerous, it appears almost always a hierarchy that classifies financial instruments associating them to categories, groups and subgroups.
As we can see from the example below, the subgroup with code "1" has three different descriptions depending on the group to which it belongs. In turn, the group code "2" has two different descriptions depending on the category to which it belongs.


So, if we have dependence between codes, the management of the descriptions shown in step 2 does not work. You have to manage it in a different way.

Change the configuration table

The first thing to do is to change the configuration table of the columns by adding the information on the dependencies between codes. We assume no more than 5 dependencies. The new example table will be the dimensionale table of the financial instruments.

DROP TABLE MEF_STA3_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_STA3_CFT (
  IO_COD      VARCHAR2(9)
  ,COLUMN_COD VARCHAR2(30)
  ,DOM_COD    VARCHAR2(30)
  ,DC1_COD VARCHAR2(30)
  ,DC2_COD VARCHAR2(30)
  ,DC3_COD VARCHAR2(30)
  ,DC4_COD VARCHAR2(30)
  ,DC5_COD VARCHAR2(30)  
);
INSERT INTO MEF_STA3_CFT VALUES ('FIN','CAT_COD','CAT','','','','','');
INSERT INTO MEF_STA3_CFT VALUES ('FIN','GRP_COD','GRP','CAT_COD','','','','');
INSERT INTO MEF_STA3_CFT VALUES ('FIN','SUBGRP_COD','SUBGRP','GRP_COD','CAT_COD','','','');


If we see its contents, we can clearly see the dependencies between codes that we described earlier. The sub-group depends on the group and category, the group depends on the category, the category has no dependencies. 



Also configure the new source/dimension

DELETE MEF_IO2_CFT WHERE IO_COD='FIN';
INSERT INTO MEF_IO2_CFT VALUES ('FIN','FIN_STA','FIN_DIT');

Create and load the domains table

The domains table must contain the value of the dependence codes. These dependencies must be studied and analyzed. As in the previous stage, even if they are not known, all combinations of dependencies will be initialized automatically after the loading of the staging area.
The DCLIST_TXT field is only informative. It is useful to know what are the names of the columns to which the DV * codes refer themselves.

DROP TABLE MEF_DOM3_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM3_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60) NOT NULL
  ,HDS_TXT  VARCHAR2(255)
  ,BDS_TXT  VARCHAR2(255)
  ,DV1_COD    VARCHAR2(30 BYTE)
  ,DV2_COD    VARCHAR2(30 BYTE)
  ,DV3_COD    VARCHAR2(30 BYTE)
  ,DV4_COD    VARCHAR2(30 BYTE)
  ,DV5_COD    VARCHAR2(30 BYTE)
  ,DCLIST_TXT   VARCHAR2(160 BYTE)
);
INSERT INTO MEF_DOM3_CFT VALUES ('SUBGRP','1','LUSS. MOB.','LUSS. MOB.','3','4','','','','GRP_COD,CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('SUBGRP','1','EQUITIES','EQUITIES','2','7','','','','GRP_COD,CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('SUBGRP','1','ORD. STOCK  BUY','ORD. STOCK  BUY','2','3','','','','GRP_COD,CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('GRP','3','FOR.INV.FOUNDS','FOR.INV.FOUNDS','4','','','','','CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('GRP','2','PUT  ITALY','PUT  ITALY','7','','','','','CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('GRP','2','FOREIGN WARRANTS','FOREIGN WARRANTS','3','','','','','CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('CAT','4','COMMON FOUNDS','COMMON FOUNDS','','','','','','');
INSERT INTO MEF_DOM3_CFT VALUES ('CAT','7','OPTIONS','OPTIONS','','','','','','');
INSERT INTO MEF_DOM3_CFT VALUES ('CAT','3','WARRANTS','WARRANTS','','','','','','');


See the domains table:




Load the staging table

The new staging table of financial instruments will be:

DROP TABLE FIN_STA CASCADE CONSTRAINTS;
CREATE TABLE FIN_STA
(
  FIN_COD VARCHAR2(12)
  ,CAT_COD VARCHAR2(3)
  ,GRP_COD VARCHAR2(3) 
  ,SUBGRP_COD VARCHAR2(3)   
);
INSERT INTO FIN_STA VALUES ('LU0108803940','4','3','1');
INSERT INTO FIN_STA VALUES ('IT0002693197','7','2','1');
INSERT INTO FIN_STA VALUES ('CH0014852120','3','2','1');
INSERT INTO FIN_STA VALUES ('IT0001119848','3','1','1');
INSERT INTO FIN_STA VALUES ('IT0000566056','1','1','1');
COMMIT;


As we can see, there's two combinations (1,1,1 and 3,1,1) not present in the domain.

Update the domain table

The update procedure of the domains is certainly more complicated than the previous case. The code is implemented only for demonstration, and surely can be implemented more efficiently.

create or replace procedure p_dom3(p_io_cod varchar2 default '%'
,p_column_cod varchar2 default '%') is
v_sql varchar2(4000);
v_dom_cod varchar2(61);
v_hds_txt varchar2(200);
v_bds_txt varchar2(200);
v_cod_txt varchar2(200);
v_column_cod varchar2(200);
type t_rc is ref cursor;
v_cur t_rc;
cr char(1) := chr(10);
v_list varchar2(200);
v_list2 varchar2(200);
v_c1 varchar2(200);
v_c2 varchar2(200);
v_c3 varchar2(200);
v_c4 varchar2(200);
v_c5 varchar2(200);
v_count number;
begin
     for r in (select a.STT_COD
               ,b.column_cod
             ,nvl(b.dom_cod,a.io_cod||'.'||b.column_cod) dom_cod
             ,b.dc1_cod,b.dc2_cod,b.dc3_cod,b.dc4_cod,b.dc5_cod
               from mef_io2_cft a
               inner join mef_sta3_cft b
               on (a.io_cod = b.io_cod
               and column_cod like p_column_cod
               and b.column_cod like '%_COD'
               and nvl(b.dom_cod,'*') <> 'NULL')
               where a.io_cod like p_io_cod) loop
         v_list := null;
         v_list2 := null;
         v_count := 0;
         if (r.dc1_cod is not null) then
            v_list := v_list||','||r.dc1_cod;
            v_list2 := v_list2||' and '||r.dc1_cod||' = dv1_cod';
            v_count := v_count+1;
         end if;
         if (r.dc2_cod is not null) then
            v_list := v_list||','||r.dc2_cod;
            v_list2 := v_list2||' and '||r.dc2_cod||' = dv2_cod';
            v_count := v_count+1;
         end if;
         if (r.dc3_cod is not null) then
            v_list := v_list||','||r.dc3_cod;
            v_list2 := v_list2||' and '||r.dc3_cod||' = dv3_cod';
            v_count := v_count+1;
         end if;
         if (r.dc4_cod is not null) then
            v_list := v_list||','||r.dc4_cod;
            v_list2 := v_list2||' and '||r.dc4_cod||' = dv4_cod';
            v_count := v_count+1;
         end if;
         if (r.dc5_cod is not null) then
            v_list := v_list||','||r.dc5_cod;
            v_list2 := v_list2||' and '||r.dc5_cod||' = dv5_cod';
            v_count := v_count+1;
         end if;              
         v_sql := 'select distinct '||''''||r.dom_cod||''''||cr||
                  '   ,a.'||r.column_cod||cr||
                  '   ,b.cod_txt'||cr||
                  '   ,nvl(b.hds_txt,''nd'') hds_txt '||cr||
                  '   ,nvl(b.bds_txt,''nd'') bds_txt '||cr||
                  v_list||cr||
                  'from '||r.STT_COD||' a '||cr||
                  'left outer join mef_dom3_cft b '||cr||
                  'on (a.'||r.column_cod||' = b.cod_txt '||cr||
                  '    and dom_cod='||''''||r.dom_cod||''''||cr||
                  v_list2||cr||
                  ')';
         dbms_output.put_line(v_sql);
         open v_cur for v_sql;
         loop
            if (v_count=0) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt;
            elsif (v_count=1) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1;
            elsif (v_count=2) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2;
            elsif (v_count=3) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2,v_c3;
            elsif (v_count=4) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2,v_c3,v_c4;
            else
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2,v_c3,v_c4,v_c5;
            end if;        
            exit when v_cur%notfound;
            if (v_cod_txt is null) then
               insert into mef_dom3_cft (dom_cod,cod_txt,hds_txt,bds_txt
               ,dv1_cod, dv2_cod,dv3_cod, dv4_cod, dv5_cod,dclist_txt)
               values(v_dom_cod,v_column_cod,v_hds_txt,v_bds_txt
               ,v_c1,v_c2,v_c3,v_c4,v_c5,substr(v_list,2));
            end if;
         end loop;
         close v_cur;
     end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dom3('FIN');


Compared to the version of stage 2, there is the need to acquire the name of the columns of the dependence codes. After the execution of the procedure, which, remember, is to be performed as post-processing of the Staging Area loading, the table of the domains will have the following values.




Creation and load of the dimension table

We create the new dimension table and load it  from the staging table.

DROP TABLE FIN_DIT CASCADE CONSTRAINTS;
CREATE TABLE FIN_DIT (
  FIN_KEY NUMBER
  ,FIN_COD VARCHAR2(12)
  ,CAT_COD VARCHAR2(30)
  ,CAT_BDS VARCHAR2(60)
  ,CAT_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (CAT_COD||' - '||CAT_BDS) VIRTUAL
  ,GRP_COD VARCHAR2(30)
  ,GRP_BDS VARCHAR2(60)
  ,GRP_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (GRP_COD||' - '||GRP_BDS) VIRTUAL
  ,SUBGRP_COD VARCHAR2(30)
  ,SUBGRP_BDS VARCHAR2(60)
  ,SUBGRP_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (SUBGRP_COD||' - '||SUBGRP_BDS) VIRTUAL        
);
INSERT INTO FIN_DIT(FIN_KEY,FIN_COD,CAT_COD,GRP_COD,SUBGRP_COD)
SELECT ROWNUM,FIN_COD,CAT_COD,GRP_COD,SUBGRP_COD
FROM FIN_STA;
COMMIT;


We see the dimension table before the upgrade of the descriptions.



Update the dimension table

Even the update procedure will be more complex than the previous, in order to manage the dependencies.

create or replace procedure p_dit3(
   p_DIT_COD varchar2 default '%'
   ,p_col varchar2 default '%'
   ,p_force_cod number default 0) as
type t_rc is ref cursor;
v_cur t_rc;
v_sql varchar2(2000);  
v_sql2 varchar2(2000);    
v_value_cod varchar2(60);
v_column_cod varchar2(30);
v_dom_cod varchar2(61);
v_exec_cnt number;
v_where varchar2(80);
v_count number;
cr char(1) := chr(10);
v_c1 varchar2(200);
v_c2 varchar2(200);
v_c3 varchar2(200);
v_c4 varchar2(200);
v_c5 varchar2(200);
v_ditc1 varchar2(200);
v_ditc2 varchar2(200);
v_ditc3 varchar2(200);
v_ditc4 varchar2(200);
v_ditc5 varchar2(200);
v_list varchar2(2000);
v_list2 varchar2(2000);
begin
   for s in (select io_cod,DIT_COD
             from MEF_IO2_CFT
             where DIT_COD like p_DIT_COD) loop
      for r in (select b.column_cod
                ,replace(b.column_cod,'_COD','_BDS') column_bds
                ,nvl(dom_cod,s.io_cod||'.'||column_cod) dom_cod
                ,dc1_cod
                ,dc2_cod
                ,dc3_cod
                ,dc4_cod
                ,dc5_cod
                from cols a
                inner join MEF_STA3_CFT b
                on (a.column_name = b.column_cod
                    and b.io_cod = s.io_cod
                    and b.column_cod like '%_COD'
                    and nvl(b.dom_cod,'*') <> 'NULL')
                where a.table_name = s.DIT_COD
                and a.column_name like p_col
                ) loop
            v_list := null;
            v_list2 := null;
            v_count := 0;
            dbms_output.put_line(r.column_bds);        
            if (r.dc1_cod is not null) then
               v_list := v_list||','||r.dc1_cod;
               v_count := v_count+1;
            end if;
            if (r.dc2_cod is not null) then
               v_list := v_list||','||r.dc2_cod;
               v_count := v_count+1;
            end if;
            if (r.dc3_cod is not null) then
               v_list := v_list||','||r.dc3_cod;
               v_count := v_count+1;
            end if;
            if (r.dc4_cod is not null) then
               v_list := v_list||','||r.dc4_cod;
               v_count := v_count+1;
            end if;
            if (r.dc5_cod is not null) then
               v_list := v_list||','||r.dc5_cod;
               v_count := v_count+1;
            end if;                 
            if (p_force_cod = 0) then
               v_where := ' where '||r.column_bds||' is null';
            else
               v_where := null;
            end if;
            dbms_output.put_line(v_count);   
            v_sql := 'select distinct '||r.column_cod||' value_cod'||cr||
                     v_list||' from '||s.DIT_COD||v_where;
            dbms_output.put_line(v_sql);
            open v_cur for v_sql;
            loop
               if (v_count=0) then
                  fetch v_cur into v_value_cod;
                  v_list2 := null;
               elsif (v_count=1) then
                  fetch v_cur into v_value_cod,v_ditc1;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod';
               elsif (v_count=2) then
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod';
               elsif (v_count=3) then
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2,v_ditc3;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod'||
                             ' and d.'||r.dc3_cod||' = s.dv3_cod';
               elsif (v_count=4) then
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2,v_ditc3,v_ditc4;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod'||
                             ' and d.'||r.dc3_cod||' = s.dv3_cod'||
                             ' and d.'||r.dc4_cod||' = s.dv4_cod';
               else
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2,v_ditc3,v_ditc4,v_ditc5;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod'||
                             ' and d.'||r.dc3_cod||' = s.dv3_cod'||
                             ' and d.'||r.dc4_cod||' = s.dv4_cod'||
                             ' and d.'||r.dc5_cod||' = s.dv5_cod';
               end if;
               exit when v_cur%notfound;   
               v_sql2 :=
               'update '||s.DIT_COD||' d'||cr||
               'set ('||r.column_bds||') = ('||cr||'select bds_txt '||cr||
               'from mef_dom3_cft s where s.dom_cod = '||''''||r.dom_cod||''''||
               ' and s.cod_txt = :1 '||
               v_list2||cr||
               ')'||cr||
               'where '||r.column_cod||' = :2 '||cr||
               ' and exists (select 1 from MEF_DOM3_CFT s '||
               'where s.dom_cod = '||''''||r.dom_cod||''''||' and s.cod_txt = :3 '||
               v_list2||')';
               dbms_output.put_line(v_sql2);
               execute immediate v_sql2 using v_value_cod,v_value_cod,v_value_cod;
            end loop;
            close v_cur;
      end loop;
   end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dit3('FIN_DIT');


After the execution of the procedure, the table of financial instruments will be: 




The first three lines have combinations of codes that are present in the table of the domains, so they have been updated the corresponding descriptions for all codes.
The fourth line has only updated the description of the category, because the category has no other dependencies.
The fifth line has all descriptions indefinite because it is a combination of unknown dependencies.

The descriptions management - Stage 4

Many of you will be amazed to see represented a further stage 4. It seems incredible, but we have not again said all about the descriptions management. We summarize the analysis that we have done so far.
In Phase 1 we highlighted the problem, we have introduced the domains table, a configuration table necessary to operate, we have defined a naming convention.
In Phase 2, we have automated the process, we have introduced other configuration tables in support of a dynamic management of the descriptions. We have shown how to pre-load the domains table, before to receive that information from the source systems.
In Phase 3, we have handled the dependence between codes.
What is still missing?
The answer depends on the international feature of your Data Warehouse. If you are sure that the Data Warehouse is internal and used only by users of your native language, you can stop at step 3, otherwise you have to take into account the multilanguage management.
I will not make a complete example as the other phases. The action is a change to the domains table, adding the language code field. (and some technical columns that is always useful to have). So you have to add language code as a key field in the procedures that have been implemented. With the multilanguage management , the domains table will increase volume as many times as the languages that you add.

The new domains table

The domains table, that we see below, can be considered complete, and I advise you to create it immediately with the code language even though at the time will not be used.

DROP TABLE MEF_DOM4_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM4_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60) NOT NULL
  ,HDS_TXT  VARCHAR2(255)
  ,BDS_TXT  VARCHAR2(255)
  ,DV1_COD    VARCHAR2(30 BYTE)
  ,DV2_COD    VARCHAR2(30 BYTE)
  ,DV3_COD    VARCHAR2(30 BYTE)
  ,DV4_COD    VARCHAR2(30 BYTE)
  ,DV5_COD    VARCHAR2(30 BYTE)
  ,DCLIST_TXT   VARCHAR2(160 BYTE)
  ,LANG_COD VARCHAR2(30 BYTE)
  ,INS_DTS     DATE
  ,STAMP_DTS   DATE  DEFAULT SYSDATE
  ,CTRL_TXT    VARCHAR2(30)
);

Conclusion

As we have seen, the descriptions management is not a simple topic, has many implications, employ many people and it is extremely important because it is not a technical detail, but involves the way the end user sees the data. Determining factor for the success of the Data Warehouse.
I hope that what I have written can serve, for all the people involved in the design of a Data Warehouse, to become aware of the problem. The "descriptions management" activity must become a fixed point in the Gantt of the project.