Saturday, May 26, 2012

Strategies with Staging Area on the Target

Strategies with Staging Area on the Target

These strategies are used when the staging area schema is located in the same data server as the target table schema. In this configuration, complex integration strategies can take place
6.1.2.1.1 Append
This strategy simply inserts the incoming data flow into the target datastore, possibly deleting the content of the target beforehand.
This integration strategy includes the following steps:
  1. Delete (or truncate) all records from the target table. This step usually depends on a KM option.
  2. Transform and insert data from sources located on the same server and from loading tables in the staging area. When dealing with remote source data, LKMs will have already prepared loading tables. Sources on the same server can be read directly. The integration operation will be a direct INSERT/SELECT statement leveraging containing all the transformations performed on the staging area in the SELECT clause and on all the transformation on the target in the INSERT clause.
  3. Commit the Transaction. The operations performed on the target should be done within a transaction and committed after they are all complete. Note that committing is typically triggered by a KM option called COMMIT.
The same integration strategy can be obtained by using the Control Append strategy and not choosing to activate flow control.
6.1.2.1.2 Control Append
In the Append strategy, flow data is simply inserted in the target table without any flow control. This approach can be improved by adding extra steps that will store the flow data in an integration table ("I$"), then call the CKM to isolate erroneous records in the error table ("E$").
This integration strategy includes the following steps:
  1. Drop (if it exists) and create the integration table in the staging area. This is created with the same columns as the target table so that it can be passed to the CKM for flow control.
  2. Insert data in the loading table from the sources and loading tables using a single INSERT/SELECT statement similar to the one loading the target in the append strategy.
  3. Call the CKM for flow control. The CKM will evaluate every constraint defined for the target table on the integration table data. It will create an error table and insert the erroneous records into this table. It will also remove erroneous records from the integration table.
    After the CKM completes, the integration table will only contain valid records. Inserting them in the target table can then be done safely.
  4. Remove all records from the target table. This step can be made dependent on an option value set by the designer of the interface
  5. Append the records from the integration table to the target table in a single INSERT/SELECT statement.
  6. Commit the transaction.
  7. Drop the temporary integration table.
Error Recycling
In some cases, it is useful to recycle errors from previous runs so that they are added to the flow and applied again to the target. This method can be useful for example when receiving daily sales transactions that reference product IDs that may not exist. Suppose that a sales record is rejected in the error table because the referenced product ID does not exist in the product table. This happens during the first run of the interface. In the meantime the missing product ID is created by the data administrator. Therefore the rejected record becomes valid and should be re-applied to the target during the next execution of the interface.
This mechanism is implement IKMs by an extra task that inserts all the rejected records of the previous executions of this interface from the error table into integration table. This operation is made prior to calling the CKM to check the data quality, and is conditioned by a KM option usually called RECYCLE_ERRORS.
6.1.2.1.3 Incremental Update
The Incremental Update strategy is used to integrate data in the target table by comparing the records of the flow with existing records in the target according to a set of columns called the "update key". Records that have the same update key are updated when their associated data is not the same. Those that don't yet exist in the target are inserted. This strategy is often used for dimension tables when there is no need to keep track of the records that have changed.
The challenge with such IKMs is to use set-oriented SQL based programming to perform all operations rather than using a row-by-row approach that often leads to performance issues. The most common method to build such strategies often relies on the integration table ("I$") which stores the transformed source sets. This method is described below:
  1. Drop (if it exists) and create the integration table in the staging area. This is created with the same columns as the target table so that it can be passed to the CKM for flow control. It also contains an IND_UPDATE column that is used to flag the records that should be inserted ("I") and those that should be updated ("U").
  2. Transform and insert data in the loading table from the sources and loading tables using a single INSERT/SELECT statement. The IND_UPDATE column is set by default to "I".
  3. Recycle the rejected records from the previous run to the integration table if the RECYCLE_ERROR KM option is selected.
  4. Call the CKM for flow control. The CKM will evaluate every constraint defined for the target table on the integration table data. It will create an error table and insert the erroneous records into this table. It will also remove erroneous records from the integration table.
  5. Update the integration table to set the IND_UPDATE flag to "U" for all the records that have the same update key values as the target ones. Therefore, records that already exist in the target will have a "U" flag. This step is usually an UPDATE/SELECT statement.
  6. Update the integration table again to set the IND_UPDATE column to "N" for all records that are already flagged as "U" and for which the column values are exactly the same as the target ones. As these flow records match exactly the target records, they don't need to be used to update the target data.
    After this step, the integration table is ready for applying the changes to the target as it contains records that are flagged:
    • "I": these records should be inserted into the target.
    • "U": these records should be used to update the target.
    • "N": these records already exist in the target and should be ignored.
  7. Update the target with records from the integration table that are flagged "U". Note that the update statement is typically executed prior to the INSERT statement to minimize the volume of data manipulated.
  8. Insert records in the integration table that are flagged "I" into the target.
  9. Commit the transaction.
  10. Drop the temporary integration table.
Optimization
This approach can be optimized depending on the underlying database. The following examples illustrate such optimizations:
  • With Teradata, it may be more efficient to use a left outer join between the flow data and the target table to populate the integration table with the IND_UPDATE column already set properly.
  • With Oracle, it may be more efficient in some cases to use a MERGE INTO statement on the target table instead of an UPDATE then INSERT.
Update Key
The update key should always be unique. In most cases, the primary key will be used as an update key. The primary key cannot be used, however, when it is automatically calculated using an increment such as an identity column, a rank function, or a sequence. In this case an update key based on columns present in the source must be used.
Comparing Nulls
When comparing data values to determine what should not be updated, the join between the integration table and the target table is expressed on each column as follow:
<target_table>.ColumnN = <loading_table>.ColumnN or (<target_table> is null and <loading_table>.ColumnN is null)
This is done to allow comparison between null values, so that a null value matches another null value. A more elegant way of writing it would be to use the coalesce function. Therefore the WHERE predicate could be written this way:
<%=odiRef.getColList("","coalesce(" + odiRef.getTable("L", "INT_NAME", "A") + ".[COL_NAME], 0) = coalesce(T.[COL_NAME], 0)", " \nand\t", "", "((UPD and !TRG) and !UK) ")%>
Column-Level Insert/Update Behavior
Columns updated by the UPDATE statement are not the same as the ones used in the INSERT statement. The UPDATE statement uses selector "UPD and not UK" to filter only mappings marked as "Update" in the interface and that do not belong to the update key. The INSERT statement uses selector "INS" to retrieve mappings marked as "insert" in the interface.
Transaction
It is important that the UPDATE and the INSERT statements on the target belong to the same transaction. Should any of them fail, no data will be inserted or updated in the target.

No comments:

Post a Comment