Strategies with Staging Area on the Target
These strategies are used when the staging area schema is located in the same data server as the target table schema. In this configuration, complex integration strategies can take place6.1.2.1.1 Append
This strategy simply inserts the incoming data flow into the target datastore, possibly deleting the content of the target beforehand.This integration strategy includes the following steps:
-
Delete (or truncate) all records from the target table. This step usually depends on a KM option.
-
Transform and insert data from sources located on the same server and
from loading tables in the staging area. When dealing with remote
source data, LKMs will have already prepared loading tables. Sources on
the same server can be read directly. The integration operation will be a
direct INSERT/SELECT statement leveraging containing all the
transformations performed on the staging area in the SELECT clause and
on all the transformation on the target in the INSERT clause.
-
Commit the Transaction. The operations performed on the target should
be done within a transaction and committed after they are all complete.
Note that committing is typically triggered by a KM option called COMMIT.
6.1.2.1.2 Control Append
In the Append strategy, flow data is simply inserted in the target table without any flow control. This approach can be improved by adding extra steps that will store the flow data in an integration table ("I$"), then call the CKM to isolate erroneous records in the error table ("E$").This integration strategy includes the following steps:
-
Drop (if it exists) and create the integration table in the staging
area. This is created with the same columns as the target table so that
it can be passed to the CKM for flow control.
-
Insert data in the loading table from the sources and loading tables
using a single INSERT/SELECT statement similar to the one loading the
target in the append strategy.
-
Call the CKM for flow control. The CKM will evaluate every constraint
defined for the target table on the integration table data. It will
create an error table and insert the erroneous records into this table.
It will also remove erroneous records from the integration table.
After the CKM completes, the integration table will only contain valid records. Inserting them in the target table can then be done safely.
-
Remove all records from the target table. This step can be made
dependent on an option value set by the designer of the interface
-
Append the records from the integration table to the target table in a single INSERT/SELECT statement.
-
Commit the transaction.
-
Drop the temporary integration table.
Error Recycling
In some cases, it is useful to recycle errors from previous runs so
that they are added to the flow and applied again to the target. This
method can be useful for example when receiving daily sales transactions
that reference product IDs that may not exist. Suppose that a sales
record is rejected in the error table because the referenced product ID
does not exist in the product table. This happens during the first run
of the interface. In the meantime the missing product ID is created by
the data administrator. Therefore the rejected record becomes valid and
should be re-applied to the target during the next execution of the
interface.This mechanism is implement IKMs by an extra task that inserts all the rejected records of the previous executions of this interface from the error table into integration table. This operation is made prior to calling the CKM to check the data quality, and is conditioned by a KM option usually called RECYCLE_ERRORS.
6.1.2.1.3 Incremental Update
The Incremental Update strategy is used to integrate data in the target table by comparing the records of the flow with existing records in the target according to a set of columns called the "update key". Records that have the same update key are updated when their associated data is not the same. Those that don't yet exist in the target are inserted. This strategy is often used for dimension tables when there is no need to keep track of the records that have changed.The challenge with such IKMs is to use set-oriented SQL based programming to perform all operations rather than using a row-by-row approach that often leads to performance issues. The most common method to build such strategies often relies on the integration table ("I$") which stores the transformed source sets. This method is described below:
-
Drop (if it exists) and create the integration table in the staging
area. This is created with the same columns as the target table so that
it can be passed to the CKM for flow control. It also contains an
IND_UPDATE
column that is used to flag the records that should be inserted ("I") and those that should be updated ("U").
-
Transform and insert data in the loading table from the sources and loading tables using a single INSERT/SELECT statement. The
IND_UPDATE
column is set by default to "I".
-
Recycle the rejected records from the previous run to the integration table if the RECYCLE_ERROR KM option is selected.
-
Call the CKM for flow control. The CKM will evaluate every constraint
defined for the target table on the integration table data. It will
create an error table and insert the erroneous records into this table.
It will also remove erroneous records from the integration table.
-
Update the integration table to set the
IND_UPDATE
flag to "U" for all the records that have the same update key values as the target ones. Therefore, records that already exist in the target will have a "U" flag. This step is usually an UPDATE/SELECT statement.
-
Update the integration table again to set the
IND_UPDATE
column to "N" for all records that are already flagged as "U" and for which the column values are exactly the same as the target ones. As these flow records match exactly the target records, they don't need to be used to update the target data.
After this step, the integration table is ready for applying the changes to the target as it contains records that are flagged:
-
"I": these records should be inserted into the target.
-
"U": these records should be used to update the target.
-
"N": these records already exist in the target and should be ignored.
-
"I": these records should be inserted into the target.
-
Update the target with records from the integration table that are
flagged "U". Note that the update statement is typically executed prior
to the INSERT statement to minimize the volume of data manipulated.
-
Insert records in the integration table that are flagged "I" into the target.
-
Commit the transaction.
-
Drop the temporary integration table.
Optimization
This approach can be optimized depending on the underlying database. The following examples illustrate such optimizations:-
With Teradata, it may be more efficient to use a left outer join
between the flow data and the target table to populate the integration
table with the
IND_UPDATE
column already set properly.
-
With Oracle, it may be more efficient in some cases to use a MERGE
INTO statement on the target table instead of an UPDATE then INSERT.
Update Key
The update key should always be unique. In most cases, the primary
key will be used as an update key. The primary key cannot be used,
however, when it is automatically calculated using an increment such as
an identity column, a rank function, or a sequence. In this case an
update key based on columns present in the source must be used.
Comparing Nulls
When comparing data values to determine what should not be updated,
the join between the integration table and the target table is expressed
on each column as follow:<target_table>.ColumnN = <loading_table>.ColumnN or (<target_table> is null and <loading_table>.ColumnN is null)
This is done to allow comparison between null values, so that a null value matches another null value. A more elegant way of writing it would be to use the coalesce function. Therefore the WHERE predicate could be written this way:
<%=odiRef.getColList("","coalesce(" + odiRef.getTable("L", "INT_NAME", "A") + ".[COL_NAME], 0) = coalesce(T.[COL_NAME], 0)", " \nand\t", "", "((UPD and !TRG) and !UK) ")%>
Column-Level Insert/Update Behavior
Columns updated by the UPDATE statement are not the same as the ones
used in the INSERT statement. The UPDATE statement uses selector "UPD
and not UK" to filter only mappings marked as "Update" in the interface
and that do not belong to the update key. The INSERT statement uses
selector "INS" to retrieve mappings marked as "insert" in the interface.
Transaction
It is important that the UPDATE and the INSERT statements on the
target belong to the same transaction. Should any of them fail, no data
will be inserted or updated in the target.
No comments:
Post a Comment