Tag Archives: Integration

Migration: Beyond ETL

2013.05.26 [IMG_0514]

When approaching a migration it is tempting to try and do the simplest thing that could possibly work, most likely a re-key or one-off ETL , because any code or process created for the migration is probably going to be discarded once complete. Assuming a code solution is proposed then it’s probably going to look like this:

Beyond ETL 1

So who is going to do the work? If it’s a single team then maybe this approach will work but if there are two teams: one focussed on the source system and one focussed on the target system then the question is who works on what, most likely the answer is the sending system team gets to do the extract and the target system team does the load but who does the transform?

Beyond ETL 2

Working on the transform is probably going to take both teams. This has some problems:

  • both teams need to understand each-others data-model which, of course, can be very complex especially if one party forgot the most important step of normalisation.
  • codes need to be mapped and this is often not a one-to-one relationship
  • the target system team need to understand where data might be missing in the source system
  • the target system team need to understand what to do about data quality problems in the source system
  • reconciling the migration will be difficult if both data models are complex.

The answer: introduce an intermediate data model

Beyond ETL 3

The main design consideration of the intermediate data model is that it should be clear and simple to understand, there should be no ambiguity; it has the following features:

  • The structure is as simple as possible: code tables are denormalised  and merged back into the major entities; user management data can be removed along with system log or other management tables.  Both teams are likely to need to come together to agree the structure because this is the common representation of the data from both systems.
  • The structure does not need to match either the source or target system (if it does maybe you don’t need this step).
  • All codes are turned into a description, e.g. replace a status code of “1” with ”Active”
  • Column names for tables should be as descriptive as possible.
  • Optimisation for performance involving changes to structure should only be considered if absolutely essential.
  • Missing data should be explicitly marked as such.
  • Data quality issues should be resolved by the source system team as part of their ETL.

The advantages of using the intermediate data model are:

  • Each team can focus on their area of expertise
  • There is less scope for ambiguity leading to mistakes and subsequent rework
  • Reconciliation can be split into two, simpler, reconciliations (source to intermediate and intermediate to target)
  • The target system team are not bound to the source system team making data available, they should understand the intermediate model well enough to generate test data
  • The intermediate data model serves as an archive for subsequent audit or enquiries needing to understand how data was created in the target system

It is especially worth considering using an intermediate data model if the migration is split into phases, or there will be multiple source systems over time, as it can be extended and modified to represent any unique requirements at each phase, or source system, rather than having to understand all of these complexities at one time.

These advantages are also applicable to integrations that follow the ETL model.

Authenticating B2B Integrations with Tokens

2012.01.07 [IMG_8721]

I was surprised to discover that a common practice in certain B2B integrations, in part of the financial services sector, was for user names and  passwords to be stored in consuming applications. To understand the scenario where this occurs see the diagram, below:

B2B Tokens

Here a user has access both the System A and System D. System D aggregates information from a number of systems, all of which are owned by different organisations. The user puts their password for System A into System D and this is subsequently used by System D to act on behalf of the user when requesting information, or transactions, from system D. There are a number of problems with this:

1         The password must be presented to System A as entered so it must be stored either in plain text or reversible encryption which makes it vulnerable to theft.

2         When the user has to change their password for System A they must also change it in System D; failing to do this can lead to account lock-outs

3         Organisation A may want to revoke user A’s access via Organisation D but still allow direct access, this complicates System A.

Potentially problem 1 can be resolved by hashing the password but this has its own problems: you must be very careful the hashing routines are absolutely identical – I’ve seen cases where .NET and Java implementations differ; hashes can be broken using multi-GPU hardware; and this does not address 2 or 3

The answer is to issue the user with a token instead of requiring them to use their password, much as OAuth does: this can either be done by creating a registration process between Application D and Application A or adding a UI to application A to generate the user a token. Application D then presents the token instead of a password.  The advantages are:

1         The password never needs to be stored in a reversible form (application A will salt and encrypt it)

2         The token can have a validity independent of the password (e.g. 12 months instead of 1 month)

3         Account lock-outs are avoided

4         The token can be part of a defence-in-depth approach, for example Organisation A can place an IP restriction on access from Organisation D

5         Organisation A can revoke the taken whenever is choses

6         The aggregating application (D) simply needs to store the token instead of a password so does not need to be modified unless a registration process driven through the aggregating application (D) is desired.

To summarise: in B2B integration scenarios consider tokens as part of the security solution.

Integration: let’s have a conversation

2011.12.24 [IMG_8480]

Having established who is driving the integration it’s tempting to start designing the message content or turning to a messaging standard like ORIGO or FIX.  However there is a layer above the message content that needs to be considered: conversation patterns. This is also a valid consideration for standards which often concentrate on the content and, whilst conversation patterns are implied or stated, it’s important to be certain about which you are using and the consequences of that choice.  Don’t go mad with the number of conversation patterns for the integration, try and keep it to a small set. The following are taken from a solution I designed in 2005, which has stood the test of time:

  • Inform-Receipt:

Inform-Recipt

A wants to tell B something and wants B to acknowledge it has been told; A does not need any information back from B, other than the acknowledgement.

  • Request-Response:

Request-Response

A wants some information from B. ‘Ah!’ I hear you say, ‘that’s the same as Inform-Receipt!’, well you are probably going to implement them the same way but the conversation is different: in Inform-Response B is getting something from A but in Request-Response A is getting something from B. Also note that B is not bothered if A actually got the response.

  •  Request-Response-Receipt

Request-Response-Receipt

Extends Request-Response such that B now needs to confirm A has received the response before continuing.

Once you have established the conversation patterns consider a scheme for giving each message a unique identifier; this is useful for debugging problems but critical for managing idempotency and allowing for re—transmission if the connection or message is lost.

Message IDs

 

My final observation about conversation patterns is to be very clear about persisting the message before acknowledging receipt and, if possible, do this in a transaction otherwise it becomes much harder to sort out missing and duplicate message problems.

 

Integrations: a common language

2012.01.13 [IMG_8794]

For anyone who wants to start an integration project and who has not come across Enterprise Integration Patterns by Hohpe and Woolf – Stop! I would highly recommend reading this book first, or at the very least visit the website:

EIP_book_cover          http://www.enterpriseintegrationpatterns.com/

One of the aspects of this book I really like is the ‘Gregor-grams’, which are symbols representing the particular patterns described, here is an example:

gregorgram

and this is what it says: a document messages is transformed, enriched and sent to a pub-sub channel. You can get these as a Visio template (see website above) and I find this really useful, not only as a source of symbols, but as a recipe book for integration.

Don’t get me wrong, I’m not saying this is the only way to approach integration work: the Gregor-grams are focussed on messaging solutions and don’t cover database, file or screen integration; you are also likely to need other views of the integration, e.g. from a user (actor) perspective or a view of the technology stack.

Gregor-grams do make a great part of the toolkit and if we all use them it will be easier for those inheriting our designs to understand them.

 

Integration: who’s driving ?

2011.12.25 [IMG_8521

When considering an integration it is important to be clear about which part is the customer and which is the provider. This is probably fairly obvious in a B2B relationship but may require some thought for internal scenarios. My default position is that the consumer, or customer, should drive the conversations. Consider this high-level scenario:

integration_initiation

Fairly typically, A is asking B for some information or to enact a transition. As this may take some time an asynchronous response is required (what is in the response is not important here, it could be some data, the result of a transaction or an error condition). Now it may be tempting to have A initiate the conversation with B when making the request (1) but have B initiate the conversation when providing the response (2). Now this can work well, especially in internal situations but if A and B are in separate organisations, or if B needs to know how to respond to a variety of clients (As), then there is an overhead in that both parties need  to build solutions to both act as the initiator and responder in a conversation. I firmly believe that it is simpler to have one party initiate all conversations and the other to only ever respond to requests. Look at the more detailed solution:

integration_initiation_detail

Here A initiates all conversations. This has advantages:

  • A does not need to open firewall ports to receive responses from B
  • A does not need to build an endpoint for B to respond to
  • A gets to run things at its own pace
  • A does not need to correlate responses from B if using the pattern shown above (it does need to maintain its own state)
  • If A is temporarily down B is not impacted (simplifies SLAs)
  • A might not need to authenticate B (depending on what is being returned)

Of course there are always exceptions but start with this pattern and then justify changing it based on specific requirements.