Wednesday, September 27, 2006

Enterprise Information Architecture as a foundation for successful data quality management.


Data quality is a well-known problem and a very expensive one to fix. While it has been plaguing major US corporations for quite some time, lately it is becoming increasingly painful. Higher prominence of various regulatory compliance acts, i.e., SOX, GLB, Basel II, HIPPA, HOEPA, etc necessitates an adequate response to the problem.

A common approach to data quality problem usually starts and ends with the activities scoped to the physical data storage layer (frequently relational databases) in the classes of applications that heavily depend on enterprise data quality: i.e., Business Intelligence, Finance reporting, Market Trend Analysis, etc.
Not surprisingly, according to the trade publications, most of these efforts have minimal success. Given that in business applications data always exist within the context of a business process, all the attempts to solve “data quality problem” at the pure physical data level (a.k.a. databases and ETL tools) are doomed to fail.

Successful business data management begins by taking focus away from data. The focus initially should be on creation of Enterprise Architecture, especially commonly missing Business as well as Information architectures constituents of it. Information Architecture spans Business and Technology architectures, brings them together, keeps them together and provides necessary rich contextual environment to solve the ubiquitous “data quality problem”.

Thus Enterprise Business, Information and Technology architectures are needed for successful data management.

Data Quality

Data Quality Deficiency Syndrome
Major business initiatives in a broad spectrum of industries, private and public sectors alike, have been delayed and even cancelled citing poor data quality as the main reason. The problem of poor information quality has become so severe, that it has moved to the top tier among the reasons for business customers’ dissatisfaction with their IT counterparts.
While it is hardly an argument that poor data quality is probably the most noticeable issue, in a vast majority of cases, it will be accompanied by the equally poor quality of systems engineering in general, i.e., requirements elicitation and management, application design, configuration management, change control, and overall project management. The popular belief that “if we just get data under control, the rest (usability, scalability, maintainability, modifiability, etc) will also follow”, has proven to be consistently wrong. I have never seen a company where data quality level was significantly different from the overall IT environment quality level. If business applications in general are failing to meet (even) realistic expectations of the business users, then data is just one of the reasons cited, albeit the most frequent one. As a corollary to this, if business users are happy with the level of IT services, usually all the quality parameters of the IT organization effort are at a satisfactory level. I challenge the readers, from their prior experience and knowledge, to come up with an example where data quality in a company was significantly different from the rest of the information systems quality parameters.
What is commonly called “poor data quality problem” should be more appropriately called the “data quality deficiency syndrome”. It is indeed just a symptom of a larger and more complex phenomenon that can be called something to the kind of “poor quality of systems engineering in general”. Data quality is just the most tangible and obvious problem that our business partners do observe 1.

What is data?
Since data plays such a prominent role in our discussion, let’s first agree on what data is. Generally, most agree that “data 2” is a statement accepted at face value. A large class of data is measurements of a variable.
While all the examples in this article assume numeric and alphanumeric values, the assertions should be applicable to image-typed values as well.

Data context
A notion that data is produced by measurements or observations is very significant. It points to a very important concept that is absolutely critical to the success of any data quality improvement effort. This concept is a notion of data context or metadata. In other words, a number just by itself, stripped of its context, is not really meaningful for business users. For example, the number 10 taken without an appropriate context bears little use. However, if one learns that we are talking about 10 cars and not 10 bikes, it now yields a better understanding of the business situation. The more data context is available, the better is our ability to understand what this piece of data really means. To continue with the example noted above, so far we have learned that we are talking about 10 cars. Now if we add to this context that we are talking about “10 cars waiting detailing and then delivery to a specific party, let’s say “Z Car Shop”, that has already paid for them”, we now have a much better understanding of the business circumstances surrounding this number. This is at the crux of “poor data quality problem” – lack of sufficient data context. We typically do not have enough supporting information to understand what a particular number (or a set of numbers) means, and we thus cannot make an accurate judgment about validity and applicability of the data. 3
As an IT consultant Ellen Friedman puts it: “Trying to understand the business domain by understanding individual data elements out of context is like trying to understand a community by reading the phone book.”
The class of data that is the subject of this article always exists within a context of a business process. In order to solve “poor data quality problem”, data context should always be well-defined, well-understood, and well-managed by data producers and consumers.

Data quality attributes

Professor Richard Wang of MIT, defines 15 dimensions or categories of data quality problems. They are: accuracy, objectivity, believability, reputation, relevancy, value-added, timeliness, completeness, amount of information, interpretability, ease of understanding, consistent representation, concise representation, access, and security.
A serious discussion of the above list would warrant a whole book; however, it is important to make a point that most of these attributes are in fact representing the notion of data context. For the purpose of our discussion on data quality, the most relevant attributes are: interpretability, ease of understanding, completeness, and timeliness.
The timeliness attribute also known as temporal aspect of information/data is arguably the most intricate one from the data quality perspective.
There are at least two interpretations of data timeliness. The first deals with our ability to present required data to a data consumer on time. It is a derivative of good

requirements and design, but in the context of this article, it is of little interest to us. The second aspect is the notion of data having a distinctive “time/event stamp” related to the business process, and thus allowing us to interpret data in conjunction with the appropriate business events. It is not hard to see that more than half of the data quality attributes in the list above are at least associated with, if not derived from, this interpretation of timeliness. The importance of the time/event attribute points to a fundamental problem with the conventional data modeling technique, i.e., entity-relationship modeling or entity-relationship diagram (ERD). The ERD method lacks any mechanism similar to UML’s Event and State Transition Diagrams. This gap in turn leads not only to a consistent under-representation of this extremely important aspect of data quality in the conventional data models, but also creates a serious knowledge-management problem for a large group of players in the data quality arena.

According to J.M. Juran, a well known authority in the quality control area and the author of the Pareto principle, which is commonly referred to today as the “80-20 principle”, data are of high quality "if they are fit for their intended uses in operations, decision making and planning. Alternatively, data are deemed of high quality if they correctly represent the real-world construct to which they refer.” 4
Again, this definition points to the notion that data quality is dependent on our ability to understand data correctly and use them appropriately.
As an example, consider U.S. postal address data. Postal addresses are one of the very few data areas that have well defined and universally accepted standards. Even though an address can be validated against commercially available data banks to ensure its validity, this is not enough. If a shipping address is used for billing and vice-versa, or borrower correspondence address is used for appraisal, the results obviously will be wrong.

As already discussed above, the temporal aspect of information quality is extremely important for understanding and communicating, but it is often lost. For example, in the mortgage-backed securities arena, there are two very similar processes with almost identical associated data. First is Asset Accounting Cycle, which starts at the end of the month for interest accrual due next period. The second is the Cash Flow Distribution Cycle, which starts 15 days after the Asset Accounting Cycle begins. This difference of 15 calendar days, during which many possible changes to status of a financial asset can take place, can make financial outcomes differ significantly, but from the pure data modeling perspective, the database models in both cases are very similar or even identical. A data modeler who is not intimately familiar with the nuances of a business process, will not be able to discern the difference between the data associated with disparate processes by just analyzing the data in the database.

Architecture as metadata source
As previously discussed, conventional data modeling techniques do not contain a mechanism that can provide sufficiently rich metadata, which is absolutely necessary for any successful data quality improvement effort to be successful. At the same time, this rich contextual model is a natural byproduct of successful Enterprise Architecture (EA) development process so long as this process adheres to a rigorous engineering approach5.
Architecture definition
Architecture is one of the most used (and abused) terms in the areas of software and systems engineering. In order to get a good feel for the complexity of the systems architecture topics, it suffices to list some of the most commonly used architectural categories, methods and models: Enterprise, Data, Application, Systems, Infrastructure, Zachman, Information, Business, Network, Security, Model Driven Architecture (MDA) and certainly the latest silver-bullet: Service-Oriented Architecture (SOA). All of the above architecture types naturally have a whole body of theoretical and practical knowledge associated with them. Any in depth discussion about various architectural categories and approaches is clearly outside the scope of this article; however, it is important to concentrate on the concept of Enterprise Architecture, and the following definition by Philippe Kruchten provide the context for this discussion “Architecture encompasses the set of significant decisions about the system structure”6.
Similarly, Eberhardt Rechtin states7 “A system is defined ... as a set of different elements so connected or related as to perform a unique function not performable by the elements alone”.
In order to emphasize the practical side of architecture development, the two definitions above can be further enriched and a long-time colleague of mine Mike Regan a systems architect with many successful system implementations under his belt adds: “Architecture can be captured as a set of abstractions about the system that provide enough essential information to form the basis for communication, analysis, and decision making.”

From the above definitions, it is clear that system architecture is the fundamental organization of a system. System architecture contains definitions of the main system constituencies, as well as the relationships among these constituencies.
Naturally, the architecture of a complex system is very complex as well. In order to deal with such architectural complexity, some decomposition method is needed. One such method is the Three-Layered Model.


Three-Layered Model
All modern architectural approaches are centered on a concept of model layers -- horizontally-oriented groups defined by a common relationship with other layers, usually their immediate neighbors above and below. A possible layering for EA can constitute a capabilities (or business process) layer at the top, the information technology specifications layer in the middle, and the information technology physical implementation layer on the bottom . This model assumes information systems-centered approach; in other words, the purpose of this architectural model is to provide an approach to successful information systems implementation.

A simplified Three-Layered Model is shown in Figure 1. Some key concepts are worth mentioning:
First, although business strategy is not a constituent of the Business Architecture layer, it represents a set of guidelines for enterprise actions regarding markets, products, business partners and clients. A more elaborate view of these actions is captured by Business Architecture.
Second, the model demonstrates that Enterprise Information Models reside in both the Conceptual as well as in the Logical layers, and provide the foundation for consistent interaction between these layers.
Third, Enterprise IT Governance Framework is defined in the top conceptual layer, while IT standards and guidelines that support the Governance Framework are implemented in the Specification layer.
Finally, Enterprise Specification Layer defines only Enterprise Integration Model for the departmental systems, but not their internal architectures.

The discussion that succeeds the diagram expands on the notion of business process architecture and elaborates on the layering details.

Business Architecture
It is important to emphasize the business process layer as the foundation for our Enterprise Architecture model. Carnegie Mellon University (CMU) provides the following definition for Enterprise Architecture: “A means for describing business structures and processes that connect business structures”. Interestingly enough this definition from the CMU Software Architecture Glossary is actually applicable to and definitive of the EA as a whole, and not just for the Business EA.
EA definition used by the US Government agencies and departments emphasizes a strategic set of assets, which defines the business, as well as the information necessary to operate the business, and the technologies necessary to support the business operations.
While this definition maps extremely well to the above proposed three layered view of EA, a word of caution is appropriate: while three layered model provides a good first approximation of Enterprise Architecture, it is by no means complete and/or rigorous. It is obvious that both business and technical constituencies of the EA model can and should be in turn decomposed into multiple sub-layers.
In the quest to make Business Enterprise Architecture (BEA) layer a robust practical concept, BEA has morphed from the initial organizational chart-centered and thus brittle view, into a business process-centered orientation, and lately into business capabilities-centered view, becoming even more resilient to business changes and transformations .
Current consensus around EA accentuates both business and technical constituencies of it. This business and IT partnership is even more highlighted by the advances of the service-oriented architecture (SOA), which views businesses process model and supporting technology model as an assembly of inter-connected services.

Architectural model as a foundation for data quality improvement
Top Business layer
In the proposed three-layered view of the EA, the business process (or capabilities) layer includes a business domain class model . Since this model is implemented at the highest possible level of abstraction, it captures only foundational business entities and their relationships. Thus, the top layer domain model is very stable and is not subject to change unless the most essential underlying business structures change. The information (or data) elements that are defined at this level of the domain model are cross-referenced against the business process model residing in the same top layer. In other words, every domain model element has at least one business process definitions that references it. The reverse is also true: there is no information element called out in the business processes definitions that does not exist in the domain model.

It is also worth pointing out that only common enterprise-level information processes and elements are captured at the top layer of EA. For example, due to historical reasons, an enterprise consists of multiple lines of business (LOB), each carrying out its own unique business process with related information definitions. At the same time, all the LOBs participate in the common enterprise process. In this case, at the top (enterprise level) business process layer, only the common enterprise-level process will be modeled. In extreme cases, this enterprise process will primarily consist of the interfaces between the LOB-level business processes.

Each of the enterprise’s LOBs will need to have its own three-layered model, where top level business entities and the corresponding information (or data) elements will be unambiguously mapped to the enterprise-level model entities. Needless to say, only the elements that have their counterparts at the enterprise level can possibly be mapped. By relating LOB-level definitions to the common enterprise-level equivalents, we are eliminating one of the main reasons for low enterprise data quality: semantic mismatch (a.k.a. ambiguity) between different business units. And since our data elements are cross-referenced with the business process models, we should have enough contextual information to correlate information elements at the enterprise- and LOB- levels. In the most difficult cases, UML State Transition Diagrams should be created to capture temporal and event aspects of the business processes.

Specification layer
The Specification Layer of the three-layered EA model introduces system-related considerations and defines specifications for the enterprise-level information systems. These are the systems that need to be constructed to support the business processes defined at the top layer of the model. By defining system requirements in terms of the business processes, another major cause of low data quality is eliminated: a disconnect between the business and the technology views of the enterprise system.

For example, it is quite common for more than one system to be operating on a data element defined at the top business layer. In this case, each system specification will define its own unique data attribute, but all these attributes are in turn mapped to the one element at the top layer. This top down decomposition approach helps to alleviate a problem known as “departmental information silo”.

Again, similar to the top layer, in the spirit of correlating data with the process contextual information, Business Use Cases Realizations and System Use Cases (or similar artifacts) are introduced at this level to provide enough grounding for the data definitions. It is important to note that in addition to the enterprise systems, the system interfaces of LOB-level systems (to support business process connection points between the different LOBs) are also specified in this layer.

Implementation layer

In this layer, the platform-specific implementation are defined and implemented. Unlike at the Specification layer, multiple platform specific implementations may be mapped to the same element defined at the specification layer. This unambiguous, contextually-based mapping from possibly multiple technology-specific implementations to a data element defined at the technology-independent specification level is the foundation for the robust high quality data management approach.

It is impossible to overestimate the importance of the two-dimensional traceability in the discussed architectural model. The first dimension – vertical traceability between the model layers – provides a foundation for rich contextual connection between the business process and the system implementation that supports this process. The second dimension – horizontal traceability within the same model layer – provides a foundation for a rich contextual connection between the hierarchical organizational units, as well as the systems implemented at their respective levels.
A robust traceability mechanism is absolutely necessary for high data quality to become a reality. The architectural model provides a foundation for the information traceability and thus data quality, without which it is not possible to address a cluster of issues introduced by the modern business environment in general and especially by the legal and regulatory compliance concerns.

There Are No Pure Data Problems

Printed in Computerworld; October 17, 2005

While I agree with Ken Karacsony’s assessment that too much ETL is a sign of potential problems (COMPUTERWORLD, SEPTEMBER 05, 2005), I have a very different opinion on what is at the heart of the issue and what kind of solution it deserves. Before I continue with the rest of my response, I would like to emphasize that everything I am asserting is mainly relevant to the on-line transactional processing (OLTP) side of the IT domain. Things look somewhat different on the on-line analytical processing (OLAP) side.
While Ken Karacsony states that what we see is a sign of “poor data management” I tend to think that much more often it is a sign of poor engineering practices in general rather than just poor data management.
Data (or numeric value of certain business-related attributes) tends to be the most tangible and visible aspect that we, as well as our business partners can observe. Given that we work for business community, this visibility by and perception of the business users is much more important than our own (IT professional) perception. Quite often when business users say “we have data problems”, we should interpret their statement as “something is wrong with the system, I do not know what it is exactly, I just know that it gives me a wrong answer, please fix it”.
There is no such thing as “pure data problem”, because in any business application data always exists within the context of the business process. Whenever data is taken out of the (business process) context, i.e. stored in the relational DBMS tables, it loses a considerable portion of its semantic significance. For instance, let’s assume that a typical data base for a financial services company would have an Address record defined. While it may be sufficient for very simple cases to have just one flavor of addresses in the database, with an increase in business process complexity, data analysts and system developers will find themselves dealing with numerous variations of the Address structure: Current Client Residence Address, Property Address, Client Correspondence Address, Shipping Address, Billing Address, Third Party Address, etc. While all these Address records may have identical physical structure, semantically they are very different. For example using automated home appraisal method with a wrong address, i.e. the Current Client Residence Address instead of the Property Address, will produce a wrong result, which is impossible to catch outside of the business process context. To give Shipping department a Billing address instead of the Shipping one is probably also a bad idea.
One way to ensure that data is not taken out of the business context is to build cohesive systems around a logical unit of the business, and expose these systems to each other only through semantically-rich messages. The advantage that messaging style integration has versus the shared database integration style is this ability to transmit not only the shared data but also the shared business context semantics. While it is not hard to maintain similar degree of clarity within the shared database design style, in the absence of the very mature development process, a shared database, by its own nature servicing many different owners at the same time, will rapidly lose its initial design crispiness due to the inability to keep up with the numerous modifications requests. This in turn will lead to the data overloading, redundancy, inconsistency and at the end to the poor “data quality” at the application level. Do not get me wrong: I am not against the shared data store integration approach; I am just recommending being realistic about the complexity of the method within the confines of the modern business environment. I would recommend using shared data integration within the scope of a single business unit while using message-based integration for the inter-departmental development as well as enterprise level. It is significantly easier to provide highly cohesive development environment within the boundaries of a single business unit due to the natural uniformity of the unit’s business priorities.
Except for the area of ad hoc reporting, our clients do not deal with databases -- they deal with business applications. I also would argue that too much ad hoc reporting signals problems with the business process design, and/or application workflow design, and/or UI design. Too many OLTP applications are poorly designed and thus have very inadequate usability characteristics, forcing users to compensate by requesting a high volume of “canned” reports as well as sophisticated ad hoc reporting capabilities. In the world of carefully designed applications, it is the applications and not the databases that are the centers of customer interactions. As an example, I recently worked on a project where we were able to either completely eliminate, or migrate into an application’s human workflow process more than half of the reports initially requested by the business users.
The solution to the “too much ETL” problem in the OLTP world is thus less centralization and lower coupling of the OLTP systems and not more centralization and tighter application coupling through a common data store. One can argue that it is always possible to introduce a layer of indirection (i.e. XML) between the application logic and the common database physical schema, thus providing a level of flexibly and decoupling. While this may work for some companies, in my personal experience, this type of design proved to be harder to maintain than the more robust asynchronous middleware-based messaging due to the fact that it mixes two different design paradigms.
I would be interested in hearing from COMPUTERWORLD readers about any medium- to large-sized company that was successful in building multi-departmental Operational Data Stores that worked well with the multiple inter-departmental systems through a number of consecutive releases. I predict that it will be hard to find a significant number of cases to discuss at all, and it will be especially difficult to find any examples from companies with a dynamic business process that requires constant introduction of new products and services. The main reason for the lack of success, from my point of view, is not technical in nature. It is relatively easy to build tightly-coupled applications integrated via the common data store, especially if it is done under the umbrella of one single program with a mature system development culture. The problem is in the “Realpolitik” of a modern business environment: we live in and work for businesses in the age of ever-accelerating global competition. It is almost impossible to coordinate business plans of various departments, and the subsequent deployment schedules of multiple IT projects, each working on its group of business priorities in order to keep systems, which are built around one shared Database, current. When one of the interdependent development teams misses a deliverable deadline, political pressure to separate will become hard to resist. And if a commercial of the shelf software (COTS) package is acquired, or a corporate merger or an acquisition takes place, the whole idea of all applications working with one common data format is immediately thrown out the window.
So we, in IT, need to learn how to build systems that will not require rigid release synchronization from the multiple OLTP systems belonging to disparate business units. Decoupling can provide us with the required flexibility to modify our systems on a coordinated, but not prohibitively-rigid schedule.
Finally, it is important to emphasize that while loose coupling gives us an opportunity to modify different systems on different schedules without corrupting the coupled systems, loosely-coupled does not mean “loosely-managed.” Loose coupling provides us with a degree of flexibility in implementation and deployment. This additional degree of flexibility gives our business partners the ability to move rapidly when they need to and at the same time provides IT with the ability to contain and manage the challenges caused by the ever-increasing rate of business change. We have to acknowledge that developing loosely coupled applications that work well together across an enterprise with well-delineated responsibilities is a very challenging engineering problem. If not managed well this type of system development may turn advantage of loose coupling into disadvantage of “delayed-action” semantic problems. A mature IT development process is absolutely necessary to overcome this engineering problem and deliver this type of information infrastructure to our business partners. From this perspective, it is worthwhile for any organization that is striving to build a well-integrated enterprise level IT infrastructure to look into SEI Capability Maturity Model. Specifically, Maturity Level 3, called the Defined Level, addresses issues of development process consistency across the whole enterprise. From my point of view it is a prerequisite to the physical integration of the enterprise systems into consistent whole. CMM manual describes this process level as when “the standard process for developing and maintaining software across the organization is documented, including both software engineering and management processes, and these processes are integrated into a coherent whole.” Unless an organization is prepared to operate at his level, it should not have high hopes for a success in the integration area.
So to summarize: successful data management begins by taking focus away from data. Instead, the focus should be on the general level of system engineering and its main aspects, i.e., Requirements Analysis and Management, Business Domain Modeling, Configuration Management, QA Process, etc.
I would argue that any medium to large company that has not reached CMM level 3 and is trying to get “data under control” would have little chance to succeed in this undertaking, regardless of what integration style it will use.