The Data Abstraction Layer: Software Architecture’s Great Frontier
Abstraction has meaningfully permeated almost every layer of modern software architecture except for the data layer. This lack of abstraction has led to a myriad of data-related problems perhaps the most of important of which is significant data duplication and inconsistency throughout most large enterprises. Companies have generally responded to these problems by building elaborate and expensive enterprise application integration (EAI) infrastructures to try and synchronize data throughout an enterprise and/or cleanse it of inconsistencies. But these infrastructures simply perpetuate the status quo and do nothing to address the root cause of all this confusion: the lack of true abstraction at the data layer. Fortunately, the status quo may soon be changing thanks to a new generation technologies designed to create a persistent “data abstraction layer” that sits between databases and applications. This Data Abstraction Layer could greatly reduce the need for costly EAI infrastructures while significantly increasing the productivity and flexibility of application development.
Too Many Damn Databases
In an ideal world, companies would have just one master database. However if you take a look inside any large company’s data center, you will quickly realize one thing: they have way too many damn databases. Why would companies have hundreds of databases when they know that having multiple databases is causing huge integration and consistency problems? Simply put, because they have hundreds of applications and these applications have been programmed in a way that pre-ordains each one has to have a separate database.
Why would application programmers pre-ordain that their applications must have dedicated databases? Because of the three S’s: speed, security and schemas. Each of these factors drives the need for dedicated databases in their own way:
1. Speed: Performance is a critically important facet of almost every application. Programmers often spend countless hours optimizing their code to ensure proper performance. However, one of the biggest potential performance bottlenecks for many applications is the database. Given this, programmers often insist on their own dedicated database (and often their own dedicated hardware) to ensure that the database can be optimized, in terms of caching, connections, etc., for their particular application.
2. Security: Keeping data secure, even inside the firewall, has always been of paramount importance to data owners. In addition, new privacy regulations, such as HIPPA, have made it critically important for companies to protect data from being used in ways that violate customer privacy. When choosing between creating a new database or risking a potential security or privacy issue, most architects will simply take the safe path and create their own database. Such access control measures have the additional benefit of enhancing performance as they generally limit database load.
3. Schemas: The database schema is the essentially the embodiment of an application’s data model. Poorly designed schemas can create major performance problems and can greatly limit the flexibility of an application to add features. As a result, most application architects spend a significant amount of time optimizing schemas for each particular application. With each schema heavily optimized for a particular application it is often impossible for applications to share schemas which in turn makes it logical to give each application its own database.
Taken together, the three S’s effectively guarantee that the utopian vision of a single master database for all applications will remain a fantasy for some time. The reality is that the 3 S’s (not to mention pragmatic realities such as mergers & acquisitions and internal politics) virtually guarantee that large companies will continue to have hundreds if not thousands of separate databases.
This situation appears to leave most companies in a terrible quandary: while they’d like to reduce the number of databases they have in order to reduce their problems with inconsistent and duplicative data, the three S’s basically dictate that this is near next to impossible.
Master Database = Major Headache
Unwilling to accept such a fate, in the 1990’s companies began to come up with “work arounds” to this problem. One of the most popular involved the establishment of “master databases” or databases “of record”. These uber databases typically contained some of the most commonly duplicated data, such as customer contact information. The idea was that these master databases would contain the sole “live” copy of this data. Every other database that had this information would simply subscribe to the master database. That way, if a record was updated in the master database, the updates would cascade down to all the subordinate databases. While not eliminating the duplication of data, master databases at least kept important data consistent.
The major drawback with this approach is that in order to ensure proper propagation of the updates it is usually necessary to install a complex EAI infrastructure as this infrastructure provides the publish & subscribe “bus” that links all of the master/servant databases together. However, in addition to being expensive and time consuming to install, EAI infrastructures must be constantly maintained because slight changes to schemas or access controls can often disrupt them.
Thus, many companies that turned to EAI to solve their data problems have unwittingly created an additional expensive albatross that they must spend significant amounts of time and money on just to maintain. The combination of these complex EAI infrastructures with the already fragmented database infrastructure has created what amount’s to a Rube-Goldberg like IT architecture within many companies which is incredibly expensive to maintain, troubleshoot, and expand. With so many interconnections and inter-dependencies, companies often find themselves reluctant to innovate as new technologies or applications might threaten the very delicate balance they have established in their existing infrastructure.
So the good news is that by using EAI it is possible to eliminate some data consistency problems, but the bad news is that the use of EAI often results in a complex and expensive infrastructure that can even reduce overall IT innovation. EAI’s fundamental failing is that rather than offering a truly innovative solution to the data problem, it simply “paves the cow path” by trying to incrementally enhance the existing flawed infrastructure.
The Way Out: Abstraction
In recognition of this fundamental failure, a large number of start-ups have been working on new technologies that might better solve these problems. While these start-ups are pursuing a variety of different technologies, a common theme that binds them is their embracement of “abstraction” as the key to solving data consistency and duplication problems.
Abstraction is one of the most basic principles of information technology and it underpins much of the advances in programming languages and technical architectures that have occurred in the past 20 years. One particular area in which abstraction has been applied with great success is in the definition of interfaces between the “layers” of an architecture. For example, by defining a standardize protocol (HTTP) and a standardized language (HTML), it has been possible to abstract much of the presentation layer from the application layer. This abstraction allows programmers working on the presentation layer to be blissfully unaware of and uncoordinated with the programmers working on the application layer. Even within the network layer, technologies such as DNS or NAT rely on simple but highly effective implementations of the principle of abstraction to drive dramatic improvements in the network infrastructure.
Despite all of its benefits, abstraction has not yet seen wide use in the data layer. In what looks like the dark ages compared to the presentation layer, programmers must often “hard code” to specific database schemas, data stores, and even network locations. They must also often use database-specific access control mechanisms and tokens.
This medieval behavior is primarily due to one of the Three S’s: speed. Generally speaking, the more abstract and architecture, the more processing cycles required. Given the premium that many architects place on database performance, they have been highly reluctant to employ any technologies which might compromise performance.
However as Moore’s Law continues its steady advance, performance concerns are becoming less pronounced and as a result architects are increasing willing to consider “expensive” technologies such as abstraction, especially if they can help address data consistency and duplication problems.
The Many Faces of Abstraction
How exactly can abstraction solve these problems? It solves them by applying the principles of abstraction in several key areas including:
1. Security Abstraction: To preserve security and speed, database access has traditionally been carefully regulated. Database administrators typically “hard code” access control provisions by tying them to specific applications and/or users. Using abstraction, access control can be centralized and managed in-between the data layer and the application layer. This mediated access control frees programmers and database administrators from having to worry about coordinating with each other. It also provides for centralized management of data privacy and security issues.
2. Schema Abstraction: Rather than having programmers hard code to database schemas associated with a specific databases, abstraction technologies enable them to code to virtual schemas that sit between the application and database layers. These virtual schemas may map to multiple tables in multiple different databases but the application programmer remains blissfully unaware of the details. Some virtual schemas also theoretically have the advantage of being infinitely extensible thereby allowing programmers to easily modify their data model without having to redo their database schemas.
3. Query/Update Abstraction: Once security and schemas have been abstracted it is possible to bring abstraction down to the level of individual queries and updates. Today queries and updates must be directed at specific databases and they must often have knowledge of how that data is stored and indexed within each database. Using abstraction to pre-process queries as they pass from the application layer to the data layer, it is possible for applications to generate federated or composite queries/updates. While applications view these composite queries/updates as a single request, they may in fact require multiple operations in multiple databases. For example, a single query to retrieve a list of a customer’s last 10 purchases may be broken down into 3 separate queries: one to a customer database, one to an orders database and one to a shipping database.
The Data Abstraction Layer
With security, schemas and queries abstracted what starts to develop is a true data abstraction layer. This layer sits between the data layer and the application layer and decouples them once and for all freeing programmers from having to worry about the intimate details of databases and freeing database administrators from maintaining hundreds of bi-lateral relationships with individual applications.
With this layer fully in place, the need for complicated EAI infrastructures starts to decline dramatically. Rather than replicating master databases through elaborate “pumps” and “busses”, these master databases are simply allowed to stand on their own. Programmers creating new data models and schemas select existing data from a library of abstracted elements. Queries/updates are pre-processed at the data abstraction layer which determines access privileges and then federates the request across the appropriate databases.
Data Servers: Infrastructure’s Next Big Thing?
With so much work to be done at the data abstraction layer, the potential for a whole new class of infrastructure software called Data Servers, seems distinctly possible. Similar to the role application servers play in the application layer, Data Servers manage all of the abstractions and interfaces between the actual resources in the data layer and a set of generic APIs/standards for accessing them. In this way, the data servers virtually create the ever elusive “single master database”. From the programmer’s perspective this database appears to have a unified access control and schema design, but it still allows the actual data layer to be highly optimized in terms of resource allocation, physical partitioning and maintenance.
The promise is that with data servers in place, there will be little if any rationale for replicating data across an organization as all databases can be accessed from all applications. By reducing the need for replication, data servers will not only reduce the need for expensive EAI infrastructures but they will reduce the actual duplication of data. Reducing the duplication of data will naturally lead to reduced problems with data consistency.
Today however, the promise of data servers remains just that, a promise. There remain a number of very tough challenges to overcome before Data Servers truly can be the “one database to rule them all”. Just a couple of these challenges include:
1. Distributed Two-Phase Commits: Complex transactions are typically consummated via a “two phase commit” process that ensures the ACID properties of a transaction are not compromised. While simply querying or reading a database does not typically require a two phase commit, writing to one typically does. Data servers theoretically need to be able to break-up a write into several smaller writes, in essence they need to be able to distribute a transaction across multiple databases while still being able to ensure a two-phase commit. There is general agreement in the computer science world that, right now at least, it is almost impossible to consummate a distributed two-phase commit with absolute certainty. Some start-ups are developing “work arounds” that cheat by only guaranteeing a two-phase commit with one database and letting the others fend for the themselves, but this kind of compromise will not be acceptable over the long term.
2. Semantic Schema Mapping: Having truly abstracted schemas that enable programmers to reuse existing data elements and map them into their new schemas sounds great in theory, but it is very difficult to pull off in the real world where two programmers can look at the same data and easily come up with totally different definitions for it. Past attempts at similar programming reuse and standardization, such as object libraries, have had very poor results. To ensure that data is not needlessly replicated, technology that incorporates semantic analysis as well as intelligent pattern recognition will be needed to ensure that programmers do not unwittingly create a second customer database simply because they were unaware that one already existed.
Despite these potential problems and many others, the race to build the data abstraction layer is definitely “on”. Led by a fleet of nimble start-ups, companies are moving quickly develop different pieces of the data abstraction layer. For example, a whole class of companies such as Composite Software, Metamatrix, and Pantero are trying to build query/update federation engines that enable federated reads and writes to databases. On the schema abstraction front, not many companies have made dramatic progress but companies such as Contivo are trying to create meta-data management systems which ultimately seek to enable the semantic integration of data schemas, while XML database companies such as Ipedo and Mark Logic continue to push forward the concept of infinitely extensible schemas.
Ultimately, the creation of a true Data Server will require a mix of technologies from a variety companies. The market opportunity for whichever company successfully assembles all of the different piece parts is likely to be enormous though, perhaps equal to or larger than the application server market. This large market opportunity combined with the continued data management pains of companies around the world suggests that the vision of the universal Data Server may become a reality sooner than many people think which will teach us once again to never underestimate the power of abstraction.