The Storage Explosion
I am a big believer in the “scarcity and abundance” theory of IT development. The theory basically postulates that if you want to understand the near term future of information technology development the most important thing to consider is the scarcity and/or abundance of the “big 3” foundations of computing power: processing power, storage and bandwidth. Understanding the absolute and relative levels of these three technologies is the closest thing possible to having an IT industry crystal ball as they have a huge influence over everything from system architectures to capital investment plans to end user demand.
A 15 Year Rocket Ship Ride
It is with this in mind that I found a recent article on Tom’s Hardware to be fascinating. The article details the increases in storage capacity and performance over the last 15 years. Some of the numbers involved are mind boggling. In the last 15 years, the storage capacity of top-end hard drives has increased 5,907X from 130MB in 1991 to 750GB in 2006. To put this in perspective, during the same time period here are some increases in the other core technologies of a “bleeding edge” PC:
|End User Technology||1991||2006||X Increase|
|LAN Bandwidth||10 mbit/s (10BaseT)||1 gbit/s (1000BaseT)||102X|
|WAN Bandwidth||14.4 kbit/s (v.32bis)||3 mbit/s (Cable)||213X|
|CPU Performance||54 MIPS (486DX)||27079 MIPS (X6800)||501X|
The price difference is even more dramatic. In 1991 a megabyte of storage cost about $7.00, now it costs $0.000527. That’s a 13,274X price improvement or a 99.9925% price drop in 15 years. Not too shabby.
It’s interesting to note though that disk I/O performance improvement has lagged dramatically with only a 121X improvement from 0.7 MB/s to 85 MB/s. This clearly makes disk I/O one of the biggest, if not the biggest bottleneck in a modern PC. Put another way, it would theoretically take 3.1 minutes to write the entire contents of a 130MB disk drive in 1991 while it takes about 2.5 hours to write the contents of a 750GB drive in 2006 (as anyone who has tried to back-up a monster like this well knows).
Forecast Calls For:
More, Cheaper Storage
The near term prospects for additional gains in storage capacity remain promising. For traditional magnetic storage hard drives, the introduction of perpendicular recording should continue to drive platter densities higher, but much like CPUs, the limits of physics are starting to catch up with magnetic hard drives which suggests that they will not be able to continue their capacity gains indefinitely. Not to worry though, two new storage technologies are finally coming to market this year including Flash memory hard drives and holographic storage. Flash memory hard drives use NAND chips instead of magnetic platters to store data. They have a number of important advantages over traditional magnetic media including dramatically faster file access times and the ability to process far more I/O operations per second. For example, SanDisk just announced today a 32GB flash drive that accesses files in 0.12 milliseconds vs. 8 milliseconds for the best magnetic drive, a 67X improvement (although its read rate is only 64 MB/s). Flash drives also draw 50-70% less power than magnetic drives, weight about 50% less, do not produce noise or vibrations, give off very little heat, and are less fragile than magnetic drives. But all these improvements do not come cheaply as the initial 32GB flash drives will cost more than $20/GB or about 35-40X more than a magnetic drive. Still you should expect to these as an option on high end ultra-portable laptops in a few months. I also expect that many PC enthusiasts may buy a flash drive for their "C:" drive to dramatically improve Windows Vista boot times.
2007 will also see the first commercially available holographic storage drives, although these drives are likely to only compete with archival media, such as tape drives and Blu-Ray media, for the foreseeable future. Holographic areal storage densities have doubled in just the past year and at 515Gb/inch are already more than 2X those of the most advanced magnetic drives. To put this in perspective, Sony’s Blu-Ray disks which have just started shipping after many delays have a 50GB capacity while InPhase’s initial holographic disk (which had it’s 1st shipments just a few days ago) have a 300GB capacity and should go to 500GB in the next year or so. Put another way, you could store 106 DVDs on a single (slightly larger) holographic disk (although you can only write to these disks at about 23 MB/s, so be prepared to spend over 6 hours putting you DVD collection on a single holographic disk). Because it uses light rather than magnetic heads to access data, holographic storage should theoretically delivery better access times and faster read throughput than magnetic storage mediums. Holographic storage is expected to seriously challenge tape backup systems in the short term, but could also displace consumer storage mediums such as DVD and Blu-Ray as prices fall. Holographic disk drives are theoretically possible, but significant technical challenges still have to be overcome especially in terms of being able to rewrite the media and improve write speeds.
What Happens In A
World Awash With Storage?
You may be asking yourself, this is all very interesting but why do you, someone who generally is focused on software and the internet, care so much about storage? I care because storage capacity is clearly the most abundant and faster growing component of the “big three”(processing power, bandwidth, and storage) today and it looks as though the relative disparity between the three may even increase further in the short term. This situation should have some very interesting implications for both software and internet related businesses, some of which I will outline in my next post on the topic.
Sweet Revenge: Oracle Takes Over Siebel
There have been a lot of great posts written about Oracle’s purchase of Siebel today, so I won’t belabor the point, however I’d like to offer up that in addition to this acquisition being consistent with Oracle’s somewhat dubious consolidation strategy, it also is sweet revenge for Larry Ellison.
You see when Tom Siebel left Oracle to start Siebel there was, according to many people I’ve talk to, no love lost between Larry and Tom. Indeed, today’s news brought back memories to me of a dinner at the Circus Club in the summer of 2001 where Tom Siebel was receiving the “Entrepreneur of the Year” award from Stanford’s Center for Entrepreneurial Studies. At the time Siebel’s stock was still around $50/share and Tom Siebel was still running the ship. In his speech, Siebel talked a lot about how he ran the company in a much different manner than “the average Silicon Valley company”. It quickly became clear that he was really saying that he ran the company much more ethically and professionally than Larry Ellison ran Oracle. He capped off his comparison with a clear and very personal swipe at Ellison when he said something to the effect of “I did not want to found a company where the CEO walks off the corporate jet with the PR floozy of the week on his arm.” The room was packed with VCs and Silicon Valley CEOs and there was a lot of snickering when Siebel made this statement because everyone knew exactly who he was talking about.
I don’t know if Ellison ever heard about this speech, but if he did it probably only strengthened his resolve to get the last laugh. Fast forward 4 years and it looks like (a now married) Ellison will indeed be getting the last laugh. My guess is that all traces of “Siebel” will be erased within 24 hours after the acquisition closes. A lot of posts are talking about the premium that Oracle paid to take out Siebel however they neglect to figure in that the value of sweet revenge is priceless.
Software's Top 10 2005 Trends: #9 Data Abstraction
While abstraction has helped drive tremendous innovation and flexibility at the presentation and business logic layers, it has yet to be fully implemented in the data layer.
During 2005, the concept of “data abstraction” is likely to get increasing attention thanks to some early successes in the space, mostly from so-called Enterprise Information Integration (EII) players such as Composite Software and Metamatrix.
In the near term, the most promising space for EII appears to business intelligence and many of the major BI players have already established partnerships in the space (Cognos with Composite; Business Objects and Hyperion with Metamatrix) however over the long term as the fundamental concepts of data abstraction and “baked in” to both application servers and development tools, data abstraction is primed to become a powerful trend that could impact the balance of power in the database industry and fundamentally change data management.
For now though , the trend is only #9 on our Top 10 list because it’s really just getting under way at this point.
For a complete list of Software's Top 10 2005 trends click here.
Honey I Bought The Wrong Company!
On Monday December 13th Oracle announced that it had finally reached an agreement to acquire Peoplesoft thus ending a corporate siege reminiscent of the Roman siege of the Masada. While I suspect many at Oracle are feeling quite triumphant right now, they have a big problem: they brought the wrong company.
The Other Deal
Just two days after the Peoplesoft deal was announced, Symantec announced that it was going to buy Veritas for $13.5BN. Veritas is the market leader in backup, recovery, and high availability software and also an emerging player in the database and application management markets thanks to its acquisition of Precise in 2002. As it happens, Veritas built much of its business by selling back-up and recovery software to Oracle’s customers (much in the same way that Business Objects built its BI business).
As I have written before, Oracle’s pursuit of Peoplesoft appears to have been based on two main assumptions: 1) Oracle can generate significant scale economies and marginally increase its overall growth rate by acquiring one of its major ERP competitors. 2) Oracle’s belief that its core market, data management, is a mature, low growth space.
It’s hard to dispute assumption #1 (although with the higher price Oracle is paying the deal is inherently less accretive), however assumption #2 is likely flat out wrong. There is a relative explosion of growth and innovation going on the data management business compared to the ERP business. Whether it's unstructured data management, application management, virtualization, or disaster recovery, new opportunities for growth abound in the data management space.
Just look at Veritas. It has made data management (in a broad sense) its core business and has grown revenues 17% between 2001 and 2003 while Peoplesoft has only grown its revenues only 9.4% during the same period. Granted, Veritas isn’t exactly a growth poster child, but it would have grown much faster if it wasn’t for some sales execution and product transition issues (thus its sellout to Symantec).
Moving the Needle … In the Wrong Direction
After I wrote my last piece on Oracle, a few people who claimed to be in the know told me “Hey, Oracle knows that data management represents an attractive long term opportunity, but Oracle needs to make acquisitions that will move the needle in the short term, and there are no comparable companies in the data management space that are big enough to make a difference”.
Unfortunately that statement doesn’t hold much water when one looks at Veritas. While Peoplesoft does indeed have more revenues with $699M vs. Veritas’ $497M in Q3 04 revenues, 77% of that revenue came from low margin services revenue vs. Veritas’ 42%. Given its huge services business, it's not surprising that Peoplesoft is much less profitable than Veritas with only $40M in operating profits vs. Veritas’ $96M in Q3 04. On a trailing 12 month basis this profit differential is even worse with Peoplesoft earning only $107M in profits vs. Veritas’ $500M.
What this means is that Oracle’s $10.3BN acquisition of Peoplesoft, excluding cash, equates to about 81X times operating income while Symantec is getting Veritas for the comparative bargain of about 22X operating income. Even if Oracle can get PSFT’s operating earnings back up to $250M/year (where they were in 2002 … and 2001) that’s still 35X operating income.
When you net it all out, from even an optimistic perspective, Oracle looks to be paying about a 50%+ premium for a business that is growing about ½ the rate of Veritas and producing far less in absolute dollar profits.
It’s hard to look at these numbers as well as the relative long term growth opportunities in ERP vs. data management and not come to the conclusion that Oracle is truly buying the wrong company. The irony of the situation is probably not lost on Oracle’s former product head, Gary Bloom, who just happens to be CEO of Veritas. I can just imagine him sitting back in his chair, shaking his head, and laughing.
The Data Abstraction Layer: Software Architecture’s Great Frontier
Abstraction has meaningfully permeated almost every layer of modern software architecture except for the data layer. This lack of abstraction has led to a myriad of data-related problems perhaps the most of important of which is significant data duplication and inconsistency throughout most large enterprises. Companies have generally responded to these problems by building elaborate and expensive enterprise application integration (EAI) infrastructures to try and synchronize data throughout an enterprise and/or cleanse it of inconsistencies. But these infrastructures simply perpetuate the status quo and do nothing to address the root cause of all this confusion: the lack of true abstraction at the data layer. Fortunately, the status quo may soon be changing thanks to a new generation technologies designed to create a persistent “data abstraction layer” that sits between databases and applications. This Data Abstraction Layer could greatly reduce the need for costly EAI infrastructures while significantly increasing the productivity and flexibility of application development.
Too Many Damn Databases
In an ideal world, companies would have just one master database. However if you take a look inside any large company’s data center, you will quickly realize one thing: they have way too many damn databases. Why would companies have hundreds of databases when they know that having multiple databases is causing huge integration and consistency problems? Simply put, because they have hundreds of applications and these applications have been programmed in a way that pre-ordains each one has to have a separate database.
Why would application programmers pre-ordain that their applications must have dedicated databases? Because of the three S’s: speed, security and schemas. Each of these factors drives the need for dedicated databases in their own way:
1. Speed: Performance is a critically important facet of almost every application. Programmers often spend countless hours optimizing their code to ensure proper performance. However, one of the biggest potential performance bottlenecks for many applications is the database. Given this, programmers often insist on their own dedicated database (and often their own dedicated hardware) to ensure that the database can be optimized, in terms of caching, connections, etc., for their particular application.
2. Security: Keeping data secure, even inside the firewall, has always been of paramount importance to data owners. In addition, new privacy regulations, such as HIPPA, have made it critically important for companies to protect data from being used in ways that violate customer privacy. When choosing between creating a new database or risking a potential security or privacy issue, most architects will simply take the safe path and create their own database. Such access control measures have the additional benefit of enhancing performance as they generally limit database load.
3. Schemas: The database schema is the essentially the embodiment of an application’s data model. Poorly designed schemas can create major performance problems and can greatly limit the flexibility of an application to add features. As a result, most application architects spend a significant amount of time optimizing schemas for each particular application. With each schema heavily optimized for a particular application it is often impossible for applications to share schemas which in turn makes it logical to give each application its own database.
Taken together, the three S’s effectively guarantee that the utopian vision of a single master database for all applications will remain a fantasy for some time. The reality is that the 3 S’s (not to mention pragmatic realities such as mergers & acquisitions and internal politics) virtually guarantee that large companies will continue to have hundreds if not thousands of separate databases.
This situation appears to leave most companies in a terrible quandary: while they’d like to reduce the number of databases they have in order to reduce their problems with inconsistent and duplicative data, the three S’s basically dictate that this is near next to impossible.
Master Database = Major Headache
Unwilling to accept such a fate, in the 1990’s companies began to come up with “work arounds” to this problem. One of the most popular involved the establishment of “master databases” or databases “of record”. These uber databases typically contained some of the most commonly duplicated data, such as customer contact information. The idea was that these master databases would contain the sole “live” copy of this data. Every other database that had this information would simply subscribe to the master database. That way, if a record was updated in the master database, the updates would cascade down to all the subordinate databases. While not eliminating the duplication of data, master databases at least kept important data consistent.
The major drawback with this approach is that in order to ensure proper propagation of the updates it is usually necessary to install a complex EAI infrastructure as this infrastructure provides the publish & subscribe “bus” that links all of the master/servant databases together. However, in addition to being expensive and time consuming to install, EAI infrastructures must be constantly maintained because slight changes to schemas or access controls can often disrupt them.
Thus, many companies that turned to EAI to solve their data problems have unwittingly created an additional expensive albatross that they must spend significant amounts of time and money on just to maintain. The combination of these complex EAI infrastructures with the already fragmented database infrastructure has created what amount’s to a Rube-Goldberg like IT architecture within many companies which is incredibly expensive to maintain, troubleshoot, and expand. With so many interconnections and inter-dependencies, companies often find themselves reluctant to innovate as new technologies or applications might threaten the very delicate balance they have established in their existing infrastructure.
So the good news is that by using EAI it is possible to eliminate some data consistency problems, but the bad news is that the use of EAI often results in a complex and expensive infrastructure that can even reduce overall IT innovation. EAI’s fundamental failing is that rather than offering a truly innovative solution to the data problem, it simply “paves the cow path” by trying to incrementally enhance the existing flawed infrastructure.
The Way Out: Abstraction
In recognition of this fundamental failure, a large number of start-ups have been working on new technologies that might better solve these problems. While these start-ups are pursuing a variety of different technologies, a common theme that binds them is their embracement of “abstraction” as the key to solving data consistency and duplication problems.
Abstraction is one of the most basic principles of information technology and it underpins much of the advances in programming languages and technical architectures that have occurred in the past 20 years. One particular area in which abstraction has been applied with great success is in the definition of interfaces between the “layers” of an architecture. For example, by defining a standardize protocol (HTTP) and a standardized language (HTML), it has been possible to abstract much of the presentation layer from the application layer. This abstraction allows programmers working on the presentation layer to be blissfully unaware of and uncoordinated with the programmers working on the application layer. Even within the network layer, technologies such as DNS or NAT rely on simple but highly effective implementations of the principle of abstraction to drive dramatic improvements in the network infrastructure.
Despite all of its benefits, abstraction has not yet seen wide use in the data layer. In what looks like the dark ages compared to the presentation layer, programmers must often “hard code” to specific database schemas, data stores, and even network locations. They must also often use database-specific access control mechanisms and tokens.
This medieval behavior is primarily due to one of the Three S’s: speed. Generally speaking, the more abstract and architecture, the more processing cycles required. Given the premium that many architects place on database performance, they have been highly reluctant to employ any technologies which might compromise performance.
However as Moore’s Law continues its steady advance, performance concerns are becoming less pronounced and as a result architects are increasing willing to consider “expensive” technologies such as abstraction, especially if they can help address data consistency and duplication problems.
The Many Faces of Abstraction
How exactly can abstraction solve these problems? It solves them by applying the principles of abstraction in several key areas including:
1. Security Abstraction: To preserve security and speed, database access has traditionally been carefully regulated. Database administrators typically “hard code” access control provisions by tying them to specific applications and/or users. Using abstraction, access control can be centralized and managed in-between the data layer and the application layer. This mediated access control frees programmers and database administrators from having to worry about coordinating with each other. It also provides for centralized management of data privacy and security issues.
2. Schema Abstraction: Rather than having programmers hard code to database schemas associated with a specific databases, abstraction technologies enable them to code to virtual schemas that sit between the application and database layers. These virtual schemas may map to multiple tables in multiple different databases but the application programmer remains blissfully unaware of the details. Some virtual schemas also theoretically have the advantage of being infinitely extensible thereby allowing programmers to easily modify their data model without having to redo their database schemas.
3. Query/Update Abstraction: Once security and schemas have been abstracted it is possible to bring abstraction down to the level of individual queries and updates. Today queries and updates must be directed at specific databases and they must often have knowledge of how that data is stored and indexed within each database. Using abstraction to pre-process queries as they pass from the application layer to the data layer, it is possible for applications to generate federated or composite queries/updates. While applications view these composite queries/updates as a single request, they may in fact require multiple operations in multiple databases. For example, a single query to retrieve a list of a customer’s last 10 purchases may be broken down into 3 separate queries: one to a customer database, one to an orders database and one to a shipping database.
The Data Abstraction Layer
With security, schemas and queries abstracted what starts to develop is a true data abstraction layer. This layer sits between the data layer and the application layer and decouples them once and for all freeing programmers from having to worry about the intimate details of databases and freeing database administrators from maintaining hundreds of bi-lateral relationships with individual applications.
With this layer fully in place, the need for complicated EAI infrastructures starts to decline dramatically. Rather than replicating master databases through elaborate “pumps” and “busses”, these master databases are simply allowed to stand on their own. Programmers creating new data models and schemas select existing data from a library of abstracted elements. Queries/updates are pre-processed at the data abstraction layer which determines access privileges and then federates the request across the appropriate databases.
Data Servers: Infrastructure’s Next Big Thing?
With so much work to be done at the data abstraction layer, the potential for a whole new class of infrastructure software called Data Servers, seems distinctly possible. Similar to the role application servers play in the application layer, Data Servers manage all of the abstractions and interfaces between the actual resources in the data layer and a set of generic APIs/standards for accessing them. In this way, the data servers virtually create the ever elusive “single master database”. From the programmer’s perspective this database appears to have a unified access control and schema design, but it still allows the actual data layer to be highly optimized in terms of resource allocation, physical partitioning and maintenance.
The promise is that with data servers in place, there will be little if any rationale for replicating data across an organization as all databases can be accessed from all applications. By reducing the need for replication, data servers will not only reduce the need for expensive EAI infrastructures but they will reduce the actual duplication of data. Reducing the duplication of data will naturally lead to reduced problems with data consistency.
Today however, the promise of data servers remains just that, a promise. There remain a number of very tough challenges to overcome before Data Servers truly can be the “one database to rule them all”. Just a couple of these challenges include:
1. Distributed Two-Phase Commits: Complex transactions are typically consummated via a “two phase commit” process that ensures the ACID properties of a transaction are not compromised. While simply querying or reading a database does not typically require a two phase commit, writing to one typically does. Data servers theoretically need to be able to break-up a write into several smaller writes, in essence they need to be able to distribute a transaction across multiple databases while still being able to ensure a two-phase commit. There is general agreement in the computer science world that, right now at least, it is almost impossible to consummate a distributed two-phase commit with absolute certainty. Some start-ups are developing “work arounds” that cheat by only guaranteeing a two-phase commit with one database and letting the others fend for the themselves, but this kind of compromise will not be acceptable over the long term.
2. Semantic Schema Mapping: Having truly abstracted schemas that enable programmers to reuse existing data elements and map them into their new schemas sounds great in theory, but it is very difficult to pull off in the real world where two programmers can look at the same data and easily come up with totally different definitions for it. Past attempts at similar programming reuse and standardization, such as object libraries, have had very poor results. To ensure that data is not needlessly replicated, technology that incorporates semantic analysis as well as intelligent pattern recognition will be needed to ensure that programmers do not unwittingly create a second customer database simply because they were unaware that one already existed.
Despite these potential problems and many others, the race to build the data abstraction layer is definitely “on”. Led by a fleet of nimble start-ups, companies are moving quickly develop different pieces of the data abstraction layer. For example, a whole class of companies such as Composite Software, Metamatrix, and Pantero are trying to build query/update federation engines that enable federated reads and writes to databases. On the schema abstraction front, not many companies have made dramatic progress but companies such as Contivo are trying to create meta-data management systems which ultimately seek to enable the semantic integration of data schemas, while XML database companies such as Ipedo and Mark Logic continue to push forward the concept of infinitely extensible schemas.
Ultimately, the creation of a true Data Server will require a mix of technologies from a variety companies. The market opportunity for whichever company successfully assembles all of the different piece parts is likely to be enormous though, perhaps equal to or larger than the application server market. This large market opportunity combined with the continued data management pains of companies around the world suggests that the vision of the universal Data Server may become a reality sooner than many people think which will teach us once again to never underestimate the power of abstraction.
RSS: A Big Success In Danger of Failure
There’s a lot of hoopla out these days about RSS. RSS is an XML-based standard for summarizing and ultimately syndicating web-site content. Adoption and usage of RSS has taken off in the past few years leading some to suggest that it will play a central role in transforming the web from a random collection of websites into a paradise of personalized data streams. However the seeds of RSS’s impending failure are being sown by its very success and only some serious improvements in the standard will save it from a pre-mature death.
The roots of RSS go all the way back to the infamous “push” revolution of late 1996/ early 1997. At that point in time, Pointcast captured the technology world’s imagination with a vision of the web in which relevant, personalized content would be “pushed” to end users freeing them from the drudgery of actually having to visit individual websites. The revolution reached its apex in February of 1997 when Wired Magazine published a “Push” cover story in which they dramatically declared the web dead and “push” the heir apparent. Soon technology heavyweights such as Microsoft were pushing their own “push” platforms and for a brief moment in time the “push revolution” actually looked like it might happen. Then, almost as quickly as took off, the push revolution imploded. There doesn’t appear to be one single cause of the implosion (outside of Wired’s endorsement), some say it was the inability to agree on standards while others finger clumsy and proprietary “push” software, but whatever the reasons “push” turned out to be a big yawn for most consumers. Like any other fad, they toyed with it for a few months and then moved on the big next thing. Push was dead.
Or was it? For while Push, as conceived of by PointCast, Marimba and Microsoft had died an ugly and (most would say richly deserved) public death, the early seeds of a much different kind of push, one embodied by RSS, had been planted in the minds of its eventual creators. From the outset, RSS was far different from the original “push” platforms. Instead of a complicated proprietary software platform designed to capture revenue from content providers, RSS was just a simple text-based standard. In fact, from a technical perspective RSS was actually much more “pull” than “push” (RSS clients must poll sites to get the latest content updates) but from the end-user’s perspective, the effect was basically the same. As an unfunded, collective effort RSS lacked huge marketing and development budgets, and so, outside of a few passionate advocates, it remained relatively unknown many years after its initial creation.
Recently though, RSS has emerged from its relative obscurity, thanks in large part to the growing popularity of RSS “readers” such as Feedemon, Newsgator, and Sharpreader. These readers allows users to subscribe to several RSS “feeds” at once, thereby consolidating information from around the web into one highly efficient, highly personalized, and easy-to-use interface. With it’s newfound popularity, proponents of RSS have begun hailing it as the foundation for creating a much more personalized and relevant web experience which will ultimately transform the web from an impenetrable clutter of passive websites, into a constant, personalized stream of highly relevant data that can reach a user no matter where they are or what device they are using.
Back to the Future?
Such rhetoric is reminiscent of the “push” craze, but this time it may have a bit more substance. The creators of RSS clearly learned a lot from push’s failures and they have incorporated a number of features which suggest that RSS will not suffer the same fate. Unlike “push”, RSS is web friendly. It uses the many of same protocols and standards the power the web today and uses them in the classic REST-based “request/response” architecture that underpins web. RSS is also an open standard that anyone is free to use in whatever way they see fit. This openness is directly responsible for the large crop of diverse RSS readers and the growing base of RSS friendly web sites and applications. Thus, by embracing the web instead of attempting to replace it, RSS has been able to leverage the web to help spur its own adoption.
One measure of RSS’s success is the number of RSS compliant, feeds or channels available on the web. At Syndicat8.com, a large aggregator of RSS feeds, the total number of feeds listed has grown over 2000% in just 2.5 years from about 2,500 in the middle of 2001 to almost 53,000 in February of 2004. The growth rate also appears to be accelerating as a record 7,326 feeds were added in January of 2004, which is 2X the previous monthly record.
A Victim of Its Own Success
The irony of RSS’s success though is that this same success may ultimately contribute to its failure. To understand why this might be the case, it helps to imagine the RSS community as a giant Cable TV operator. From this perspective, RSS has now has tens of thousands of channels and will probably hundreds of thousands of channels by the end of the year. While some of the channels are branded, most are little known blogs and websites. Now imagine that you want to tune into channels about, let’s say, Cricket. Sure there will probably be a few channels with 100% of their content dedicated to Cricket, but most of the Cricket information will inevitably be spread out in bits and pieces across the 100,000’s of channels. Thus, in order to get all of the Cricket information you will have to tune into hundreds, if not thousands, of channels and then try to filter out all the “noise” or irrelevant programs that have nothing to do with Cricket. That’s a lot of channel surfing!
The problem is only going to get worse. Each day as the number of RSS channels grows, the “noise” created by these different channels (especially by individual blogs which often have lots of small posts on widely disparate topics) also grows, making it more and more difficult for users to actually realize the “personalized” promise of RSS. After all, what’s the point of sifting through thousands of articles with your reader just to find the ten that interest you? You might as well just go back to visiting individual web sites.
Searching In Vain
What RSS desperately needs are enhancements that will allow users to take advantage of the breadth of RSS feeds without being buried in irrelevant information. One potential solution is to apply search technologies, such as key word filters, to incoming articles (such as pubsub.com is doing). This approach has two main problems: 1) The majority of RSS feeds include just short summaries, not the entire article, which means that 95% of the content can’t even be indexed. 2) While key-word filters can reduce the number of irrelevant articles, they will still become overwhelmed given a sufficiently large number of feeds. This “information overload” problem is not unique to RSS but one of the primary problems of the search industry where the dirty secret is that the quality of search results generally declines the more documents you have to search.
Classification and Taxonomies to the Rescue
While search technology may not solve the “information overload” problem, its closely related cousins, classification and taxonomies, may have just what it takes. Classification technology uses advanced statistical models to automatically assign categories to content. These categories can be stored as meta-data with the article. Taxonomy technology creates detailed tree structures that establish the hierarchical relationships between different categories. A venerable example of these two technologies working together is Yahoo!’s Website Directory. Here Yahoo has created a taxonomy, or hierarchical list of categories, of Internet sites. Yahoo has then used classification technology to assign each web site one or more categories within the taxonomy. With the help of these two technologies, a user can sort through millions of internet sites to find just those websites that deal with say, Cricket, in just a couple of clicks.
It’s easy to see how RSS could benefit from the same technology. Assigning articles to categories and associating them with taxonomies will allow users to subscribe to “Meta-feeds” that are based on categories of interest, not specific sites. With such a system in place, users will be able to have their cake and eat it to as they will effectively be subscribing to all RSS channels at once, but due to the use of categories they will only see those pieces of information that are personally relevant. Bye-bye noise!
In fact, the authors of the RSS anticipated the importance of categories and taxonomies early on and the standard actually supports including both category and taxonomy information within an RSS message, so the good news is that RSS is already “category and taxonomy ready”.
What Do You Really Mean?
But there’s a catch. Even though RSS supports the inclusion of categories and taxonomies, there’s no standard for how to determine what category an article should be in or which taxonomy to use. Thus there’s no guarantee that that two sites with very similar articles will categorize them the same way or use the same taxonomy. This raises the very real prospect that, for example, the “Football” category will contain a jumbled group of articles including articles on both the New England Patriots and Manchester United. Such as situation leads us back to an environment filled with “noise” and thus no better off when we started.
The theoretical solution to this problem is get everyone in a room and agree on a common way to establish categories and on a universal taxonomy. Unfortunately, despite the best efforts of academics around the world, this has so far proven impossible. Another idea might be to try and figure out a way to map relationships between different concepts and taxonomies and then provide some kind secret decoder ring that enables computers to infer how everything is interrelated. This is basically what the Semantic Web movement is trying to do. This sounds great, but it will likely be a long time before the Semantic Web is perfected and everyone will easily lose patience with RSS before then. (There is actually a big debate within the RSS community over how Semantic-web centric RSS should be.)
Meta-Directories And Meta-Feeds
The practical solution will likely be to create a series of meta-directories that collect RSS feeds and then apply their own classification tools and taxonomies to those feeds. These intermediaries would then either publish new “meta-feeds” based on particular categories or they would return the category and taxonomy meta-data to the original publisher which would then incorporate the metadata into their own feeds.
There actually is strong precedent for such intermediaries. In the publishing world, major information services like Reuters and Thompson have divisions that aggregate information from disparate sources, classify the information and then resell those classified news feeds. There are also traditional syndicators, such as United Media, who collect content and then redistribute it to other publications. In addition to these establish intermediaries, some RSS-focused start-ups such as Syndic8 and pubsub.com also looked poised to fulfill these roles should they choose to do so.
Even if these meta-directories are created, it’s not clear that the RSS community will embrace them as they introduce a centralized intermediary into an otherwise highly decentralized and simplistic system. However, it is clear that without the use of meta-directories and their standardized classifications and taxonomies the RSS community is in danger of collapsing under the weight of its own success and becoming the “push” of 2004. Let’s hope they learned from the mistakes of their forefathers.
EMC + Documentum = War for Control of Unstructured Data
One of the most interesting recent acquisitions in the software space was EMC’s purchase of Documentum. Not because it was a particularly large acquisition in terms of dollar size or premium paid but because of the strategic implications it has for much of the software industry, especially for companies in the content management, storage software, and database markets.
Documentum isn’t the only acquisition EMC has made recently. It also acquired storage software maker Legato Systems and virtualization leader VmWare. However, both of these acquisitions can be seen as incremental expansions of EMC’s existing focus on storage and storage management (and probably a competitive response to some of the moves storage players like Veritas have been making) whereas the Documentum acquisition represents a major leap “up the stack” right into the midst of the classic enterprise software space.
By boldly jumping into the enterprise software space, EMC appears to be, as a fighter pilot might say, “going vertical”. They are making a bet that customers will want to buy the “whole enchilada” from one vendor including not just platters and storage management software, but high-level content management and work flow software as well. Indeed, one can reasonably expect that the logical extension of this strategy will be a series of vertical solutions targeted at specific applications such as claims processing, image management, content publishing, e-mail management, etc.
By providing the entire solution (no doubt delivered by its services division), EMC should theoretically be able to improve margins by focusing its customers on the value of the entire bundled solution as opposed to simply the cost/gigabyte of its storage products.
Even more important than this solutions focus though, is that fact that EMC is trying to stake a claim to the entire unstructured data management space. EMC’s drive to do this has no doubt been influenced by their customers who are increasingly buying additional storage not to supplement existing databases or information warehouses, but to store and manage unstructured data such as e-mails, PowerPoint presentations, and web pages.
That EMC can stake a claim to the unstructured data management space without alienating some of its biggest ISV partners (the database and warehouse vendors) has much to do with the fact that traditional RDBMS vendors have been surprisingly reluctant to make major commitments to the unstructured data management space. Many of these players, led by Oracle, continue to hold on to the outdated belief that all of an enterprise’s information will be managed by RDBMSs and therefore they have made few attempts to expand into unstructured data management. This abdication has in turn opened the door for EMC to make a move without causing massive near term channel problems.
That’s not to say that EMC’s move into the unstructured data management space won’t ruffle more than a few feathers. Those that will feel the brunt of this entry are the remaining content management players such as FileNet, Interwoven, Stellent and Vignette. These firms must now contend with a very large, aggressive competitor selling hardware/software bundles. In addition, EMC’s traditional storage software vendors must now consider whether or not they will respond by making their own forays into the unstructured data management space. Veritas in particular will now have some difficult decisions to make. Finally, by letting such a large, aggressive company as EMC into their back yard, the traditional RDBMS players are going to have to make a decision as to whether or not they continue to hold on to their anti-file system views or they respond by delivering their own unstructured data management solutions.
Caught in the crossfire between all of these behemoths will be the existing unstructured data players in content management, search, categorization, taxonomy, and work-flow. These relatively small players (many of them still start-ups) will have to decide if it is better to sell out to one of the big players moving into their space or to solider on and attempt to carve out a defensible niche. In this sense, EMC’s acquisition of Documentum represents just the first shot by a major player in what is likely to be a long conflict for control of the unstructured data management space.