Persistent Search: Search’s Next Big Battleground
What do you get when you marry ping servers and RSS with stored queries? A whole new type of search that is destined to become the search industry’s next big battleground: Persistent Search. While Persistent Search presents search companies with difficult new technical challenges and potentially higher infrastructure costs, it also gives them powerful mechanisms for building much stronger user relationships which may increase the value of their advertising services.
The Search That Never Stops
Simply put, Persistent Search allows users to enter a search query just once and then receive constant, near real-time, automatic updates whenever new content that meets their search criteria is published on the web. For example, let’s say you are a stock trader and you want to know whenever one of the stocks in your portfolio is mentioned on the web. By using a persistent search query, you can be assured that you will receive a real-time notification whenever one of your stocks is mentioned. Or perhaps you are a teenager who is a rabid fan of a rock group. Wouldn’t it be nice to have a constant stream of updates on band gossip, upcoming concerts, and new albums flowing to your mobile phone? Or maybe you are just looking to rent the perfect apartment or buy a specific antique. Wouldn’t it be nice to get notified as soon as new items which roughly matched your criteria were listed on the web so that you were able to respond before someone else beat you to the punch? Persistent search makes all of this possible for end users with very little incremental effort.
Something Old, Something New
While the technical infrastructure required for Persistent Search services leverages existing search technology, there are several new elements that must be added to existing technology to make Persistent Search a reality. These elements include:
- Ping Servers: Most blogs and an increasing number of other sites now send special “pings” to so called “Ping Servers” every time they publish new content. The ping servers do things such as cue crawlers at a search engine to re-index a site or provide a summarized list of recently published information to other web sites. Because ping servers are the first to know about newly published content, they are critical to enabling the real-time nature of Persistent Search.
- RSS: RSS feeds can be used both to feed raw information into Persistent Search platforms (in a similar fashion to what GoogleBase does) as well as to take processed queries out. RSS is a polling based mechanism so it does not provide real time notification, but it is good enough in most cases.
- Stored Queries: Stored queries are simply search queries that are “saved” for future use. Ideally, the stored query is constantly running in the background and it flags any new piece of content that meets the search criteria. While this is a simple concept, it presents some very difficult technical challenges. The easiest way for a search engine to do a stored query would be to execute the stored query into its existing index at some regular interval, say once an hour. However, executing each unique stored query 24 times a day could start to become very expensive if Persistent Search starts to take off. One could easily imagine search companies in the near future executing billions of incremental stored queries an hour. Processing these added queries will take lots of extra resources, but will not generate the same amount of revenue that traditional ad-hoc search queries generate because stored queries will often return no new results. One alternative would be for search companies to use the “stream database” query techniques pioneered by start-ups such as StreamBase. These techniques would allow them to query the new content as it flows into their index, not just reducing overall query load but also improving latency. However changing their query approach is a huge step for existing search companies and therefore one that it unlikely to be undertaken. One more likely approach might to use special algorithms to combine stored queries into “master queries”. This reduces the number of queries that need to be executed and uses simple post-query filters to “personalize” the results. Given their critical important to overall quality of persistent search, the design of “stored query architectures” is likely to become one of the key technical battle grounds of search companies, much the way query result relevancy has been for the past few years.
Once these three pieces are put together, search companies will be in position to provide rich Persistent Search services. The results of those services will be distributed to end users via e-mail, RSS, IM, SMS, or some pub-sub standard, depending on their preferences and priorities.
The Business of Persistent Search
From a business and competitive perspective, Persistent Search has a number of very attractive aspects to it relative to traditional ad-hoc queries. Traditional ad-hoc search queries tend to result in very tenuous user relationships with each new query theoretically a competitive “jump ball”. Indeed, the history of search companies, with no less than 5 separate search “leaders” in 10 years, suggests that search users are not very loyal.
Persistent Search presents search companies with the opportunity to build rich, persistent relationships with their users. The search engine that captures a user’s persistent searches will not only have regular, automatic exposure to that user, but they will be able to build a much better understanding of the unique needs and interests of that user which should theoretically enable them to sell more relevant ads and services at higher prices. They will also stand a much better chance of capturing all or most of that users’ ad-hoc queries because they will already be in regular contact with the user.
It is this opportunity to build a long term, rich relationship directly with a uniquely identifiable consumer that will make persistent search such an important battle ground between the major search companies. Persistent search may be especially important to Google and Yahoo as they attempt to fight Microsoft’s efforts to imbed MSN Search into Windows Vista.
It should also be noted that enterprise-based Persistent Search offers corporations the opportunity to improve both internal and external communications. For example, it’s not hard to imagine the major media companies offering persistent search services or “channels” to consumers for their favorite actor, author, or singer.
The State of Play
As it stands, Persistent Search is in its infancy. The world leader in persistent search is most likely the US government, specifically the NSA, however the extent of their capabilities appears to be a closely guarded secret. Some of the commercial players in the space include:
- Pub-Sub is in many ways the commercial pioneer of the space. It operates one of the largest ping servers and has been offering customized key word-based “real time” persistent searches for some time. Some start-ups are even building vertical persistent search services on top of pub-sub’s infrastructure. However Pub-Sub lacks the infrastructure and resources of a complete search engine and lacks the consumer awareness and multi-channel distribution capabilities needed to reach critical consumer mass.
- Google itself offers Google Alerts, a service that enables users to be e-mailed the results of stored queries, however this service relies on Google’s general index and so it could take days for new content to appear. It also does not appear to directly incorporate ping servers and does not offer any distribution mechanisms beyond e-mail. In addition, as a beta service, it’s not clear that it is capable of scaling efficiently should demand take off.
- Real Time Matrix is a start-up focused on aggregating RSS feeds and re-broadcasting them based on user preferences. The net effect of their technology is to deliver real-time Persistent Search-based RSS feeds to consumers.
- Technorati, a popular blog search engine, allows users to create “watchlists” of persistent searches. However Technorati limits its index to blogs and so does not offer a comprehensive search service.
- Yahoo and MSN offer their own alert services and while these services provide very good distribution options (e-mail, IM, SMS), they do not integrate into their search engines but just offer canned updates for things such as news stories and the weather.
- WebSite-Watcher and Trackengine offer services that track specific sites and/or RSS feeds for changes, but they do not allow fine grained free text queries and are just focused at the site level.
Despite this activity, no one has yet to put together an end-to-end Persistent Search offering that enables consumer-friendly, comprehensive, real-time, automatic updates across multiple distribution channels at a viable cost. That said, the opportunity is clear and the competitive pressures are real, so I expect to see rapid progress towards this goal in the near future. It will be interesting to see how it plays out.
Next Generation Search: Entities, Categories, and Pings
The search industry has been dominated for the last 6 years or so by one significant technology, Google’s Page Rank. Page Rank basically uses popularity as a proxy for relevancy which, if Google’s $100BN market cap is any guide, has proven to be a valuable approximation. However like many technologies, Page Rank’s hegemony is gradually wearing thin thanks to the fact that it is easily gamed and inherently limited by its derived and agnostic nature. With Page Rank’s utility declining thanks to SEO companies and “site spam” the natural question becomes what, if any, new technologies will emerge to either supplement or even replace Page Rank.
To that end, a couple of recent entrants into the search market give strong hints as to how search technology will evolve in the near future. The good news is that this evolution will significantly improve both the relevance and utility of search. The bad news is that it is not clear which company will be main the beneficiary of this evolution.
The first new entrant emblematic of this evolution is Vast.com, which officially launched its public beta yesterday. Vast is a multi-category vertical search engine with its first three categories being automobiles, jobs, and personals. Vast is similar in some respects to other category specific vertical search players, such as Simplyhired and Trulia, however, as its name clearly suggests, it has more expansive ambitions. What makes Vast particularly noteworthy is that one of the key ingredients in its secret sauce is a very advanced entity extraction engine. Entity extraction is the process of “mining’ unstructured data for key attributes (such as model of a car or the salary level of a job) which are then separately indexed. Entity extraction is much more sophisticated than screen scraping because it uses semantic inferences, and not pre-determined static “maps”, to determine what entities are present in a given piece of text.
By using the same core entity extraction engine across multiple verticals, Vast has in essence created a “horizontal” vertical search platform that is readily extensible to new verticals and highly efficient in that it only needs to crawl a given site once to extract multiple types of entities.
What does this technology do for search users? Well for one thing, it makes it a hell of lot easier to find a used car for sale on the web. Not only can a user instantly narrow their search to a specific make, model, and year, but thanks to its entity indices, Vast can also tell them what they are likely to pay for the new purchase. For example, if you are looking for a 2001-2002 Acura 3.2 TL, according to Vast chances are you will pay around $18K for the privilege. If you’re looking to get a job as a computer programmer in the Bay Area, well you’re probably looking at about an $80K salary while the same job will only net you $62.5K in Des Moines, Iowa (although you will live like king compared to the poor coder in Silicon Valley living in some half million dollar shack).
How does Vast know this? Because they have extracted all the entity information, indexed it, and done some very basic calculations. But Vast has really only scratched the surface of what they can do with this meta-data. For example, why have people search for a particular car, when they can search by a particular budget, such as “Hey I only have $20-25K to spend and want the most car for my money, so what’s out there that I can afford?” Ask that question to a standard search engine and they likely spew back a lot of random garbage. However a site such as Vast could very easily provide a highly organized matrix of all the different make, model, year combinations that one could afford for that budget without breaking a sweat. Even more intriguing, a site like Vast could just as easily start ranking its search results according to different dimensions, such as “best value”. After all, with a few basic algorithms in place, Vast will be able to spot that clueless person in Oregon that is listing their new car way below market because Vast knows what price everyone else with a similar car is asking. While the ultimate success of Vast is still an open question given that it faces a large field of both horizontal and vertical competitors, Vast clearly demonstrates the value and utility of adding robust entity extraction technologies to search and therefore provides us a likely glimpse of search’s near term evolution.
The Browseable Web
Another new search site that provides a similar glimpse is Kosmix.com. Kosmix is focused on making search results more “browseable” by using categorization technology to determine the topics addressed by a web page. Once the topics are determined, the web page is then associated with specific nodes of a pre-defined taxonomy.
For example, go to Kosmix Health and search on a medical condition, say autism. Off to the left hand side of the initial search results is a list of categories associated with autism. This list or taxonomy enables people to rapidly zoom in on the specific type of information they are interested in, such as medical organizations that are specifically focused on autism, allowing people to not only rapidly filter results for the particular type of information they are looking for but also enabling them to easily browse through different types of content about the same subject.
There have been others that have tried to do similar category-based searches, however Kosmix is the only company that has figured out how to build categorized search technology that can perform acceptably at “internet scale” and as such represents a major advance in the field.
We’ll Need A Search Running … It Has Already Begun
One last missing piece of the next generation search infrastructure that will likely enter the picture is ping servers. Ping servers are currently used mostly by blogs to notify search engines that new content is available to index. Ping servers thus greatly speed the rate at which new content is indexed. While traditionally these servers have been used by publishers to notify search engines, these servers are increasingly being used to notify end users that new content is available as well. Ping servers will become particularly powerful though when they combine persistent queries with the entity and categorization technologies discussed above.
Already today an end user can have a ping server, such as Pubsub, run persistent queries for them and then notify them, typically via RSS, of any newly published content that fits their query. Such persistent queries will get even more powerful though when they are processed through search engines with reliable entity extraction and categorization capabilities.
For example, if you are looking to buy a specific kind of used car, wouldn’t you like to know immediately when someone listed a car for sale that met your criteria? If you were a company, wouldn’t you like to know immediately if a current employee listed their resume on a job site? If you are a portfolio manager wouldn’t you like to know immediately that a blog just published a negative article on a stock you own? If you are a lung cancer drug developer, wouldn’t you like to know every time someone published a new academic study on lung cancer?
A New Foundation
Thus, ping servers combined with persistent queries filtered through entity and categorization engines will create a rich, multi-dimensional real time web search that actively and continually adds values to users while greatly limiting the ability of search engine optimization companies to cloud results with spam and other poor quality content. This may not be the whole future of search, but given near term trends it is likely to at least be a significant part of it.
Epilogue: Wither Google?
Whether or not Google can remain at the top of this rapidly evolving technical space is an open question. Indeed few remember that we have arguably had no less than 5 web search leaders in just 13 years (e.g. WWW Virtual Library, EINet Galaxy, Yahoo, Alta Vista, Google). That said, Google has hired many of the world’s leading experts in areas such as entity extraction and categorization so it will not lack the intellectual firepower to keep pace with start-ups such as Vast and Kosmix and clearly has the financial resources to acquire any particularly promising firm before it becomes too meddlesome (as Google did to Yahoo). So while the ultimate winner is as yet unknown, one thing is for certain: it will be an interesting race to watch.
Burnham’s Beat Reports Record Q4 Revenues
Silicon Valley, CA – (BLOGNESS WIRE) – Jan. 11, 2005
Burnham’s Beat today reported record results for its fourth quarter ended December 31, 2005. Revenues for Q4 2005 were $168.64 up 176% compared to $61.08 in Q4 2004 and up 27.3% sequentially vs. Q3 2005. Earnings before expenses, which management believes are the most cynical results we can think of, were also up 176%.
Commenting on the results, Bill Burnham, Chief Blogger of Burnham’s Beat explains “This quarter’s results continue to demonstrate that blogging is a complete waste time. While we did not achieve our previously forecasted results of 100 billion page views and ‘Google-style cash, Baby!’, we remain hopeful that people forgot about those projections. There are several reasons for missing our projections including an outage of our hosting provider in late Q4 which cost us a least $1.00, the continued poor quality of the writing on the site, high oil prices, several deals that slipped to next quarter, and uncertainty created by the war in Iraq. ”
Page views were up 921% in Q4 2005 to 71,772 compared to 7,028 in Q4 2004. However advertising click-through rates declined from 0.78% in Q2 2004 to 0.23% in Q4 2005. In addition, revenue per click fell 56.3% to $0.49/click compared to Q4 2004’s $1.11/click. Commenting on these statistics Burnham added “We continue to believe that page view growth and advertising revenues have been adversely impacted by the switch to “full text” RSS feeds that we implemented in Q3, but we are too lazy to do anything about it. We added an additional search advertising partner in Q4 and were generally disappointed with the results. While revenue per click is higher at this partner, overall click through rates are much lower. In terms of our main advertising ‘partner’ we have seen a clear pattern throughout the year of them reducing the revenue share they pay to their blog-related ‘partners’. Apparently they aren’t making enough money as it is and need to stick it to the little man.”
Revenue per post was $9.92 in Q4 2005 compared to $6.11 in Q4 2004. “Revenue per post indicates that we could pretty much write about paint drying and search engines would still drive enough random traffic to our site to make a few bucks.”
Affiliate fee revenues were $33.91 in Q4 2005 up 963% vs. $3.19 in Q3 2005. Burnham notes “We launched our affiliate fee division in Q1 of 2005. This unit performed poorly until we decided in early Q4 to blatantly pander to affiliate revenues by using sensationalized rhetoric and better placement, a tactic which appears to have worked well.”
Pro-forma expenses were $44.85 up 67% in Q4 2005 vs. Q4 2004 primarily due to switching from the basic $8.95 hosting package on Typepad.com to a $14.95 “advanced user” package which management has yet to really figure out how to use but it would be embarrassing to tell a fellow blogger that they were still using the “basic” package as only newbies do that.
Readers are reminded that Burnham’s Beat’s financial results exclude all labor, connectivity, and capital expenses, all opportunity costs of actually doing something useful, and the considerable goodwill charges that result from constantly antagonizing people by badmouthing their company/industry/personal views. Including these expenses Burnham’s beat would have reported earnings of approximately -$1,000,000,000 but management does not believe that these actual results fairly reflect the alternate reality in which we currently exist.
Burnham’s Beat is comfortable with its previous guidance of
100 billion page views and ‘Google-Style cash in 200
56, Baby!” and hopes
that people remain forgetful.
Burnham’s Beat and Subsidiaries
Pro-Forma, Pro-Forma Preliminary Restated Unaudited Results
|Q4 2005||Q4 2004|
P.S. Yes these are the actual numbers.
RSS and Google Base: Google Feeds Off The Web
There has been a lot of talk about Google Base today on the web and much of the reaction appears to be either muted or negative. The lack of enthusiasm seems to be driven by the fact that the GUI is pretty rudimentary and doesn't provide any real-time positive feedback (as Fred points out). But people that are turned off by the GUI should take into account that Google Base wasn't designed primarily for humans; it was designed for computers. In fact, I think if people were computers their reaction would be closer to jumping for joy than scratching their heads.
What's perhaps most interesting about the Google Base design is that it appears to have been designed from the ground up with RSS and XML at its center. One need look no further then the detailed XML Schema and extensive RSS 2.0 specification to realize that Google intends to build the world's largest RSS "reader" which in turn will become the world's largest XML database.
To faciliate this, I suspect that Google will soon announce a program whereby people can register their "Base compliant" RSS feeds with Google base. Google will then poll these feeds regularly just like any other RSS reader. Publishers can either create brand new Base-compliant feeds or with a bit of XSLT/XML Schema of their own they can just transpose their own content into a Base compliant feed. Indeed I wouldn't be surprised if there are several software programs available for download in a couple months that do just that. Soon, every publisher on the planet will be able to have a highly automated, highly structured feed directly into Google base.
Once the feed gets inside Google the fun is just beginning. Most commentators have been underwhelmed by Google Base because they don't see the big deal of Google Base entires showing up as part of free text search. What these commentators miss, is that Google isn't gathering all this structured data just so they can regurgitate it piece-meal via unstructured queries, they are gathering all this data so that they can build the world's largest XML database. With the database assembled, Google will be able to deliver a rich, structured experience that, as Michael Parekh sagely points out, is similar to what directory structures do, however because Google Base will in fact be a giant XML database it will be far more powerful than a structured directory. Not only will Google Base users be able to browse similar listings in a structured fashion, but they will also ultimately be able to do highly detailed, highly accurate queries.
In addition, it should not be lost on people that once Google assimilates all of these disparate feeds, it can combine them and then republish them in whatever fashion it wishes. Google Base will thus become the automated engine behind a whole range of other Google extensions (GoogleBay, GoogleJobs, GoogleDate) and it will also enable individual users to subscribe to a wide range of highly specific and highly customized meta-feeds. "Featured listings" will likely replace or complement AdWords in this implementation, but the click-though model will remain.
As for RSS, Google Base represents a kind of Confirmation. With Google's endorsement, RSS has now graduated from a rather obscure content syndication standard to the exautled status of the web's default standard for data integration. Google's endorsement should in turn push other competitors to adopt RSS as their data transport format and process of choice. This adoption will in turn force many of the infrastructure software vendors to enhance their products so that they can easily consume and produce RSS-based messages which in turn will further cement the standard. At its highest level, Google's adoption of RSS represents a further trimph of REST-based SOA architectures over the traditional RPC architecture being advanced by many software vendors. Once again, short and simple wins over long and complex.
In my next post I will talk about Google Base's impact on the "walled garden" listings sites. I'll give you a hint: it won't be pretty.
Feed Overload Syndrome: 5 Reccomended Ways To Cure It
Looks like there has been a major outbreak of RSS inspired "Feed Overload Syndrome" and it's spreading faster than the Avian flu. First Fred Wilson admitted that he was fully infected and then both Jeff Nolan and Om Malik confessed to similar symptoms. At this rate Feed Overload Syndrome may soon become an Internet-wide pandemic. Perhaps the government will divert some of that $7.1BN to establish the CFC or the Centers for Feed Control.
Some may recall that in the past I have posted about the dangers of Feed Overload Syndrome and I admit to having one of the first confirmed cases (at least confirmed by me). While I am still dealing with Feed Overload Syndrome (you can never cure it, just control it), I have made some progress fighting it and with that in mind I would like to offer my Top 5 Ways To Combat Feed Overload Syndrome:
- Direct Multiple Feeds to the Same Folder.
Many RSS readers or plug-ins allow you to create user specified folders. While the default option is usually one feed per folder, in most instances you can direct multiple feeds to the same folder. By consolidating multiple feeds into a single folder you can dramatically cut down on the overall clutter of your feed list. For example, I subscribe to about 6 poker related feeds. All these feeds post to one folder called, creatively enough, "poker'. I have done the same thing for a number of other subjects as well. You don't have to do this for all your feeds, but it is especially good for feeds that you read infrequently and/or that post infrequently.
- Subscribe to meta-feeds.
Metafeeds are RSS feeds that are composed of posts from a number of individual RSS feeds. For example, I like the business intelligence feed provided by Technology Updates. This feed contains articles about business intelligence from a wide variety of other feeds. I also subscribe to keyword based meta-feeds from sites such as Pubsub and Technorati. Meta feeds are rapidly increasing in popularity and can now found in places such as Yahoo! where you can, for example, subscribe to all the news on a particular stock or all of the articles on your favorite football team. Once you subscribe to meta-feeds make sure to eliminate any old feeds that are now covered by the meta-feed.
- Increase your publisher to distributor ratio.
A very coarse way to segregate feeds is as publishers or distributors. Publishers tend to publish relatively few posts that are longer than average and filled with original content. Distributors (also called linkers) tend to generate many posts a day and typically republish short excepts of other people's post with a short commentary of their own. Each has their place on the web, but you will find over time that as you feed list grows, the distributors will provide less value to you because you will already be directly subscribing to many of the same feeds that they tend to republish. Selectively eliminating just a few distributors/linkers can dramatically lower the number of posts you have to read each day.
- Regularly purge and organize your feed list.
You should review you feed list once a month with an eye towards removing old feeds you no longer want to read, consolidating existing feeds into shared folders and substituting meta-feeds for primary feeds that you read infrequently or selectively.
- Support efforts to create true metafeed services.
As I have written about before, RSS is in danger of becoming its own worst enemy thanks in large part to the fundamental factors driving Feed Overload Syndrome. The best hope for controlling this affliction is to support the growth of both metafeed and metatagging services. I personally do not believe that unstructured "democratic" tagging methodologies stand much of a chance as tags without a standardized and consistent taxonomy are not much better than simple keyword based meta-feeds. Creating metatags and metafeeds that are logically consistent and easily integrated into a well formed taxonomy will only be accomplished once the necessary intellectual horsepower and financial resources are focused on it. Interestingly enough, Google long ago hired many of best minds in this area and has more than enough money, so they probably have the best shot of anyone of curing or at least controlling Feed Overload Syndrome.
Following these 5 steps is not guaranteed to cure Feed Overload Syndrome, but I guarantee it will start to control it.
The Coming Blog Wars: Google vs. Yahoo
For Yahoo and Google, the Internet’s two search titans, Blogs are rapidly becoming both an important distribution channel and a growing cost center. The battle to control this distribution channel, while at the same time reducing its costs, will intensify greatly this year and will most likely be characterized by some rapid fire acquisitions within the “Blogsphere”.
It’s The Channel Stupid
According to Technorati, the number of blogs on the web has grown from about 100,000 two years ago to over 6,500,000 today with about 20,000 new blog being added every day. Over at Pew research, their latest study indicates that 27% of Internet users in the US, or 32 million people, are now reading blogs, up almost 150% in just one year.
Right now, Google owns the blog-channel thanks in part to its acquisition of Blogger, but mostly to its self-serve Adsense platform that allows bloggers to easily add paid placement and search services to their sites. (I set up both services on this site in 30 minutes with no human help or interaction.) While Google doesn’t say just how much of its revenues it generates via blogs, with growth numbers like those above it’s no doubt that Google’s “blog-related” revenues are growing quite quickly.
While Yahoo is rumored to be building a competitive offering to Adsense, for now it is limited to only serving large sites, so its blog-related revenues are likely miniscule, however Yahoo clearly is aware of the growing importance of blogs and knows that it must have a competitive response to Google’s Adsense platform.
If either player were able to control or at least significantly influence which paid placement services bloggers chose to incorporate into their sites, it would given them a substantial competitive advantage in their head-to-head competition and control over one of the fasting growing channels on the web.
For example, not only would control allow Yahoo or Google to push their own paid placement and search services at the expense of the other, but it would allow them to route other “affiliate” traffic through their own hubs and thereby take a piece of the action. For example, rather than having bloggers link directly to something like Amazon's Associates program, the bloggers would instead send their traffic to a “master” affiliate account at Google, one in which Google was able to negotiate a larger % cut due to its overall volume, or they might just send it to Froogle if that was a better deal. In such a case both the bloggers and Google win. Google gets a cut of affiliate revenues that it previously missed out on and bloggers get a slightly higher rev share thanks to Google’s much greater bargaining power.
A Costly Partnership
While integrating blogging more closely into their business models offers Google and Yahoo additional revenue opportunities, it also presents them with significant costs, mostly in the form of revenue share payments that they must make to blogs. While they must make similar (and often higher) payments to traditional media partners, the payments to blogs are more costly to process (due to large number of blogs) and much more susceptible to click-through fraud schemes. Controlling the cost of the channel is therefore likely to be almost as big a focus as increasing the revenues its produces.
While controlling fraud will be very important, it likely won’t be a source of competitive advantage as both firms have similar incentives to control fraud. One can even imagine both firms partnering, directly or indirectly, to jointly fight fraud. That leaves reducing payments to blogs as the most obvious way to control costs, however reducing payments will be difficult to achieve due to competitive pressures.
Blog Barter: Spend Money To Save Money
The best way to save costs may actually be to spend some money and acquire companies that currently offer services to bloggers. These services can then be bartered to blogs in return for a reduction in payments. For example, a blog that pays its hosting firm $20/month might be willing barter $20 worth of its click-through revenues for hosting instead of writing a check. At the very least, Google and Yahoo might be able to buy such services for $18/month and resell them for $20 in click-through credits thus saving themselves 10% cash in the process.
This math, plus burgeoning payments to bloggers, is likely to prompt both Google and Yahoo to make some rapid fire acquisitions in the space. For Google, the acquisitions will be about protecting its lead and denying Yahoo the chance to become competitive. For Yahoo they will be about quickly catching up to Google and potentially surpassing it. Of the firms, Yahoo is perhaps in the best position to initiate this “blog barter” economy given that it has a broad range of existing subscription services that it can barter, however they are also the further behind in terms of using blogs as a channel for their services.
The Hit List
While it’s tough to say with 100% accuracy which start-ups will be on the blog-inspired M&A “hit list” of Google and Yahoo, it is clear that such firms will do one of two things: they will either enhance control over the distribution channel or they will reduce its costs by enabling “blog-barter” transactions. Google early on struck the 1st blow in this M&A battle with the acquisition of Pyra Labs (creator of Blogger) in 2003, but more deals are undoubtedly on the horizon due to blogging’s explosive growth in 2004. A few of the most likely candidates for acquisition include:
- Six Apart: Creator of the popular TypePad blog authoring software and service (this blog is hosted on their service), TypePad has long been rumored to be a potential acquisition candidate, most recently for Yahoo. With 6.5M users after its recently announced acquisition of LiveJournal is completed, Typepad is clearly the biggest piece of the blog channel that it potentially up for grabs. While the odds on bet is that Yahoo will acquire TypePad given that it has no blog platform to date, Google may well try to lock up the channel by acquiring TypePad ahead of Yahoo, but it’s also possible that long shots such as Amazon and Microsoft may make their own bids.
- Burning Door: Operator of the increasingly popular Feedburner service that provides detailed usage statistics of RSS feeds, Feedburner is an ideal acquisition both as an incremental revenue generator and a “blog barter” service. On the revenue front, Feedburner offers a fantastic platform for “splicing” contextually relevant RSS advertising and affiliate offers into RSS feeds creating yet another advertising platform for the search players. On the blog-barter front, serious bloggers will easily part with a few “barter bucks” each month in return for the detailed usage and reporting statistics that Feedburner is able to provide.
- MySpace: Originally pitched as a social networking site, MySpace has melded social networking with blogging to the point where the site has become some kind of strange youth-centric amalgam of the two worlds. The risk is that MySpace is really just GeoCities 2, but this may turn out to be the best kind of platform for advertisers to reach the youth market in a contextual, non-obtrusive way. In addition, if MySpace can help its members earn a little pocket change through blogging, they may be able to solve the input-output asymmetry problem that I wrote about in an earlier post on social networking.
There are a bunch of blog aggregation sites such as Bloglines, Technorati, del.icio.us that appear on the surface to be interesting acquisition opportunities, but it is unclear what these sites really add to the existing capabilities of Yahoo and Google as these sites focus on end-users, not bloggers, and Google and Yahoo already have plenty of end-users.
As for VCs, this round of blog-inspired M&A will be a nervous game a musical chairs as quickly placed wagers either pay-off or are permanently bypassed. Watching how this , the Great Blog Battle, plays out will be entertaining indeed!
Saving RSS: Why Meta-feeds will triumph over Tags
It’s pretty clear that RSS has now become the de facto standard for web content syndication. Just take a look at the numbers. The total number of RSS feeds tracked by Syndic8.com has grown from about 2,500 in the middle of 2001, to 50,000 at the beginning of 2004, to 286,000 as of the middle of this month. That’s total growth of over 11,300% in just the past 3.5 years!
Feed Overload Syndrome
However, as I wrote at the beginning of last year, the very growth of RSS threatens to sow the seeds of its own failure by creating such a wealth of data sources that it becomes increasingly difficult for users to sift through all the “noise” to find the information that they actually need.
Just ask any avid RSS user about how their use of RSS has evolved and they will likely tell you the same story: When they first discovered RSS it was great because it allowed them to subscribe to relevant information “feeds” from all of their favorite sites and have that information automatically aggregated into one place (usually an RSS reader like Sharpreader or NewsGator). However, as they began to add more and more feeds (typically newly discovered blogs), the number of posts they had to review started rising quickly, so much so that they often had hundreds, if not thousands of unread posts sitting in their readers. Worse yet, many of these posts ended up either being irrelevant (especially random postings on personal blogs) or duplicative. Suffering from a serious case of “feed overload”, many of these users ultimately had to cut back on the number of feeds that they subscribed to in order to reduce the amount of “noise” in their in-box and give them at least a fighting chance of skimming all of their unread posts each day.
The First Step: Recognizing That You Have Problem
Many in the RSS community recognize that “Feed Overload Syndrome” is indeed becoming a big problem and have begun initiatives to try and address it.
Perhaps the most obvious way to address the problem is to create keyword based searches that filter posts based on keywords. The results from such searches can themselves be syndicated as an RSS feed. This approach has several problems though. First, many sites only syndicate summaries of their posts, not the complete post thus making it difficult to index the entire post. Second, keyword-based searches become less and less effective the more data you index, as the average query starts to return more and more results. Third, keywords often have multiple contexts which in turn produce significant “noise” in the results. For example, a keyword search about “Chicago” would produce information about the city, the music group, and the movie among other things. That said many “feed aggregation sites” such as Technorati and Bloglines currently offer keyword based searching/feeds and for most folks these are better than nothing. However it’s pre-ordained that as the number of feeds increase, these keyword filtering techniques will prove less and less useful.
Tag, You’re Categorized
Realizing the shortcomings of keyword-based searching, many people are embracing the concept of “tagging”. Tagging is simply adding some basic metadata to an RSS post, usually just a simple keyword “tag”. For example, the RSS feed on this site effectively “tags” my posts by using the dc:subject property from the RSS standard. Using such keywords, feed aggregators (such as Technorati, PubSub and Del.icio.us ) can sort posts into different categories and subscribers can then subscribe to these categorized RSS feeds, instead of the “raw” feeds from the sites themselves. Alternatively, RSS readers can sort the posts into user-created folders based on tags (although mine doesn’t offer this feature yet).
Tagging is a step in the right direction, but it is ultimately a fundamentally flawed approach to the issue. The problem at the core of tagging is the same problem that has bedeviled almost all efforts at collective categorization: semantics. In order to assign a tag to a post, one must make some inherently subjective determinations including: 1) what’s the subject matter of the post and 2) what topics or keywords best represent that subject matter. In the information retrieval world, this process is known as categorization. The problem with tagging is that there is no assurance that two people will assign the same tag to the same content. This is especially true in the diverse “blogsphere” where one person’s “futbol” is undoubtedly another’s “football” or another’s “soccer”.
Beyond a fatal lack of consistency, tagging efforts also suffer from a lack of context. As any information retrieval specialist will tell you, categorized documents are most useful when they are placed into a semantically rich context. In the information retrieval world, such context is provided by formalized taxonomies. Even though the RSS standard provides for taxonomies, tagging as it is currently executed lacks any concept of taxonomies and thus lacks context.
Deprived of consistency and context, tagging threatens to become a colossal waste of time as it merely adds a layer of incoherent and inconsistent metadata on top of an already unmanageable number of feeds.
While tagging may be doomed to confusion, there are some other potential approaches that promise to bring order to RSS’s increasingly chaotic situation. The most promising approach involves something called a Meta-feed. Meta-feeds are RSS feeds comprised solely of metadata about other feeds. Combining meta-feeds with the original source feeds enables RSS readers to display consistently categorized posts within rich and logically consistent taxonomies. The process of creating a meta-data feed looks a lot like that needed to create a search index. First, crawlers must scour RSS feeds for new posts. Once they have located new posts, the posts are categorized and placed into a taxonomy using advanced statistical processes such as Bayesian analysis and natural language processing. This metadata is then appended to the URL of the original post and put into its own RSS meta-feed. In addition to the categorization data, the meta-feed can also contain taxonomy information, as well as information about such things as exact/near duplicates and related posts.
RSS readers can then request both the original raw feeds and the meta-feeds. They then use the meta-feed to appropriately and consistently categorize and relate each raw post.
For end users, meta-feeds will enable a wealth of features and innovations. Users will be able to easily find related documents and eliminate duplicates of the same information (such as two newspapers reprinting the same wire story). Users will also be able to create their own custom taxonomies and category names (as long they relate them back to the meta-feed). Users can even combine meta-feeds from two different feeds so long as one of the meta-feed publishers creates an RDF file that relates the two categories and taxonomies (to the extent practical). Of course the biggest benefit to users will be that information is consistently sorted and grouped into meaningful categories allowing them greatly reduce the amount of “noise” created by duplicate and non-relevant posts.
At a higher level, the existence of multiple meta-feeds, each with its own distinct taxonomy and categories, will in essence create multiple “views” of the web that are not predicated on any single person’s semantic orientation (as is the case with tagging). In this way it will be possible to view the web through unique editorial lenses that transcend individual sites and instead present the web for what it is: a rich and varied collective enterprise that can be wildly different depending on your perspective.
The Road Yet Traveled
Unfortunately, the road to this nirvana is long and as of yet, largely un-traveled. While it may be possible for services like Pubsub and Technorati to put together their own proprietary end-to-end implementations of meta-feeds, in order for such feeds to become truly accepted, standards will have to be developed that incorporate meta-feeds into readers and allow for interoperability between meta-feeds.
If RSS fails to address “Feed Overload Syndrome”, it will admittedly not be the end of the world. RSS will still provide a useful, albeit highly limited, “alert” service for new content at a limited number of sites. However for RSS to reach its potential of dramatically expanding the scope, scale, and richness of individuals’ (and computers’) interaction with the web, innovations such as meta-feeds are desperately needed in order to create a truly scaleable foundation.
RSS: A Big Success In Danger of Failure
There’s a lot of hoopla out these days about RSS. RSS is an XML-based standard for summarizing and ultimately syndicating web-site content. Adoption and usage of RSS has taken off in the past few years leading some to suggest that it will play a central role in transforming the web from a random collection of websites into a paradise of personalized data streams. However the seeds of RSS’s impending failure are being sown by its very success and only some serious improvements in the standard will save it from a pre-mature death.
The roots of RSS go all the way back to the infamous “push” revolution of late 1996/ early 1997. At that point in time, Pointcast captured the technology world’s imagination with a vision of the web in which relevant, personalized content would be “pushed” to end users freeing them from the drudgery of actually having to visit individual websites. The revolution reached its apex in February of 1997 when Wired Magazine published a “Push” cover story in which they dramatically declared the web dead and “push” the heir apparent. Soon technology heavyweights such as Microsoft were pushing their own “push” platforms and for a brief moment in time the “push revolution” actually looked like it might happen. Then, almost as quickly as took off, the push revolution imploded. There doesn’t appear to be one single cause of the implosion (outside of Wired’s endorsement), some say it was the inability to agree on standards while others finger clumsy and proprietary “push” software, but whatever the reasons “push” turned out to be a big yawn for most consumers. Like any other fad, they toyed with it for a few months and then moved on the big next thing. Push was dead.
Or was it? For while Push, as conceived of by PointCast, Marimba and Microsoft had died an ugly and (most would say richly deserved) public death, the early seeds of a much different kind of push, one embodied by RSS, had been planted in the minds of its eventual creators. From the outset, RSS was far different from the original “push” platforms. Instead of a complicated proprietary software platform designed to capture revenue from content providers, RSS was just a simple text-based standard. In fact, from a technical perspective RSS was actually much more “pull” than “push” (RSS clients must poll sites to get the latest content updates) but from the end-user’s perspective, the effect was basically the same. As an unfunded, collective effort RSS lacked huge marketing and development budgets, and so, outside of a few passionate advocates, it remained relatively unknown many years after its initial creation.
Recently though, RSS has emerged from its relative obscurity, thanks in large part to the growing popularity of RSS “readers” such as Feedemon, Newsgator, and Sharpreader. These readers allows users to subscribe to several RSS “feeds” at once, thereby consolidating information from around the web into one highly efficient, highly personalized, and easy-to-use interface. With it’s newfound popularity, proponents of RSS have begun hailing it as the foundation for creating a much more personalized and relevant web experience which will ultimately transform the web from an impenetrable clutter of passive websites, into a constant, personalized stream of highly relevant data that can reach a user no matter where they are or what device they are using.
Back to the Future?
Such rhetoric is reminiscent of the “push” craze, but this time it may have a bit more substance. The creators of RSS clearly learned a lot from push’s failures and they have incorporated a number of features which suggest that RSS will not suffer the same fate. Unlike “push”, RSS is web friendly. It uses the many of same protocols and standards the power the web today and uses them in the classic REST-based “request/response” architecture that underpins web. RSS is also an open standard that anyone is free to use in whatever way they see fit. This openness is directly responsible for the large crop of diverse RSS readers and the growing base of RSS friendly web sites and applications. Thus, by embracing the web instead of attempting to replace it, RSS has been able to leverage the web to help spur its own adoption.
One measure of RSS’s success is the number of RSS compliant, feeds or channels available on the web. At Syndicat8.com, a large aggregator of RSS feeds, the total number of feeds listed has grown over 2000% in just 2.5 years from about 2,500 in the middle of 2001 to almost 53,000 in February of 2004. The growth rate also appears to be accelerating as a record 7,326 feeds were added in January of 2004, which is 2X the previous monthly record.
A Victim of Its Own Success
The irony of RSS’s success though is that this same success may ultimately contribute to its failure. To understand why this might be the case, it helps to imagine the RSS community as a giant Cable TV operator. From this perspective, RSS has now has tens of thousands of channels and will probably hundreds of thousands of channels by the end of the year. While some of the channels are branded, most are little known blogs and websites. Now imagine that you want to tune into channels about, let’s say, Cricket. Sure there will probably be a few channels with 100% of their content dedicated to Cricket, but most of the Cricket information will inevitably be spread out in bits and pieces across the 100,000’s of channels. Thus, in order to get all of the Cricket information you will have to tune into hundreds, if not thousands, of channels and then try to filter out all the “noise” or irrelevant programs that have nothing to do with Cricket. That’s a lot of channel surfing!
The problem is only going to get worse. Each day as the number of RSS channels grows, the “noise” created by these different channels (especially by individual blogs which often have lots of small posts on widely disparate topics) also grows, making it more and more difficult for users to actually realize the “personalized” promise of RSS. After all, what’s the point of sifting through thousands of articles with your reader just to find the ten that interest you? You might as well just go back to visiting individual web sites.
Searching In Vain
What RSS desperately needs are enhancements that will allow users to take advantage of the breadth of RSS feeds without being buried in irrelevant information. One potential solution is to apply search technologies, such as key word filters, to incoming articles (such as pubsub.com is doing). This approach has two main problems: 1) The majority of RSS feeds include just short summaries, not the entire article, which means that 95% of the content can’t even be indexed. 2) While key-word filters can reduce the number of irrelevant articles, they will still become overwhelmed given a sufficiently large number of feeds. This “information overload” problem is not unique to RSS but one of the primary problems of the search industry where the dirty secret is that the quality of search results generally declines the more documents you have to search.
Classification and Taxonomies to the Rescue
While search technology may not solve the “information overload” problem, its closely related cousins, classification and taxonomies, may have just what it takes. Classification technology uses advanced statistical models to automatically assign categories to content. These categories can be stored as meta-data with the article. Taxonomy technology creates detailed tree structures that establish the hierarchical relationships between different categories. A venerable example of these two technologies working together is Yahoo!’s Website Directory. Here Yahoo has created a taxonomy, or hierarchical list of categories, of Internet sites. Yahoo has then used classification technology to assign each web site one or more categories within the taxonomy. With the help of these two technologies, a user can sort through millions of internet sites to find just those websites that deal with say, Cricket, in just a couple of clicks.
It’s easy to see how RSS could benefit from the same technology. Assigning articles to categories and associating them with taxonomies will allow users to subscribe to “Meta-feeds” that are based on categories of interest, not specific sites. With such a system in place, users will be able to have their cake and eat it to as they will effectively be subscribing to all RSS channels at once, but due to the use of categories they will only see those pieces of information that are personally relevant. Bye-bye noise!
In fact, the authors of the RSS anticipated the importance of categories and taxonomies early on and the standard actually supports including both category and taxonomy information within an RSS message, so the good news is that RSS is already “category and taxonomy ready”.
What Do You Really Mean?
But there’s a catch. Even though RSS supports the inclusion of categories and taxonomies, there’s no standard for how to determine what category an article should be in or which taxonomy to use. Thus there’s no guarantee that that two sites with very similar articles will categorize them the same way or use the same taxonomy. This raises the very real prospect that, for example, the “Football” category will contain a jumbled group of articles including articles on both the New England Patriots and Manchester United. Such as situation leads us back to an environment filled with “noise” and thus no better off when we started.
The theoretical solution to this problem is get everyone in a room and agree on a common way to establish categories and on a universal taxonomy. Unfortunately, despite the best efforts of academics around the world, this has so far proven impossible. Another idea might be to try and figure out a way to map relationships between different concepts and taxonomies and then provide some kind secret decoder ring that enables computers to infer how everything is interrelated. This is basically what the Semantic Web movement is trying to do. This sounds great, but it will likely be a long time before the Semantic Web is perfected and everyone will easily lose patience with RSS before then. (There is actually a big debate within the RSS community over how Semantic-web centric RSS should be.)
Meta-Directories And Meta-Feeds
The practical solution will likely be to create a series of meta-directories that collect RSS feeds and then apply their own classification tools and taxonomies to those feeds. These intermediaries would then either publish new “meta-feeds” based on particular categories or they would return the category and taxonomy meta-data to the original publisher which would then incorporate the metadata into their own feeds.
There actually is strong precedent for such intermediaries. In the publishing world, major information services like Reuters and Thompson have divisions that aggregate information from disparate sources, classify the information and then resell those classified news feeds. There are also traditional syndicators, such as United Media, who collect content and then redistribute it to other publications. In addition to these establish intermediaries, some RSS-focused start-ups such as Syndic8 and pubsub.com also looked poised to fulfill these roles should they choose to do so.
Even if these meta-directories are created, it’s not clear that the RSS community will embrace them as they introduce a centralized intermediary into an otherwise highly decentralized and simplistic system. However, it is clear that without the use of meta-directories and their standardized classifications and taxonomies the RSS community is in danger of collapsing under the weight of its own success and becoming the “push” of 2004. Let’s hope they learned from the mistakes of their forefathers.