Persistent Search: Search’s Next Big Battleground
What do you get when you marry ping servers and RSS with stored queries? A whole new type of search that is destined to become the search industry’s next big battleground: Persistent Search. While Persistent Search presents search companies with difficult new technical challenges and potentially higher infrastructure costs, it also gives them powerful mechanisms for building much stronger user relationships which may increase the value of their advertising services.
The Search That Never Stops
Simply put, Persistent Search allows users to enter a search query just once and then receive constant, near real-time, automatic updates whenever new content that meets their search criteria is published on the web. For example, let’s say you are a stock trader and you want to know whenever one of the stocks in your portfolio is mentioned on the web. By using a persistent search query, you can be assured that you will receive a real-time notification whenever one of your stocks is mentioned. Or perhaps you are a teenager who is a rabid fan of a rock group. Wouldn’t it be nice to have a constant stream of updates on band gossip, upcoming concerts, and new albums flowing to your mobile phone? Or maybe you are just looking to rent the perfect apartment or buy a specific antique. Wouldn’t it be nice to get notified as soon as new items which roughly matched your criteria were listed on the web so that you were able to respond before someone else beat you to the punch? Persistent search makes all of this possible for end users with very little incremental effort.
Something Old, Something New
While the technical infrastructure required for Persistent Search services leverages existing search technology, there are several new elements that must be added to existing technology to make Persistent Search a reality. These elements include:
- Ping Servers: Most blogs and an increasing number of other sites now send special “pings” to so called “Ping Servers” every time they publish new content. The ping servers do things such as cue crawlers at a search engine to re-index a site or provide a summarized list of recently published information to other web sites. Because ping servers are the first to know about newly published content, they are critical to enabling the real-time nature of Persistent Search.
- RSS: RSS feeds can be used both to feed raw information into Persistent Search platforms (in a similar fashion to what GoogleBase does) as well as to take processed queries out. RSS is a polling based mechanism so it does not provide real time notification, but it is good enough in most cases.
- Stored Queries: Stored queries are simply search queries that are “saved” for future use. Ideally, the stored query is constantly running in the background and it flags any new piece of content that meets the search criteria. While this is a simple concept, it presents some very difficult technical challenges. The easiest way for a search engine to do a stored query would be to execute the stored query into its existing index at some regular interval, say once an hour. However, executing each unique stored query 24 times a day could start to become very expensive if Persistent Search starts to take off. One could easily imagine search companies in the near future executing billions of incremental stored queries an hour. Processing these added queries will take lots of extra resources, but will not generate the same amount of revenue that traditional ad-hoc search queries generate because stored queries will often return no new results. One alternative would be for search companies to use the “stream database” query techniques pioneered by start-ups such as StreamBase. These techniques would allow them to query the new content as it flows into their index, not just reducing overall query load but also improving latency. However changing their query approach is a huge step for existing search companies and therefore one that it unlikely to be undertaken. One more likely approach might to use special algorithms to combine stored queries into “master queries”. This reduces the number of queries that need to be executed and uses simple post-query filters to “personalize” the results. Given their critical important to overall quality of persistent search, the design of “stored query architectures” is likely to become one of the key technical battle grounds of search companies, much the way query result relevancy has been for the past few years.
Once these three pieces are put together, search companies will be in position to provide rich Persistent Search services. The results of those services will be distributed to end users via e-mail, RSS, IM, SMS, or some pub-sub standard, depending on their preferences and priorities.
The Business of Persistent Search
From a business and competitive perspective, Persistent Search has a number of very attractive aspects to it relative to traditional ad-hoc queries. Traditional ad-hoc search queries tend to result in very tenuous user relationships with each new query theoretically a competitive “jump ball”. Indeed, the history of search companies, with no less than 5 separate search “leaders” in 10 years, suggests that search users are not very loyal.
Persistent Search presents search companies with the opportunity to build rich, persistent relationships with their users. The search engine that captures a user’s persistent searches will not only have regular, automatic exposure to that user, but they will be able to build a much better understanding of the unique needs and interests of that user which should theoretically enable them to sell more relevant ads and services at higher prices. They will also stand a much better chance of capturing all or most of that users’ ad-hoc queries because they will already be in regular contact with the user.
It is this opportunity to build a long term, rich relationship directly with a uniquely identifiable consumer that will make persistent search such an important battle ground between the major search companies. Persistent search may be especially important to Google and Yahoo as they attempt to fight Microsoft’s efforts to imbed MSN Search into Windows Vista.
It should also be noted that enterprise-based Persistent Search offers corporations the opportunity to improve both internal and external communications. For example, it’s not hard to imagine the major media companies offering persistent search services or “channels” to consumers for their favorite actor, author, or singer.
The State of Play
As it stands, Persistent Search is in its infancy. The world leader in persistent search is most likely the US government, specifically the NSA, however the extent of their capabilities appears to be a closely guarded secret. Some of the commercial players in the space include:
- Pub-Sub is in many ways the commercial pioneer of the space. It operates one of the largest ping servers and has been offering customized key word-based “real time” persistent searches for some time. Some start-ups are even building vertical persistent search services on top of pub-sub’s infrastructure. However Pub-Sub lacks the infrastructure and resources of a complete search engine and lacks the consumer awareness and multi-channel distribution capabilities needed to reach critical consumer mass.
- Google itself offers Google Alerts, a service that enables users to be e-mailed the results of stored queries, however this service relies on Google’s general index and so it could take days for new content to appear. It also does not appear to directly incorporate ping servers and does not offer any distribution mechanisms beyond e-mail. In addition, as a beta service, it’s not clear that it is capable of scaling efficiently should demand take off.
- Real Time Matrix is a start-up focused on aggregating RSS feeds and re-broadcasting them based on user preferences. The net effect of their technology is to deliver real-time Persistent Search-based RSS feeds to consumers.
- Technorati, a popular blog search engine, allows users to create “watchlists” of persistent searches. However Technorati limits its index to blogs and so does not offer a comprehensive search service.
- Yahoo and MSN offer their own alert services and while these services provide very good distribution options (e-mail, IM, SMS), they do not integrate into their search engines but just offer canned updates for things such as news stories and the weather.
- WebSite-Watcher and Trackengine offer services that track specific sites and/or RSS feeds for changes, but they do not allow fine grained free text queries and are just focused at the site level.
Despite this activity, no one has yet to put together an end-to-end Persistent Search offering that enables consumer-friendly, comprehensive, real-time, automatic updates across multiple distribution channels at a viable cost. That said, the opportunity is clear and the competitive pressures are real, so I expect to see rapid progress towards this goal in the near future. It will be interesting to see how it plays out.
Next Generation Search: Entities, Categories, and Pings
The search industry has been dominated for the last 6 years or so by one significant technology, Google’s Page Rank. Page Rank basically uses popularity as a proxy for relevancy which, if Google’s $100BN market cap is any guide, has proven to be a valuable approximation. However like many technologies, Page Rank’s hegemony is gradually wearing thin thanks to the fact that it is easily gamed and inherently limited by its derived and agnostic nature. With Page Rank’s utility declining thanks to SEO companies and “site spam” the natural question becomes what, if any, new technologies will emerge to either supplement or even replace Page Rank.
To that end, a couple of recent entrants into the search market give strong hints as to how search technology will evolve in the near future. The good news is that this evolution will significantly improve both the relevance and utility of search. The bad news is that it is not clear which company will be main the beneficiary of this evolution.
The first new entrant emblematic of this evolution is Vast.com, which officially launched its public beta yesterday. Vast is a multi-category vertical search engine with its first three categories being automobiles, jobs, and personals. Vast is similar in some respects to other category specific vertical search players, such as Simplyhired and Trulia, however, as its name clearly suggests, it has more expansive ambitions. What makes Vast particularly noteworthy is that one of the key ingredients in its secret sauce is a very advanced entity extraction engine. Entity extraction is the process of “mining’ unstructured data for key attributes (such as model of a car or the salary level of a job) which are then separately indexed. Entity extraction is much more sophisticated than screen scraping because it uses semantic inferences, and not pre-determined static “maps”, to determine what entities are present in a given piece of text.
By using the same core entity extraction engine across multiple verticals, Vast has in essence created a “horizontal” vertical search platform that is readily extensible to new verticals and highly efficient in that it only needs to crawl a given site once to extract multiple types of entities.
What does this technology do for search users? Well for one thing, it makes it a hell of lot easier to find a used car for sale on the web. Not only can a user instantly narrow their search to a specific make, model, and year, but thanks to its entity indices, Vast can also tell them what they are likely to pay for the new purchase. For example, if you are looking for a 2001-2002 Acura 3.2 TL, according to Vast chances are you will pay around $18K for the privilege. If you’re looking to get a job as a computer programmer in the Bay Area, well you’re probably looking at about an $80K salary while the same job will only net you $62.5K in Des Moines, Iowa (although you will live like king compared to the poor coder in Silicon Valley living in some half million dollar shack).
How does Vast know this? Because they have extracted all the entity information, indexed it, and done some very basic calculations. But Vast has really only scratched the surface of what they can do with this meta-data. For example, why have people search for a particular car, when they can search by a particular budget, such as “Hey I only have $20-25K to spend and want the most car for my money, so what’s out there that I can afford?” Ask that question to a standard search engine and they likely spew back a lot of random garbage. However a site such as Vast could very easily provide a highly organized matrix of all the different make, model, year combinations that one could afford for that budget without breaking a sweat. Even more intriguing, a site like Vast could just as easily start ranking its search results according to different dimensions, such as “best value”. After all, with a few basic algorithms in place, Vast will be able to spot that clueless person in Oregon that is listing their new car way below market because Vast knows what price everyone else with a similar car is asking. While the ultimate success of Vast is still an open question given that it faces a large field of both horizontal and vertical competitors, Vast clearly demonstrates the value and utility of adding robust entity extraction technologies to search and therefore provides us a likely glimpse of search’s near term evolution.
The Browseable Web
Another new search site that provides a similar glimpse is Kosmix.com. Kosmix is focused on making search results more “browseable” by using categorization technology to determine the topics addressed by a web page. Once the topics are determined, the web page is then associated with specific nodes of a pre-defined taxonomy.
For example, go to Kosmix Health and search on a medical condition, say autism. Off to the left hand side of the initial search results is a list of categories associated with autism. This list or taxonomy enables people to rapidly zoom in on the specific type of information they are interested in, such as medical organizations that are specifically focused on autism, allowing people to not only rapidly filter results for the particular type of information they are looking for but also enabling them to easily browse through different types of content about the same subject.
There have been others that have tried to do similar category-based searches, however Kosmix is the only company that has figured out how to build categorized search technology that can perform acceptably at “internet scale” and as such represents a major advance in the field.
We’ll Need A Search Running … It Has Already Begun
One last missing piece of the next generation search infrastructure that will likely enter the picture is ping servers. Ping servers are currently used mostly by blogs to notify search engines that new content is available to index. Ping servers thus greatly speed the rate at which new content is indexed. While traditionally these servers have been used by publishers to notify search engines, these servers are increasingly being used to notify end users that new content is available as well. Ping servers will become particularly powerful though when they combine persistent queries with the entity and categorization technologies discussed above.
Already today an end user can have a ping server, such as Pubsub, run persistent queries for them and then notify them, typically via RSS, of any newly published content that fits their query. Such persistent queries will get even more powerful though when they are processed through search engines with reliable entity extraction and categorization capabilities.
For example, if you are looking to buy a specific kind of used car, wouldn’t you like to know immediately when someone listed a car for sale that met your criteria? If you were a company, wouldn’t you like to know immediately if a current employee listed their resume on a job site? If you are a portfolio manager wouldn’t you like to know immediately that a blog just published a negative article on a stock you own? If you are a lung cancer drug developer, wouldn’t you like to know every time someone published a new academic study on lung cancer?
A New Foundation
Thus, ping servers combined with persistent queries filtered through entity and categorization engines will create a rich, multi-dimensional real time web search that actively and continually adds values to users while greatly limiting the ability of search engine optimization companies to cloud results with spam and other poor quality content. This may not be the whole future of search, but given near term trends it is likely to at least be a significant part of it.
Epilogue: Wither Google?
Whether or not Google can remain at the top of this rapidly evolving technical space is an open question. Indeed few remember that we have arguably had no less than 5 web search leaders in just 13 years (e.g. WWW Virtual Library, EINet Galaxy, Yahoo, Alta Vista, Google). That said, Google has hired many of the world’s leading experts in areas such as entity extraction and categorization so it will not lack the intellectual firepower to keep pace with start-ups such as Vast and Kosmix and clearly has the financial resources to acquire any particularly promising firm before it becomes too meddlesome (as Google did to Yahoo). So while the ultimate winner is as yet unknown, one thing is for certain: it will be an interesting race to watch.
Burnham’s Beat Reports Record Q4 Revenues
Silicon Valley, CA – (BLOGNESS WIRE) – Jan. 11, 2005
Burnham’s Beat today reported record results for its fourth quarter ended December 31, 2005. Revenues for Q4 2005 were $168.64 up 176% compared to $61.08 in Q4 2004 and up 27.3% sequentially vs. Q3 2005. Earnings before expenses, which management believes are the most cynical results we can think of, were also up 176%.
Commenting on the results, Bill Burnham, Chief Blogger of Burnham’s Beat explains “This quarter’s results continue to demonstrate that blogging is a complete waste time. While we did not achieve our previously forecasted results of 100 billion page views and ‘Google-style cash, Baby!’, we remain hopeful that people forgot about those projections. There are several reasons for missing our projections including an outage of our hosting provider in late Q4 which cost us a least $1.00, the continued poor quality of the writing on the site, high oil prices, several deals that slipped to next quarter, and uncertainty created by the war in Iraq. ”
Page views were up 921% in Q4 2005 to 71,772 compared to 7,028 in Q4 2004. However advertising click-through rates declined from 0.78% in Q2 2004 to 0.23% in Q4 2005. In addition, revenue per click fell 56.3% to $0.49/click compared to Q4 2004’s $1.11/click. Commenting on these statistics Burnham added “We continue to believe that page view growth and advertising revenues have been adversely impacted by the switch to “full text” RSS feeds that we implemented in Q3, but we are too lazy to do anything about it. We added an additional search advertising partner in Q4 and were generally disappointed with the results. While revenue per click is higher at this partner, overall click through rates are much lower. In terms of our main advertising ‘partner’ we have seen a clear pattern throughout the year of them reducing the revenue share they pay to their blog-related ‘partners’. Apparently they aren’t making enough money as it is and need to stick it to the little man.”
Revenue per post was $9.92 in Q4 2005 compared to $6.11 in Q4 2004. “Revenue per post indicates that we could pretty much write about paint drying and search engines would still drive enough random traffic to our site to make a few bucks.”
Affiliate fee revenues were $33.91 in Q4 2005 up 963% vs. $3.19 in Q3 2005. Burnham notes “We launched our affiliate fee division in Q1 of 2005. This unit performed poorly until we decided in early Q4 to blatantly pander to affiliate revenues by using sensationalized rhetoric and better placement, a tactic which appears to have worked well.”
Pro-forma expenses were $44.85 up 67% in Q4 2005 vs. Q4 2004 primarily due to switching from the basic $8.95 hosting package on Typepad.com to a $14.95 “advanced user” package which management has yet to really figure out how to use but it would be embarrassing to tell a fellow blogger that they were still using the “basic” package as only newbies do that.
Readers are reminded that Burnham’s Beat’s financial results exclude all labor, connectivity, and capital expenses, all opportunity costs of actually doing something useful, and the considerable goodwill charges that result from constantly antagonizing people by badmouthing their company/industry/personal views. Including these expenses Burnham’s beat would have reported earnings of approximately -$1,000,000,000 but management does not believe that these actual results fairly reflect the alternate reality in which we currently exist.
Burnham’s Beat is comfortable with its previous guidance of
100 billion page views and ‘Google-Style cash in 200
56, Baby!” and hopes
that people remain forgetful.
Burnham’s Beat and Subsidiaries
Pro-Forma, Pro-Forma Preliminary Restated Unaudited Results
|Q4 2005||Q4 2004|
P.S. Yes these are the actual numbers.
Google Base + Vertical Search + RSS = Death of Walled Gardens
A few weeks ago, I promised to write a follow-up on my post about Google Base that detailed how the launch of Google Base might affect the Internet’s so-called “Walled Gardens” (content sites that charge users and/or suppliers for access to their databases). One month and a long cruise later here it is...
When I think of Walled Gardens on the Internet I am reminded of one of the many priceless scenes from Monty Python’s Holy Grail in which the King of “Swamp Castle” explains that he has built three successive castles in the same place only to see the swamp subsequently swallow each castle. The king ends his speech by declaring that the 4th castle was the strongest in the land despite all historical experience to the contrary.
Much like the delusional King in the Holy Grail, many of the Internet’s biggest and most profitable “Walled Gardens”, sites such as Monster.com, Realtor.com, Match.com, and even EBay, appear to be in denial about the ultimate destiny of their sites, which is, that they are bound to be subsumed by the larger Internet.
History Repeats Itself
Not that this should come as a big surprise. Just look at the original “Walled Gardens”: Compuserv, Prodigy and AOL. In their heyday these powerful “online services” controlled every piece of content on their networks and extracted princely sums, often from both content providers and customers alike. With the advent of the Internet all three of these players tried to hold onto their Walled Garden strategy only to see the growth and diversity of the Internet overwhelm their own suddenly meager offerings in just a few short years.
The New Old Thing
Flash forward to today. A set of “new and improved” Walled Gardens have been built only now they are known as “paid listings” sites. Want to find a date? Better be prepared to pay Match.com a monthly fee. Want to sell a house? Better list it though a realtor with access to Realtor.com. Want to hire someone? Better pay Monster.com some coin. You’re a small business and need a heavily trafficked distribution channel? Be prepared to pay EBay a hefty fee.
The primary value-add that most paid listings sites offer is that they aggregate, structure, and index similar content into one coherent “site”. Early on in the Internet’s evolution these sites were literally the only place on the web that one could go for this kind of information. The sites that successfully built scale, brand, and network effects usually ended up the “winners” in their respective categories. Once they became winners, they were able to charge a healthy premium and thereby, in most cases, become highly profitable businesses.
What A Difference 10 Years Makes
The problem for these Walled Gardens, is that much like the online services before them, they have been built on a quickly shifting and deeply flawed foundation. The Internet of 2005 is a far different environment than the Internet of 1995. In 1995, the average user couldn’t spell “Internet”, let alone figure out set up and run their own Internet site. Even if they did set up their own site, the lack of any mechanisms for other users to easily find a site made operating one’s own site a pointless exercise as it was like building a billboard on deserted island in the middle of the Pacific.
In 2005, things are a bit different. Not only has the average Internet user become much more sophisticated, but several trends are rapidly coalescing to deliver what could be a “knock out” blow to the Walled Gardens. These trends include:
- Self-publishing: Thanks to dramatically lower hosting costs and greatly improved software, today most businesses and a rapidly increasing number of individuals have their own Internet sites. Businesses in particular have quickly taken to publishing all kinds of information on their sites. For example, most companies have a section of their site where they list job openings and in the real estate industry almost every agent has their own web site with detailed descriptions of their current listings. Of critical importance is that none of this content is behind a “wall”. It is typically out there for anyone who happens to stop by to see. The net effect of all this self-publishing is that there is now a ton of openly accessible “primary content” just sitting out there waiting to be indexed and manipulated by anyone who chooses to.
- Pervasive Search: Sophisticated index search has become the glue that ties the entire Internet together. With search, no site is an island (or a Walled Garden) unless it chooses to be. Search has in essence leveled the field when it comes to distribution and brand. You can spend $1M on an ad in the Super Bowl, but it won’t change your search results.
- RSS: RSS is a content syndication standard which makes it incredibly easy for people to subscribe to content from a particular site. Want to be notified every time someone updates their personal profile on their blog, or every time a company adds a new job opening, or every time a realtor gets a new listing? Subscribe to their RSS feed. What RSS does is that it provides an automated way for the web to “feed” new information directly to interested people and computers.
Collectively these trends are transforming the Internet from a disparate, sparsely populated frontier with a few major cities into a teeming highly integrated metropolis. It’s as if everyone on the Internet went from living on a deserted island to all living together on the same block in Manhattan.
A World Of Hurt
In such a close-knit world the foundation of paid listings sites is starting to look more than a little shaky. About the only thing that is still missing is the ability to relate listings within a taxonomy or consistent hierarchy. However that barrier is also rapidly eroding. The first visible signs of such erosion came when a set of “vertical” search start-ups put 2 and 2 together and realized that thanks to self-publishing and pervasive search, they could create highly focused “vertical” search engines that trolled self-published Internet sites for specific information and then republished this information as one aggregated site. (See Jeff Clavier's blog for more in-depth coverage of vertical search.) Seemingly overnight these start-ups have been able to build databases that in many ways rival those that the Walled Gardens have taken years to build. For example, Trulia and Home Pages are searching individual realtor sites to build up databases of real estate listings that in many areas are bigger and more detailed than the ones available on the venerable Realtor.com. Simply Hired and Indeed and doing something similar in the job listings space by indexing job listings from company sites and other “open" gardens.
That start-ups have been able to become so big so fast is a very bad omen for the existing Walled Gardens, but it is nothing compared to the world of hurt that is represented by Google Base (and the undoubtedly soon to follow copy cats from Yahoo and Microsoft). What Google Base represents is an attempt to marry the world’s best search infrastructure and technology with a highly structured and highly automated way of gathering, storing, and presenting listings information.
As I mentioned in my previous post, Google base is essentially the world’s largest XML database. If you take the time to read through the XML schema you will see that Google has essentially already built all of the components that it needs to enter the vertical search space in a big way; all it needs to do now is refine a few algorithms and flip a switch. If and when it does so, it will undoubtedly not only the have the largest collection of listings in all major categories overnight but will also have arguably the best distribution channel for those listings on the Internet. This is not good news if you are currently charging to either to display or to access similar listings (or if you are a vertical search start-up).
It is apparent that Google would prefer to have users flow information to Google Base via automated RSS feeds, and that is still probably the long term goal (services to help users do this have already been created). However, if they decide to “prime the pump” by doing a little indexing of their own, watch out, because then Google can start automatically feeding millions of listings/day into Google Base. Shortly thereafter, there will be population explosion of X.Google.com domains, as in Jobs.Google.com, houses.google.com, cars.google.com, etc., etc., etc. as Google launches “listing sites”, which will really be nothing more than a thin skin on top of Google Base and their indexing/crawling engine; think of it as Google News on steroids.
In fact, if you haven’t checked out Google Base in the last few weeks since the launch announcement you should probably take another look. In just a few short weeks of beta testing users have voluntarily submitted well over a million items to the database despite the lack of user friendly tools to do so. Once at the site, take each of the major Walled Garden Sites you can think of and match them up against each of the major categories that Google Base is currently listing. As a Wall Garden Executive might say: “Uh Oh!”
With Google Base fully in place (and ultimately similar services from Yahoo, Microsoft, and Amazon), why in world would anyone pay to have their listings displayed or pay to have to access to a database of listings? After all, if you publish the listing on your own site Google will automatically index it and then list it within Google Base within the next few days and if you want to make sure they get it immediately, you can just submit if directly to Google Base or register your RSS feed with them (a feature I'll bet they are likely to add). Instead of charging you (or its end users) for the privilege, Google will make money off of the advertising it sells around the listings. Perhaps you may even be able to pay a fee to have your particular listing “advertised” in a preferential position.
Whatever the case, the end result is clear: the Walled Garden sites are in for a world of hurt and unfortunately for them it will probably come much more quickly and severely than it did to Compuserv, Prodigy and AOL. That’s because the online service providers at least controlled people’s dial-up access to the Internet and thus were initially protected by very high switching costs. No such protections will be afforded to the listings site though. Since most of them sell short term services with little or no customer lock-in, the effect of services such as Google Base and its associated off-spring, will likely have a much more immediate and painful impact.
That’s not to say that all of the paid listings sites are headed for imminent Armageddon, but it is to say that they shouldn’t renew their lease on Swamp Castle because it is clearly going to get swallowed up yet more time.
I will follow up on this with a post on with a ranking of the “most vulnerable” Walled Garden business and what, if anything, they can do to avoid sinking into oblivion.
RSS and Google Base: Google Feeds Off The Web
There has been a lot of talk about Google Base today on the web and much of the reaction appears to be either muted or negative. The lack of enthusiasm seems to be driven by the fact that the GUI is pretty rudimentary and doesn't provide any real-time positive feedback (as Fred points out). But people that are turned off by the GUI should take into account that Google Base wasn't designed primarily for humans; it was designed for computers. In fact, I think if people were computers their reaction would be closer to jumping for joy than scratching their heads.
What's perhaps most interesting about the Google Base design is that it appears to have been designed from the ground up with RSS and XML at its center. One need look no further then the detailed XML Schema and extensive RSS 2.0 specification to realize that Google intends to build the world's largest RSS "reader" which in turn will become the world's largest XML database.
To faciliate this, I suspect that Google will soon announce a program whereby people can register their "Base compliant" RSS feeds with Google base. Google will then poll these feeds regularly just like any other RSS reader. Publishers can either create brand new Base-compliant feeds or with a bit of XSLT/XML Schema of their own they can just transpose their own content into a Base compliant feed. Indeed I wouldn't be surprised if there are several software programs available for download in a couple months that do just that. Soon, every publisher on the planet will be able to have a highly automated, highly structured feed directly into Google base.
Once the feed gets inside Google the fun is just beginning. Most commentators have been underwhelmed by Google Base because they don't see the big deal of Google Base entires showing up as part of free text search. What these commentators miss, is that Google isn't gathering all this structured data just so they can regurgitate it piece-meal via unstructured queries, they are gathering all this data so that they can build the world's largest XML database. With the database assembled, Google will be able to deliver a rich, structured experience that, as Michael Parekh sagely points out, is similar to what directory structures do, however because Google Base will in fact be a giant XML database it will be far more powerful than a structured directory. Not only will Google Base users be able to browse similar listings in a structured fashion, but they will also ultimately be able to do highly detailed, highly accurate queries.
In addition, it should not be lost on people that once Google assimilates all of these disparate feeds, it can combine them and then republish them in whatever fashion it wishes. Google Base will thus become the automated engine behind a whole range of other Google extensions (GoogleBay, GoogleJobs, GoogleDate) and it will also enable individual users to subscribe to a wide range of highly specific and highly customized meta-feeds. "Featured listings" will likely replace or complement AdWords in this implementation, but the click-though model will remain.
As for RSS, Google Base represents a kind of Confirmation. With Google's endorsement, RSS has now graduated from a rather obscure content syndication standard to the exautled status of the web's default standard for data integration. Google's endorsement should in turn push other competitors to adopt RSS as their data transport format and process of choice. This adoption will in turn force many of the infrastructure software vendors to enhance their products so that they can easily consume and produce RSS-based messages which in turn will further cement the standard. At its highest level, Google's adoption of RSS represents a further trimph of REST-based SOA architectures over the traditional RPC architecture being advanced by many software vendors. Once again, short and simple wins over long and complex.
In my next post I will talk about Google Base's impact on the "walled garden" listings sites. I'll give you a hint: it won't be pretty.
Feed Overload Syndrome: 5 Reccomended Ways To Cure It
Looks like there has been a major outbreak of RSS inspired "Feed Overload Syndrome" and it's spreading faster than the Avian flu. First Fred Wilson admitted that he was fully infected and then both Jeff Nolan and Om Malik confessed to similar symptoms. At this rate Feed Overload Syndrome may soon become an Internet-wide pandemic. Perhaps the government will divert some of that $7.1BN to establish the CFC or the Centers for Feed Control.
Some may recall that in the past I have posted about the dangers of Feed Overload Syndrome and I admit to having one of the first confirmed cases (at least confirmed by me). While I am still dealing with Feed Overload Syndrome (you can never cure it, just control it), I have made some progress fighting it and with that in mind I would like to offer my Top 5 Ways To Combat Feed Overload Syndrome:
- Direct Multiple Feeds to the Same Folder.
Many RSS readers or plug-ins allow you to create user specified folders. While the default option is usually one feed per folder, in most instances you can direct multiple feeds to the same folder. By consolidating multiple feeds into a single folder you can dramatically cut down on the overall clutter of your feed list. For example, I subscribe to about 6 poker related feeds. All these feeds post to one folder called, creatively enough, "poker'. I have done the same thing for a number of other subjects as well. You don't have to do this for all your feeds, but it is especially good for feeds that you read infrequently and/or that post infrequently.
- Subscribe to meta-feeds.
Metafeeds are RSS feeds that are composed of posts from a number of individual RSS feeds. For example, I like the business intelligence feed provided by Technology Updates. This feed contains articles about business intelligence from a wide variety of other feeds. I also subscribe to keyword based meta-feeds from sites such as Pubsub and Technorati. Meta feeds are rapidly increasing in popularity and can now found in places such as Yahoo! where you can, for example, subscribe to all the news on a particular stock or all of the articles on your favorite football team. Once you subscribe to meta-feeds make sure to eliminate any old feeds that are now covered by the meta-feed.
- Increase your publisher to distributor ratio.
A very coarse way to segregate feeds is as publishers or distributors. Publishers tend to publish relatively few posts that are longer than average and filled with original content. Distributors (also called linkers) tend to generate many posts a day and typically republish short excepts of other people's post with a short commentary of their own. Each has their place on the web, but you will find over time that as you feed list grows, the distributors will provide less value to you because you will already be directly subscribing to many of the same feeds that they tend to republish. Selectively eliminating just a few distributors/linkers can dramatically lower the number of posts you have to read each day.
- Regularly purge and organize your feed list.
You should review you feed list once a month with an eye towards removing old feeds you no longer want to read, consolidating existing feeds into shared folders and substituting meta-feeds for primary feeds that you read infrequently or selectively.
- Support efforts to create true metafeed services.
As I have written about before, RSS is in danger of becoming its own worst enemy thanks in large part to the fundamental factors driving Feed Overload Syndrome. The best hope for controlling this affliction is to support the growth of both metafeed and metatagging services. I personally do not believe that unstructured "democratic" tagging methodologies stand much of a chance as tags without a standardized and consistent taxonomy are not much better than simple keyword based meta-feeds. Creating metatags and metafeeds that are logically consistent and easily integrated into a well formed taxonomy will only be accomplished once the necessary intellectual horsepower and financial resources are focused on it. Interestingly enough, Google long ago hired many of best minds in this area and has more than enough money, so they probably have the best shot of anyone of curing or at least controlling Feed Overload Syndrome.
Following these 5 steps is not guaranteed to cure Feed Overload Syndrome, but I guarantee it will start to control it.
RSS vs. E-Mail: It’s No Contest, E-Mail Wins … For Now
As I mentioned in a prior post, there’s a study out that shows only 11% of blog readers use RSS and 2/3rds of them don’t even know what RSS is. For Bloggers trying to build a subscription base of readers that’s not good news. It means that on average only 1 out of every 9 visitors to a blog is going to be able to subscribe to a blog via RSS. Unless you have another option to regularly reach such readers, they are going to be left out in the cold.
As it happens, there is another option and that option is none other than good old reliable e-mail. While e-mail may not be as sexy as RSS, you can bet that close to 100% of blog readers will have an e-mail account and know what it’s used for. Indeed, given the almost ubiquitous reach of e-mail and its “push” nature, one might argue that if you are really interested in reaching your users, you should probably make e-mail the preferred means of subscribing to your blog. That may sound like heresy to some in the blogging community, but I’d be willing to wager that the read rates for blog posts sent via e-mail are probably much higher than those that are simply made available via RSS, not to mention that fact that e-mail subscriptions apparently reach the 90%+ of Internet users that don’t use RSS.
Up until recently, it seems as though one site, Bloglet, had a monopoly on enabling blogs to offer e-mail subscriptions to their posts. By many accounts, Bloglet is a somewhat unreliable service and with little or no customer service. But it was the only game in town, so basically everyone used it. Recently a few new RSS to e-mail services have emerged including Feedblitz and RSSFWD. I myself have switched to Feedblitz and, like several others who posted recently, have been very happy with the switch.
All that said, over the long term RSS will triumph as e-mail subscriptions to blogs do not scale well and RSS will become much more ubiquitous and user friendly (thanks largely to MSFT’s decision to embed RSS into Vista), however in the short term I think it’s hard to argue that e-mail isn’t a far more accessible and practical way of allowing the vast majority of readers to subscribe to a blog.
RSS: Geeks Only Please
Jeff just linked to a new Neilson study that reveals only 11% of blog readers use RSS and that a whooping 66% of blog readers don't even know what RSS is. These figures should be a bit sobering for VCs and the rest of Silicon Valley because not only do 100% of VCs seem to know what RSS is but it seems like 66% of them have already invested in an RSS/Blog related start-up. Some guys are even apparently trying to raise an RSS themed VC fund.
Fact is, if you wander just a little bit outside of Geek-centric world of tech-related and VC-related blogs what you quickly discover is that RSS feeds are few and far between. Take political blogs for example. A couple of weeks ago I noticed that one of the political blogs I enjoy, which also happens to be in the top half of the Top 500 feeds, didn't appear to have an RSS feed. I contacted the author and asked him if he had an RSS feed and he asked me "What's an RSS feed?". I explained and told him that since he was apparently using Moveable Type that he should just be able to check a configuration box to generate a feed. Five minutes later, a small link to his feed (Moveable Type's strange default "Syndicate this Site (XML)") appeared way at the bottom of his blog. After that we exchanged a couple e-mails in which I encouraged him to consider moving the link up to the top so that he could capture subscribers and to also consider inserting ads into his feeds to generate some more money. I checked back today and he still just has the single link way at the bottom of his site.
Now remember, this guy is a professional blogger doing numerous posts a day and trying to earn a living off of his blog (and judging by his ranking on the Top 500, doing a better job of that than most), but RSS wasn't even on his radar and even after having the supposed benefits described to him, he hasn't been motivated enough to do much more than the bare minimum. Thing is, his lack of action is totally rational. Given that I was apparently the only reader of his highly ranked blog to have ever asked him for an RSS feed I assume he therefore suspects, quite rightly, that spending a lot of time and energy to optimize his RSS feed would be a complete waste of time at this point.
I don't think such behavior indicates that RSS is doomed or that it is a passing fad (in fact it may just indicate that we are in the early stages of something huge and there's still plenty of time left to make RSS related investments), but I do think it indicates that RSS still has a long way to go to mainstream adoption.
Perhaps most importantly, I think it underscores that VCs have to be careful not overestimate near term adoption rates. Just because something is "hot" within the incestuous and self-centered world of Silicon Valley doesn't mean that it is hot elsewhere or even destined to be hot elsewhere.
The Coming Blog Wars: Google vs. Yahoo
For Yahoo and Google, the Internet’s two search titans, Blogs are rapidly becoming both an important distribution channel and a growing cost center. The battle to control this distribution channel, while at the same time reducing its costs, will intensify greatly this year and will most likely be characterized by some rapid fire acquisitions within the “Blogsphere”.
It’s The Channel Stupid
According to Technorati, the number of blogs on the web has grown from about 100,000 two years ago to over 6,500,000 today with about 20,000 new blog being added every day. Over at Pew research, their latest study indicates that 27% of Internet users in the US, or 32 million people, are now reading blogs, up almost 150% in just one year.
Right now, Google owns the blog-channel thanks in part to its acquisition of Blogger, but mostly to its self-serve Adsense platform that allows bloggers to easily add paid placement and search services to their sites. (I set up both services on this site in 30 minutes with no human help or interaction.) While Google doesn’t say just how much of its revenues it generates via blogs, with growth numbers like those above it’s no doubt that Google’s “blog-related” revenues are growing quite quickly.
While Yahoo is rumored to be building a competitive offering to Adsense, for now it is limited to only serving large sites, so its blog-related revenues are likely miniscule, however Yahoo clearly is aware of the growing importance of blogs and knows that it must have a competitive response to Google’s Adsense platform.
If either player were able to control or at least significantly influence which paid placement services bloggers chose to incorporate into their sites, it would given them a substantial competitive advantage in their head-to-head competition and control over one of the fasting growing channels on the web.
For example, not only would control allow Yahoo or Google to push their own paid placement and search services at the expense of the other, but it would allow them to route other “affiliate” traffic through their own hubs and thereby take a piece of the action. For example, rather than having bloggers link directly to something like Amazon's Associates program, the bloggers would instead send their traffic to a “master” affiliate account at Google, one in which Google was able to negotiate a larger % cut due to its overall volume, or they might just send it to Froogle if that was a better deal. In such a case both the bloggers and Google win. Google gets a cut of affiliate revenues that it previously missed out on and bloggers get a slightly higher rev share thanks to Google’s much greater bargaining power.
A Costly Partnership
While integrating blogging more closely into their business models offers Google and Yahoo additional revenue opportunities, it also presents them with significant costs, mostly in the form of revenue share payments that they must make to blogs. While they must make similar (and often higher) payments to traditional media partners, the payments to blogs are more costly to process (due to large number of blogs) and much more susceptible to click-through fraud schemes. Controlling the cost of the channel is therefore likely to be almost as big a focus as increasing the revenues its produces.
While controlling fraud will be very important, it likely won’t be a source of competitive advantage as both firms have similar incentives to control fraud. One can even imagine both firms partnering, directly or indirectly, to jointly fight fraud. That leaves reducing payments to blogs as the most obvious way to control costs, however reducing payments will be difficult to achieve due to competitive pressures.
Blog Barter: Spend Money To Save Money
The best way to save costs may actually be to spend some money and acquire companies that currently offer services to bloggers. These services can then be bartered to blogs in return for a reduction in payments. For example, a blog that pays its hosting firm $20/month might be willing barter $20 worth of its click-through revenues for hosting instead of writing a check. At the very least, Google and Yahoo might be able to buy such services for $18/month and resell them for $20 in click-through credits thus saving themselves 10% cash in the process.
This math, plus burgeoning payments to bloggers, is likely to prompt both Google and Yahoo to make some rapid fire acquisitions in the space. For Google, the acquisitions will be about protecting its lead and denying Yahoo the chance to become competitive. For Yahoo they will be about quickly catching up to Google and potentially surpassing it. Of the firms, Yahoo is perhaps in the best position to initiate this “blog barter” economy given that it has a broad range of existing subscription services that it can barter, however they are also the further behind in terms of using blogs as a channel for their services.
The Hit List
While it’s tough to say with 100% accuracy which start-ups will be on the blog-inspired M&A “hit list” of Google and Yahoo, it is clear that such firms will do one of two things: they will either enhance control over the distribution channel or they will reduce its costs by enabling “blog-barter” transactions. Google early on struck the 1st blow in this M&A battle with the acquisition of Pyra Labs (creator of Blogger) in 2003, but more deals are undoubtedly on the horizon due to blogging’s explosive growth in 2004. A few of the most likely candidates for acquisition include:
- Six Apart: Creator of the popular TypePad blog authoring software and service (this blog is hosted on their service), TypePad has long been rumored to be a potential acquisition candidate, most recently for Yahoo. With 6.5M users after its recently announced acquisition of LiveJournal is completed, Typepad is clearly the biggest piece of the blog channel that it potentially up for grabs. While the odds on bet is that Yahoo will acquire TypePad given that it has no blog platform to date, Google may well try to lock up the channel by acquiring TypePad ahead of Yahoo, but it’s also possible that long shots such as Amazon and Microsoft may make their own bids.
- Burning Door: Operator of the increasingly popular Feedburner service that provides detailed usage statistics of RSS feeds, Feedburner is an ideal acquisition both as an incremental revenue generator and a “blog barter” service. On the revenue front, Feedburner offers a fantastic platform for “splicing” contextually relevant RSS advertising and affiliate offers into RSS feeds creating yet another advertising platform for the search players. On the blog-barter front, serious bloggers will easily part with a few “barter bucks” each month in return for the detailed usage and reporting statistics that Feedburner is able to provide.
- MySpace: Originally pitched as a social networking site, MySpace has melded social networking with blogging to the point where the site has become some kind of strange youth-centric amalgam of the two worlds. The risk is that MySpace is really just GeoCities 2, but this may turn out to be the best kind of platform for advertisers to reach the youth market in a contextual, non-obtrusive way. In addition, if MySpace can help its members earn a little pocket change through blogging, they may be able to solve the input-output asymmetry problem that I wrote about in an earlier post on social networking.
There are a bunch of blog aggregation sites such as Bloglines, Technorati, del.icio.us that appear on the surface to be interesting acquisition opportunities, but it is unclear what these sites really add to the existing capabilities of Yahoo and Google as these sites focus on end-users, not bloggers, and Google and Yahoo already have plenty of end-users.
As for VCs, this round of blog-inspired M&A will be a nervous game a musical chairs as quickly placed wagers either pay-off or are permanently bypassed. Watching how this , the Great Blog Battle, plays out will be entertaining indeed!
Saving RSS: Why Meta-feeds will triumph over Tags
It’s pretty clear that RSS has now become the de facto standard for web content syndication. Just take a look at the numbers. The total number of RSS feeds tracked by Syndic8.com has grown from about 2,500 in the middle of 2001, to 50,000 at the beginning of 2004, to 286,000 as of the middle of this month. That’s total growth of over 11,300% in just the past 3.5 years!
Feed Overload Syndrome
However, as I wrote at the beginning of last year, the very growth of RSS threatens to sow the seeds of its own failure by creating such a wealth of data sources that it becomes increasingly difficult for users to sift through all the “noise” to find the information that they actually need.
Just ask any avid RSS user about how their use of RSS has evolved and they will likely tell you the same story: When they first discovered RSS it was great because it allowed them to subscribe to relevant information “feeds” from all of their favorite sites and have that information automatically aggregated into one place (usually an RSS reader like Sharpreader or NewsGator). However, as they began to add more and more feeds (typically newly discovered blogs), the number of posts they had to review started rising quickly, so much so that they often had hundreds, if not thousands of unread posts sitting in their readers. Worse yet, many of these posts ended up either being irrelevant (especially random postings on personal blogs) or duplicative. Suffering from a serious case of “feed overload”, many of these users ultimately had to cut back on the number of feeds that they subscribed to in order to reduce the amount of “noise” in their in-box and give them at least a fighting chance of skimming all of their unread posts each day.
The First Step: Recognizing That You Have Problem
Many in the RSS community recognize that “Feed Overload Syndrome” is indeed becoming a big problem and have begun initiatives to try and address it.
Perhaps the most obvious way to address the problem is to create keyword based searches that filter posts based on keywords. The results from such searches can themselves be syndicated as an RSS feed. This approach has several problems though. First, many sites only syndicate summaries of their posts, not the complete post thus making it difficult to index the entire post. Second, keyword-based searches become less and less effective the more data you index, as the average query starts to return more and more results. Third, keywords often have multiple contexts which in turn produce significant “noise” in the results. For example, a keyword search about “Chicago” would produce information about the city, the music group, and the movie among other things. That said many “feed aggregation sites” such as Technorati and Bloglines currently offer keyword based searching/feeds and for most folks these are better than nothing. However it’s pre-ordained that as the number of feeds increase, these keyword filtering techniques will prove less and less useful.
Tag, You’re Categorized
Realizing the shortcomings of keyword-based searching, many people are embracing the concept of “tagging”. Tagging is simply adding some basic metadata to an RSS post, usually just a simple keyword “tag”. For example, the RSS feed on this site effectively “tags” my posts by using the dc:subject property from the RSS standard. Using such keywords, feed aggregators (such as Technorati, PubSub and Del.icio.us ) can sort posts into different categories and subscribers can then subscribe to these categorized RSS feeds, instead of the “raw” feeds from the sites themselves. Alternatively, RSS readers can sort the posts into user-created folders based on tags (although mine doesn’t offer this feature yet).
Tagging is a step in the right direction, but it is ultimately a fundamentally flawed approach to the issue. The problem at the core of tagging is the same problem that has bedeviled almost all efforts at collective categorization: semantics. In order to assign a tag to a post, one must make some inherently subjective determinations including: 1) what’s the subject matter of the post and 2) what topics or keywords best represent that subject matter. In the information retrieval world, this process is known as categorization. The problem with tagging is that there is no assurance that two people will assign the same tag to the same content. This is especially true in the diverse “blogsphere” where one person’s “futbol” is undoubtedly another’s “football” or another’s “soccer”.
Beyond a fatal lack of consistency, tagging efforts also suffer from a lack of context. As any information retrieval specialist will tell you, categorized documents are most useful when they are placed into a semantically rich context. In the information retrieval world, such context is provided by formalized taxonomies. Even though the RSS standard provides for taxonomies, tagging as it is currently executed lacks any concept of taxonomies and thus lacks context.
Deprived of consistency and context, tagging threatens to become a colossal waste of time as it merely adds a layer of incoherent and inconsistent metadata on top of an already unmanageable number of feeds.
While tagging may be doomed to confusion, there are some other potential approaches that promise to bring order to RSS’s increasingly chaotic situation. The most promising approach involves something called a Meta-feed. Meta-feeds are RSS feeds comprised solely of metadata about other feeds. Combining meta-feeds with the original source feeds enables RSS readers to display consistently categorized posts within rich and logically consistent taxonomies. The process of creating a meta-data feed looks a lot like that needed to create a search index. First, crawlers must scour RSS feeds for new posts. Once they have located new posts, the posts are categorized and placed into a taxonomy using advanced statistical processes such as Bayesian analysis and natural language processing. This metadata is then appended to the URL of the original post and put into its own RSS meta-feed. In addition to the categorization data, the meta-feed can also contain taxonomy information, as well as information about such things as exact/near duplicates and related posts.
RSS readers can then request both the original raw feeds and the meta-feeds. They then use the meta-feed to appropriately and consistently categorize and relate each raw post.
For end users, meta-feeds will enable a wealth of features and innovations. Users will be able to easily find related documents and eliminate duplicates of the same information (such as two newspapers reprinting the same wire story). Users will also be able to create their own custom taxonomies and category names (as long they relate them back to the meta-feed). Users can even combine meta-feeds from two different feeds so long as one of the meta-feed publishers creates an RDF file that relates the two categories and taxonomies (to the extent practical). Of course the biggest benefit to users will be that information is consistently sorted and grouped into meaningful categories allowing them greatly reduce the amount of “noise” created by duplicate and non-relevant posts.
At a higher level, the existence of multiple meta-feeds, each with its own distinct taxonomy and categories, will in essence create multiple “views” of the web that are not predicated on any single person’s semantic orientation (as is the case with tagging). In this way it will be possible to view the web through unique editorial lenses that transcend individual sites and instead present the web for what it is: a rich and varied collective enterprise that can be wildly different depending on your perspective.
The Road Yet Traveled
Unfortunately, the road to this nirvana is long and as of yet, largely un-traveled. While it may be possible for services like Pubsub and Technorati to put together their own proprietary end-to-end implementations of meta-feeds, in order for such feeds to become truly accepted, standards will have to be developed that incorporate meta-feeds into readers and allow for interoperability between meta-feeds.
If RSS fails to address “Feed Overload Syndrome”, it will admittedly not be the end of the world. RSS will still provide a useful, albeit highly limited, “alert” service for new content at a limited number of sites. However for RSS to reach its potential of dramatically expanding the scope, scale, and richness of individuals’ (and computers’) interaction with the web, innovations such as meta-feeds are desperately needed in order to create a truly scaleable foundation.