Next Generation Search: Entities, Categories, and Pings
The search industry has been dominated for the last 6 years or so by one significant technology, Google’s Page Rank. Page Rank basically uses popularity as a proxy for relevancy which, if Google’s $100BN market cap is any guide, has proven to be a valuable approximation. However like many technologies, Page Rank’s hegemony is gradually wearing thin thanks to the fact that it is easily gamed and inherently limited by its derived and agnostic nature. With Page Rank’s utility declining thanks to SEO companies and “site spam” the natural question becomes what, if any, new technologies will emerge to either supplement or even replace Page Rank.
To that end, a couple of recent entrants into the search market give strong hints as to how search technology will evolve in the near future. The good news is that this evolution will significantly improve both the relevance and utility of search. The bad news is that it is not clear which company will be main the beneficiary of this evolution.
The first new entrant emblematic of this evolution is Vast.com, which officially launched its public beta yesterday. Vast is a multi-category vertical search engine with its first three categories being automobiles, jobs, and personals. Vast is similar in some respects to other category specific vertical search players, such as Simplyhired and Trulia, however, as its name clearly suggests, it has more expansive ambitions. What makes Vast particularly noteworthy is that one of the key ingredients in its secret sauce is a very advanced entity extraction engine. Entity extraction is the process of “mining’ unstructured data for key attributes (such as model of a car or the salary level of a job) which are then separately indexed. Entity extraction is much more sophisticated than screen scraping because it uses semantic inferences, and not pre-determined static “maps”, to determine what entities are present in a given piece of text.
By using the same core entity extraction engine across multiple verticals, Vast has in essence created a “horizontal” vertical search platform that is readily extensible to new verticals and highly efficient in that it only needs to crawl a given site once to extract multiple types of entities.
What does this technology do for search users? Well for one thing, it makes it a hell of lot easier to find a used car for sale on the web. Not only can a user instantly narrow their search to a specific make, model, and year, but thanks to its entity indices, Vast can also tell them what they are likely to pay for the new purchase. For example, if you are looking for a 2001-2002 Acura 3.2 TL, according to Vast chances are you will pay around $18K for the privilege. If you’re looking to get a job as a computer programmer in the Bay Area, well you’re probably looking at about an $80K salary while the same job will only net you $62.5K in Des Moines, Iowa (although you will live like king compared to the poor coder in Silicon Valley living in some half million dollar shack).
How does Vast know this? Because they have extracted all the entity information, indexed it, and done some very basic calculations. But Vast has really only scratched the surface of what they can do with this meta-data. For example, why have people search for a particular car, when they can search by a particular budget, such as “Hey I only have $20-25K to spend and want the most car for my money, so what’s out there that I can afford?” Ask that question to a standard search engine and they likely spew back a lot of random garbage. However a site such as Vast could very easily provide a highly organized matrix of all the different make, model, year combinations that one could afford for that budget without breaking a sweat. Even more intriguing, a site like Vast could just as easily start ranking its search results according to different dimensions, such as “best value”. After all, with a few basic algorithms in place, Vast will be able to spot that clueless person in Oregon that is listing their new car way below market because Vast knows what price everyone else with a similar car is asking. While the ultimate success of Vast is still an open question given that it faces a large field of both horizontal and vertical competitors, Vast clearly demonstrates the value and utility of adding robust entity extraction technologies to search and therefore provides us a likely glimpse of search’s near term evolution.
The Browseable Web
Another new search site that provides a similar glimpse is Kosmix.com. Kosmix is focused on making search results more “browseable” by using categorization technology to determine the topics addressed by a web page. Once the topics are determined, the web page is then associated with specific nodes of a pre-defined taxonomy.
For example, go to Kosmix Health and search on a medical condition, say autism. Off to the left hand side of the initial search results is a list of categories associated with autism. This list or taxonomy enables people to rapidly zoom in on the specific type of information they are interested in, such as medical organizations that are specifically focused on autism, allowing people to not only rapidly filter results for the particular type of information they are looking for but also enabling them to easily browse through different types of content about the same subject.
There have been others that have tried to do similar category-based searches, however Kosmix is the only company that has figured out how to build categorized search technology that can perform acceptably at “internet scale” and as such represents a major advance in the field.
We’ll Need A Search Running … It Has Already Begun
One last missing piece of the next generation search infrastructure that will likely enter the picture is ping servers. Ping servers are currently used mostly by blogs to notify search engines that new content is available to index. Ping servers thus greatly speed the rate at which new content is indexed. While traditionally these servers have been used by publishers to notify search engines, these servers are increasingly being used to notify end users that new content is available as well. Ping servers will become particularly powerful though when they combine persistent queries with the entity and categorization technologies discussed above.
Already today an end user can have a ping server, such as Pubsub, run persistent queries for them and then notify them, typically via RSS, of any newly published content that fits their query. Such persistent queries will get even more powerful though when they are processed through search engines with reliable entity extraction and categorization capabilities.
For example, if you are looking to buy a specific kind of used car, wouldn’t you like to know immediately when someone listed a car for sale that met your criteria? If you were a company, wouldn’t you like to know immediately if a current employee listed their resume on a job site? If you are a portfolio manager wouldn’t you like to know immediately that a blog just published a negative article on a stock you own? If you are a lung cancer drug developer, wouldn’t you like to know every time someone published a new academic study on lung cancer?
A New Foundation
Thus, ping servers combined with persistent queries filtered through entity and categorization engines will create a rich, multi-dimensional real time web search that actively and continually adds values to users while greatly limiting the ability of search engine optimization companies to cloud results with spam and other poor quality content. This may not be the whole future of search, but given near term trends it is likely to at least be a significant part of it.
Epilogue: Wither Google?
Whether or not Google can remain at the top of this rapidly evolving technical space is an open question. Indeed few remember that we have arguably had no less than 5 web search leaders in just 13 years (e.g. WWW Virtual Library, EINet Galaxy, Yahoo, Alta Vista, Google). That said, Google has hired many of the world’s leading experts in areas such as entity extraction and categorization so it will not lack the intellectual firepower to keep pace with start-ups such as Vast and Kosmix and clearly has the financial resources to acquire any particularly promising firm before it becomes too meddlesome (as Google did to Yahoo). So while the ultimate winner is as yet unknown, one thing is for certain: it will be an interesting race to watch.
Other Articles In This Blog By Topic: Blogs Collaboration Content Managment CRM Database Development Tools EAI ERP Internet Middleware Network Management Open Source Operating Systems Operations Management PLM RSS Security Software Stocks Supply Chain Venture Capital Wall Street Web Services Wireless
The thoughts and opinions on this blog are mine and mine alone and not affiliated in any way with Inductive Capital LP, San Andreas Capital LLC, or any other company I am involved with. Nothing written in this blog should be considered investment, tax, legal,financial or any other kind of advice. These writings, misinformed as they may be, are just my personal opinions.