On the other hand, every web page is built for “humans”, and that’s precisely what Diffbot is using as the foundation of its technology. Instead of looking at the HTML code, Diffbot uses computer vision technology to determine the nature of the content. For example, a title is often using larger text and the author name is usually near the top of the article. Of course Diffbot’s algorithm can handle a variety of situations, but you get the point.Diffbots has two APIs:
1/ On-demand processing of web pages. For example, this can bu used to extract elements of a web page that can be of interest, like a title content and images of a page, while ignoring other features like ads or navigation elements.
2/ A Follow API, which is used to detect changes in a webpage and extract relevant information that can be used to illustrate the change.
It’s really up to developers to use these building block to create great applications, but I can tell you that if this works as advertised (I haven’t had time to try it yet), it is something that should add a lot of value because it’s hard to build. For instance, AOL Editions (site no longer exists) is already using Diffbot’s technology.
The API is free within a relatively large limit in the number of API calls that one can perform. Beyond that, developers will have to pay “per API call”, which means that they will have to monetize their application. Companies that have sensitive information can also get a license that run on a private server inside their firewalls.
Using computer vision technology to look at web page is a great idea and one that would bypass a lot of tricks designed for “bots”. Of course, you can expect to have some glitches here and there, but for most developers who need this type of functionality, this looks like a gold mine.
Links: Diffbot SDK/Docs,