Where the "Semantic Web" goes wrong: thoughts on "Web 3.0"

I have been thinking about issues of meaning (semantics), context, human understanding and the like for some time now, and particularly on the role the internet plays and could play in the future in human communication and thought. The Pew Internet and American Life Project has a new report The Fate of the Semantic Web in which they sought opinions from hundreds of internet luminaries and long-time experts on prospects for the "semantic web". I find the results illuminating and reassuringly in line with some of my thinking on the subject. I'd been preparing to write an article here on the semantic web (and what's wrong with it) for a few months now, so the release of this report seemed an opportune time to put at least some of what I've been gathering out for public comment.

Everyone agrees that the internet, and particularly the world wide web that began about 20 years ago with Tim Berners-Lee's invention of the URL, HTML, the http protocol and the first web browser, has brought a deluge of information to billions of people, something almost beyond the imagination of earlier generations. But making intelligent use of all that information is difficult. Tools like Google's search engine greatly help in sifting out the best stuff on any given topic. But there is very little to help us make sense of it all. Other than the links themselves, our computers have no understanding of the meaning of what various websites provide us, they can't correlate information from multiple sources to provide a coherent story. We want our computers to give us not just "information", but "meaningful information", "knowledge", perhaps "understanding".

But from the beginning, mankind's quest for knowledge has had a downside:

And out of the ground made the LORD God to grow every tree that is pleasant to the sight, and good for food; the tree of life also in the midst of the garden, and the tree of knowledge of good and evil. ... "Of every tree of the garden thou mayest freely eat: But of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die." ... And the serpent said unto the woman, "Ye shall not surely die: For God doth know that in the day ye eat thereof, then your eyes shall be opened, and ye shall be as gods, knowing good and evil." [from Genesis Ch. 2 and 3, KJV]

We hold to firm beliefs in real truths about the world, and through the sciences in particular we strive to acquire all those truths nature can provide us with, to help us make better decisions, to live better lives, to better preserve the things we love most about this existence. We spend years of our lives educating ourselves, and even after college we continue to acquire knowledge through books and all the other information conveyances of our time. And many of us devote lives to sharing knowledge acquired, teaching, writing, educating in one way or another. Meaning, understanding, "semantics" are not simple things that can be obtained with just a single bite, they require real intellectual effort and work. Is it likely that our computers will ever be able to significantly help?

In 2001 Berners-Lee proposed a new "semantic web", where more specialized markup would provide a precision missing in the current web of linked documents. The vision articulated was one where documents would provide real meaning that "intelligent agents" would understand and use to accomplish goals for their human owners. Berners-Lee and others at the WWW Consortium had of course been working on this for some time - see in particular this 1998 "roadmap" and related documents linked from there, many added over the intervening years to further explain the concepts.

While much of the talk is of "intelligent agents", the goal of the semantic web is not those agents themselves, i.e. not "artificial intelligence" explicitly, but rather the interlinked web of documents and data that third parties can then build agents to utilize, just as Google built a search engine on top of the existing web. Much of the work focuses on the "RDF" language, a very general way of representing information, and ancillary specifications for representing meaning within RDF. Collections of RDF statements can be represented as a "graph", interlinking properties of real-world entities. Berners-Lee often talks of the "Giant Global Graph" that the entire semantic web would create, with all these interlinked documents working together in a well-defined coherent manner to represent humanity's knowledge about the world.

But there's a serpent hiding in the grass of this paradisaical utopia of human knowledge: not all data posted online can be trusted. Some things may be wrong by accident. Many things will be deliberately wrong, or twisted in some fashion, wrong through omission or abuse of ambiguities so that what seems to be one meaning is really another. There is a reason we have words like deception, corruption, mischief and error. The giant global graph will, inevitably, include things that are wrong. What then?

Berners-Lee is of course aware of this problem. Many of his original articles touch on the issue of wrong or Inconsistent Data:

... There is no a priori reason to believe any document on the web. The reason to believe a document will be found in some information (metadata) about the document. That metadata may be an endosement of the document - another RDF statement, which in turn was found another document, and so on.
...
Digital sgnature (see trust) of course adds a notion of secuirty to the whole process. The first step is that a document is not endorsed without giving the checksum it had when believed. The second step is to secify more powerful rules of the form

"whatever any document says so long it is signed with key 57832498437".

In prcatice, particular authroities are trusted only for specific purposed. The semantic web must support this. You must be able to restrict the information believed ...

In my view the semantic web vision is actually backwards: trust, provenance, and context need to be at the foundation of meaning, not tacked on at the top layer for "intelligent agents" to try to figure out. A given statement is meaningless out of context. Recognition of the existence of multiple contexts allows us to handle mutually inconsistent statements without human rationality completely breaking down, so "intelligent agents" will need to do this too. RDF and the "giant global graph" tries to define a universal context, but I believe such a project can never succeed (outside of, possibly, mathematics) because the real world is just too messy.

I plan to come back to this with some further thoughts later - for now I just want to highlight some echoes of this from the Pew survey. After 10+ years of semantic web development, the survey queried 895 experts on "the likely progress toward achieving the goals of the semantic web by the year 2020", a further 10 years away. There was significant disagreement among the experts, with the larger number on the side suggesting the goals will not be achieved:

EXPERTS TOTAL Question
38% 41% By 2020, the semantic web envisioned by Tim Berners-Lee
and his allies will have been achieved to a significant degree
and have clearly made a difference to the average internet
users.
52% 47% By 2020, the semantic web envisioned by Tim Berners-Lee will
not be as fully effective as its creators hoped and average
users will not have noticed much of a difference.

a few salient comments from the respondents:

”Alas, the semantic web is an idea that owes more to the desires of computing scientists
and information theorists for a world of perfected knowledge and processed reason
than to reality. The semantic web is like the Encyclopaedia of the Modern project: an
ideal whose existence enables us to make progress but that can never be achieved
because it fails to account for the cultural malleability of knowledge, the political
economy of information, and – ultimately – the agency of humans, with their machines,
in subverting the ideals of pure reason to the partial ends of personal gain.” –Matthew
Allen, director of the department of Internet Studies at the School of Media, Culture
and Creative Arts, Curtin University of Technology, and critic of social uses and cultural
meanings of the Internet; http://www.netcrit.net/

“I don't like answering this question in the negative, but I understand Berners‐Lee's
concept of the semantic web as being more structured than the various collections of
folksonomies and APIs that we have today, and I don't foresee us progressing far in that
direction in the coming 10 years. A more structured web can be enabled by
enhancements to HTML, for example, but getting people to adopt those enhancements
and use them consistently and regularly is another matter. There are also the issues of
human language to be considered; linkages across languages will remain problematic.
Even if a semantic web emerges for the English‐language web, what about everyone
else?” –Mindy McAdams, Knight Chair in journalism, University of Florida, author,
“Flash Journalism: How to Create Multimedia News Packages,” journalist,
http://mindymcadams.com/index.htm

“The key problem with the semantic web is the problem of false data and trust. I think it
is a great idea in theory, and many of these principles of the semantic web will be more
deeply integrated into the services we use, but an automated web‐for‐machines that
automatically make better decisions for us because of the data they export is a
pipedream.” –David Sifry, founder of Technorati and CEO of Offbeat Guides;
http://www.sifry.com/alerts/about/

“Artificial intelligence will certainly accomplish many if not all of the goals of the
semantic web, but I do not think that the semantic web is the right mechanism for
helping computers truly understand the internet. The idea behind the semantic web is
too artificial and makes too many false assumptions about the inputs.” –Hal Eisen,
senior software engineering manager for Ask.com; http://www.linkedin.com/pub/haleisen/0/95/a24

“The problems facing the successful arrival of a semantic web are not simply
technological, but lie in significant part in the human element itself. The nature of
human‐produced content makes it extremely difficult to categorise without loss of
accuracy or reliability. In libraries there are certain requirements, and agreed formats,
and even then it is necessary to blur lines and endure mistakes. It will take more than
ten or eleven years for the human‐produced content of the Internet to become
compatible with the idea of a semantic Web. In considering semantic web, it is
important to note that the sort of Internet envisioned by Tim Berners‐Lee is quite
different from the one we seem to be developing.” –Francis J.L. Osborn, philosopher,
University of Wales‐Lampeter

“Meaning is elusive, depending on context and perspective and a range of human
intellectual processes that we still only dimly understand. Despite confident
predictions, AI and Expert Systems and other attempts to capture meaning through
machine ‘intelligence’ have fallen far short of their hype. The semantic web is only the
latest new thing that will disappoint its hopeful champions." –Mark U. Edwards, senior
advisor to the dean, Harvard University Divinity School

“The semantic web might prove useful in certain tightly controlled domains, but true
artificial intelligence remains elusive." –Dean Thrasher, founder, Infovark;
http://www.linkedin.com/in/deanthrasher

“Already the semantic web folks have backed off on their stated expectations and are
more interested in data portability and interoperability. Frankly this is not only a
realistic move but a smart one. The best place for semantics is in data which structure
and metadata and description are essential.” –Paul Jones, founder and director of
ibibilo.org, University of North Carolina‐Chapel Hill; http://www.ibiblio.org/pjones/

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

> We want our computers to

> We want our computers to give us not just "information", but "meaningful information", "knowledge",
> perhaps "understanding".

Understanding would have to work both ways, you know.

I imagine sitting in front of the 3V screen and hearing the sultry voice telling me, "Oh, yes, darling, I see what you're searching for, and I _do_ understand -- I understand not just what you want, I understand what you _need_, and thanks to this sponsored search, I can get it for you. Just say yes, darling, or just nod your head ... that's right ...."

(runs, screaming ....)

This subject is of

This subject is of fundamental importance , especially to improving science. I think we should be moving forward to making scientific papers fit a machine readable format. In particular if this format suits mathematics that is sufficient for an enormous advance.