November 28th, 2003
Spiders & Webs
The Need for a Semantic Web

Intro

In this age of deep changes for the human kind, this age which, after the coming of the microchip, has become a post-industrial age - now that the World Wide Web has penetrated so deeply in our lives, and still in poor countries like Senegal there is an Internet Cafe at every roads corner - its licit to ask ourselves where is going the WWW.

Into the last years we noticed what I called the Rise of the Robots. I can easily recall the birth of the various WebCrawler, Altavista, and the superstar Google. Surely they do a great job, as evidenced by some numbers, today October 2003 Google:
  • satisfies more than 200 millions searches a day;
  • Web pages searched: more than 3 billion;
  • Images: 425 million+
  • Usenet messages: 800 million+
  • Global unique users per month: 73.5 million
    (Nielsen//NetRatings 2/03)
[online: http://www.google.com/corporate/facts.html].

These numbers are really impressive. But is the Web really working well? And what does the future deserve? How will be the next years?
Inasmuch the nowaday Web is filled with zillion of pages of text that can be machine-readable but not machine-understandable, it is pretty tough to automate everything, nor this volume of information can be managed manually. So the solution proposed is to use metadata to describe the content of the Web.

The Importance of the Metadata.

One of the key points is how we manage the Metadata. Very clarifying is the presentation written by Paolo Ceravolo, that I tried to translate; and inasmuch it's a very long article, I have cut or exposed what in my opinion is less or more important; adding something of mine and reassembling the data.

What is astounding is that many people ask which is the purpose of the Metadata. This is weird because without the Metadata the Web simply doesn't work. The programmed choice of the Web is to question not the texts but the Metadata. And this happens because the data on the Web have no structure, they are simply spread on millions of sites. While the Metadata are informations built following a sharp schema. It's thank to this structure that we are allowed to manipulate the data, knowing how they are interelated.

At the same time the Metadata are the weak ring of the chain. Because they are expensive to produce and sometime could result clear as mud. If they are generated through a human indexing might be vague, inexact. And the indexing requires well trained and motivated people. Otherwise if it's faced simply as a clerk's due, the result is not reliable. The alternate way is an automated, or semiautomated, indexing.

Here the biggest problem is that an automated tool cannot distinguish the text from the context. That means an automated tool can only extract the most significant data but can never insert info not included in the original document but to it related, i.e. the context. And it remains sensible to the ambiguity of some words.

So the actual choice is the socalled Assisted Extraction, corrected and reviewed by trained operators, able to moderate the limits of the automated tools. Usually they are divided in:
  • first step of such a tool is a Text Zoner, that divide the text in structured parts, like title, body and so on;
  • then is the turn of the Preprocessor that does a morphological analsys of the sentences, trying to guess what is the subject, the verb, the object;
  • then is time for a Filter, that cuts off sentences not reputed useful;
  • through a Named Entity Recognizer can be identified minimal lexical structures like names, dates, numbers, companies names, and so on;
  • all these informations are then organized by a parser that will furnish the hierarchy of the relations, ordering them with a tree-structure;
  • last step is a Lexical Disambiguation that must make sure that words with plural meanings be translated in a single way.
[Paolo Ceravolo, online: pro.html.it].

Its humble opinion of truly yours that one of the biggest odd of the nowadays Internet is the shape in which the informations are rendered. I mean that if I am a student and need an abstract about Franklin Delano Roosevelt, I receive some thousands links, that I have to browse to find his birthdate or death, or to discover he made the "New Deal", and what this was, and why it has been important in the American history.

So it's required something more to address the user, as the student as the researcher. Is there any new on sight?

Let's start with some definitions:
Definition: The Semantic Web is the representation of data on the World Wide Web. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming. [ W3C]

"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation." -- Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001.

Resource Description Framework (RDF)

The RDF is "a foundation for processing metadata. It provides interoperability between applications that exchange machine-understandable information on the Web. RDF can be used in a variety of application areas; for example: in resource discovery to provide better search engine capabilities, in cataloging for describing the content and content relationships available at a particular Web site, page, or digital library, by intelligent software agents to facilitate knowledge sharing and exchange, in content rating, in describing collections of pages that represent a single logical "document", for describing intellectual property rights of Web pages, and for expressing the privacy preferences of a user as well as the privacy policies of a Web site. RDF with digital signatures will be key to building the "Web of Trust" for electronic commerce, collaboration, and other applications".
[online: www.w3.org].

Now the way in which RDF data are represented, i.e. the syntax, the model for encoding and transporting this metadata, the use of the N-Triples (subject, predicate, object) is out of the scope of this article.
Let's say only that: "The syntax uses the Extensible Markup Language [XML]: one of the goals of RDF is to make it possible to specify semantics for data based on XML in a standardized, interoperable manner. RDF and XML are complementary: RDF is a model of metadata and only addresses by reference many of the encoding issues that transportation and file storage require (such as internationalization, character sets, etc.). For these issues, RDF relies on the support of XML. It is also important to understand that this XML syntax is only one possible syntax for RDF and that alternate ways to represent the same RDF data model may emerge" [online: www.w3.org].

And let's add that RDF uses classes to define and model the world, like many object-oriented languages. For this purpose probably you would check [RDFSchema].

But the RDF presents some odds, it is only a frame system, it does not include a mechanism for reasoning. But a reasoning mechanism could be built on top of this frame system. So what's next?

Ontology Web Language (OWL)

Before talking about ontology let's give a look to the DARPA Agent Markup Language (DAML). As its site clearly states: "The goal of the DAML effort is to develop a language and tools to facilitate the concept of the Semantic Web".

Ontology, definition: study of the existence; a branch of the metaphysics related to the nature of the human being; from the ancient Greek root: ont- "being".

Now just for love of technical matter, let's give a short look to the DAML coding to understand the implementation of classes and methods:
<rdf:RDF xmlns:a="http://www.daml.org/2001/01/gedcom/gedcom#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:daml="http://www.daml.org/2000/12/daml+oil#">
  <a:Individual rdf:ID="@I1@"
                a:givenName="Michael Anthony"
                a:surname="Dean"
                a:sex="M">
    <a:spouseIn rdf:resource="#@F1@"/>
    <a:childIn rdf:resource="#@F2@"/>
    <a:birth rdf:resource="#event1820"/>
    <daml:equivalentTo rdf:resource="mailto:mdean@bbn.com"/>
  </a:Individual>
  <a:Birth rdf:ID="event1820"
           a:date="7 Aug 1961"
           a:place="Wyandotte, Wayne Co., Michigan"/>
  ...
</rdf:RDF>
[online: www.daml.org]
So at least now we have an idea about the DAML. But the smart user could ask: "Works? What there is outta here?".

State of the Art

In the words of Sean B.Palmer (a W3C researcher): "Unfortunately, the Semantic Web is dissimilar in many ways from the World Wide Web, including that you can't just point people to a Web site for them to realise how it's working, and what it is. However, there have been a number of small scale Semantic Web applications written up. One of the best ones is Dan Connolly's Arcs and Nodes diagrams experiment": Circles and arrows diagrams using stylesheet rules, Dan Connolly .
Sean advises to check also: "another good example of the Semantic Web at work is Dan Brickley et al.'s": RDFWeb.
Both mentioned in: http://infomesh.net/2001/swintro/#itWorks].

What Can I Do To Help?

Go here, hit PgDwn on your keyboard and read.

The End

So I hope to have given to my readers at least a pale idea of what appears to be the future of a layer of Internet; if the Semantic Web will put roots, this will bring to a Social Change for the human kind.
But hey the average webmaster is a human being, so as lazy as any other one.
Thank you for your time, stay tuned,

2003.11.28

Vincenzo Maggio


References & Acknowledgements:

Over than a user, as researcher and programmer, my personal thanks go at least to:
Tim Berners-Lee, Ora Lassila, James Hendler, Sean B.Palmer, Aaron Swartz, plus some other dozens, for the exceptional work done.

For the purpose of this article have been consulted the following:

http://logicerror.com/semanticWeb-long (The Semantic Web In Breadth)
http://pro.html.it/articoli/id_400/idcat_46/pro.html (Metadata)
http://www.w3.org/TR/1999/REC-rdf-syntax-19990222 (RDF)
http://www.w3.org/TR/rdf-schema/ (RDF syntax)
http://rdfweb.org/ (Various links)
http://infomesh.net/2001/swintro/ (Introduction to Semantic Web)
http://www.w3.org/XML/ (XML)
http://www.w3.org/2000/10/swap/Primer (N-Triples - N3)
http://www.w3.org/2001/sw/RDFCore/ntriples/ (N3)
http://www.daml.org/ (DAML)
http://www.daml.org/2001/02/horus-daml/Overview.html (DAML status and tools)
http://www.w3.org/DesignIssues/Semantic.html (Road map)
http://www.w3.org/2000/01/sw/ (Advanced Development)