Jump to content

Wikipedia:Semantic Wikipedia

From Wikipedia, the free encyclopedia

The Semantic Wikipedia would combine the properties of the Semantic Web and Wiki technology. In this enhancement, articles would have properties (or traits), which could be mixed or combined to allow articles to be members of dynamic categories, chosen by user requests. Lists would no longer be just the numerous pre-formatted list articles, but rather, a list could be dynamically created for all articles matching selected search properties.

This gives rise to the possibility of computer-generated articles, creating an article composed of pieces of other articles, such as the birth/born paragraph from several selected authors, and possibly saved as a temporary article, for a certain duration of time. Temporary articles could be saved, either in an individual user-space or in a larger group-space (shared by users with a common interest). Again, temporary articles would have a "sunset clause" so that they could be automatically deleted, later, unless the expiration date was reset.

Advantages for Wikipedia

[edit]
  • Provides rich metadata
  • Advanced searches: Multiple properties could be searched:
  • Find me: Italian directors born between 1956 and 1963, that worked on films starring Jim Carrey, on films set in English-speaking countries, ....
  • Find me: All articles about German physicists, edited for images during May 20 – July 19, 2011.
  • Generate article: containing the wiki "References" sections of all American films released during June-August 1939, plus the "Cast" sections of all American films in December 1939.
  • Data for external applications/sister projects – potential for revenue ( This is Freebase's revenue model).
  • Editing Efficiency
  • Reduce duplication of data;
  • Removes need for so many manual lists to be compiled.
  • Elegance/Comprehension
  • Solves the awkward category problems (cf. WP:CI):
  • [[Category:African-American Actors from New York]] → [[Ethnicity:African-American]] [[From:New York]]
  • [[Category:Films about WWII | Films about US history|Films about UK history|Films about French History]] → [[Films about:WWII |US history|UK history|French History]]
  • (A further developed list of advantages can be found here)

Disadvantages for Wikipedia

[edit]
  • Semantic MediaWiki markup syntax may be more difficult to understand for less technically inclined editors. Wikipedia aims to be open to everyone.

Data mining work in Wikipedia

[edit]

Though primarily written text, Wikipedia has a very large amount of structured data, in various forms.

Categories

[edit]

A semantic 'type' is very similar to a Wikipedia category, as it collects shared things together.

Categories and types often correlate very highly. Categories like 'Category:1923 deaths' are extremely strong evidence that an article is a 'foaf:Person' for example. However, if you look at Category:American Idol, you’ll find that many of the topics linked, like “Canadian Idol” or “Malaysian Idol”, are television programs, but by no means all of them. There is a topic that's actually a book written by an “Idol” judge, so that's not a TV program.

Freebase has a learning application, to make inferences based on these relationships. If any categories from Wikipedia have a high confidence of accuracy based on human votes, they start automatically asserting them rather than seeking human confirmation. These assertions can also be made manually.

Tables and lists

[edit]

Wikipedia has huge numbers of structured lists and tables of well-formatted data. Dbpedia's user mappings are able to parse wikipedia tables. Some projects are underway to enable easy importing from html tables to freebase.

Infoboxes

[edit]

Several projects have parsed Wikipedia templates and infoboxes, in order to allow processing of this information in different ways.

Dbpedia parses many infoboxes and offers a Sparql query service. It is preparing its live extraction framework.

Freebase has also parsed some Wikipedia templates and infoboxes, and offers dumps and an api.

Wikipedia³ is a conversion of the English Wikipedia templates into RDF. It's a monthly updated dataset containing around 47 million triples, and doesn't yet offer them over an api.

[edit]

Wikipedia's internal links provide a great deal of unambiguous structured information about co-occurrence and relatedness.

Interlanguage links can provide semantic translation.

Redirects may seem to be a good source of alias information, but prove to be very problematic. Wikipedia redirects include misspellings, previous names, Character names redirect to movies, anglicized or translated names, adjectives of nouns, and related terms - 'golf course' redirects to 'Golf'. Some data games exist hoping to find proper aliases in Wikipedia redirects manually.

Natural language

[edit]

A large amount of work has gone into parsing semantic data from the text of Wikipedia articles using natural language processing.

Yahoo has done a large scale NLP analysis of wikipedia, including Sentence and token splitting, Part of Speech tagging, Named Entities recognition, and dependency parsing.

Other more modest work includes matching template sentences, and date extraction - for example, if the article is an event, the first mentioned date is likely the date it happened, etc.

Ontology

[edit]

It would be very interesting to define an ontology for storing various future properties of Wikipedia articles, such as:

An article about a literary author contains information of:

  • biography
  • main works
  • style, trends he or she followed
  • review
  • bibliography
  • notes (footnotes in article)
  • references (used in article)

An article about a literary movement is related to:

  • authors that participated
  • historical episodes related to those authors biographies
  • mention to main works

And so forth: Authors related to towns, towns related to countries, countries to continents... It would help making inferences, associations, content augmentations, etc. It would also combine with robots that create templates, relating existing information into new articles.

It would be a very enriching complement to browsing and content discovery.

Adoption/Integration/Scalability

[edit]

The adoption of semantic tools would leave Wikipedia vulnerable to beginners' mistakes. It therefore seems sensible to limit the rate/extent of its adoption by strategically limiting where/how it is used and/or who is allowed to implement it. Also, articles could be "pre-compiled" (or pre-screened by computer) to detect formatting problems, before being saved, or save with an auto-tag warning to other users that the saved text has potiential formatting problems.

Ontology for Wikipedia

[edit]

Please feel free to develop this ontology: The goal is to have an exhausive account for all classes and properties that would sensibly be included in an ontology for Wikipedia (WP). However, an "exhaustive account" is probably not feasible, because WP already contains over 2.6 million articles (in November 2008), and it is humanly impossible for any small group of users to understand what those articles really cover. However, an ontology-generator could be developed to help define property-trees to be applied, retroactively, to large collections of existing articles, as time permits.

[edit]
  • Wikidata, a free knowledge base about the world that can be read and edited by humans and machines alike.
  • Semantic MediaWiki
  • Platypus Wiki "Platypus Wiki is a project to develop an enhanced Wiki Wiki Web with ideas borrowed from the Semantic Web. It offers a simple user interface to create wiki pages with metadata based on W3C standards. It uses RDF (Resource Description Framework), RDF Schema and OWL (Web Ontology Language) to create ontologies and manage metadata. Platypus Wiki is an ongoing open source project started on 23rd December 2003. The project is actually hosted on SourceForge and licensed under GNU GPL."
  • Wikipedia:Persondata
  • Wikipedia³ is a monthly-updated conversion of the English Wikipedia into RDF
  • DBpedia is a conversion of Wikipedia into RDF combined with other Linked Data sites to provide extra information
  • Freebase is, according to parent company Metaweb, "a massive, collaboratively-edited database of cross-linked data".

Notes

[edit]


References

[edit]
  • Semantic Wikipedia by Markus Krötzsch, Denny Vrandecic, Max Völkel, Heiko Haller, Rudi Studer.

Press coverage

[edit]