Archive for the 'Semantic Web' Category

Opening up the social graph at the WebCamp workshop on “social network portability”

20071127b.png

A WebCamp “Social Network Portability” workshop has been announced to be co-located with BlogTalk on 2nd March 2008. You can view the wiki page for this event.

“Social network portability” is a term that has been used to describe the ability to reuse one’s own profile and contacts across various social networking sites and social media applications. At this workshop, presentations will be combined with breakout sessions to discuss all aspects of portability for social networking sites (including accounts, friends, activities / content, and applications).

Topics of relevance include, but are not limited to, social network centralisation versus decentralisation, OpenSocial, microformats including XHTML Friends Network (XFN) and hCard, authentication and authorisation, OpenID single sign-on, Bloom filters, categorising friends and personas, FOAF, ownership of your published content, SIOC, the OpenFriend format, the Social Network Aggregation Protocol (SNAP), aggregation and privacy, permissions and context, and the Extensible Messaging and Presence Protocol (XMPP).

You can register for this workshop in conjunction with BlogTalk 2008. If you are interested in speaking or otherwise participating in the workshop, please add your name under the Speakers or Participants headings on the wiki page at http://webcamp.org/SocialNetworkPortability.

Talk by Barney Pell at ISWC 2007, CTO of Powerset

Barney Pell gave the opening talk of the day at ISWC this morning. Barney is former CEO, now CTO of natural language search company Powerset.

He talked about how natural language (NL) helps the Semantic Web (SW), especially both sides of the chicken-and-egg problem (the chicken AND the egg). On one side, annotations can be created from unstructured text, and ontologies can be generated, mapped and linked. On the other side, NL search can consume SW information, and can expose SW services in response to NL queries.

The goal of Powerset is to enable people to interact with information and services as naturally and effectively as possible, by combining NL and scalable search technology. Natural language search interprets the Web, indexes it, interprets queries, searches and matches.

Historically, search has matched query intents with document intents, and a change in the document model has driven the latest innovations. The first is proximity: there’s been a shift from documents being a “bag of keywords” to becoming a “vector of keywords”. The second is in relation to anchor text: adding off-page text to search is next.

Documents are loaded with linguistic structure that is mostly discarded and ignored (due to cost and complexity), but it has immense value. A document’s intent is actually encoded in this linguistic structure. Powerset’s semantic indexer extracts meaning from the linguistic structure, and Barney believes that they are just at the start of exciting times in this area.

Converging trends that are enabling this NL search are language technologies, lexical and ontological knowledge resources, Moore’s law, open-source software, and commodity computing.

Powerset integrates diverse resources, e.g. websites, newsfeeds, blogs, archives, metadata (”MetaSearch”), video, and podcasts. It can also do real-time queries to databases, where an NL query is converted into a database query. Barney maintains that results from databases drive further engagement.

He then gave some demos of Powerset. With the example “Sir Edward Heath died from pneumonia”, Barney showed how Powerset parses each sentence; extracts entities and semantic relationships, identifies and expands these to similar entities, relationships and abstractions; and indexes multiple facts for each sentence. He showed an interesting demonstration where multiple queries on the same topic to Powerset retrieve the same “facts”. The information on the various entities or relationships can come from multiple sources, e.g. information on Edward Heath or Deng Xiaoping is from Freebase and details on pneumonia comes from WordNet.

20071114a.png He gave an example of the search query “Who said something about WMDs?”. This is difficult to express using keyword search: to express that someone “said something” and that it is also about weapons of mass destruction. Barney also showed a parse for the famous wrestler / actor Hulk Hogan, with all the relations or “connections” to him (e.g., defeat) and the subjects or “things” that he is related to (e.g., André the Giant).

Powerset’s language technologies are the result of commercialising the XLE work from PARC, leveraging their “multidimensional, multilingual architecture produced from long-term research”. Some of their main challenges are in the areas of scalability, systems integration, incorporating various data and knowledge resources, and enriching the user experience.

He next talked about accelerating the SW ecosystem. Barney said that the wisdom of crowds can help to accelerate the Semantic Web. What starts as a broad platform gets deeper faster when it gets deployed at a large scale, realising a Semantic Web faster than expected. This drive comes from four types of people:

  • The first category is publishers, who upload their ontologies to get more traffic, and can get feedback to help with improving their content.
  • Users are the next group, as they will “play games” to create and improve resources, will provide feedback to get better search, and will create (lightweight, simple) ontologies for personalisation and organising their own groups.
  • There are also developers, who can package knowledge for specialised applications (e.g., for vertical search).
  • Finally, advertisers will want to create and upload ontologies to express all the things that should match their commercial offerings.

For the community, Powerset will provide various APIs and will give access to their technologies to build mashups and other applications. Powerset’s other community contributions are in the form of datasets, annotations, and open-source software.

Their commercial model is in relation to advertising (like most search engines) and licensing their technologies to other companies or search engines. Another related company (a friend of Barney’s) is [true Knowledge]™.

I’m still waiting for my Powerset Labs account to be approved; looking forward to getting in there and trying it out myself. Thanks to Barney for the great talk.

Brewster Kahle’s (Internet Archive) ISWC talk on worldwide distributed knowledge

Universal access to all knowledge can be one of our greatest achievements.

The keynote speech at ISWC 2007 was given this morning by Brewster Kahle, co-founder of the Internet Archive and also of Alexa Internet. Brewster’s talk discussed the challenges in putting various types of media online, from books to video:

  • He started to talk about digitising books (1 book = 1 MB; the Library of Congress = 26 million books = 26 TB; with images, somewhat larger). At present, it costs about $30 to scan a book in the US. For 10 cents a page, books or microfilm can now be scanned at various centres around the States and put online. 250,000 books have been scanned in so far and are held in eight online collections. He also talked about making books available to people through the OPLC project. Still, most people like having printed books, so book mobiles for print-on-demand books are now coming. A book mobile charges just $1 to print and bind a short book.
  • Next up was audio, and Brewster discussed issues related to putting recorded sound works online. At best, there are two to three million discs that have been commercially distributed. The biggest issue with this is in relation to rights. Rock ‘n’ roll concerts are the most popular category of the Internet Archive audio files (with 40,000 concerts so far); for “unlimited storage, unlimited bandwidth, forever, for free”, the Internet Archive offers bands their hosting service if they waive any issues with rights. There are various cultural materials that do not work well in terms of record sales, but there are many people who are very interested in having these published online. Audio costs about $10 per disk (per hour) to digitise. The Internet Archive has 100,000 items in 100 collections.
  • Moving images or video was next. Most people think of Hollywood films in relation to video, but at most there are 150,000 to 200,000 video items that are designed for movie theatres, and half of these are Indian! Many are locked up in copyright, and are problematic. The Internet Archive has 1,000 of these (out of copyright or otherwise permitted). There are other types of materials that people want to see: thousands of archival films, advertisements, training films and government films, being downloaded in the millions. Brewster also put out a call to academics at the conference to put their lectures online in bulk at the Internet Archive. It costs $15 per video hour for digitisation services. Brewster estimates that there are 400 channels of “original” television channels (ignoring duplicate rebroadcasts). If you record a television channel for one year, it requires 10 TB, with a cost of $20,000 for that year. The Television Archive people at the Internet Archive have been recording 20 channels from around the world since 2000 (it’s currently about 1 PB in size) - that’s 1 million hours of TV - but not much has been made available just yet (apart from video from the week of 9/11). The Internet Archive currently has 55,000 videos in 100 collections,
  • Software was next. For example, a good archival source is old software that can be reused / replayed via virtual machines or emulators. Brewster came out against the Digital Millennium Copyright Act, which is “horrible for libraries” and for the publishing industry.
  • The Internet Archive is best known for archiving web pages. It started in 1996, by taking a snapshot of every accessible page on a website. It is now about 2 PB in size, with over 100 billion pages. Most people use this service to find their old materials again, since most people “don’t keep their own materials very well”. (Incidentally, Yahoo! came to the Internet Archive to get a 10-year-old version of their own homepage.)

Brewster then talked about preservation issues, i.e., how to keep the materials available. He referenced the famous library at Alexandria, Egypt which unfortunately is best known for burning. Libraries also tend to be burned by governments due to changes in policies and interests, so the computer world solution to this is backups. The Internet Archive in San Francisco has four employees and 1 PB of storage (including the power bill, bandwidth and people costs, their total costs are about $3,000,000 per year; 6 GB bandwidth is used per second; their storage hardware costs $700,000 for 1 PB). They have a backup of their book and web materials in Alexandria, and also store audio material at the European Archive in Amsterdam. Also, their Open Content Alliance initiative allows various people and organisations to come together to create joint collections for all to use.

Access was the next topic of his presentation. Search is making in-roads in terms of time-based search. One can see how words and their usage change over time (e.g., “marine life”). Semantic Web applications for access can help people to deal with the onslaught of information. There is a huge need to take large related subsets of the Internet Archive collections and to help them make sense for people. Great work has been done recently on wikis and search, but there is a need to “add something more to the mix” to bring structure to this project. To do this, Brewster reckons we need the ease of access and authoring from the wiki world, but also ways to incorporate the structure that we all know is in there, so that it can be flexible enough for people to add structure one item at a time or to have computers help with this task.

20071113b.jpg In the recent initiative “OpenLibrary.org“, the idea is to build one webpage for every book ever published (not just ones still for sale) to include content, metadata, reviews, etc. The relevant concepts in this project include: creating Semantic Web concepts for authors, works and entities; having wiki-editable data and templates; using a tuple-based database with history; making it all open source (both the data and the code, in Python). OpenLibrary.org has 10 million book records, with 250k in full text.

I really enjoyed this talk, and having been a fan of the Wayback Machine for many years, I think there could be an interesting link to the SIOC Project if we think in terms of archiving people’s conversations from the Web, mailing lists and discussion groups for reuse by us and the generations to come.

At the International Semantic Web Conference in Busan

I arrived in Busan on Sunday evening for the 6th International Semantic Web Conference in Busan, Korea. Busan is a great big place with three million people; very impressive as you drive into the city from the airport.

Yesterday, I chaired the 2nd International ExpertFinder Workshop (or FEWS, “Finding Experts on the Web with Semantics”), where we had six interesting and varied papers. The workshop had about 35 attendees, and this bodes very well for future events. We also had a meeting about the ExpertFinder initiative for FOAF afterwards. Thanks to the ISWC 2007 Metadata Chairs Tom and Knud, metadata from FEWS is available here.

From DERI, NUI Galway, both Tudor et al. (”SALT: Weaving the Claim Web”) and Andreas et al. (”YARS2: A Federated Repository for Querying Graph Structured Data from the Web”) have been nominated for the best student paper award. Best of luck to you! (With Hak-Lae et al., I also had a submission for the Semantic Web Challenge at the conference.)

20071112a.png

Last night, members from DERI Galway and DERI Seoul had dinner at a famous local fish restaurant. Sebastian snapped some great pictures of our meal, and here’s a video of something wriggling that he and Andreas bravely ate…

Some more images from Busan: a lovely sea view from the Paradise Hotel and a city view from the other side, a Japanese-style Captain Kirk toilet, and maybe butter is good for your heart.

New Friend-of-a-Friend diagram from danbri

Saw this on danbri’s Flickr and his corresponding blog entry, a very nice concise diagram showing FOAF classes and properties (and some DOAP, GEO, OWL, SKOS, and SIOC too):

20071106a.jpg

The Future of Social Networks on the Internet: The Need for Semantics

Stefan and I wrote an article entitled “The Future of Social Networks on the Internet: The Need for Semantics” for the IEEE Internet Computing magazine. It was published yesterday (1st November). You can read an extract and see a rendered copy below.

20071101f.png In the article, we describe how Jyri’s idea of object-centered / object-oriented sociality not only provides meaning to social networks, but also defines an application area for the Semantic Web in terms of representation mechanisms for interconnecting people and objects across different social networks.

20071101g.png We also propose a social networking stack that would allow the reuse of one’s personal profile, social network connections and content-creation history (e.g, using FOAF and SIOC) across various sites and applications (there’s some obvious crossover with the OpenSocial People and Activities APIs here).

Anyway, here it is:

The Future of Social Networks on the Internet: The Need for Semantics

“I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation between us and everyone else on this planet. The President of the United States, a gondolier in Venice, just fill in the names… It’s not just big names — it’s anyone. A native in a rain forest, a Tierra del Fuegan, an Eskimo. I am bound — you are bound — to everyone on this planet by a trail of six people.” — John Guare

Everyone on the Internet knows the buzzword social networking. Sites such as Friendster, Facebook, Orkut, LinkedIn, Bebo, and MySpace, as well as content-sharing sites that also offer social networking functionality (including YouTube, Flickr, Upcoming, del.icio.us, Last.fm, and 43 Things) have captured the attention of millions of users and millions of dollars from venture capitalists. Compete.com states that, as of November 2006, the 10 most popular domains accounted for about 40 percent of all page views on the Web, and nearly half of those views were from the social networking services (SNSs) MySpace and Facebook.

SNSs usually offer the same basic functionalities: network of friends listings (showing a person’s “inner circle”), person surfing, private messaging, discussion forums or communities, events management, blogging, commenting (sometimes as endorsements on people’s profiles), and media uploading. With such features, SNSs demonstrate how the Internet continues to better connect people for various social and professional purposes. Yet, fundamental problems with today’s SNSs block their potential to access the full range of available content and networked people online. A possible solution is to build semantic social networking into the fabric of the next-generation Internet itself — interconnecting both content and people in meaningful ways.

Page 1

Page 2

Page 3

Page 4

Page 5

I think this article is timely given the unveiling of OpenSocial these past few days (we managed to reference the then forthcoming API in time for a section about “Your Social Graph” on page 3). But as Uldis and Daniel Feygin pointed out on the SNP mailing list, while OpenSocial addresses social application portability and widget developers nicely, it seems to miss out on tackling the issues of social graph portability and cross-network identity links.

David Emery highlights this closed social network problem: “OpenSocial doesn’t solve this, but if it had it could be truly revolutionary; if Google had gone after opening up the social graph [...] then Facebook would have become much more of an irrelevance – people could go to whatever site they wanted to use, and still preserve all the interactions with their friends (the bit that really matters).” Marc Canter says: “Me - I’m just sitting here, smiling and wondering about interop and whether all these platforms are really gonna open up their social graphs with unique identifiers. After waiting four years - who’s in a hurry?” And Bob Warfield says: “One of the biggest things will be portability of one’s social graph. Can I carry mine from one participating Social Network to the next? That’s a touchy business. [...] Who will be first to write an app whose sole purpose is to carry your identity and Social Graph from one network to the next?” Of course, not everyone wants their graphs to be portable or linked together - there may be very good reasons for isolation, but if OpenSocial could allow people to choose to link or reuse their profile / connections across sites (or not), I think it would be a leap rather than a step in the right direction.

Lally meetup on Saturday…

Had an interesting evening chatting about Web 2.0, the Semantic Web and Fortune 500 consultancy with Brendan Lally (a Galway-born IT and Web consultant currently based in Colorado) during a night out with a few other web heads including James Cooley and Ina O’Murchu from DERI, and Richard Garsthagen, Technical Marketing Manager EMEA for VMware. We started off in the Kashmir Indian restaurant and gradually made our way to Sheridan’s on the Dock for some organic colas and Erdingers. As Brendan mentioned, Richard helped me to get my Nokia 770 talking to my 6234 (*99# was news to me) so that he could show us Autostitch (a fully-automatic 2D image stitcher). It was good to meet you Brendan; I hope the rest of your round-Ireland trip goes well.

Been a busy B…

…for the past few months, hence the lack of regular blog entries. Most of my summer has been taken up with proposal writing for research funding here at DERI, the first of which finished up around the end of the June and the second ran from then until the end of August, so unfortunately I haven’t had time for much else…

Anyway, here are some updates about future social media / social software activities I’m involved in:

int.ere.st - create and share tags across your online communities

As mentioned in a previous blog post, int.ere.st has just launched. The main objective of int.ere.st is to demonstrate how Semantic Web and Web 2.0 technologies can be combined to provide better metadata creation and sharing support across various online communities.

With int.ere.st, you can save, tag and bookmark your own as well as other people’s tag clouds, as represented using the SCOT ontology. The tag meta search also allows you to look for similar patterns of tagging from other people based on their interests (as expressed using tags).

Some functionalities of int.ere.st include:

  • Various options for tag searching, such as and (&), or (space), co-occurrence (+), broader (>), and narrower (< )
  • User searching
  • Resource searching
  • Integrating tagged data across communities
  • Meta tagging
  • Ontology bookmarking
  • Sharing metadata produced using the FOAF, SIOC, and SCOT ontologies

You can try it at out at http://int.ere.st/. Here are some video demos of int.ere.st in action, and some more videos are forthcoming:


Short


Long

State of the SIOC-o-sphere (#5)

It’s that random time of the year again where I summarise what’s been going on in the world of SIOC