Archive for the 'Web 2.0' Category

Prototype for distributed / decentralised microblogging using semantics

Download the paper and get the code.

Try out our anonymous client and server demos for SMOB.

Michael Arrington of TechCrunch wrote an interesting blog post on Monday about a “decentralised Twitter”, which was picked up by Dave Winer, Marc Canter and Chris Saad amongst others.

20080512a.png I’m happy to say that we have recently described and shown how this can work. Alex has been the driving force behind a paper that we (Alexandre Passant, Tuukka Hastrup, Uldis Bojars and I) have written for SFSW 2008, demonstrating (a prototype called SMOB for) distributed / decentralised microblogging:

Microblogging: A Semantic Web and Distributed Approach

The prototype uses FOAF and SIOC to model microbloggers, their properties, account and service information, and the microblog updates that users create. A multitude of publishing services can ping one or a set of aggregating servers as selected by each user, and it is important to note that users retain control of their own data through self hosting.

The aggregate view of microblogs use ARC2 for storage / querying and Exhibit for the user interface. Security and privacy are open issues, but can be addressed in some part by requiring OpenID authentication.

The SMOB prototype code (both the semantic microblogging publishing client and server-based web service) is available here. You can install your own client and post to our demo server (set up today by Tuukka) here. There are some pictures below of it in use:

20080505a.jpg
Latest updates rendered in Exhibit

20080505b.jpg
Map view of latest updates with Exhibit

20080505c.png
Global architecture of distributed semantic microbloggging

Related posts:

I wish I was going to XTech 2008 in Dublin…

…but unfortunately due to a major review here next week, I have a lot of presentation preparation to do.

Anyway, if I were going to XTech 2008 tomorrow in Dublin, here’s what I’d go to see (thanks to the XTech 2008 personal scheduler):

9:45 Wednesday, 7 May 2008
Opening keynote
David Recordon (Six Apart)

11:00 Wednesday, 7 May 2008
Using socially authored content to provide new routes through existing content archives
Rob Lee (Rattle Research)

11:45 Wednesday, 7 May 2008
Browsers on the move: The year in review, the year ahead
Michael(tm) Smith (W3C)

14:00 Wednesday, 7 May 2008
Here Be Dragons: Knowing Where the World Ends
Leigh Dodds (Ingenta)

14:45 Wednesday, 7 May 2008
Linked Data Deployment
Daniel Lewis (OpenLink Software)

9:00 Thursday, 8 May 2008
OpenSocial, a standard programming model for the Social Web
Matthew Trewhella (Google)

9:45 Thursday, 8 May 2008
Creating portable social networks with microformats
Jeremy Keith (Clearleft)

11:00 Thursday, 8 May 2008
The Programmes Ontology
Tom Scott (BBC Audio and Music Interactive), Yves Raimond (Queen Mary, University of London), Patrick Sinclair (BBC Audio and Music Interactive), Nicholas Humfrey (BBC Audio and Music Interactive)

11:45 Thursday, 8 May 2008
Ni Hao, Monde: Connecting communities across cultural and linguistic boundaries
Simon Batistoni (Flickr)

14:00 Thursday, 8 May 2008
SemWebbing the London Gazette
Jeni Tennison (The Stationery Office), John Sheridan (The Office of Public Sector Information)

14:45 Thursday, 8 May 2008
Data portability for whom? Some psychology behind the tech
Gavin Bell (Nature)

16:00 Thursday, 8 May 2008
Google Data APIs on the move: innovation vs. Standards Compliance
Frank Mantek (Google)

16:45 Thursday, 8 May 2008
The attention economy is only just around the corner
Ian Forrester (BBC)

9:00 Friday, 9 May 2008
Data Portability with SIOC and FOAF
Uldis Bojārs (DERI Galway), John Breslin (DERI, National University of Ireland, Galway), Alexandre Passant (LaLIC institute (at Université Paris Sorbonne) and Electricité de France R&D)

(Here is the full schedule.)

Slides from the SIOC tutorial at WWW2008

Here are the PowerPoint slides from our tutorial on “Interlinking Online Communities and Enriching Social Software with the Semantic Web” at the World Wide Web Conference in Beijing - you can also download them from here:

The tutorial went well, it was hot in the room and we were a bit jetlagged, but we had some good feedback afterwards and about 30 people attended in all.

I had a nice few days in Beijing, participating in the W3C advisory commitee meeting on Sunday, Monday and Tuesday, giving our SIOC tutorial with Alex and Uldis on Monday afternoon, popping along to our paper at the Linked Data on the Web workshop on Tuesday, attending some sessions on Wednesday (Kai-Fu Lee’s plenary keynote on Cloud Computing, the discussion panel with Lada Adamic et al. on the Future of Online Social Interactions, the W3C Open Your Data! track, and a packed session on Social Networks: Discovery and Evolution of Communities). On Thursday, I gave a talk about DERI at Tsinghua University to Cemon Yang and his team at the Digital Government / Web and Software Research Centre. Thursday evening we had the banquet in the Great Hall of the People, and I headed back to Ireland on Friday.

Unfortunately I saw little of Beijing outside of travelling between venues in taxis and buses, so I have a good reason to return and see / do more next time…

WWW2008 Beijing: Dr. Kai-Fu Lee (Google) - “Cloud Computing”

Kai-Fu Lee is Vice President of Engineering at Google, and President of Google Greater China. He joined Google in 2005, and developed the first speaker-independent continuous speaker recognition system, for which he won a Business Week award in 1988.

He started by talking about the “people theme”, saying that this is what the (Chinese) Internet is all about. (For April Fool’s Day, Google China announced that they were going to shut down their servers to save electricity, and that they would have to hire 25 million people to do their searches for them. They got 1,800 resumes for the positions.)

There are 235 million people on the Internet in China. What do these people want? Kai-Fu listed these things: accessibility, shareability, freedom (data wherever they are), simplicity, and security. Google believes that cloud computing solves a lot of these problems. It’s not new, so Google are just a part of it like we all are. But day by day, cloud computing is changing the way we use the Internet.

He then explained a little bit about what the Cloud is. Data is stored in the Cloud, on some server somewhere that is not necessarily known by the user, but it’s just there and accessible. Software and services are also moving to the Cloud, usually accessible via a full-featured web browser on the client device. He also advocated the use of open standards and protocols, which he says are “liked” by Google (e.g. Linux, AJAX, LAMP, etc.) so as to avoid control by one company. Finally, the Cloud should be accessible from any device, especially from phones. He said that when the Apple iPhone hit the market, they found that web usage from that device was 50 times greater than that from other web-capable phones, and that Google’s servers really felt it.

Next up was a history lesson on cloud computing. The PC era was hardware centric. Then, the client-server era was more software centric, which was great for enterprise computing. Cloud computing now abstracts that server and makes it very scalable, by hiding complexities, and with the server being anywhere. This is service centric.

Banks too have become “Clouds”, allowing people to go to any ATM and remove money from their bank wherever they are. Electricity can be thought of similarly, as it can come from various places, and you don’t have to know where it comes from: it just works.

Driving forces behind cloud-based computing include: (i) the falling cost of storage, (ii) ubiquitous broadband, and (iii) the democratisation of the tools of production. This is beginning to make cloud-based computing more like a utility. A lot of this is due to IBM and DEC’s work in the 1990s, who realised that computing should be a utility. It is only now that these three key things are in place that this becoming a reality.

There are six further properties that make this area exciting, being: (1) user centric, (2) task centric, (3) powerful, (4) accessible, (5) intelligent, (6) programmable.

(1) User centric. The data moves with you, and the application moves with you. People don’t want to reload their address book or applications on new machines, as it is painful to do. For example, how bad do you feel if you drop or break your laptop? How easy is it to switch your cellphone? It’s hard, because synchronising your data is usually hard to do. The IR functionality on a mobile is not easy to use / user centric: how often do people use it to backup stuff to their laptops?

If data is all stored in the Cloud - images, messages, whatever - once you’re connected to the Cloud, any new PC or mobile device that can access your data becomes yours. Not only is the data yours, but you can share it with others (e.g. on Picasa Web, your photos are stored in the Cloud). You don’t have to worry about where it is. We’re not there just yet, but the time is approaching where the way we deal with photographs will change. Another example is GMail, as you can use it on any device (since large storage is not required on the device). Kai-Fu bets that everyone in the room has some kind of cloud computing-based e-mail.

PCs are normally our window to the world, but mobile devices can do more. Since services know who you are and where you are (eek!), they can give you more targetted content. There are 600 million cellphone users in China, three billion worldwide, dwarfing the number of PCs that are Internet-accessible. Intelligent mobile search is useful for cellphones, giving you local listings and results relevant to your context. The most powerful and popular application is maps, especially when people get lost, or if they spontaneously want to go somewhere. Maps are more than the traditional flat piece of paper, allowing you to search nearby, see real-time traffic flows, etc. Such mashups provide even more power - calling these integrations a map is a misnomer - the capabilities are enormous. As there’s a move from e-mail usage towards maps and photos, these new applications have to go into the Cloud as well. And with the shift in this direction, another question is how do you make this economic?

Instant information sharing is also important, e.g. via Google Docs, Page Creator, etc. Recently, Google Sites was released - Google hosts it all for you, so there’s no need for you to buy servers or hosting - 50,000 sites were set up in the first few hours after it began. Not only can you access the data, but you can create it anywhere. The browser is the platform.

(2) Task centric. The applications of the past - spreadsheets, e-mail, calendar - are becoming modules, and can be composed and laid out in a task-specific manner. For example, a task may be teachers creating a departmental curriculum, where you can see the people viewing the curriculum spreadsheet and they can have debates in parallel in real time. Spreadsheet editing allows collaboration and publishing to a selected group of people, with version control.

Google considers communication to be a task, such that in GMail you see pop-up chats and chat histories which provide zero-latency discussions combined in communications tasks. If you want, you can have real-time discussions instead of waiting for e-mail responses if people are online in the contacts list. You can also organise all of your common tasks, e.g. using iGoogle’s widgets portal.

(3) Powerful. Having lots of computers in the Cloud means that it can do things that your PC cannot do. For example, Google Search is faster than searching in Windows or Outlook or Word. Of course, Google Search has to be be much faster, even though there are many more documents. In terms of how much storage is required, if there are 100 billion pages at 10 kB per page, that’s about 1000 TB of disk space. Cloud computing should have an infinite amount of disks / computation at its disposal. When you issue a query to the Google web search engine, it queries at least 1000 machines (potentially accessing 1000s of terabytes).

(4) Accessible. Universal search (”searchology”) was announced by Google last year. Traditional web page search does IR / TF-IDF / page rank stuff pretty well on the Web at large, but if you want to do a specific type of search, for restaurants, images, etc., web search isn’t necessarily the best option. It’s difficult for most people to get to the right vertical search page in the first place, since they usually can’t remember where to go. Universal search is basically a single search that will access all of these vertical searches.

This search requires simultaneously querying and searching over all the specific databases: news, images, videos, tens of such sources today, with potentially hundreds and thousands of them in the future. There are lots of these simultaneous searches which then get ranked, so it is even more computing intensive than current web search.

(5) Intelligent. Data mining and massive data analysis are required to give some intelligence to the masses of data available (massive data storage + massive data analysis = Google Intelligence).

In their machine translation work for translate.google.com, a trillion words were collected from bilingual and monolingual text, and they wanted to not only find various orders of words but also the mappings of words. Statistical models of translation were trained, and they saw how an English-Chinese pair could be aligned. Then, they needed to extract phrases and collect statistics (e.g. how often variations of a certain translation were being used, such as variations for latest / last / newest / most recent). As more training data is added, the quality improves. Context is also an important matter for consideration, and it provides an advantage for the phrase analysis part of Google’s translators. There are estimates that their translator is equivalent to a high-school student’s level of translator quality.

Lots of data can be processed by machine analysis to generate intelligence. But this needs to be combined with humans - via their collaboration and contributions - to change a mess / mass of photos or data or whatever into a very powerful combination. People and tools together can create intelligent knowledge. Applications like Google Earth are much more useful when people can contribute to them, e.g. by National Geographic sticking loads of high-res photos into it. Reviews, 3-D buildings, etc. can turn a tool from a bunch of pictures into something special. Creativity adds connections to data-centric applications, enabling intelligent combinations of content.

With all this data comes the issue of server costs. If you are trying to choose between buying $42,000 high-end servers or cheap PC-class servers for $2,500 each, you can get 33 times cost efficiency by going for the PC-class servers. You can get a 1000 CPU PC-class cluster for the same price as a high-end 64 CPU server, with possibly 30 times the performance (figures may be out of date).

Even though there is a lower cost, there still needs to be high reliability. Google search is mainly based on low-cost commodity PCs running Linux. Failures are expected in every system every day. If we assume that there are 20,000 machines, there’s typically a failure rate of 110 per day. Google has built a custom software layer that can tolerate failure. (They have also deployed a new data centre in just three days.)

(6) Programmable. This follows on from the previous description of data requirements. How does one program for 10,000 “flaky servers” in a Google farm? There needs to be: (i) fault tolerance, (ii) distributed shared memory (if storing every web page in yahoo.com, no one machine can store that, so multiples are required), and (iii) new programming paradigms required for storing stuff.

For (i) fault tolerance, Google uses GFS or distributed disk storage. Every piece of data is replicated three times. If one machine dies, a master redistributes the data to a new server. There are around 200 clusters (some with over 5 PB of disk space on 500 machines).

The “Big Table” is used for (ii) distributed memory. The largest cells in the Big Table are 700 TB, spread over 2000 machines.

MapReduce is the solution for (iii) new programming paradigms. It cuts a trillion records into a thousand parts on a thousand machines. Each machine will then load a billion records and will run the same program over these records, and then the results are recombined. While in 2005, there were some 72,000 jobs being run on MapReduce, in 2007, there were two million jobs (use seems to be increasing exponentially). Not everything is suitable for MapReduce, e.g. parallelising SVMs. Matrix operations can’t be split and re-glued together easily. For this, they use Incomplete Cholesky Factorisation.

Cloud computing needs new skills, especially when working with tens of thousands of machines as opposed to just one. The Academic Cloud Computing Initiative in the US and China (at Tsinghua) was launched by Google and IBM. Cloud computing is not just for web-based problems, but it can help provide solutions for scientific problems that were previously very hard to solve.

In terms of benefits, everything should just work, changing the way we work and play. IT should become “simple and safe”, by outsourcing IT to a “trusted shop” via a browser. Entrepreneurs should have new opportunities with this paradigm shift, being freed from monopoly-dominated markets as more cloud-based companies evolve that are powered by open technologies. Governments should leverage such “innovation-enabling platforms”, where people can effectively program tens of thousands of machines themselves. With $540 million of venture capital infused into China last year, Kai-Fu sees cloud-based computing as being a catalyst of economic growth. He finished up saying that cloud computing has arrived. “Embrace the Cloud!”

There was one question from the audience. The questioner said that Kai-Fu made cloud computing sound simple (i.e., it was well explained, not that the techologies or efforts were trivial). He asked what is the societal change rather than the technological change? Assume we have cloud-based computing, how we can start to encourage “cloud thinking” within society? The questioner works with universities looking at open access, trying to encourage people to share their intellectual outputs, but believes it is difficult to persuade knowledge workers to move their work into the Cloud. His question was, what can we do encourage cloud thinking and “cloud knowledge”?

Kai-Fu’s answer was firstly that cloud computing is not simple, rather it is incredibly complex, but we can learn from what has happened so far. There have been efforts to categorise world knowledge, e.g. Cycorp, which Kai-Fu said has not resulted in a success yet (however, I’ll note here that they are becoming part of the Linked Data initiative: as Kingsley Idehen said yesterday, “Yoda is awake”!). There has been some success in various question and answering systems with pieces of knowledge that can be mined and found. He stated that these were the two extremes, but believes that the answer lies somewhere in the middle: some organisation, but not too much. Wikipedia is a step in this direction, so he suggested bringing the question and answering approach and the Wikipedia approach closer together.

He said that two things would be required. Firstly, he saw the need for some kind of translation capability. There is so much knowledge in English, which spoils native English speakers. In China, people are also spoiled. However, for many other countries, there is very little local language content. If auto translation doesn’t work well, some kind of assisted translation is required. Secondly, there should be mobile endeavours to make knowledge available. There may also need to be some economic incentive for people to create and share content via their mobiles.

(More reviews at 1, 2 and 3.)

Really cool SIOC widget from Sindice (for WordPress)

I’ve installed the new Sindice SIOC widget, produced by Adam, Fabio and Giovanni from the Sindice team.

As you can see, if you look at the post author or click into any comments list, each user now has a speech bubble beside the username. Clicking on this bubble will show you posts, comments and topics created by that user across the “SIOC-o-sphere”.

20080411b.png

You can also click on any arrow icon beside a link in a blog post to see where else it has been referenced, like this one.

There is a Sindice SIOC API available which serves as a gateway to SIOC data via the Sindice discovery and search services, enabling the verification of the presence of a user or a link on the SIOC-o-sphere as indexed within Sindice.

Tales from the SIOC-o-sphere #7

20080403a.png It’s been three months since my last round-up of all things SIOC-ed, so here is entry number seven in the series:

Previous SIOC-o-sphere articles:

#6 http://sioc-project.org/node/310
#5 http://sioc-project.org/node/294
#4 http://sioc-project.org/node/272
#3 http://sioc-project.org/node/271
#2 http://sioc-project.org/node/138
#1 http://sioc-project.org/node/79

Danja rocks with his “DataPortability and me” video / some slides I’ve made for DP+SIOC

Wow! Danny Ayers has made the best video I’ve seen for the “DataPortability and me” competition, which ends today:

Travelling on the train to Dublin and back this morning, I gathered and made some slides for future presentations on DataPortability and SIOC:

DataPortability, Microsoft’s Contacts API and OpenSocial.org

20080326a.png (No, the picture I created on the right ISN’T the new DataPortability logo; I totally missed out on the closing date, but it will serve as an image for this blog post. There have been some very cool submissions for the competition however.)

There were two interesting announcements yesterday in the portability space. The first was from Microsoft, announcing that they would be “working with Facebook, Bebo, Hi5, Tagged and LinkedIn to exchange functionally-similar Contacts APIs, allowing us to create a safe, secure two-way street for users to move their relationships between our respective services” (Contacts APIs provide contact data portability). The second was from Google, Yahoo! and MySpace, jointly announcing that an OpenSocial Foundation is to be formed as a non-profit entity (OpenSocial provides social application portability). Unfortunately, there is still some confusion regarding exactly what data portability functionality OpenSocial will offer (if any), and at the moment the consensus seems to be that DataPortability and OpenSocial aren’t as related as previously thought.

DataPortability (including Microsoft’s move in this area) is mainly about users being able to have portable data (profiles, identities, content like photos, videos, discussion posts) that they can move between the services and sites that they trust and choose to use. (See Uno de Waal’s interesting post on how the Microsoft Invite2Messenger service allows you to get your Facebook friends’ e-mail addresses in plain text.)

OpenSocial on the other hand is more about “gadget” portability, where social applications can be deployed across a variety of social networking sites. As summarised by Julian Bond, OpenSocial consists of a gadget API (for gadget programmers) and a standard for site owners to implement these gadgets on their own sites. The part of OpenSocial related to DataPortability is a REST API, details of which are a bit vague right now. Not to be confused with OpenSocial (although the similar names make this difficult), the Social Graph API from Google is more related to DataPortability as it indexes semantic data from many social networking sites like Hi5, MySpace, LiveJournal, Twitter, etc. and allows users to bring their social graph with them when they sign up for a new site that supports the API.

Apart from the lack of intersection between Microsoft (plus affiliate Facebook) and Google, a good few companies are in multiple “camps” (DataPortability, Contacts APIs, OpenSocial), as shown by the Venn diagram I drew below:

20080326b.png

Marc Canter and others have pointed out that although the Contact APIs from Microsoft are not open in themselves, at least the APIs seem to export as much data as they can import. Marc also says that Microsoft (and other big companies) may not be explicitly following the actions (e.g. the technical recommendations) of the DataPortability initiative, but rather claims that it would hurt them if they didn’t open up and go along with some portable data efforts given the current climate and the tide of users in favour of this.

For users to have true data portability, there needs to be some consensus on both the APIs and the formats needed to transfer / represent this portable data. It may be that a number of APIs and formats are required for different scenarios. The Semantic Web is an ideal means for representing the data to be ported from social websites, in that is well suited (using vocabularies like SIOC and FOAF) to represent how people and all kinds of objects on these sites are connected together (documents, discussions, meetups, places, interests, media files - whatever). Of course other data formats may be used, but most importantly, it would be a waste of time to come up with a bunch of new formats for representing the data that needs to be portable, because a lot of work has been done on how to best provide interoperable, reusable and linked data through efforts like the Semantic Web, AtomPub and the microformats community.

I’ll be attending the DataPortability Lunch Meetup in London on the 6th April 2008 if anyone there feels like a chat about some of these topics…

Related posts:

Nova Spivack visits DERI, NUI Galway and talks about Twine: Radar Networks’ semantic social software product in beta

20080325b.png In association with the IT Association of Galway, DERI recently invited Radar NetworksNova Spivack to speak at our research institute in the National University of Ireland, Galway (Nova also gave a keynote talk at BlogTalk 2008 in Cork).

Nova is CEO of one of the companies that is practically applying Semantic Web technologies to social software applications. Radar have a beta product called Twine which is a “knowledge networking” application that allows users to share, organise, and find information with people they trust. People create and join “twines” (community containers) around certain topics of interest, and items (documents, bookmarks, media files, etc., that can be commented on) are posted to these twines through a variety of methods. The seminar room was full of both “DERIzens” and members of Galway’s IT community for Nova’s talk on the Semantic Web and Twine (see his slides here), and after a lengthy question-and-answers session, this was followed by some presentations to Nova of ongoing research work in DERI.

20080325c.png I personally find Twine very interesting, and as well as using it to gather information about SIOC, I intend to use it to gather and publish personal interests that I think will be of interest to the public (once it leaves beta). As well as producing semantic data (just stick “?rdf” onto the end of any twine.com URL), Twine features some cool functionality that elevates it beyond the social bookmarking sites to which it has been compared, including an extensive choice of twineable item types, twined item customisation (”add detail”) and the “e-mail to a twine” feature, all of which I believe are extremely useful. (I have a few Twine invites left for readers of my blog; drop me an e-mail if you need one.)

There is also the community aspects of twines. I forsee that these twines will act as the “social objects” (see presentation by Jyri) that will draw you back to the service, in a much stronger manner than other social bookmarking sites currently do (due to Twine’s more viral nature, its stronger social networking functionality, better commenting, and a more identifiable “home” for these objects). Of course, having more public users will help, but from experience I know that it is a good idea to build on a core group of regular users (in Twine’s case, mainly techies) before increasing the user base too much.

It’s been an exciting few months in terms of announcements relating to commercial Semantic Web applications. As I mentioned recently in an interview with Rob Cawte for the web2.0japan.com blog, this is becoming obvious with the attention being given to startup companies in this space like