Archive for the 'Web' Category

My new job in Electronic Engineering! Will still collaborate with DERI…

Next month, I will begin a tenured lectureship position at the Department of Electronic Engineering here in the College of Engineering and Informatics at the National University of Ireland, Galway. However, I will still do joint research with the Digital Enterprise Research Institute, continuing (amongst other things) to work with the Social Software Unit (on SIOC, SCOT, etc.) and with the TripPlanr project. In my new role, I will also be researching with the NCBES Bioelectronics Research Cluster in NUI Galway.

For those of you who have just come across me and my blog as a result of my work with DERI, you may not know that my background was in electronic engineering, having studied it at undergraduate and postgraduate level, and I also lectured for four years full time in the Department of Electronic Engineering before joining DERI in 2004. When I joined DERI initially, I imagined that I would be working on some intersection between electronic engineering and the Semantic Web. In fact, I fell into the world of the Semantic Web and social software, after an interesting discussion about semantic social networks with Stefan Decker, who was a senior researcher in the Institute at the time. I realised that my “hobby” interests in creating community websites could be combined with interesting research challenges around the Semantic Web, and although I (and then director Dieter Fensel) was unsure about how I would fare in a new research area, I’m glad to say that it worked out okay! Now I’m back to thinking about the convergence between electronics and semantics again, with some social software thrown in the mix (e.g. wearable communities).

Below is a collage of some memories from the past four-and-a-half years: including the FOAF Galway workshop, a Semantic Web cluster meeting, ESWC and a DERI offsite meeting, Wikimania, DERI Stanford, BlogTalk, meeting timbl, BarCamp, DERI drinks, the ITAG awards, and our Social Software summer / christmas parties.

I’ve really enjoyed working with all the smart and cool people in DERI, and I shall continue to do so, while strengthening ties between the Institute and NUI Galway’s College of Engineering and Informatics through my new job. (It’s my last day before holidays, so if you’re in Galway this evening, we’re going out for a few drinks in the Westwood Hotel after work at 5:30…)

My week in California

I had a nice productive week in San Jose / San Francisco last week, where I attended the Semantic Technologies Conference 2008 (SemTech 2008) and some other nearby events. SemTech 2008 had a record attendance of over 1000 people, and it was great to meet up with old friends and new (some of whom I had often conversed with online but not in real life).

  • 20080528a.jpg Arriving on Sunday afternoon, Uldis, Stefan and I prepared for our SemTech 2008 tutorial. On Monday, we gave the tutorial entitled “The Future of Social Networks on the Internet: The Need for Semantics“, inspired by our IEEE Internet Computing article from last year. You can get the slides here. We talked about how a combination of FOAF and SIOC could be used to represent and interlink people and social objects within and across social websites. The tutorial was well received and we had some interesting questions afterwards…
  • On Tuesday morning, I chaired a late-breaking DataPortability interest group session, where I quizzed Chris Saad on the initiative and we had a good discussion with Daniela Barbosa, Danny Ayers, Ian Davis, Henry Story, Uldis and others. Afterwards, I attended the keynote talks by Nova Spivack and Eric Miller. You may already have seen my reports here and here respectively.
  • On Tuesday afternoon, I met with Sanjay Sabnani, CEO of CrowdGather and friend Chris. CrowdGather is a big network of medium to large message board sites that includes the huge General Mayhem community. (Disclaimer: I am on the CrowdGather Inc. board of advisors.) That evening, we met Ashely and went along to the SF Beta event (”The San Francisco Web 2.0 Mixer”), where I saw some interesting demos including Hitchsters (share taxi trips to the airport). After dinner, we had drinks with TouristR’s Conor Wade, LeFora co-founder Vinnie Lauria and friend David. Unfortunately, I was pretty much “wiped” with jet lag by then.
  • 20080528c.jpg 20080528b.jpg On Wednesday, I took it easy. From the lovely Hotel Kabuki in Japantown, I wandered up Fillmore to see what old breakfast haunt Galette had become (it’s now La Boulange). I skipped on to another breakfast favourite, Ella’s, and had something of a mammoth breakfast (yes, those three plates of food in the picture!) that kept me going for the day. After a spot in Kinokuniya, where I picked up the latest in the Alita: Last Order manga series, I walked on and drove over the Golden Gate Bridge, and then headed back south again for an evening spent with family in the locality.
  • On Thursday, I attended some more SemTech 2008 talks in the morning including Steven Forth et al. from Monitor presenting about Team Learning on Semantic Mediawiki and also part of the FISHBOWL SemTech Reflections discussion session. In the afternoon, a team of us DERI researchers headed up to Radar Networks in San Francisco where we presented some of our work and brainstormed on things we could do together.

20080528d.jpg And I flew back on Friday, arriving back in Galway on Saturday. San Francisco is still a very special place to me, and I look forward to a proper family holiday there in the next year or three. Funnily enough, on Sunday I was driving behind a car with a California license plate on a Galway road - it was a long way from home!

Now, it’s catch-up time again. We’ve had a busy few weeks here in DERI what with our major funding review (which was held on-site a fortnight ago), so a lot of stuff went by the wayside (if I haven’t replied to you yet, please accept my apologies as I have a backlog of e-mail to get through and also my phone SIM card died this morning).

So what else is happening? I had an interview with Maryrose Lyons yesterday for the latest Brightspark Consulting newsletter, and today I’m correcting some exam papers that were put on a very long finger. I also got a copy of Jonathan Zittrain’s “The Future of the Internet - And How to Stop It” in the post which I’m looking forward to reading soon…

SemTech 2008: Eric Miller (Zepheira) - “Reuse, Repurpose, Remix”

Eric Miller from Zepheira gave the second keynote talk yesterday, talking about some of their open-source development activities that have reached a level that he thought we might be interested in. The aim of the talk was to show that it is possible to reduce the costs for people who are interested in mixing together data from lots of different sources while hiding a lot of the complexity that makes that happen.

He began with a story about when his dad was in hospital with cancer: it was a comedy of errors (”many errors with not much comedy”). As he went from one department to another, they couldn’t correlate any information because their patient care model had no primary key to aid with the combination of that information. Talking to the doctors and others in this space, Eric realised how alarmingly frequent that it is. Common statements were “the systems weren’t designed to do that”, “we can’t do that”, etc., resulting in general frustration. That pattern in the hospital is repeated across various businesses and organisations. Eric said that there are too many important things that we as a community need right now, so we need a useful reusable infrastructure to solve various problems, and one way is to use the Web. We can bring lessons learned from the Web back into these organisations.

He then moved on to talk about some of the things we can do to make the required bridges stronger. There’s a common theme (when talking to different people and groups in health, climate change, etc.) of a requirement for such bridging technologies. A lot of the solutions exist, so we just have to stick the parts of the answer together. If we could figure out how to connect these together, then we can have a serious jump on the problem(s). Lessons from the Web (and the Semantic Web) can be applicable to managing information from these enterprise or organisational spaces.

He talked about a document analogy. A big change on the Web from several years ago was the blog. Before then, the so-called Read/Write Web had a disproportionate amount of the “read” aspect to it. People began adding little bits of structure to the creation of content in blogs. We can take advantage of likeness factors or patterns in communities (of bloggers): it’s a very powerful aspect. This little bit of structure can feed into larger communities, e.g. Technorati leverages the structure from multiple blogs.

He then talked about a music analogy. Sid Vicious did Sinatra’s “My Way”. Apple’s GarageBand reduced the technical barriers for people to reuse lyrics and music, allowing people to get more creative about how they could use each other’s data. Recently, NIN made their multi-track files available for remixing. Just as in the document analogy, this is adding more structure to the content which allows people to take this and do more with it. This also takes advantage of the network effect, by leveraging multiple community contributions across available repurposable data (not just for one song or one individual). As a result, we get services like MusicBrainz where we can also see patterns around music.

In this way, we can stop worrying so much about whether it is a spreadsheet, a database, whatever. [These are all just parts that can be brought together, and you don't have to settle on a particular format or storage mechanism to progress.]

From an action standpoint, Eric said that this corresponds to: create, publish, and analyse. For documents, the corresponding action stream is from creating a blog text to publishing on the Blogger website to mass analyses via Technorati. For music, this could be from creating a song in GarageBand to publishing via iTunes to analysis in MusicBrainz. Finally, for data, Eric will show us this process using Exhibit, Remix and Studio.

He gave a demo of Exhibit from MIT SIMILE. Exhibit is a software service for rendering data. You ship data to it and you get back a facetted navigation system. You don’t need to install a database, and you don’t have to create a business logic tier. You can style it in different ways, and look at it in different “lenses”.

Remix is a tool that builds on top of this. Eric is one of the PIs of the project. It ties together best-agreed components - visual interfaces, data transformation interfaces, data storage, etc. - all of this is brought together under the Remix umbrella. Eric also mentioned that Remix leverages persistent identifiers using purlz.org. These can be for people, places, concepts, network objects, anything.

He presented an example of data that an oncology nurse or doctor uses frequently, which is not in an ontology: some of it is in their head and the rest is in a spreadsheet. He showed Remix stitching together two spreadsheets from different clinics for oncology. You can stitch together fields and see if it makes sense from a data perspective. Remix has some tools for “simultaneous editing” which allows editing over patterns of data, so by editing one entry you can edit all of them. This acts like a script which can change “lastname, firstname” to “firstname lastname” without any complicated programming. You can connect anything, but it may not necessarily make sense, so there’s a need for interfaces to show users if it does makes sense. Then in Exhibit, you can customise facets, views, apply different themes, etc. Within a matter of minutes, Remix gives tools that a nurse can use to not just create an interface but to publish the information to the Web so that other people can benefit from it.

Every bit of the transformation that has occurred here has been identified (with an identifier). Everything has become a web resource, with a framework that enables people so stitch stuff together in a resource-oriented architecture. Then this can be analysed using Studio. If Technorati provides real-time analysis of RSS feeds, Studio provides an analysis of your company or organisational data, e.g. as reports with pattern analysis. Because it’s based on RDF / SPARQL, you can create queries that are relevant to you: “show me all the most popular or least popular reports”, or “show me any reports that used some of my data”.

This can bring organisations into a “Linked Enterprise Data” (LED) framework. Some people may not care about so much about Linked Open Data (LOD): “expose your data, and something cool is going to happen”. Rather, Eric talked about exposing your enterprise data and showing that something is going to happen right now, so that you can see the benefits in terms of solutions available immediately. LED is a big part of what they’ve been focussing on in Zepheira.

The key subtext is recognising that what we’re dealing with is hospitals, organisations etc., who can leverage lots of the standards and solutions that we’ve been using on the Web but at a larger scale. Tools like this are a critical aspect of what companies can take now and can start to use to link their data together.

Eric said that there are huge advantages for companies to not just be “on” the Web but to be “in” the Web. If employees are a company’s most important aspect, why tie their hands behind their backs and ask them to solve a particular problem without providing them with the means to do it? There’s a need to empower them, to make it easier for them to get at data, to integrate it and to share it. There are just too many problems not to address / attack them aggressively through not just one approach or representation, but by stitching various parts together.

Eric finished by challenging ten companies to try out these tools if they haven’t before, to come back to SemTech 2009 with reports, and to share each other’s knowledge. The standards and tools are robust, so it can be done.

I wish I was going to XTech 2008 in Dublin…

…but unfortunately due to a major review here next week, I have a lot of presentation preparation to do.

Anyway, if I were going to XTech 2008 tomorrow in Dublin, here’s what I’d go to see (thanks to the XTech 2008 personal scheduler):

9:45 Wednesday, 7 May 2008
Opening keynote
David Recordon (Six Apart)

11:00 Wednesday, 7 May 2008
Using socially authored content to provide new routes through existing content archives
Rob Lee (Rattle Research)

11:45 Wednesday, 7 May 2008
Browsers on the move: The year in review, the year ahead
Michael(tm) Smith (W3C)

14:00 Wednesday, 7 May 2008
Here Be Dragons: Knowing Where the World Ends
Leigh Dodds (Ingenta)

14:45 Wednesday, 7 May 2008
Linked Data Deployment
Daniel Lewis (OpenLink Software)

9:00 Thursday, 8 May 2008
OpenSocial, a standard programming model for the Social Web
Matthew Trewhella (Google)

9:45 Thursday, 8 May 2008
Creating portable social networks with microformats
Jeremy Keith (Clearleft)

11:00 Thursday, 8 May 2008
The Programmes Ontology
Tom Scott (BBC Audio and Music Interactive), Yves Raimond (Queen Mary, University of London), Patrick Sinclair (BBC Audio and Music Interactive), Nicholas Humfrey (BBC Audio and Music Interactive)

11:45 Thursday, 8 May 2008
Ni Hao, Monde: Connecting communities across cultural and linguistic boundaries
Simon Batistoni (Flickr)

14:00 Thursday, 8 May 2008
SemWebbing the London Gazette
Jeni Tennison (The Stationery Office), John Sheridan (The Office of Public Sector Information)

14:45 Thursday, 8 May 2008
Data portability for whom? Some psychology behind the tech
Gavin Bell (Nature)

16:00 Thursday, 8 May 2008
Google Data APIs on the move: innovation vs. Standards Compliance
Frank Mantek (Google)

16:45 Thursday, 8 May 2008
The attention economy is only just around the corner
Ian Forrester (BBC)

9:00 Friday, 9 May 2008
Data Portability with SIOC and FOAF
Uldis Bojārs (DERI Galway), John Breslin (DERI, National University of Ireland, Galway), Alexandre Passant (LaLIC institute (at Université Paris Sorbonne) and Electricité de France R&D)

(Here is the full schedule.)

CELT talk / WWW@15 on Morning Ireland / Ulrich Schnauss

A mixed-up blog post, but I haven’t the energy to write three separate posts, so here’s a three-in-one:

  • On Wednesday, I gave a talk at CELT, NUI Galway about “Learning via the Social Web”, which was a slightly-revised version of the one I gave in February. Again, there was an amazing turnout, and there will be a webcast made available via the CELT website at a later date. For now, you can access the PowerPoint slides here.
  • Yesterday, Damien Mulley and I were interviewed by Richard Downes on RTÉ R1 Morning Ireland about the 15th anniversary of CERN releasing the World Wide Web code for free (podcast available here; alternatively there’s an extracted clip here). I talked a little bit about the WWW versus UMn’s Gopher, and how the Web has expanded beyond the initial target audience of academics and researchers. I gave a slightly-tangential answer to a question I was asked about the importance of the Web to Ireland’s future and economy (FYI: CSO 2007 ICT stats), saying how dependent we are on the Web to do many tasks today, and describing how our work at DERI in NUI Galway will help us to deal with the current over-abundance of websites, by adding more structure to web pages so that computers can help us in finding the right information. “Are you telling me that the future of the Web [...] is being designed in Galway?”, Richard asked at one point. Yes!!! Finally, I mentioned how the problems with online video gridlock may have larger consequences as the Web is increasingly moving from the desktop to mobile devices where bandwidth is even more important, so smarter ways are needed to reduce exactly what will be sent to your phone (FYI: Opera Mini is a nice example, a tiny Java browser that works on most phones where the content is pre-filtered server-side before it gets to you).
  • Last night, I went along with friend Conrad to see Ulrich Schnauss at Stress in DeBurgo’s here in Galway. Although I missed the encore (it had been a long day, with a nine-hour session at work), I really enjoyed the night and the support acts: Beatpoet was great playing on his mono-something device, and Airiel were pretty good too :)

Slides from the SIOC tutorial at WWW2008

Here are the PowerPoint slides from our tutorial on “Interlinking Online Communities and Enriching Social Software with the Semantic Web” at the World Wide Web Conference in Beijing - you can also download them from here:

The tutorial went well, it was hot in the room and we were a bit jetlagged, but we had some good feedback afterwards and about 30 people attended in all.

I had a nice few days in Beijing, participating in the W3C advisory commitee meeting on Sunday, Monday and Tuesday, giving our SIOC tutorial with Alex and Uldis on Monday afternoon, popping along to our paper at the Linked Data on the Web workshop on Tuesday, attending some sessions on Wednesday (Kai-Fu Lee’s plenary keynote on Cloud Computing, the discussion panel with Lada Adamic et al. on the Future of Online Social Interactions, the W3C Open Your Data! track, and a packed session on Social Networks: Discovery and Evolution of Communities). On Thursday, I gave a talk about DERI at Tsinghua University to Cemon Yang and his team at the Digital Government / Web and Software Research Centre. Thursday evening we had the banquet in the Great Hall of the People, and I headed back to Ireland on Friday.

Unfortunately I saw little of Beijing outside of travelling between venues in taxis and buses, so I have a good reason to return and see / do more next time…

WWW2008 Beijing: Dr. Kai-Fu Lee (Google) - “Cloud Computing”

Kai-Fu Lee is Vice President of Engineering at Google, and President of Google Greater China. He joined Google in 2005, and developed the first speaker-independent continuous speaker recognition system, for which he won a Business Week award in 1988.

He started by talking about the “people theme”, saying that this is what the (Chinese) Internet is all about. (For April Fool’s Day, Google China announced that they were going to shut down their servers to save electricity, and that they would have to hire 25 million people to do their searches for them. They got 1,800 resumes for the positions.)

There are 235 million people on the Internet in China. What do these people want? Kai-Fu listed these things: accessibility, shareability, freedom (data wherever they are), simplicity, and security. Google believes that cloud computing solves a lot of these problems. It’s not new, so Google are just a part of it like we all are. But day by day, cloud computing is changing the way we use the Internet.

He then explained a little bit about what the Cloud is. Data is stored in the Cloud, on some server somewhere that is not necessarily known by the user, but it’s just there and accessible. Software and services are also moving to the Cloud, usually accessible via a full-featured web browser on the client device. He also advocated the use of open standards and protocols, which he says are “liked” by Google (e.g. Linux, AJAX, LAMP, etc.) so as to avoid control by one company. Finally, the Cloud should be accessible from any device, especially from phones. He said that when the Apple iPhone hit the market, they found that web usage from that device was 50 times greater than that from other web-capable phones, and that Google’s servers really felt it.

Next up was a history lesson on cloud computing. The PC era was hardware centric. Then, the client-server era was more software centric, which was great for enterprise computing. Cloud computing now abstracts that server and makes it very scalable, by hiding complexities, and with the server being anywhere. This is service centric.

Banks too have become “Clouds”, allowing people to go to any ATM and remove money from their bank wherever they are. Electricity can be thought of similarly, as it can come from various places, and you don’t have to know where it comes from: it just works.

Driving forces behind cloud-based computing include: (i) the falling cost of storage, (ii) ubiquitous broadband, and (iii) the democratisation of the tools of production. This is beginning to make cloud-based computing more like a utility. A lot of this is due to IBM and DEC’s work in the 1990s, who realised that computing should be a utility. It is only now that these three key things are in place that this becoming a reality.

There are six further properties that make this area exciting, being: (1) user centric, (2) task centric, (3) powerful, (4) accessible, (5) intelligent, (6) programmable.

(1) User centric. The data moves with you, and the application moves with you. People don’t want to reload their address book or applications on new machines, as it is painful to do. For example, how bad do you feel if you drop or break your laptop? How easy is it to switch your cellphone? It’s hard, because synchronising your data is usually hard to do. The IR functionality on a mobile is not easy to use / user centric: how often do people use it to backup stuff to their laptops?

If data is all stored in the Cloud - images, messages, whatever - once you’re connected to the Cloud, any new PC or mobile device that can access your data becomes yours. Not only is the data yours, but you can share it with others (e.g. on Picasa Web, your photos are stored in the Cloud). You don’t have to worry about where it is. We’re not there just yet, but the time is approaching where the way we deal with photographs will change. Another example is GMail, as you can use it on any device (since large storage is not required on the device). Kai-Fu bets that everyone in the room has some kind of cloud computing-based e-mail.

PCs are normally our window to the world, but mobile devices can do more. Since services know who you are and where you are (eek!), they can give you more targetted content. There are 600 million cellphone users in China, three billion worldwide, dwarfing the number of PCs that are Internet-accessible. Intelligent mobile search is useful for cellphones, giving you local listings and results relevant to your context. The most powerful and popular application is maps, especially when people get lost, or if they spontaneously want to go somewhere. Maps are more than the traditional flat piece of paper, allowing you to search nearby, see real-time traffic flows, etc. Such mashups provide even more power - calling these integrations a map is a misnomer - the capabilities are enormous. As there’s a move from e-mail usage towards maps and photos, these new applications have to go into the Cloud as well. And with the shift in this direction, another question is how do you make this economic?

Instant information sharing is also important, e.g. via Google Docs, Page Creator, etc. Recently, Google Sites was released - Google hosts it all for you, so there’s no need for you to buy servers or hosting - 50,000 sites were set up in the first few hours after it began. Not only can you access the data, but you can create it anywhere. The browser is the platform.

(2) Task centric. The applications of the past - spreadsheets, e-mail, calendar - are becoming modules, and can be composed and laid out in a task-specific manner. For example, a task may be teachers creating a departmental curriculum, where you can see the people viewing the curriculum spreadsheet and they can have debates in parallel in real time. Spreadsheet editing allows collaboration and publishing to a selected group of people, with version control.

Google considers communication to be a task, such that in GMail you see pop-up chats and chat histories which provide zero-latency discussions combined in communications tasks. If you want, you can have real-time discussions instead of waiting for e-mail responses if people are online in the contacts list. You can also organise all of your common tasks, e.g. using iGoogle’s widgets portal.

(3) Powerful. Having lots of computers in the Cloud means that it can do things that your PC cannot do. For example, Google Search is faster than searching in Windows or Outlook or Word. Of course, Google Search has to be be much faster, even though there are many more documents. In terms of how much storage is required, if there are 100 billion pages at 10 kB per page, that’s about 1000 TB of disk space. Cloud computing should have an infinite amount of disks / computation at its disposal. When you issue a query to the Google web search engine, it queries at least 1000 machines (potentially accessing 1000s of terabytes).

(4) Accessible. Universal search (”searchology”) was announced by Google last year. Traditional web page search does IR / TF-IDF / page rank stuff pretty well on the Web at large, but if you want to do a specific type of search, for restaurants, images, etc., web search isn’t necessarily the best option. It’s difficult for most people to get to the right vertical search page in the first place, since they usually can’t remember where to go. Universal search is basically a single search that will access all of these vertical searches.

This search requires simultaneously querying and searching over all the specific databases: news, images, videos, tens of such sources today, with potentially hundreds and thousands of them in the future. There are lots of these simultaneous searches which then get ranked, so it is even more computing intensive than current web search.

(5) Intelligent. Data mining and massive data analysis are required to give some intelligence to the masses of data available (massive data storage + massive data analysis = Google Intelligence).

In their machine translation work for translate.google.com, a trillion words were collected from bilingual and monolingual text, and they wanted to not only find various orders of words but also the mappings of words. Statistical models of translation were trained, and they saw how an English-Chinese pair could be aligned. Then, they needed to extract phrases and collect statistics (e.g. how often variations of a certain translation were being used, such as variations for latest / last / newest / most recent). As more training data is added, the quality improves. Context is also an important matter for consideration, and it provides an advantage for the phrase analysis part of Google’s translators. There are estimates that their translator is equivalent to a high-school student’s level of translator quality.

Lots of data can be processed by machine analysis to generate intelligence. But this needs to be combined with humans - via their collaboration and contributions - to change a mess / mass of photos or data or whatever into a very powerful combination. People and tools together can create intelligent knowledge. Applications like Google Earth are much more useful when people can contribute to them, e.g. by National Geographic sticking loads of high-res photos into it. Reviews, 3-D buildings, etc. can turn a tool from a bunch of pictures into something special. Creativity adds connections to data-centric applications, enabling intelligent combinations of content.

With all this data comes the issue of server costs. If you are trying to choose between buying $42,000 high-end servers or cheap PC-class servers for $2,500 each, you can get 33 times cost efficiency by going for the PC-class servers. You can get a 1000 CPU PC-class cluster for the same price as a high-end 64 CPU server, with possibly 30 times the performance (figures may be out of date).

Even though there is a lower cost, there still needs to be high reliability. Google search is mainly based on low-cost commodity PCs running Linux. Failures are expected in every system every day. If we assume that there are 20,000 machines, there’s typically a failure rate of 110 per day. Google has built a custom software layer that can tolerate failure. (They have also deployed a new data centre in just three days.)

(6) Programmable. This follows on from the previous description of data requirements. How does one program for 10,000 “flaky servers” in a Google farm? There needs to be: (i) fault tolerance, (ii) distributed shared memory (if storing every web page in yahoo.com, no one machine can store that, so multiples are required), and (iii) new programming paradigms required for storing stuff.

For (i) fault tolerance, Google uses GFS or distributed disk storage. Every piece of data is replicated three times. If one machine dies, a master redistributes the data to a new server. There are around 200 clusters (some with over 5 PB of disk space on 500 machines).

The “Big Table” is used for (ii) distributed memory. The largest cells in the Big Table are 700 TB, spread over 2000 machines.

MapReduce is the solution for (iii) new programming paradigms. It cuts a trillion records into a thousand parts on a thousand machines. Each machine will then load a billion records and will run the same program over these records, and then the results are recombined. While in 2005, there were some 72,000 jobs being run on MapReduce, in 2007, there were two million jobs (use seems to be increasing exponentially). Not everything is suitable for MapReduce, e.g. parallelising SVMs. Matrix operations can’t be split and re-glued together easily. For this, they use Incomplete Cholesky Factorisation.

Cloud computing needs new skills, especially when working with tens of thousands of machines as opposed to just one. The Academic Cloud Computing Initiative in the US and China (at Tsinghua) was launched by Google and IBM. Cloud computing is not just for web-based problems, but it can help provide solutions for scientific problems that were previously very hard to solve.

In terms of benefits, everything should just work, changing the way we work and play. IT should become “simple and safe”, by outsourcing IT to a “trusted shop” via a browser. Entrepreneurs should have new opportunities with this paradigm shift, being freed from monopoly-dominated markets as more cloud-based companies evolve that are powered by open technologies. Governments should leverage such “innovation-enabling platforms”, where people can effectively program tens of thousands of machines themselves. With $540 million of venture capital infused into China last year, Kai-Fu sees cloud-based computing as being a catalyst of economic growth. He finished up saying that cloud computing has arrived. “Embrace the Cloud!”

There was one question from the audience. The questioner said that Kai-Fu made cloud computing sound simple (i.e., it was well explained, not that the techologies or efforts were trivial). He asked what is the societal change rather than the technological change? Assume we have cloud-based computing, how we can start to encourage “cloud thinking” within society? The questioner works with universities looking at open access, trying to encourage people to share their intellectual outputs, but believes it is difficult to persuade knowledge workers to move their work into the Cloud. His question was, what can we do encourage cloud thinking and “cloud knowledge”?

Kai-Fu’s answer was firstly that cloud computing is not simple, rather it is incredibly complex, but we can learn from what has happened so far. There have been efforts to categorise world knowledge, e.g. Cycorp, which Kai-Fu said has not resulted in a success yet (however, I’ll note here that they are becoming part of the Linked Data initiative: as Kingsley Idehen said yesterday, “Yoda is awake”!). There has been some success in various question and answering systems with pieces of knowledge that can be mined and found. He stated that these were the two extremes, but believes that the answer lies somewhere in the middle: some organisation, but not too much. Wikipedia is a step in this direction, so he suggested bringing the question and answering approach and the Wikipedia approach closer together.

He said that two things would be required. Firstly, he saw the need for some kind of translation capability. There is so much knowledge in English, which spoils native English speakers. In China, people are also spoiled. However, for many other countries, there is very little local language content. If auto translation doesn’t work well, some kind of assisted translation is required. Secondly, there should be mobile endeavours to make knowledge available. There may also need to be some economic incentive for people to create and share content via their mobiles.

(More reviews at 1, 2 and 3.)

WebCamp SNP and BlogTalk 2008 approacheth…

I’m in Cork with a posse of eight from DERI, and it’s the night before two co-located events: the WebCamp workshop on social network portability (Sunday) and the BlogTalk conference on social software (Monday, Tuesday). Others that have arrived in Cork this evening include Niall Larkin, Ajit Jaokar, Aral Balkan, Ben Ward, Dan Brickley, Ross Duggan and Stephanie Booth.

I’m really looking forward to the talks, the discussions, the networking, the food, and some positive outcomes from the next three days. And with invited speakers of this quality, I know it’s going to be good.

Unfortunately, I’m missing the Irish Blog Awards for the second year running, but boards.ie’s Managing Director Gerry Shanahan is representing us as a sponsor. At least I hope to meet up with many of the bloggers at tomorrow night’s optional blogger’s dinner at Rossini’s here in Cork (43 people have signed up).

More blog posts about the events will be available via the tags webcampsnp and blogtalk2008. Here are some recent posts:

Five days left to register online for BlogTalk 2008!

Please note that online registration for BlogTalk 2008 (and WebCamp Social Network Portability) will close next Wednesday, 26th February 2008.

You can register at Amiando.

There are a few discount codes out there.

(Don’t forget to sign up for the optional blogger’s dinner as well!)

“A funny thing happened on the way to the forum”: Article in Indo about 10 years of boards.ie

20080214a.png Irish Independent > Business > Technology > A funny thing happened on the way to the forum
After 10 years, John Breslin’s online forum on everything from personal relationships to motors and mustard, boards.ie, is still blazing a trail

By Marie Boran
Thursday February 14 2008

Want to know where you can buy the cheapest digital camera, or how to go about claiming rent relief, or maybe if buying cowboy boots would be a fashion disaster?

The world relies on Google but the Irish have boards.ie. On this online bulletin board no question is too trivial or too bizarre and with an average 900,000 visitors to the site every month, there are plenty of answers on offer.

It is hard to believe that a decade ago, on 12 February, 1998, boards.ie founder John Breslin wrote expectantly: “The first of many messages, I hope.”

Read more…

Of course, there are four other people who have made boards.ie possible: Tom Murphy, Dan King, Gerry Shanahan, and Jerry Connolly. Without them and our amazing team of voluntary moderators, I doubt boards.ie would even exist today. Original questions and answers follow.

Continue reading ‘“A funny thing happened on the way to the forum”: Article in Indo about 10 years of boards.ie’