13 November 2007
Moved...
I will continue to blog together with my fellow PhD students at the ILPS PhD blog. Enjoy :)
27 July 2007
SIGIR 2007
This week had all the extremes! Organizing an event such as SIGIR takes a lot of preparation and, even then, things don't always work out the way they should. Projectors breaking down, electricity not working, student volunteers not showing up, and so on. But it was also a fantastic week, with attendance numbers skyrocketing, excellent papers, beautiful locations, and a lot of happy, very happy participants. Time was something I didn't have enough of over the last weeks. Besides co-organizing SIGIR, I did also submit a paper and three TREC runs. With little sleep I'm now happy that it's all over. Almost time for holidays.. Can't wait until Singapore..
22 May 2007
GridLucene
Judging by the amount of e-mails regarding GridLucene I've received recently, there seems to be a steadily growing interest in doing Information Retrieval (IR) in a grid environment. Other people have already taken steps to come to some form(s) of standardization (see e.g. the OGF's Grid Information Retrieval WG).
My personal experience is a mixed one. The current grid conceptions have mainly focused on computationally intensive tasks/problems, but not so much on data-intensive ones. Especially tasks which can be (a) easily decomposed into subtasks and (b) where the subtasks have no or little dependencies on each other.
The way current-day grids and protocols are designed is such that (sub)tasks should be allowed to die without any result. Indeed, usually a number of initiated jobs do not make it to the end and need to be re-initiated. The amount varies per grid, task, and time of day but it is something to take into account. This is fine when you want robustness with respect to the overall task (nodes chip in and drop out at will, without any harm done to the overall progress), but in real-life IR applications, this not a healthy situation. And ofcourse there's the communication overhead. If you want to index a set of documents on a grid, you'll first need to get the documents onto a grid node, and later get the index of again.
The Lessons Learned from another grid IR project confirm these observations. However, I agree with these authors and do still believe there's merit in chasing the original goal. The inherently open services-based architecture should allow to go beyond find-what-I-need IR towards a more exploratory mode. Thus, somewhere in the not-too-distant future, we'll start working on making Lucene GT4-compliant. Stay tuned...
My personal experience is a mixed one. The current grid conceptions have mainly focused on computationally intensive tasks/problems, but not so much on data-intensive ones. Especially tasks which can be (a) easily decomposed into subtasks and (b) where the subtasks have no or little dependencies on each other.
The way current-day grids and protocols are designed is such that (sub)tasks should be allowed to die without any result. Indeed, usually a number of initiated jobs do not make it to the end and need to be re-initiated. The amount varies per grid, task, and time of day but it is something to take into account. This is fine when you want robustness with respect to the overall task (nodes chip in and drop out at will, without any harm done to the overall progress), but in real-life IR applications, this not a healthy situation. And ofcourse there's the communication overhead. If you want to index a set of documents on a grid, you'll first need to get the documents onto a grid node, and later get the index of again.
The Lessons Learned from another grid IR project confirm these observations. However, I agree with these authors and do still believe there's merit in chasing the original goal. The inherently open services-based architecture should allow to go beyond find-what-I-need IR towards a more exploratory mode. Thus, somewhere in the not-too-distant future, we'll start working on making Lucene GT4-compliant. Stay tuned...
Subscribe to:
Comments (Atom)