Edgar's blog: GridLucene

Judging by the amount of e-mails regarding GridLucene I've received recently, there seems to be a steadily growing interest in doing Information Retrieval (IR) in a grid environment. Other people have already taken steps to come to some form(s) of standardization (see e.g. the OGF's Grid Information Retrieval WG).

My personal experience is a mixed one. The current grid conceptions have mainly focused on computationally intensive tasks/problems, but not so much on data-intensive ones. Especially tasks which can be (a) easily decomposed into subtasks and (b) where the subtasks have no or little dependencies on each other.

The way current-day grids and protocols are designed is such that (sub)tasks should be allowed to die without any result. Indeed, usually a number of initiated jobs do not make it to the end and need to be re-initiated. The amount varies per grid, task, and time of day but it is something to take into account. This is fine when you want robustness with respect to the overall task (nodes chip in and drop out at will, without any harm done to the overall progress), but in real-life IR applications, this not a healthy situation. And ofcourse there's the communication overhead. If you want to index a set of documents on a grid, you'll first need to get the documents onto a grid node, and later get the index of again.

The Lessons Learned from another grid IR project confirm these observations. However, I agree with these authors and do still believe there's merit in chasing the original goal. The inherently open services-based architecture should allow to go beyond find-what-I-need IR towards a more exploratory mode. Thus, somewhere in the not-too-distant future, we'll start working on making Lucene GT4-compliant. Stay tuned...

Edgar's blog

22 May 2007

GridLucene

No comments:

Tags

Archive