Inex-03.dvi
Expected Ratio of Relevant Units:
A Measure for Structured Information Retrieval
Benjamin Piwowarski
Patrick Gallinari
LIP 6, Paris, France
LIP 6, Paris, France
1]. The expected search length [4] measures the amount
Since the 60's, evaluation has been a key problem for In-
of irrelevant documents a user will consult before finding
formation Retrieval (IR) systems and has been extensively
a certain amount of relevant documents. Some measures
discussed in the IR community. New IR paradigms, like
are based on the definition of a metric over some predefined
Structured Information Retrieval (SIR), make classical eval-
statistics [2, 15], some derive from rank correlation [10]. But
uation measures inappropriate. A few tentative extensions
the most famous measures in IR are recall and precision.
to these measures have been proposed but are also inade-
Recall is defined as the ratio of the number of relevant doc-
quate. We propose in this paper a new measure which is a
uments that are retrieved to the total number of relevant
generalisation of recall. This measure takes into account the
documents. Precision is the ratio of the number of rele-
specificity of SIR, when elements to be retrieved are linked
vant documents that are retrieved to the total number of
by structural relationships. We show an instantiation of this
measure on the INEX database and present experiments toshow how well it is adapted to SIR evaluation.
Raghavan [12] proposed a probabilistic version of recall-precision, which is not inconsistent as standard precision/recallcan be, especially when documents are not fully ordered. We
will not define more precisely their measure here. Instead,
Information Retrieval systems aim at retrieving documents
we will detail an extension of precision and recall in the case
that are relevant to a given user information need. The
of a non-binary relevance scale, as it was used to evaluate
notion of relevance is not only not well defined and ambigu-
Structured Information Retrieval systems in the 2002 INEX
ous [13, 9], it is also user specific. The evaluation of IR
workshop. This extension was proposed by Kek¨
systems appeared very early as a key problem of IR. Clever-
arvelin [7]. In that case, the set R is defined in a fuzzy
don experiments on the Cranfield collection [3] were the first
way: a document can be more or less relevant. When the
experiments that justified the development of entirely auto-
document is highly relevant, it will be in the set of the rele-
matic IR systems. Evaluation is useful for comparing differ-
vant documents with a degree of 1. When the document is
ent systems and is used to justify theoretic and/or pragmatic
not relevant, it will be in this set with a degree of 0. Every
developments of IR systems.
value between 0 and 1 will be a measure of the relevance ofthe document. This scale thus generalises the classic binary
Many different parameters can be used in order to measure
scale (relevant/not relevant) that is used in IR. Let us de-
the performance of an IR system like for example time and
note j(d) the degree with which the document d belongs to
space taken by the system to answer the query and the user
the relevant set of documents for a given query. Then, recall
effort to find relevant documents. Swets [14] was the first
and precision are computed as:
to clearly define how a metric should be defined in order toprovide an objective evaluation of IR systems: a measureshould only reflect the ability of the system to discriminate
relevant documents from irrelevant ones.
A number of hypotheses are also necessary (even if they are
implicit) to develop evaluation measures. We can distin-
guish two kinds of hypotheses: those which are necessary tothe computation of the measure and those which are priorson user behaviour. Examples of typical assumptions are the
where N is the number of documents in the list, E is the set
following: (1) the user follows the ordered list of retrieved
of documents and L is the set of documents in the list. Those
elements beginning with the first element; (2) a relevant
two formulas generalise standard recall-precision: when j(d)
document is still relevant even if the user has already seen
takes only the values 0 or 1, they give the same results.
the same information in another document higher in the re-trieved list. We will make such hypotheses explicit when
In this paper, we propose a measure to evaluate SIR systems.
describing our measure.
We will first introduce the new problem of SIR. We willshow how standard recall/precision have been extended to
There are many different approaches for IR evaluation [15,
evaluate such systems and why this is not well adapted to
SIR evaluation. We will then introduce a new measure whichis related to the recall. We will compare our measure and
precision/recall extension on stereotypical systems
the corpus provided by INEX1.
if j ∈ {2E, 3L, 3S}
if j ∈ {1E, 2L, 2S}
Evaluation and Structured Information Re-
if j ∈ {1S, 1L}
Atomic units are usually documents in classical IR. Withthe actual growth of structured documents 2, the atomic
unit is no more the whole document but any logical element
in the document. We will call such an element a doxel (for
DOCument ELement) in the remainder of this paper. Com-
pared to IR on unstructured collections, Structured Infor-mation Retrieval (SIR) should not focus on returning doc-uments but the smallest doxel that contains the answer to
Table 1: Quantisations are used to convert an assess-
the query. While that query can be only free text like in
ment from the INEX scale JINEX to a binary or real
standard IR (using the INEX terminology, those are Con-
scale used to compute recall and precision. In INEX,
tent Only queries, CO in short), a query can also specify
two quantisations were proposed: fs is a "strict"
both constraints on the structure and on the content (those
quantisation, fgis a "generalised quantisation"
are called Content And Structure queries, CAS in short).
We are interested in the evaluation of systems that answer
ment component, but the component is too small to act as
CAS and CO queries, but we will focus here mainly on CO.
a meaningful unit of information; finally, exact coverage (E)
We will say that a good answer (the smallest doxel) is SIR-
when the topic is the main theme of the doxel.
relevant to distinguish this notion from usual relevance.
The two dimensions are not fully independent: a non rel-
Our work was greatly influenced by the recent INEX initia-
evant element (0) must have no coverage (N). There are
tive [6]. In this section, we describe briefly how SIR systems
only 10 different values in this scale (and not 16). In the
were evaluated in INEX 2002, which was the first initiative
remainder of this paper, JINEX denotes this set of 10 val-
where a corpus of assessed XML documents was built. We
ues. Each of these values is a digit (relevance) followed by
will show why the current evaluation methodology is not
a letter (coverage). Thus, 2E means "fairly relevant with
well suited for SIR.
exact coverage". Within this scale, the doxels that shouldbe returned by a perfect SIR system will be all the doxels
Let us first describe the INEX scale used for the user as-
with an exact coverage, beginning with those with high rele-
sessments. This scale is neither binary, nor between 0 and
vance: in the case of the INEX scale, SIR-relevant doxels are
1, but is two-dimensional. The first dimension is related
those that have an exact coverage. Doxels with too small
to the extent with which the element is relevant. The rele-
or too big coverage in this scale are considered not relevant.
vance does not take into account the non relevant part of the
The motivation is that exact doxels are the doxels a user
doxel, even if that part is 99% of the doxel. For example, the
is searching for, while "too small" doxels are contained in
common ancestor of the whole database will be considered
an "exact" doxel and "too big" doxels contain an "exact"
as highly relevant even if only a small paragraph is highly
relevant. In INEX'02, four levels of relevance were distin-guished: the doxel can be irrelevant (0) if it does not contain
LIMITS OF CURRENT MODELS
any information about the topic of the request; marginally
The first measure proposed in INEX 2002 was standard re-
relevant (1) if it mentions the topic of the request, but only
call and precision (i.e. using f
in passing; fairly relevant (2) if it contains more information
s , see table 1). In this case,
only doxels with exact coverage and high relevance (INEX
than the topic description, but this information is not ex-
scale) are the relevant elements (for the binary scale). A sys-
haustive; highly relevant (3) if it discusses the topic of the
tem that does always returns a near match will have a recall
and a precision of 0. This should be avoided since the taskcomplexity is very high. Moreover, when one is assessing
The second dimension, coverage, is specific to structured
the corpus one can find it difficult to give the exact match
Document coverage describes how
to one doxel rather than to a smaller one. For example, the
much of the document component is relevant to the request
list element in INEX often contains only one paragraph; the
topic. Again, there are four levels: no coverage (N) when
textual content of both elements (list and paragraph) is thus
the query topic is not a theme of the document component;
the same. It is impossible to make a choice and if we give
too large (L) when the topic is only a minor theme of the
an exact coverage to both, a SIR system will have to return
document component; too small (S) when the topic or an
both elements in order to have a perfect recall.
aspect of the topic is the main or only theme of the docu-
In order to cope with that problem, G¨
overt [5] proposed
to add some relevance to neighbouring doxels, using fg to
2Where the textual (or multimedia) content of the document
convert an assessment from the INEX assessment scale to a
is usually organised in a tree
value between 0 and 1. A highly relevant doxel with an exact
match will have a relevance of 1 in the [0, 1] scale. Some of
A MEASURE FOR SIR
the doxel neighbours will also have a non null relevance: its
We will suppose an ideal situation where assessments in
ancestors – within the document boundary – will have a
the INEX 2002 corpus strictly follow the definition of SIR-
relevance of 0.75 (too big); some of its children will have a
relevance (which is not the case). We will thus make the
relevance of 0.25 (too small). Non relevant doxel will have
following assumption that a SIR-relevant doxel can only
a 0 value for relevance. This choice might seem better than
contain SIR-relevant doxels that are less relevant or have
the first one, but is still not adequate:
a smaller coverage. This constraint states that the samerelevant information is assessed with "exact coverage" onlyone time.
• For every SIR-relevant doxel, there will be a new set
of IR-relevant doxels. To give an example of what it
In this section, we describe our measure, beginning with
implies, consider a system that returns a doxel and
some general hypotheses and its definition. Then we present
two ancestors: this system will have a recall of 2.25,
the probabilistic events and the assumptions we made on
which is better than a system that returns two highly
them, and finally we show how to calculate our measure.
• A system that returns all the SIR-relevant doxels will
not be considered as having retrieved all the relevant
The definition of a measure is based on an hypothetical user
information: this system will not have a recall of 1.
behaviour. Hypotheses used in classical measures are sub-jective but do reflect a reality. In the SIR framework, wewill propose a measure that estimates the number of rele-
Those problems are more connected to relevance assessments
vant doxels a user might see. We will now describe how a
for free text queries, where there is no constraint on the
typical user behaves in the context of SIR retrieval. This be-
structure of the retrieved doxels. Nevertheless, the case of
haviour will be defined by three different aspects: the doxel
structured queries can also be discussed. We will distinguish
list returned by the SIR system, the structure of the docu-
two different cases:
ments and the known relevance of doxels to a query. Thefollowing hypotheses are similar to that supposed in classicalIR:
• The topic formulation does not have any constraint
that forbids a doxel and a sub-doxel (a doxel contained
Order The user follows the list of doxels, beginning with
in this doxel like e.g. a paragraph in a section) to
the first returned. He never discourages himself nor
be both retrieved like for example the query "find a
does he jump randomly from one doxel to another;
paragraph or a section that talks about cats". Re-call/precision are clearly not adapted to this case;
Absolute relevance A doxel is still relevant even if the
user has already seen another doxel that contains the
• The topic formulation does not allow a doxel and its
same (or a part of the same) information;
sub-doxel to be both retrieved ("chapters that talkabout photography"). In this case, we can use stan-
Non-additivity Two non relevant doxels will never be rel-
dard (or generalised) recall and precision without hav-
evant even if they are merged.
ing any problem.
The three last hypotheses are specific to our measure
Classical measures require the definition of the typical be-haviour of a system user. This user consults the list of re-trieved doxels one by one, beginning with the first returned
Structure browsing The user eventually consults the struc-
doxel and continuing in the returned order. In the next
tural context (parent, children, siblings) of a returned
section, we propose a measure based on a specific user be-
doxel. This hypothesis is related to the inner structure
haviour, which takes into account the structure of the doc-
uments. In particular, we integrated in our measure the
Coverage influence The coverage of a doxel influences the
fact that a user might explore the doxels which are near the
behaviour of the user. If the doxel is "too large", then
returned doxel in the structure.
the user will most probably consult its children. Ifthe doxel is "too small", the user will most probably
In Web-based IR, classical precision/recall can be problem-
consult the doxel ancestors;
atic. Even if the problem is slightly different, some authorshave considered using the structural information (hyper-
No hyperlink The user will not use any hyperlink. More
links) of the corpus. For instance, Quintana, Kamel and
precisely, he will not jump to another document. This
McGeachy [11] proposed a measure that takes into account
hypothesis is valid in the INEX corpus but can easily
data on the displayed list of documents, on the user knowl-
be removed in order to cope with hyperlinked corpora.
edge of the topic and also on the links between the docu-ments. They propose to estimate the mean time that a userwill spend before finding a relevant document. We follow
The measure we propose is the expectation of the number
somewhat the same approach. The main difference is that
of relevant doxels a user sees when he consults the list of
we rely upon a probabilistic model which makes our measure
the k first returned doxels divided by the expectation of the
sound and easily adaptable to new corpora.
number of relevant doxels a user sees if he explores all the
Number of doxels in the list consulted by the
For simplicity, we will now drop the query q from the for-
mulas, as the measure is computed independently for every
Number of SIR-relevant doxels that have been
The doxel e is in the list consulted by the user
The user has seen the doxel e (either in the listor by browsing from a doxel in the list)
The following hypotheses are necessary for the computation
The user sees the doxel e after he consulted the
of the measure. Note that all these assumptions are made
knowing the query q and the length of the list N . The firsttwo hypotheses are intuitive. The first hypothesis statesthat the relevance of a doxel does not depend on the fact
the user sees it:
P (Se ∧ Re) = P (Se)P (Re)
doxels of the database. We denote this measure by ERR(for Expected Ratio of Relevant documents):
The second states that the behaviour of a user (going from
a doxel in the retrieved list to another doxel, e → e′) does
not depend on the fact that the doxel e is in the list (Le):
P (Le′ ∧ e′ → e) = P (Le′)P (e′ → e)
This measure is computed for one query. The measure ERRis normalised (ERR ∈ [0, 1]) as E [NR/N = E ] represents
The third states that the fact that events R or L that are
the maximum number of SIR-relevant doxels a user can see
related to different doxels are independent, and that in par-
in the whole corpus. The measure can thus be averaged over
different queries.
Se ∧ Le or ¬(Se ∧ Le) and Se′ ∧ Le′ or ¬(Se′ ∧ Le′ )
We now have to compute the expectation
with the assumptions on the user behaviour we just made.
We will introduce some events that are used to formally
This hypothesis has no intuitive meaning and has been in-
model the user behaviour and will make some hypotheses on
troduced only for allowing the measure computation. Nev-
the (probabilistic) relationships between these events. The
ertheless, it can be justified by those two statements: the
three different probabilities we introduce are respectively re-
relevance is assigned by the user and thus the probability
lated to the assessments, to the retrieved doxels and to the
of SIR-relevance does not depend upon the SIR-relevance of
document structure. The set of events we use in this paper
another doxel but on the user assessment (that is denoted
is summarised in table 2.
by our event q). The second point is that the fact Se thatthe user sees a doxel e only depends on the fact that a doxele′ is in the list (which is known when we know the length of
the list N which is the case here) and that the user moves
Let us denote E the set of doxels, e or e′ a doxel from E and
from a doxel e′ in the list to another doxel e.
q a given query. A doxel e can be more or less relevant withrespect to the query. We will denote the probability of SIR-
The third hypothesis is also a simplification of reality, but is
relevance of a given doxel by P (Re/q). The list returned
as necessary as the two first. It is related to the probability
by the SIR system is only partially ordered so that some
Se that the user see a doxel e. The more the user can access
rearrangements of the list are possible. Depending on the
this doxel from the retrieved doxels by navigating along the
length N of the list, a doxel is then consulted by the user
document structure, the more "chanches" he has to see that
with a probability P (Le/q, N = k).
doxel. As it is not possible to evaluate all the interactionsbetween previously seen doxels and this event, we make the
When a user consults a doxel e′ from the list, he eventually
hypothesis that correspond to the "noisy-or". This hypoth-
will use the structure to navigate to another doxel e from
esis is used to compute the probability of the logical impli-
the document. As it is difficult to make this process deter-
cation A1 ∨ · · · ∨ An ⇒ B as 1 − P (¬A1) . . P (¬An). We
ministic, we will use P (e′ → e/q) as the probability that the
user goes from e′ to e. Note that this probability dependsupon the query, this will be illustrated in the next sections.
e′ ∧ e′ → e)/N ´
e′ ∧ e′ → e)/N
We will suppose that the IR user sees the doxel e iff:
In this equation, we assumed that the event e → e is certain(identity move), that is P(e → e) = 1 as the logical or is
• e is in the list;
over all doxels in E.
• e′ is in the list and the user browses from e′ to e
In this subsection, we describe how to compute the measure.
This event is denoted Se and we can write:
We now have to derive this measure from the behaviour of
Le ∨ (∃e′ ∈ E, Le′ ∧ e′ → e) ≡ Se
a typical user. We will thus introduce a set of probabilities,
each of which describes a part of the user behaviour. We will
for the INEX database3, namely for a query the probability
also make several hypotheses in order to make this measure
P (Re) of relevance of a doxel and the probability P (e → e′)
computable. We now describe several hypotheses that are
that the user browse from a doxel to another.
related to the relevance assessments, to the returned list andto the structure of the documents
Computing P (Re)INEX relevance assessments are given in a two dimensional
We want to calculate E [NR/N = k], with 1 ≤ k ≤ E . We
scale (coverage and relevance). For a given query, we will
know that by definition,
compute P (Re) as4:
rP (NR = r/N = k)
The user has seen r SIR-relevant doxels (N
these two conditions are both met: (1) there exists a subset
{e1, . . , er} ⊆ E of SIR relevant doxels that the user hasseen and (2) for every other doxel, either the doxel is not
where j(e) is the assessment of the doxel e for the given
SIR-relevant or the user has not seen it. If one considers the
query in the scale JINEX. To avoid counting the same rel-
set of all sets A that contains r doxels from E, this condition
evant information twice, we will furthermore suppose that
can be written formally as:
the probability of SIR-relevance of a doxel is zero wheneverthe doxel has an ancestor that is relevant with exact match,
if ∃e′, j(e′) ∈ {1E, 2E, 3E}
and e′ is an ancestor of e
Events for two different sets are exclusive and using hypoth-esis (H3) we can state that:
Computing P (e′ → e)To compute the probability that the user jumps from a doxel
to another, we will distinguish several relationships between
those doxels. Formulas below were only justified by our in-
P (Se ∧ Re/N = k)
tuition and can easily be replaced by others. We will denote
length(e) the length of doxel e. This length will usually be
the number of words contained in the doxel. We will denote
by d(e, e′) the distance between two doxels. We used the
e ∧ Re )/N = k)
number of words that are between those two doxels: for ex-ample, the distance between the last paragraph of section 1and the second paragraph in section 2 will be the number of
This formula can be reduced, using the hypothesis H1 we
words in the first paragraph of section 2 (plus the number
of words of the section title). We can now give the formulas,
distinguishing four different cases.
P (Se ∧ Re/N = k)
e′
and e
are not in the same document
We made the hypothesis that the user does not follow any
Using the definition of S
P (e′ → e) = 0
e and the noisy-or hypothesis, we
e′
is a descendant of e
P ¬(Le′ ∧ e′ → e)/N = k´
We will suppose that the more e′ is an important part of
e the greater the probability that a user goes from e′ to
Note that E [NR/N = E ] can easily be computed as P (Se/N =
e. e′ relevance has an influence on this probability: if the
E ) = 1. Then, using hypothesis (H2), we finally obtain
e′coverage is S (or better, E), the probability is higher:
1 − Q (1 − P (Le′ /N = k)P (e′ → e))
where a is 7 when the coverage is exact, 3 when the coverage
is too small and 1 otherwise.
3Note one can use the same definitions for any corpus of
In the last section, we derived the computation of the mea-
sure ERR, but we did not instantiate it in a practical case.
4Other functions are of course possible, we chose one that
We now propose a way to compute some of the probabilities
seemed "reasonable" to us
e
is in e′
1. The model perfect is not perfect for GRP. This can
This is a symmetric case. The only difference is the coverage
be seen as it is not the best model and as precision
influence: a is 7 when the coverage is exact, 3 when the
falls very quickly between recall 0.2 and 0.6. This is
coverage is too big and 1 otherwise.
because when using the generalised quantisation fg we
are adding relevant doxels (for precision/recall) that
Other cases
are not SIR-relevant. Thus, even if the system returns
If in the same document two doxels are one after another
all the SIR-relevant systems, it does not return the
(like two sibling paragraphs), we will state that the proba-
other relevant doxels. For our measure ERR, we can
bility that the user follows the path between the two doxel
see that after almost 400 doxels, model perfect has
is proportional to the inverse of the distance between the
retrieved all SIR-relevant doxels.
2. The model ancestors has a higher performance than
P (e′ → e) = 2 + d(e′, e)´−1
model perfect. This point is related to the previousone: because the model ancestors returns more dox-
els that are relevant (due to the quantisation), recall is
better. Due to the limited size of the list and to the 4possible values for scores, examination of the retrieved
In this section, we show how the measure discriminates be-
doxels shows another thing: every SIR-relevant doxel
tween different IR systems. In order to compare the be-
in the returned list is preceded by a list of its ancestors.
haviour of generalised precision-recall versus our measure,
We can see this effect with our measure, as the mea-
we considered six different hypothetical "SIR-systems" which
sure increases slowly with the number of the retrieved
make use of known assessments. These systems exhibit "ex-
documents for the model ancestors. Our measures
treme" behaviours which illustrate a whole set of different
also correctly discriminates those two models, as the
situations. The six systems are named:
performance of model ancestors is far below the per-formance of model perfect.
perfect A system that returns the SIR-relevant doxels
3. The model parent is much higher than the model
document A system that returns all document in which a
biggest child. This is not what could be expected,
SIR-relevant doxel appears
as the parent can contain many irrelevant parts. Thiseffect is due to the fact that doxels with coverage "too
parent A system that returns always the parent of a SIR-
small" have a lower value in the real scale than those
relevant document
with coverage "too big". With our measure, model
ancestors A system that returns ancestors of a SIR-relevant
performances are much closer.
document with a score
4. The model document is close to the model biggest
biggest child The SIR system returns the biggest child (in
child. This is not a good property of GRP, since we
want a measure that favours systems that retrievedelements of smaller granularity than documents and
In all these experiments, the score of the doxel was given by
since the biggest child is very often close to the SIR-
the relevance (first dimension of J
relevant doxel (maybe as close as the document). With
INEX) of its SIR-relevant
doxel: we scored 1 for a doxel which was highly relevant,
ERR, this is not the case.
0.5 for a fairly relevant doxel and finally 0 for a marginallyrelevant doxel.
Those four observations show that our measure is better
In our experiments, we used all the content only queries for
suited to SIR evaluation than GRP. If we consider the the-
which there were some assessments. We only kept the 1000
oretic foundations of our measure, it gives some guarantees
first documents returned by the different systems. Given
about its validity.
that scores can only take three values, the P/R curve wascomputed using the Raghavan [12] probabilistic definition of
precision and recall (with a step of 0.1). We computed the
In this article, we have described a new measure for SIR sys-
values at N = 0.1000 for our own measure. We averaged
tems called the Expected Ratio of Relevant document (ERR).
our results for P/R and ERR in order to hide the specificities
This measure is a generalisation of recall in classical IR:
of each assessment. We didn't consider the case of standard
when the probability of going from a doxel to another is
precision/recall (e.g. using fs) as almost all of the models
always null, the measure reduces to a form of generalised
proposed here will have a near null precision-recall curve.
recall. This measure is consistent with SIR, in the sensethat it favours systems that find the smallest relevant dox-
els. Other proposed measures like standard or generalised
In figure 1, we present the curves obtained with our measure
precision and recall are not good indicators of the perfor-
and in figure 2 the generalised recall/precision (GRP). We
mance of a SIR system, as was shown in the last section.
will comment on those curves in this subsection: we will
Note that results presented here should however be inter-
point the shortcomings of the GRP and see how our measure
preted with care, as we took very specific systems to un-
corrects the problem. When we analyse those curves, we can
derline the strange behaviour of GRP. Our measure has the
at least identify four problems with the GRP:
advantage of a sound theoretical foundation and explicitly
integrates the structure of the documents in the modellingof user behaviour5.
The presented measure could also be very easily adaptedin order to evaluate performance of systems in the case ofweb retrieval. Another interesting property is that it couldfavour systems that provide Best Entry Points to the doc-ument structure [8], from which users can browse to accessrelevant information. In this case, if from a retrieved doxelthere is a high probability that the user goes to some (SIR-)relevant doxels, the measure will increase faster than if thedoxel is (SIR-)relevant but provides no (structural) links toother (SIR-)relevant doxels.
The last step would have been to provide an extension ofprecision as we did for recall. But when we tried to follow theprobabilistic approach of Raghavan, a number of problemsarose6 and it is still not clear which set of hypotheses couldbe used to solve the problem. However, the curves we candraw with the proposed measure are informative enough andhave good properties, so it could replace or complement theGRP used for the evaluation of SIR-systems.
5This behaviour should be empirically validated.
6In particular, we need to calculate the probability of finding
NR relevant doxels in the retrieved list if this list has a
given length. This probability can only be computed inO(2MR)where M R is the number of relevant doxels for thequery.
Figure 1: Measure ERR (log-scale for the axis of abscissas). The axis of abscissas represents the length of thelist of retrieved doxels. The axis of ordinate represents the measure ERR (in %). The measures are averagedover the queries.
Figure 2: Generalised precision-recall. The axis of abscissas represents recall and the axis of ordinate theprecision. Precision are averaged over the queries.
[12] Vijay V. Raghavan, Gwang S. Jung, and Peter
[1] Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Bollmann. A critical investigation of recall and
Modern Information Retrieval. Addison Wesley, New
precision as measures of retrieval system performance.
York, USA, 1999.
ACM Transactions on Information Systems,7(3):205–229, 1989.
[2] Peter Bollmann and Vladimir S. Cherniavsky.
Measurement-Theoretical Investigation of the
[13] Don R. Swanson. Historical Note: Information
MZ-Metric. In Robert N. Oddy, Stephen E. Robertson,
Retrieval and the Future of an Illusion, pages
C. J. van Rijsbergen, and P. W. Williams, editors,
555–561. Multimedia Information and Systems.
Proc. Joint ACM/BCS Symposium in Information
Morgan Kaufmann, July 1997.
Storage and Retrieval, pages 256–267, 1980.
[14] John A. Swets. Effectiveness of Information Retrieval
Methods. American Documentation, 20(1):72–89,
[3] C.W. Cleverdon. The Cranfield tests on index
January 1969.
language devices. In Aslib proceedings, volume 19,pages 173–192, 1967.
[15] Cornelis J. Van Rijsbergen. Information Retrieval.
Butterworths, 1979.
[4] William S. Cooper. Some inconsistencies and
misidentified modelling assumptions in probabilisticinformation retrieval. In Nicholas J. Belkin, PeterIngwersen, and Annelise Mark Pej, editors,Proceedings of the 14th ACM SIGIR, Copenhagen,Danemark, 1992. ACM Press.
overt. Assessments and evaluation measures
for XML document retrieval. In Proceedings of theFirst Annual Workshop of the Initiative for theEvaluation of XML retrieval (INEX), DELOSworkshop, Dagstuhl, Germany, December 2002.
ERCIM.
overt and Gabriella Kazai. Overview of the
Initiative for the Evaluation of XML retrieval (INEX)2002. In Proceedings of the First Annual Workshop ofthe Initiative for the Evaluation of XML retrieval(INEX), DELOS workshop, Dagstuhl, Germany,December 2002. ERCIM.
ainen and Kalervo J¨
arvelin. Using graded
relevance assessments in IR evaluation. Journal of theAmerican Society for Information Science (JASIS),53(13):1120–1129, 2002.
[8] Mounia Lalmas and Ekaterini Moutogianni. A
Dempster-Shafer indexing for the focussed retrieval ofa hierarchically structured document space:Implementation and experiments on a web museumcollection. In 6th RIAO Conference, Content-BasedMultimedia Information Access, Paris, France, April2000.
[9] Stefano Mizzaro. How many relevances in information
retrieval? Interacting With Computers, 10(3):305–322,1998.
[10] Stephen M. Pollock. Measures for the Comparison of
Information Retrieval Systems. AmericanDocumentation, 19(3):387–397, October 1968.
[11] Yuri Quintana, Mohamed Kamel, and Rob McGeachy.
Formal methods for evaluating information retrievalin hypertext systems. In Proceedings of the 11thannual international conference on Systemsdocumentation, pages 259–272, Kitchener-Waterloo,Ontario, Canada, October 1993. ACM Press.
Source: http://www.bpiwowar.net/wp-content/papercite-data/pdf/piwowarski2003expected-ratio.pdf
LA CRISIS ESTÁ EN LOS SERES HUMANOS: DEBEMOS RECUPERAR EL EQUILIBRIODELIA STEINBERG GUZMÁN, PRESIDENTE INTERNACIONAL THE CRISIS IS IN HUMAN BEINGS:WE NEED TO RESTORE BALANCEINTERNATIONAL PRESIDENT, DELIA STEINBERG GUZMÁN OINA,ORGANIZACIÓN INTERNACIONAL NUEVA ACRÓPOLISINTERNATIONAL ORGANIZATION NEW ACROPOLIS (IONA) LA OINA EN CIFRASIONA IN FIGURES ACTIVIDADES DE FILOSOFÍAPHILOSOPHY ACTIVITIES
Obesity Care 8800 ROESELARE (Belgium) Phone: +32 51 23.70.08 Fax: +32 51 23.79.41 email: [email protected] website: http://www.obesitycare.be Informed Consent for Roux-en-Y Gastric Bypass Please read this form carefully and ask about anything you may not understand. I am giving P. Pattyn and B. Smet (my doctors) and the whole ObesityCare team permission to perform a