Meta: characterization of medical treatments at different abstraction levels
Politecnico di Torino
Porto Institutional Repository
[Article] MeTA: Characterization of medical treatments at different abstractionlevels
Original Citation:Dario Antonelli; Elena Baralis; Giulia Bruno; Luca Cagliero; Tania Cerquitelli; Silvia Chiusano;Paolo Garza; Naeem A. Mahoto (2015). MeTA: Characterization of medical treatments at differentabstraction levels. In: vol. 6 n. 4, pp. 1-25. - ISSN 2157-6904
Availability:This version is available at : since: November 2015
Publisher:ACM New York, NY, USA
Published version:
Terms of use:This article is made available under terms and conditions applicable to Open Access Policy Article("Creative Commons: Attribution-Noncommercial-No Derivative Works 3.0") , as described at
Porto, the institutional repository of the Politecnico di Torino, is provided by the University Libraryand the IT-Services. The aim is to enable open access to all the world. Please howthis access benefits you. Your story matters.
(Article begins on next page)
Personalized tag recommendation based on generalized
Elena Baralis, Luca Cagliero
∗, Tania Cerquitelli, Silvia Chiusano, Paolo
Dipartimento di Automatica e Informatica, Politecnico di Torino,
Corso Duca degli Abruzzi 24, 10129, Torino, Italy
Dario Antonelli, Giulia Bruno
Dipartimento di Ingegneria Gestionale e della Produzione, Politecnico di Torino,
Corso Duca degli Abruzzi 24, 10129, Torino, Italy
Department of Software Engineering, Mehran University of Engineering and Technology,
Indus Hwy, Jamshoro 76062, Pakistan
Physicians and healthcare organizations always collect large amounts of data
during patient care. These large and high-dimensional datasets are usually
characterized by an inherent sparseness. Hence, the analysis of these datasets
to figure out interesting and hidden knowledge is a challenging task.
This paper proposes a new data mining framework based on generalized
association rules to discover multiple-level correlations among patient data.
∗Corresponding author. Tel.: +39 0110907084 Fax: +39 0110907099.
Email addresses: [email protected] (Elena Baralis),
[email protected] (Luca Cagliero),
[email protected] (TaniaCerquitelli),
[email protected] (Silvia Chiusano),
[email protected](Paolo Garza),
[email protected] (Dario Antonelli),
[email protected] (Giulia Bruno),
[email protected](Naeem A. Mahoto)
Preprint submitted to
Specifically, correlations among prescribed examinations, drugs, and patient
profiles are discovered and analyzed at different abstraction levels. The rule
extraction process is driven by a taxonomy to generalize examinations and
drugs into their corresponding categories. To ease the manual inspection
of the result, a worthwhile subset of rules, i.e., the non-redundant general-
ized rules, is considered. Furthermore, rules are classified according to the
involved data features (medical treatments or patient profiles) and then ex-
plored in a top-down fashion, i.e., from the small subset of high-level rules a
drill-down is performed to target more specific rules.
The experiments, performed on a real diabetic patient dataset, demon-
strate the effectiveness of the proposed approach in discovering interesting
rule groups at different abstraction levels.
Healthcare Informatics, Data Mining, Generalized Association
Healthcare systems are nowadays integrated platforms that can take ad-
vantage of advanced data management and analysis solutions. Large amount
of data on medical patient history is commonly stored by healthcare orga-
nizations. Data mining techniques can be used to analyze these large data
collections and to extract knowledge useful for physicians and healthcare
This study addresses the problem of analyzing patient historical data to
identify valuable correlations among patient treatments and profiles. The
extracted patterns allow experts to (i) identify the medical treatments com-
monly followed by patients with a given disease, (ii) verify the adherence
of medical treatments to shared medical guidelines, (iii) improve the effec-
tiveness of medical treatments, and (iv) plan resource allocation and reduce
costs incurred by organizations.
Association rule extraction is an established data mining technique to dis-
cover interesting correlations among large datasets [44]. Since patient history
data is relatively sparse, discovering association rules from these datasets is
a challenging task. Discovering correlations among data items that rarely
co-occur may become computationally intractable when coping with large
datasets. On the other hand, focusing only on most frequent item recur-
rences could provide not fruitful enough information. Furthermore, since a
large rule set could be mined, inferring useful and actionable knowledge from
the extracted rules can be a complex task.
This paper presents: (i) MeTA (Medical Treatment Analysis), a new
data analysis framework targeted to the discovery of underlying multiple-level
correlations among patient treatments and profiles. (ii) The classification of
the mined rules into classes according to the represented data features (e.g.,
examinations, drugs). (iii) The exploration of rules in order of descending
level of abstraction of the represented information in the input taxonomy.
(iv) The application of MeTA to a real-life use case, i.e., the analysis of
diabetic patient data provided by the National Health Center (NHC) of an
Italian province.
Patients datasets consist of log files holding information about patient
treatments and census data. Each row contains a set of pairs (
feature,
value),
where
feature corresponds to a specific data feature (i.e.,
Examination,
Drug,
Age, or
Gender ), while
value is the corresponding feature value. MeTA dis-
covers interesting and multiple-level correlations among patient data called
generalized rules [39]. These rules are represented in the form
X → Y ,
where
X and
Y are disjoint sets of items (called itemsets). The implica-
tion means that (i) itemsets
X and
Y frequently co-occur in the analyzed
dataset (regardless of the temporal order of occurrence of
X and
Y in the
source data), (ii) the strength of the implication between
X and
Y is above
a given threshold, and (iii)
X and
Y may also include items belonging to
different abstraction levels. Item generalization is driven by a taxonomy,
which consists of a set of is-a hierarchies built over the analyzed data. For
example, drugs can be generalized based on the addressed pathology [7],
while examinations are clustered based on the focused area (e.g., liver or
cardiovascular system). Aggregating items into higher-level concepts (e.g.,
examinations into the corresponding category) prevents the discarding of po-
tentially useful knowledge and thus counteracts the issue of data sparseness.
In our context of analysis, we disregard the temporal order of prescriptions
and we specifically focus on discovering multiple-level co-occurrences among
examination/drug prescriptions. In Section 3 we will demonstrate that these
patterns are worth considering for targeted analysis (e.g., resource allocation,
healthcare service management). To make the mined result more manageable
by domain experts for manual inspection, MeTA considers a worthwhile rule
subset, i.e., the non-redundant rules [48]. Non-redundant rules are generated
from closed itemsets [33], which are a compact and non-redundant subset
of frequent itemsets. Furthermore, MeTA categorizes the rules into four
groups according to the represented data facet (e.g., examination, drugs).
Within each group rules are further classified according to their level of ab-
straction of the contained items in the input taxonomy. The usefulness of
both non-redundant rule selection and rule categorization for improving the
manageability of the mining result is discussed in Section 3.4.
As an example, let us consider rule
{(
Examination,
Routine), (
Exami-
nation,
Cardiovascular )
} → {(
Drug,
Blood and blood forming organs Cate-
gory)
}. It indicates that drugs in category "Blood and blood forming organs"
have been prescribed to a relatively large number of patients to whom routine
and cardiovascular examinations have been prescribed as well (disregarding
the temporal order of prescriptions).
This information could be deemed
to be useful, for example, for shaping drug provision to medical divisions
according to the most commonly performed examinations. The rule is high-
level, because it contains only examination and drug categories. Conversely,
rules containing
also or
only single examinations/drugs will be denoted as
cross- or low-level rules, respectively. The cross-level rule
{(
Examination,
Routine), (
Examination,
Cardiovascular )
} → {(
Drug,
Acetylsalicylic Acid)
}
can be extracted as well in case drug Acetylsalicylic Acid has predominantly
been prescribed among the "Blood and blood forming organs" drug category.
Note that the aforesaid high- and cross-level rules are more likely to be fre-
quent than their low-level descendant rules (e.g.,
{(
Examination,
Complete
blood count ), (
Examination,
Cholesterol )
} → {(
Drug,
Acetylsalicylic Acid)
}).
High- and cross-level rules are worth considering because (i) they represent,
from a high-level viewpoint, valuable information that may remain hidden
in sparse datasets at lower abstraction levels and (ii) they are typically more
manageable than low-level rules for manual result exploration.
We assessed the usability of MeTA on a real dataset of diabetic pa-
tients provided by the National Health Center (NHC) of an Italian province.
The experiments demonstrate that, starting from a large collection of raw
data on patient history, the framework allows experts to identify several in-
teresting high-, cross-, and low-level correlations among patient treatments
and profiles. The results were validated by clinical domain experts. The
extracted rules appear to be consistent with the guidelines for diabetes dis-
ease [1, 23, 24]. Furthermore, the extraction of high- and cross-level rules
appears to effectively overcome limitations of traditional approaches.
This paper is organized as follows. Section 2 presents the architecture of
the proposed framework and it describes its main blocks. Section 3 assesses
the effectiveness of the system in performing knowledge discovery from a
real diabetic patient dataset. Section 4 compares our approach with most
relevant related works, while Section 5 draws conclusions and presents future
developments of this work.
2. The Medical Data Generalized Rule Miner system
MeTA (Medical Treatment Analysis) is a novel framework for medical
data analysis, which focuses on characterizing medical treatments at different
granularity levels.
The main MeTA architectural blocks are depicted in Figure 1. A brief
description of each block follows.
Data collection and preparation. This block aims at making information
about patient characteristics, examinations, and drugs suitable for the mining
Figure 1: The Medical Treatment Analysis framework
process. Patient datasets are tailored to a transactional data format, where
each transaction corresponds to a different patient and it consists of a set
of items, which represent patient census data (e.g., age, gender), prescribed
examinations (e.g., Glucose level), or prescribed drugs (e.g., Acetylsalicylic
Acid). Transactional datasets are enriched with an (analyst-provided) tax-
onomy built over the data items.
Generalized association rule mining. This block focuses on discover-
ing multiple-level correlations among the preprocessed data in the form of
generalized association rules. The extraction process is driven by the input
taxonomy to generalize data items at higher abstraction levels. To extract
only the rules that (i) occur frequently and (ii) represent positively correlated
implications among pairs of item sets in the source dataset, rules are filtered
according established quality measures, i.e., support and lift [43]. Further-
more, to filter out less informative rules only the subset of non-redundant
rules [48] is considered for further analyses.
Rule categorization. To make the mining result manageable by experts
for manual inspection, rules are categorized according to their represented
information. To analyze correlations among patient data regardless of the pa-
tient profile, rule subsets that represent (i) correlations among examinations,
(ii) correlations among drugs, and (iii) correlations between examinations
and drugs are analyzed separately. On the other hand, to gain more in-
sights into specific user profiles (e.g., elderly men, kids) implications between
specific patient characteristics and examinations/drugs are considered. To
easily explore rule categories the corresponding rules are further classified
as high-level, cross-, or low-level according to the level of abstraction of the
contained information in the input taxonomy.
A more thorough description of each block is reported below.
2.1. Data collection and preparation
Healthcare systems usually collect heterogeneous personal information
about patients into log datasets. For example, the list of prescribed exami-
nations and drugs is stored in separate log files to allow doctors to keep track
of diagnosis and therapies and healthcare system managers to plan purchases
and resource allocations. In parallel, to characterize the patient population,
census data about patients, such as gender and age, are usually collected in
separate datasets.
The MeTA framework collects and stores into a unique data repository
these three main patient data types. More specifically, for each patient the
list of (i) prescribed examinations, (ii) drugs, and (iii) the main patient char-
acteristics are stored.
To enable the mining process, the collected patient data is tailored to a
transactional data format. A transactional patient dataset is a set of trans-
actions, where each transaction corresponds to a patient and it consists of a
set of patient features, called items. Items can be related to examinations
(e.g.,
Glucose level), drugs (e.g.,
Acetylsalicylic Acid), or patient census data
(e.g.,
Male). In this work we focus on age and gender as peculiar patient
census data. Items are represented in the form (
feature,
value), where
feature
is
Examination,
Drug,
Age, or
Gender, while
value is the corresponding fea-
ture value. A more formal definition of transactional patient dataset is given
Definition 1. Transactional patient dataset. Let E be the set of all
possible patient examinations, M the set of all possible drugs, and C the
set of census data features. Let Ω(
ci)
be the domain of an arbitrary census
data feature ci ∈ C (i.e., the set of all possible values assumed by ci). An
item is a pair (feature,vi), where vi ∈ E if feature=Examination, vi ∈ M if
feature=Drug, and vi ∈ Ω(
ci)
if feature=ci. A transactional patient dataset
D is a set of transactions, where each transaction ti ∈ D is a set of items.
Let us consider, as running example, the dataset reported in Table 1.
It consists of 5 records, each one related to a different patient. For each
patient the identifier (Pid), age (Age), gender (Gender), and a list of pre-
scribed examinations and drugs is given. The dataset contains four different
examinations (
HDL Cholesterol,
Glucose level,
Electrocardiogram, and
Blood
Table 1: Example of patient transactional dataset.
Figure 2: Example of taxonomy built over the transactional patient dataset.
count ) and two different drugs (
Acetylsalicylic Acid and
Moxifloxacin). For
example, patient with Pid 5 is an elderly man to whom examinations
Electro-
cardiogram and
Blood count have been prescribed at least once. Furthermore,
he has already taken
Acetylsalicylic Acid but not
Moxifloxacin.
To enable the process of generalized rule mining from a transactional
patient dataset
D, a taxonomy (i.e., a set of generalization hierarchies) is
built over the items in
D. The taxonomy aggregates examinations and drugs
into high-level concepts, i.e., examinations are generalized as examination
categories while drugs as drug categories.
Definition 2. Taxonomy. Let D be a transactional patient dataset and I
the set of items in D. A generalization hierarchy GHI (I
k ⊆ I) built over
D is a predefined hierarchy of aggregations defined over a subset of items in
I, where hierarchy leaves are items in I, while non-leaf nodes in GHI are
ancestors of their corresponding children. Each hierarchy has a root node
(denoted as ⊥) which aggregates all its items. A taxonomy T built over D
consists of a set of generalization hierarchies GHI for which ∪
GHI ∈TIk =
I.
Although taxonomies can potentially contain many generalizations over
the same item (e.g., many categories for the same examination), in this work
we will consider only taxonomies containing at most one generalization per
An example of taxonomy built over the running example dataset is re-
ported in Figure 2. Examinations
Blood count and
Glucose level are classified
as
Routine examinations, whereas examinations
Electrocardiogram and
HDL
Cholesterol are generalized as
Cardiovascular. Finally, drugs
Acetylsalicylic
Acid and
Moxifloxacin are classified as
Analgesic and
Antibiotic, respectively.
2.2. Generalized association rule mining
This block focuses on discovering multiple-level associations, in the form
of generalized association rules, from the transactional patient dataset
D
with a taxonomy
T.
Association rules represent underlying correlations among the analyzed
data items [2]. More specifically, an association rule is an implication
A ⇒ B,
where
A and
B are itemsets, i.e., sets of data items. A
k-itemset
I a set of
items of size
k that occurs in
D.
is a 2-itemset that represents the co-occurrence of two specific examinations
in medical treatments, while the association rule
{(
Examination,Glucose
level )
} → {(
Examination,Electrocardiogram)
} indicates that the occurrence
of examination
Glucose level "implies" those of examination
Electrocardio-
gram in the analyzed data.
Generalized association rules [39] are rules that may also contain items
at higher abstraction levels, i.e., the generalized items. Every item that
is associated with a non-leaf node of the taxonomy
T (see Definition 2) is
considered as a generalized item. Similarly, generalized itemsets are itemsets
including at least one generalized item.
Definition 3. Generalized itemset.
Let D be a transactional patient
dataset and I be the set of distinct items in D. Let T be a taxonomy built
over D and G the set of generalized items (high-level tag aggregations) de-
rived by all the generalization hierarchies in T. A generalized itemset I is a
subset of I
G including at least one generalized item in G.
For example, according to the taxonomy in Table 2,
{(
Examination,Routine),(
Examination,Electrocardiogram)
}
is a generalized itemset.
Generalized itemsets are characterized by two quality indexes, i.e., the
level and support. The level of a generalized item
ik with respect to a tax-
onomy indicates the degree of abstraction of the represented information.
Definition 4. Generalized itemset level. Let D be a transactional pa-
tient dataset and I be the set of distinct items in D. Let T be the taxonomy
defined over D and ik an arbitrary item or generalized item in T. The level
of (generalized) item ik is defined as the height of T's subtree rooted in ik.
The level of a generalized itemset is defined as the maximum level among the
levels of its items.
Generalized itemsets whose items have all the same level are called
level-
sharing itemsets [22]. The level of a level-sharing itemset with respect to a
taxonomy corresponds to the one of its items.
The support of a generalized itemset evaluates its observed frequency of
occurrence in the analyzed data. It is defined in terms of the itemset coverage
with respect to the analyzed data.
Definition 5. Generalized itemset coverage. Let D be a transactional
patient dataset and T the corresponding taxonomy. A (generalized) itemset I
covers a given transaction ti ∈ D if all its (possibly generalized) items ik ∈ I
are either included in ti or ancestors (generalizations) of items ik ∈ ti with
respect to T.
The support of a generalized itemset
I is given by the ratio between the
number of transactions
ti ∈ D covered by
I and the cardinality of
D.
A (generalized) itemset
I is said to be a descendant of another generalized
itemset
Y if (i)
I and
Y have the same length (i.e., the same number of items)
and (ii) for each item
y ∈ Y there exists an item
i ∈ I that is a descendant
of
y.
The concept of generalized association rule extends traditional associa-
tion rules to the case in which they may include either generalized or not
generalized itemsets. A more formal definition follows.
Definition 6. Generalized association rule. Let A and B be two (gen-
eralized) itemsets. A generalized association rule is represented in the form
R :
A ⇒ B, where A and B are the body and the head of the rule respectively.
A and
B are also denoted as antecedent and consequent of the generalized
rule
A ⇒ B. Generalized association rule extraction is commonly driven by
rule support (
s) and confidence (
c) quality indexes [39]. While the support
index represents the observed frequency of occurrence of the rule in the source
dataset, the confidence index represents the rule strength.
Definition 7. Generalized association rule support. Let D be a trans-
actional patient dataset and T a taxonomy. The support of a generalized rule
R :
A ⇒ B is defined as the support (i.e., the observed frequency) of A ∪ B
in D.
Definition 8. Generalized association rule confidence. Let D be a
transactional patient dataset and T a taxonomy. The confidence of a rule
R :
A ⇒ B is the conditional probability of occurrence in D of the generalized
itemset B given the generalized itemset A.
For example, the generalized association rule
{(
Examination,Routine)
}
→ {(
Examination,Electrocardiogram)
} (s=60%,c=100%) indicates that ex-
aminations belonging to category
Routine co-occur with examination
Elec-
trocardiogram in 3 of the transactions of the analyzed dataset (Pids 1, 3, and
5) and the implication holds in 3 =100% of the cases.
In some cases, measuring the strength of a rule in terms of support and
confidence may be misleading [42]. When the rule consequent is characterized
by relatively high support value, the corresponding rule may be characterized
by a high confidence even if its actual strength is relatively low. To overcome
this issue, the lift (or correlation) index [43] may be used, rather than/beyond
the confidence index, to measure the (symmetric) correlation between body
and head of the extracted rules.
Definition 9. Generalized association rule lift. Let A → B be an as-
sociation rule. Its lift is given by
c(
A → B)
s(
A → B)
where s(
A → B)
and c(
A → B)
are, respectively, the rule support and confi-
dence, and s(
A)
and s(
B)
are the supports of the rule antecedent and conse-
If l(A,B) is equal to or close to 1, itemsets
A and
B are not correlated
with each other. Lift values significantly below 1 show negative correlation,
whereas values significantly above 1 indicate a positive correlation between
itemsets
A and
B, i.e., the implication between
A and
B holds more than ex-
pected. For example, rule
{(
Examination,Routine)
} → {(
Examination,Electrocardiogram)
}
is positively correlated, because its lift value is equal to 5 .
Since the interest of uncorrelated or negatively correlated rules is marginal
in our context of analysis, MeTA only considers frequent and positively cor-
related generalized association rules. Specifically, given a transactional pa-
tient dataset, a taxonomy, a minimum support threshold (
minsup), and a
minimum lift threshold (
minlift ) MeTA discovers all the generalized associ-
ation rules whose:
• support value is above a given minimum support threshold
minsup, i.e.,
• lift value is above a given minimum lift threshold
minlift, i.e., l(R)
>minlift.
The generalized rules that satisfy all the above conditions will be denoted
as
strong rules throughout the paper.
Since the set of strong rules may still contain less informative rules, a
further pruning step is applied prior to performing further analyses. Specif-
ically, MeTA discovers non-redundant generalized rules [48], which are a
worthwhile subset of strong generalized rules. Extensions of a strong gen-
eralized rule are classified as redundant if they have the same support and
confidence of their specialized version. A more formal definition is given be-
low. It extends the concept of non-redundant rule, first proposed in [48], to
the case in which rules may also contain generalized items.
Definition 10. Non-redundant generalized association rule. Let R :
A ⇒ B be a strong generalized rule. R is non-redundant if there exists no
strong rule R∗ :
C ⇒ D, C ⊆ A∧D ⊆ B such that the support and confidence
of R and R∗ are equal.
To generate non-redundant generalized rules we used the publicly avail-
able implementation of the algorithm proposed in [48] on an extended dataset
version, in which transactions contain both items and their corresponding
generalizations according to the input taxonomy. This generalized rule min-
ing strategy is similar to the one previously adopted in [39] in the context of
market basket analysis.
2.3. Rule categorization
The generalized rules extracted during the last MeTA step are explored
by domain experts to discover valuable information. Unfortunately, when
coping with relatively large or complex transactional patient datasets the
number of mined rules could be so large that a manual inspection becomes
unfeasible. To overcome this issue, this block focuses on categorizing the
extracted rules into homogeneous groups, according to their represented in-
MeTA partitions rules into worthwhile subsets that characterize the un-
derlying data from different viewpoints, because they contain different com-
binations of patient features and/or medical treatments. We highlighted four
representative rule classes, which are thoroughly described below. Table 2
reports the rule template for each class.
Class E-Rules: Correlations between examinations. Rules in this
group represent correlations among examinations regardless of the char-
acteristics of the analyzed patients and prescribed drugs.
Class E-Rules. This class may potentially include more complex rules, such as
In other words, rule antecedent can be, in general, itemsets of arbitrary size.
Class D-Rules: Correlations between drugs. Rules in this group focus
the experts' attention on correlations among the prescribed drugs, disregard-
ing examinations and patient characteristics. For example,
{(
Drug,Acetylsalicylic
Acid )
} → {(
Drug,Moxifloxacin)
} belongs to Class D-Rules. Even in this
class rules can represent implications where the antecedent is an itemset of
arbitrary size.
Table 2: Rule categories.
∗1 represents an examination or an examination class,
∗2 repre-
sents either a drug or a drug class
∗3 represents either an age or an age group, while
∗4 is
a gender value (male or female).
Correlations between examinations
Correlations between drugs
Correlations between examinations and drugs
Age Profiles
Class ED-Rules: Correlations between examinations and drugs.
This group of rules represents co-occurrences between drugs and examina-
tions into the patient dataset, regardless of patient characteristics. More
specifically, all the rules that contain both examinations/examination cate-
gory and drugs/drug categories into their antecedent/consequent are assigned
to class ED-Rules. For example,
{(
Examination,Routine), (
Drug,Aspirin)
}→ {(
Examination,Electrocardiogram)
} is assigned to this class. It indicates
the association between the co-occurrence of an examination category and a
drug and a specific examination.
Class P-Rules: Profile-based correlations. The former rule classifica-
tions disregard patient characteristics. Nevertheless, experts can deem such
information to be useful for characterizing specific user profiles (e.g., elderly
men, kids). This class consists of all the rules that contain any item related
to a census feature in their rule antecedent. This rule subset can be further
categorized according to the considered census features, because each com-
bination of patient census features may represent a distinct and potentially
meaningful user profile. Since in our work we target our analysis on age and
gender census features, rules belonging to Class P-Rules can be partitioned
into the three subgroups reported in Table 2.
For example,
{(
Age,Elder ), (
Gender,Male)
} → {(
Examination,HDL Choles-
terol )
} indicates that the
HDL Cholesterol examination has frequently been
prescribed to elderly men. Similarly, rule
{(
Age,Elder ), (
Drug,Acetylsalicylic
Acid )
} → {(
Examination,HDL Cholesterol )
} indicates that the
HDL Choles-
terol examination has frequently been prescribed to elderly people (males or
females) who have taken drug Acetylsalicylic Acid (regardless of the temporal
order of drug/examination prescriptions).
2.3.1. Level-wise exploration of rule categories
Given a worthy set of rule categories, experts are asked to go into de-
tail about the contained rules. However, since generalized rules potentially
represent information at different levels of granularity, rule class exploration
could be challenging unless considering taxonomy abstraction levels as refer-
ence information.
To easily explore rule categories, the corresponding rules are further clas-
sified as high-, cross-, or low-level according to the level of abstraction of the
contained information in the input taxonomy.
High-level rules are generalized rules
A → B, where
A and
B are level-
sharing itemsets with the same level
l > 1. They typically represent general
knowledge and thus they should be considered first during manual result
For example,
{(
Examination,Routine)
} → {(
Examination,Cardiovascular
examination)
} is a high-level rule, because both rule antecedent and conse-
quent are level-2 itemsets.
Cross-level rules are generalized rules
A → B, where
A and
B are ei-
ther not level-sharing itemsets or level-sharing itemsets with different level.
They combine detailed and general information by climbing up and down the
taxonomy for different data features. Given a subset of high-level rules, cross-
level rules can be considered as an intermediate step to perform drill-down
(i.e., moving from general to detailed information).
For example,
{(
Examination,Blood count)
} → {(
Examination,Cardiovascular
examination)
} is a cross-level rule, because the rule antecedent is a level-1
itemset, whereas the rule consequent is a level-2 itemset. If the high-level rule
is deemed to be useful for advanced analysis, then considering the former
cross-level rule can be relevant to analyze the underlying correlations be-
tween a specific routine examination and the Cardiovascular examination
Low-level rules are not generalized rules
A → B, i.e., both
A and
B are
not generalized (level-1) itemsets. They typically represent very detailed
When coping with relatively sparse datasets, many of these
rules could be discarded during the mining process by enforcing the min-
imum support threshold. However, their peculiar information is likely to be
covered, to a certain extent, by cross- and high-level rules. For example,
{(
Examination,Urine test)
} → {(
Examination,Electrocardiogram)
} is an ex-
ample of low-level rules. These rules can be analyzed to gain more insights
on a specific subset of cross-level or high-level rules.
3. Experimental results
We performed various experiments on a real-life dataset collected by
an Italian Health Center to demonstrate effectiveness and efficiency of the
MeTA framework.
The experimental section is organized as follows. Section 3.1 describes
the main characteristics of the real-life dataset analyzed in this study, while
Section 3.2 summarizes the most relevant results achieved during the mining
session as well as highlights the significance and usability of the discovered
high-level correlations among data. Section 3.3 analyzes the distribution of
the discovered rules across categories and levels of abstraction of the ex-
tracted knowledge. A quantitative analysis of the complexity of the rule
exploration process is given in Section 3.4. Finally, Section 3.5 analyzes the
efficiency of MeTA in terms of execution time.
All the experiments were performed on a quad-core 3.30 GHz Intel Xeon
workstation with 16 GB of RAM, running Ubuntu Linux 12.04 LTS. The
software used to perform rule extraction and post-processing is available
online at [30].
3.1. Diabetic patient dataset and taxonomy
The dataset considered in this study was collected by an Italian Local
Health Center (LHC). Specifically, in 2007 they collected into a unique LHC
dataset all the accesses to the medical center year-round. Then, from the
LHC dataset the examination log data of all the patients with overt dia-
betes were extracted. Raw data consist of 95,788 records and they include
examinations and drugs prescribed to 3,565 patients. The dataset contains
information about male and female patients in a wide age range (i.e., be-
tween 4 and 95 years). To analyze diabetes complications at various de-
grees of severity both routine and more specific examinations were recorded
jointly with prescribed drugs. The diagnostic and therapeutic procedures
were defined using the ICD 9-CM (International Classification of Diseases,
9th revision, Clinical Modification) [23]. Drugs were identified by the phar-
maceutical coding system adopted by the Anatomical Therapeutic Chemical
(ATC) Classification System [7].
The generalization hierarchy over examinations is shown in Table 3. It
contains 26 examinations clustered into 7 examination categories. The se-
lected examination categories are based on the expert-driven classification
reported in [5].
The drug generalization hierarchy contains as leaves the drugs encoded
by using the fifth level of the ATC classification system defined in [7]. Drugs
are aggregated into the corresponding drug category, according to the first
level of the standard ATC classification system. For instance, drug acetyl-
salicylic acid (i.e., code: B01AC06) is a leaf node of the drug generalization
hierarchy and Category B (i.e., Category Blood and blood forming organs)
is its generalization. Our dataset contains 200 distinct drugs and 14 distinct
categories. Table 4 reports the hierarchy defined over drugs.
Human life is often divided into various age ranges (e.g., infancy, middle-
adulthood, old age). Age feature values have been discretized into the fol-
lowing 8 age groups, which represent established ranges of the human lifes-
pan [46]: [0-6], [7-12], [13-22], [23-39], [40-59],[60-75], [76-90], and [91-101].
3.2. Analysis of the mined rules
We performed several generalized rule extractions from the patient dataset
by enforcing different minimum support (
minsup) and lift (
minlift ) thresh-
olds. To perform knowledge discovery from the mined rules, we selected a
representative configuration setting, i.e., we set
minsup to 1% and
minlift
to 1.1. The reasons behind the choice of the support threshold are twofold.
Firstly, too low/high support threshold values yield very detailed/general
rule sets and thus they may not produce manageable yet interesting knowl-
edge. Secondly, it is well-known that averagely low-support rules commonly
represent potentially interesting knowledge if they represent positive corre-
Table 3: Generalization hierarchy over examinations.
Checkup visit
Glucose level
Urine test
Venous blood
Complete blood count
HDL Cholesterol
Fundus oculi
Eye examinations
Complete eye examination
Microscopic urine analysis
Culture urine
ECO Doppler carotid
Limb examinations
ECO Doppler limb
Table 4: Generalization hierarchy over drugs.
A01AA01: Sodium fluoride
Category A: Alimentary tract and metabolism
B01AC06: Acetylsalicylic acid
Category B: Blood and blood forming organs
B03AA03: Ferrous gluconate
C09AA05: Ramipril
Category C: Cardiovascular system
Category D: Dermatologicals
Category G: Genito-urinary system and sex hormones
Category H: Systemic hormonal preparations, excluding
sex hormones and insulins
Category J: Antiinfectives for systemic use
Category L: Antineoplastic and immunomodulating agents
L01AB01: Busulfan
M03AC10: Mivacurium chloride
Category M: Musculo-skeletal system
Category N: Nervous system
Category P: Antiparasitic products, insecticides and repellents
Category R: Respiratory system
S02AA10: Acetic acid
Category S: Sensory organs
V10XX01: Sodium phosphate
Category V: Various
V10XA01: Sodium iodide
lations among items (i.e., their lift is above 1) [15]. Nevertheless, by general-
izing items at higher abstraction levels some of the low-support correlations
among data are still represented by higher-level rules. Hence, we selected
minsup=1% as a good trade-off between rule set specialization and general-
ity. We also enforced a minimum lift threshold equal to 1.1 to prune both
negatively correlated and uncorrelated item combinations. On the one hand,
the interest of negatively correlated rules (i.e., rules with lift below 1) is
marginal in our context of analysis. On the other hand, rules with lift close
to 1 are misleading because their occurrences are not actually correlated with
each other. Hence, among the positively correlated rules we further pruned
approximately 10% of them whose lift value is between 1 and 1.1. Finally, we
focused on the rules with length below or equal to 3, i.e., the rules consisting
of pairs or triples of (generalized) items, because they represent the most
actionable correlations among drugs/examinations. However, to specialize
the rules that provide peculiar information, experts could performed further
extractions and explore longer rules complying with the given category.
We first analyzed the rules than hold for all patients, i.e., we considered
rules belonging to Classes E-Rules, D-Rules, and ED-Rules (Section 2.3).
Then, we focused on rules that concern patient profiles (i.e., Class P-Rules).
3.2.1. Correlations between examinations and drugs
Tables 5 and 6 report worthwhile subsets of correlations between sets of
examinations and drugs, respectively. The former rules represent potentially
interesting correlations among examinations, whereas the latter correlations
among drugs. A worthy subset of correlations between examinations and
drugs (Class ED-Rules) is summarized in Table 7. For each rule, we reported
support (percentage), confidence (percentage), lift, and the corresponding
type, according to the level-dependent classification reported in Section 2.3.1
(low-level, cross-level, high-level).
Analysis of correlations between examinations (Class E-Rules).
This section addresses the analysis of a subset of interesting correlations be-
tween examinations. First, we are particularly interested in analyzing the
co-occurrences among examination categories, while disregarding the tempo-
ral order of prescriptions. High-level rules, such as rules (1)-(7) in Table 5,
represent positive correlations between examination categories. They can be
used to target the analysis towards specific issues. For example, rules (1) and
(2) highlight a pairwise association between liver and kidney examinations,
which hold 2.52 times more than expected according to the corresponding lift
value1. In other words, the expected frequency of co-occurrence of the two
examination category (assuming the independence between the occurrences
of the single examination categories) is significantly lower than the observed
one. The high-level rules (1) and (2) can be used to efficiently schedule med-
ical examination timetables according to their corresponding prescriptions.
For example, since liver and kidney examinations are frequently prescribed
to the same patient, scheduling both examinations at the same day could
reduce patient recovery time. A deeper insight into liver and kidney exami-
nations may be focused on (i) assessing the adherence of medical treatments
to the medical guidelines suggested by the Italian Ministry of Health about
liver and kidney diseases in diabetic patients or (ii) proposing new guidelines
1The lift value of the two rules is the same because of the symmetry of the lift mea-
according to the observed correlations between specific liver and kidney ex-
aminations. Similar analyses can be performed starting from the pairwise
correlations between the examination categories represented by rules (3)-(6).
Since association rules can also represent higher-order associations among
data, we should not restrict our analyses to pairwise associations among
items. For example, rule (7) shows a positive correlation between liver, car-
diovascular, and kidney examination categories. Longer rules can be used
either to specialize known lower-order associations or to figure out new and
more complex medical treatments.
To deepen into the analysis of the most specific correlations between
examinations, high-level rules are often not enough. In fact, they provide a
high-level view of the underlying correlations among data, which could be
insufficient to perform targeted analysis. On the other hand, as discussed in
the following, high-level rules are very important because they also represent
those patterns that have not been separately extracted at lower abstraction
levels because they are infrequent according to the support threshold.
A step forward is to consider also cross-level rules, which contain both
low- and high-level information, i.e., examinations and examination cate-
gories, at the same time. To take advantage of the preliminary analysis of
high-level rules, only the subset of cross-level rules that are related to some
interesting high-level rule are considered. For example, based on rule (1),
we can deepen our analysis into the search of underlying correlations be-
tween specific examinations and examination categories. For instance, given
the subset of patients to whom liver examinations are frequently prescribed,
what specific kidney examination is most likely to be frequently prescribed
as well? From the comparison between the confidences of rules (8)-(13), uric
acid appears to be the most likely kidney examination, because to 74.8% of
the patients associated with a liver examination the uric acid examination
has been prescribed as well. This information is worthy because it gives more
insights into a subset of medical treatments. Similarly, other combinations of
examinations and examination categories (which have been omitted for the
sake of brevity) have been mined.
The last step is the analysis of low-level rules, which represent signifi-
cant correlations among single examinations (disregarding the examination
categories). The exploration of low-level rules is often a challenging task,
because their cardinality is commonly so large that their manual inspection
becomes practically unfeasible. To overcome this issue, we early pruned re-
dundant rules (see Section 6) and we exploited the knowledge extracted from
higher-level patterns (i.e., high- and cross-level rules) to prevent experts from
exploring the whole rule set. For example, given rules (1) and (8), experts
may wonder what is the probability of prescribing the uric acid examination
to patients who have also received a prescription for a specific liver exami-
nation. To answer this question, we can consider low-level rules (14)-(17).
Specifically, their confidence values indicate the conditional probability of
prescription of the uric acid examination given the occurrence of specific
liver examinations in the patient dataset.
Since patient data typically contain not only examination prescriptions
but also drug prescriptions, it could be also interesting to analyze the cor-
relations between drugs (i.e., Class D-rules) and the correlations between
examinations and drugs (i.e., Class ED-Rules) at different abstraction levels.
Analysis of correlations between drugs (Class D-Rules). Table 6
reports a selection of correlations between drugs. They concern the pairwise
correlation between the drugs belonging to the respiratory system category
(category R) and those belonging to the anti-infectives for system use cate-
gory (category J). The contemporary use of drugs belonging to the above cat-
egories could prompt a detailed analysis of the corresponding guidelines [1].
Specifically, rule (3) highlights the association between the drugs belonging
to the respiratory system category and drug Levofloxacin, which is commonly
prescribed for infections of the respiratory system.
Analysis of correlations between examinations and drugs (Class
ED-Rules). Guidelines commonly indicate established associations between
examinations and drugs [1]. Their adherence could be verified against cor-
relations between examinations and drugs mined from the real log patient
data. Representative rules of this type are reported in Table 7. For example,
the high-level rule (1) in Table 7 indicates a positive correlation between the
examinations of the carotid and category B drugs (Blood and blood forming
organs). This rule confirms the common knowledge that vascular diseases,
such as problems to the carotid, are usually taken under control with drugs
related to blood diseases. More specifically, the cross-level rule (2) indicates
that carotid examinations are frequently associated with the category B drug
with code B01AC06, which corresponds to the active principle Acetylsalicylic
Acid. Acetylsalicylic Acid [7] is widely used to treat blood and vascular dis-
eases in general (including carotid issues). Hence, the drug use appears to
be coherent with guidelines. Finally, the low-level rule (3) in Table 7 shows
another interesting correlation between examinations and drugs. Unlike the
former ones, it associates a specific examination (the HDL Cholesterol car-
diovascular examination) with a specific drug (active principle: rosuvastatin,
code: C10AA07). Rosuvastatin is indicated for cardiovascular diseases and,
in particular, it is used to treat patients affected by primary hypercholes-
terolemia [24].
3.2.2. Profile-based correlations (Class P-Rules)
In this section we analyze the rules representing correlations between user
profiles (i.e., demographic features) and treatments. These rules represent re-
currences among treatments that hold for specific patient segments identified
by census features (i.e., age, gender). To facilitate the analysis of different
data facets, profile-based rules are further specialized into three subcate-
gories: Age profile, Gender profile, and Age-Gender profile-based rules. A
worthwhile subset of representative rules is reported in Table 8. Considering
these rules allow us to characterize patients with different profiles (i.e., age
and/or gender) based on their prescribed examinations and drugs.
Rules (1)-(5) in Table 8 have been classified as "Age profiles" rules (see
Section 2.3), because patients are clustered into segments according to their
age. For example, rule (1) indicates that diabetic patients in the age range
[40-59] (i.e., middle-aged patients) are used to undergo cardiovascular ex-
aminations. The implication holds for most of the patients belonging to
the segment (rule confidence 70.1%). Guidelines confirm that middle-aged
diabetic patients are expected to undergo examinations in order to prevent
cardiovascular diseases [1]. Furthermore, rules (2) and (3) in Table 8 indicate
a positive correlation between middle-aged patients and drugs Rosuvastatin
and Ramipril, respectively. Both drugs are likely to be prescribed to pa-
tients with cardiovascular diseases in conjunction with specific examinations.
Drug Rosuvastatin is mainly used to treat patients with primary hypercholes-
terolemia, whereas Ramipril (code: C09AA05) is commonly prescribed to
reduce blood pressure [7]. The confidence values of rules (2) and (3) indicate
that approximately 11% of middle-aged patients actually take the specific
drugs. Based on the achieved results, drug provision across medical centers
and pharmacies could be shaped according to the patient age distribution.
For example, medical centers that mainly treat middle-aged or elderly pa-
tients would purchase large amounts of these drugs. It is worth noticing that,
to perform such analyses, discarding low-confidence rules would be harmful
because they still provide information valuable for medical resource manage-
ment. If we focus on middle-aged diabetic patients (age group [40-59]) to
whom the HDL cholesterol examination has been prescribed at least once
(see Rule (3)), then the percentage of patients who have also taken drug
Rosuvastatin significantly increases with respect to all middle-aged patients
(rule confidence 13.5% against 11.0%) and even the rule correlation increases
(rule lift 1.82 against 1.48). Rule (6) represents a correlation between male
patients and drug Finasteride. The rule appears to be reliable, because drug
Finasteride is used for treatment and control of benign prostatic hyperplasia,
which commonly arises in male.
3.3. Analysis of the rule distribution
The MeTA framework categorizes rules according to the covered data fea-
tures and the level of abstraction of the corresponding items (see Section 2.3).
We analyzed the characteristics of the rules mined using the standard con-
figuration (i.e.,
minsup=1%,
minlift =1.1, and
maxlength=3).
Figure 3: Percentage of rules per class.
minsup=1%,
minlift =1.1, and
maxlength=3.
Figure 3 reports the rule distribution across classes E-, D-, ED-, and P-
Rules. Since profile-based correlations (class P-Rules) are further classified as
"Age profiles", "Gender profiles", and "Age-Gender profiles" (see Table 2 in
Section 2.3) we also reported the percentage of generalized rules per subcat-
egory. Class D-Rules and ED-Rules appear to be the largest rule sets. Both
D-Rules and ED-Rules sets have a high cardinality because the number of
frequently prescribed drugs in the dataset is relatively large.
We also analyzed the rule distribution across the abstraction levels of
the input taxonomy. Figure 4 reports for each class the percentage of high-,
cross-, and low-level rules (see Section 2.3.1). The results show that the per-
centage of high-level rules with respect to the total number of mined rules
is always less than 2% and, for most categories, it is less than 1%. More
specifically, Class E-Rules contains 55 high-level rules, Class D-Rules 238,
Figure 4: Percentage of rules per level.
minsup=1%,
minlift =1.1, and
maxlength=3.
whereas Class ED-Rules 276. These results confirm that high-level rules pro-
vide a compact representation of the underlying correlations among data,
which is particularly suitable for manual result inspection. Since the number
of cross- and low-level rules is one order of magnitude higher than those of
high-level rules, analyzing high-level rules first prevents experts from explor-
ing hundreds and hundreds of (more specific) rules. For example, the experts
could focus on a subset of high-level rules and then drill down to cross- and
low-level rules, which represent similar information at finer granularity levels,
only if high-level patterns are not informative enough to support knowledge
discovery. Finally, it is worth noticing that "Age profiles", "Gender pro-
files", and "Age-Gender profiles" do not produce any high-level rule because
no generalization hierarchy has been defined over the census features.
3.4. Quantitative analysis of the rule inspection process
MeTA addresses the issue of making the mined rule set manageable by
domain experts for manual inspection by (i) selecting non-redundant rules
(see Section 2.2) and (ii) performing rule exploration in a top-down fashion
(see Section 2.3.1).
To analyze the pruning rate achieved by Step (i) we evaluated the ratio
between the number of (traditional) frequent generalized rules and the num-
ber of (selected) non-redundant generalized rules mined from the patient
dataset using the standard configuration (
minsup=1%,
minlift =1.1). The
ratio is equal to 13.3 (i.e., on average a non-redundant generalized rule cor-
responds to 13.3 traditional rules) with
maxlength=3 whereas it reaches 41.6
with
maxlength=4, because longer rules are most likely to be redundant [48].
Since the non-redundant rule set cardinality is at least one order of magni-
tude smaller than the traditional one for all the tested configurations, the
usefulness of the non-redundant rule selection step is confirmed.
As discussed in Section 2.3.1, rule exploration is performed in a top-
down fashion starting from the most general (high-level) rules. To analyze
the effectiveness of Step (ii), we analyzed both the total number of mined
rules and the average number of cross- and low-level rules that experts would
need to explore for each high-level rule they select during rule exploration.
Table 9 reports for each category the total number of high-, cross-, and
low-level rules mined using the standard configuration (i.e.,
minsup=1%,
minlift =1.1, and
maxlength=3).
The number of high-level rules remains
manageable for almost all rule categories. Among the extracted high-level
rules, experts may focus on a subset according to their specific goal. For
example, based on the experts' suggestion, to plan the allocation of medical
divisions according to the offered services, healthcare system managers would
consider high level rules (1)-(7) in Table 5. Column (4) of Table 9 reports
the ratio between the number of cross- and low-level rules and the number
of high-level ones. It indicates the average number of lower-level rules per
category that experts would need to explore for each high-level rule they
select. The achieved results indicate that, on the average, domain experts
have to explore approximately 30 cross- and low-level rules per high-level
rule. Therefore, level-wise rule exploration allows the expert to perform a
simplified and effective rule browsing.
We also analyzed the impact of traditional rule quality measures (i.e.,
support and lift) on the cardinality of the mined rule set.
we performed a large body of experiments by varying
minsup and
minlift,
respectively, and by setting
maxlength to 3. A separate discussion on each
quality measure is given below.
Support. Decreasing the support threshold value the number of gen-
erated item combinations combinatorially increases.
Hence, the support
threshold significantly affects the number of extracted rules. Enforcing medium
or high values may degrade the quality of the result, because some specific yet
interesting rules could be discarded. Hence, we recommend users to set low
support threshold values (e.g., 1%), even though this setting may generate,
on average, a larger number of rules. In fact, For example, even if rule with
ID 14 in Table 5 has a relatively small support value (1.4%), it was deemed
to be valuable by domain experts for advanced analysis (e.g., to verify the
adherence of physician's prescriptions to standard guidelines).
Lift. Enforcing lift threshold values above 1 affects the number of mined
rules. As standard configuration, we enforced a minimum lift threshold equal
to 1.1 on real data to prune both negatively correlated and uncorrelated item
combinations. On the one hand, the interest of negatively correlated rules
(i.e., rules with lift below 1) is marginal in our context of analysis. On the
other hand, rules with lift close to 1 are misleading because their occurrences
are not actually correlated with each other. Hence, among the positively
correlated rules we further pruned approximately 10% of them because their
lift value is between 1 and 1.1.
Confidence. The confidence is a commonly used rule quality measure.
Unfortunately, enforcing a minimum confidence threshold could bias the qual-
ity of the mining result, because, in some cases, confidence values could be
misleading [15]. Therefore, we decided against enforcing any confidence con-
3.5. Execution time
When coping with large or complex patient datasets, rule mining becomes
the most time consuming step of the MeTA framework. Specifically, the gen-
eralized itemset mining step, driven by the support threshold, is known to
be the most computationally intensive task of the rule mining process [39].
Hence, we analyzed the execution time spent by MeTA with different sup-
port thresholds. The non-redundant generalized rule mining step took less
than 5 minutes while setting the minimum support threshold to 1%. When
higher support thresholds are enforced, the execution time decreases super-
linearly, e.g., less than 1 minutes while setting the support threshold value
The scalability of the proposed approach with number of transactions
and transaction length is the same as traditional generalized rule mining al-
gorithms (e.g., [39, 22, 10], i.e., it scales linearly with the dataset cardinality
whereas it scales more than linearly with the number of distinct data items.
In our context, the dataset cardinality corresponds to the number of consid-
ered patients, while the average transaction length strictly depends on the
average number of prescribed examinations and drugs per patient.
4. Related works
Data mining techniques have largely been used to perform medical data
analysis targeted to different diseases and treatments. Previous works ad-
dressed the problems of clustering (e.g., [5, 36, 4]), classifying (e.g., [25, 28]),
and mining frequent patterns from healthcare data (e.g., [21]. This work ad-
dresses the analysis of a specific type of frequent patterns, i.e., the generalized
association rules, mined from patient log data.
A significant research effort has been devoted to mining association rules
from healthcare data. For example, sick and healthy factors for heart dis-
eases have been investigated in [31] by exploiting three different association
rule extraction algorithms, namely Apriori [3], Predictive Apriori [37], and
Tertius [21]. Similarly, in [34] two of the above algorithms have been ex-
ploited to generate accurate rule-based models for type-2 diabetic patient
classification. In [38], association rules have been exploited to determine two
important diseases in patients diagnosed with essential hypertension, i.e.,
non-insulin dependent diabetes mellitus and cerebral infarction.
Parallel efforts have been devoted to taking temporal information into ac-
count during pattern mining from healthcare data. For example, the authors
in [47] coupled association rule mining techniques with temporal abstraction
methods to reduce hospitalization in dialysis patients, while a temporal pat-
tern mining approach has been presented in [13] to predict the risk of devel-
oping heparin-induced thrombocytopenia. Association rules have also been
applied to discover complex temporal relationships in interval-based tem-
poral clinical data [19]. The works presented in [6, 20] addressed sequential
pattern mining from healthcare data. Specifically, the authors in [6] analyzed
the diagnostic pathways for colon cancer actually followed by patients, and
compared them with standard medical guidelines, whereas in [20] a sequential
pattern mining algorithm has been customized to manage multi-dimensional
healthcare data. Unlike [13, 6, 20, 19], this work does not consider neither se-
quences nor temporal patterns. Conversely, it exploits generalized association
rules to analyze the co-occurrences among examination/drug prescriptions at
different abstraction levels regardless of the temporal order of prescriptions.
To some extent, the integration of taxonomy information is complementary
to both temporal and non-temporal pattern mining.
A parallel research effort has been devoted to extracting generalized as-
sociation rules from potentially large datasets. Generalized association rules
have first been introduced and used in [39] in the context of market basket
analysis as an extension of the traditional association rule mining task [2].
The key idea was to aggregate market data items into higher-level item cat-
egories according to a user-provided taxonomy with the aim at discovering
associations among data at different granularity levels. The first generalized
association rule mining algorithm [39] follows the traditional Apriori-based [3]
two-step process for generalized rule mining: (i) frequent generalized item-
set mining, driven by a minimum support threshold, and (ii) generalized
rule generation, from the previously mined frequent itemsets, driven by a
minimum confidence threshold. Candidate frequent generalized itemsets are
generated by exhaustively evaluating the taxonomy. To reduce the com-
plexity and improve the efficiency of the mining process, several algorithmic
optimization strategies have been proposed (e.g., [29, 35, 39, 22]). Prelimi-
nary attempts to discover generalized patterns from medical data have been
presented in [26, 14]. Specifically, in [26] the authors analyzed multiple-
level co-occurrences among diseases in a public health dataset, while in [14]
generalized rules are used to represent biomedical relationships between con-
cepts occurring in Medline. With regard to the medical context, this work
targets a completely different area with respect to [14, 26], i.e., it analyzes
multiple-level associations among medical treatments (examinations, drugs)
and patient profiles rather than among diseases or textual content of med-
ical libraries. Concerning the performed analysis, this paper significantly
improves state-of-the-art approaches, because (i) it addresses the problem
of making the rule set manageable by domain experts for manual result ex-
ploration and (ii) it also considers associations among items of length above
When dealing with large collections of electronic health records, a huge
set of patterns could potentially be generated. Hence, the readability and
manageability of the mining result may significantly reduce. To overcome
this issue, significant efforts have been devoted to applying pruning strate-
gies (e.g., [12, 27, 45, 15]) on top of/in conjunction with itemset or asso-
ciation rule mining. For example, several approaches propose to push ad
hoc constraints to reduce the number of mined frequent generalized item-
sets [40, 22, 41, 10, 18, 16]. In the context of medical data mining association
rules have been used in [8] to compactly represent correlations among exami-
nations undergone by patients. The work in [32] proposed a new graph-based
approach to reducing the cardinality of the candidate itemsets and thus dis-
covering useful and manageable rules about medical images. In this study, we
counteract the exponential growth in the number of mined generalized rules
by applying an established rule pruning strategy [12] which first generates
association rules on top of closed itemsets and then it prunes less informa-
tive rule extensions. Furthermore, we propose to explore rules in a top-down
fashion by performing a selective drill down from most interesting high-level
rules to their most specific descendant rules.
5. Conclusions and future work
This paper presents a novel approach to analyzing multiple-level correla-
tions among medical datasets equipped with taxonomies. Since patient log
dataset are often relatively sparse, discovering valuable correlations among
multiple patient data features could be a challenging task. To overcome this
issue, we propose to discover, categorize, and analyze non-redundant gener-
alized association rules, which represent worthy multiple-level associations
among data items.
The experiments, performed on a real diabetic patient dataset, highlight
correlations among treatments and patient profiles which are consistent with
the guidelines for diabetes disease [23, 1]. Furthermore, the extracted high-
level rules represent fruitful information commonly discarded by traditional
rule mining approaches.
As future work, we plan to study the temporal order of examination/drug
prescriptions at different abstraction levels. More specifically, we would like
to extend state-of-the-art sequence and temporal pattern mining approaches
(e.g., [13, 19]) by also considering taxonomy information during medical data
analysis. This approach could be applicable on any patient dataset in which
the temporal order of prescriptions is indicated.
An interesting development of our framework will be the application of
the proposed approach to enriched patient datasets containing examination
outcomes and drug posologies, because such information is strongly corre-
lated with examination/drug prescriptions. To perform this kind of analyses
a slight modification of the data representation used in the MeTA framework
is needed. Furthermore, few (straightforward) preprocessing steps (e.g., data
discretization) are needed prior to association rule mining. Finally, we would
like to explore the applicability of the proposed approach to other contexts
(e.g., genetic data [9], sports data [11], mobile data [17]) as well.
The authors wish to thank Dr. Baudolino Mussa and Dr. Dario Bellomo
for their advices and fruitful discussions.
This work was partially supported by the GenData2020 project grant,
which is funded by the Italian Ministry of Research (MIUR).
[1] ADA, American Diabetes Association Standards of Medical Care in Di-
abetes 2013, Diabetes Care 36 (2013) S11–S66.
[2] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between
sets of items in large databases, SIGMOD Rec. 22 (1993) 207–216.
[3] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in
large databases, in: Proceedings of the 20th International Conference
on Very Large Data Bases, VLDB '94, Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 1994, pp. 487–499.
[4] M.U. Ahmed, P. Funk, Mining rare cases in post-operative pain by
means of outlier detection, in: 2011 IEEE International Symposium on
Signal Processing and Information Technology, ISSPIT 2011, pp. 35–41.
[5] D. Antonelli, E. Baralis, G. Bruno, T. Cerquitelli, S. Chiusano, N.A.
Mahoto, Analysis of diabetic patients through their examination history,
Expert Syst. Appl. 40 (2013) 4672–4678.
[6] D. Antonelli, E. Baralis, G. Bruno, S. Chiusano, N. Mahoto, C. Petrigni,
Analysis of diagnostic pathways for colon cancer, Flexible Services and
Manufacturing Journal 24 (2012) 379–399.
[7] ATC, Norwegian-institute-of-public-health: Atc/ddd index 2013, 2013.
[8] E. Baralis, G. Bruno, S. Chiusano, V.C. Domenici, N.A. Mahoto,
C. Petrigni, Analysis of medical pathways by means of frequent closed
sequences, in: Knowledge-Based and Intelligent Information and Engi-
neering Systems - 14th International Conference, KES 2010, Proceed-
ings, Part III, pp. 418–425.
[9] E. Baralis, L. Cagliero, T. Cerquitelli, S. Chiusano, P. Garza, Frequent
weighted itemset mining from gene expression data, in: 13th IEEE Inter-
national Conference on BioInformatics and BioEngineering, BIBE 2013,
[10] E. Baralis, L. Cagliero, T. Cerquitelli, V. D'Elia, P. Garza, Support
driven opportunistic aggregation for generalized itemset extraction, in:
5th IEEE International Conference on Intelligent Systems, IS 2010, pp.
[11] E. Baralis, T. Cerquitelli, S. Chiusano, V. D'Elia, R. Molinari, D. Susta,
Early prediction of the highest workload in incremental cardiopulmonary
tests, ACM TIST 4 (2013) 70.
[12] I. Batal, G.F. Cooper, M. Hauskrecht, A bayesian scoring technique
for mining predictive and non-spurious rules, in: Machine Learning
and Knowledge Discovery in Databases - European Conference, ECML
PKDD 2012. Proceedings, Part II, pp. 260–276.
[13] I. Batal, H. Valizadegan, G.F. Cooper, M. Hauskrecht, A temporal pat-
tern mining approach for classifying electronic health record data, ACM
TIST 4 (2013) 63.
[14] M. Berardi, M. Lapi, P. Leo, C. Loglisci, Mining generalized association
rules on biomedical literature, in: Innovations in Applied Artificial In-
telligence, 18th International Conference on Industrial and Engineering
Applications of Artificial Intelligence and Expert Systems, IEA/AIE,
pp. 500–509.
[15] S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: Generaliz-
ing association rules to correlations, SIGMOD Rec. 26 (1997) 265–276.
[16] L. Cagliero, Discovering temporal change patterns in the presence of
taxonomies, IEEE Trans. Knowl. Data Eng. 25 (2013) 541–555.
[17] L. Cagliero, T. Cerquitelli, P. Garza, L. Grimaudo, Misleading general-
ized itemset discovery, Expert Syst. Appl. 41 (2014) 1400–1410.
[18] L. Cagliero, P. Garza, Itemset generalization with cardinality-based con-
straints, Inf. Sci. 244 (2013) 161–174.
[19] C. Combi, A. Sabaini, Extraction, analysis, and visualization of tem-
poral association rules from interval-based clinical data, in: Artificial
Intelligence in Medicine - 14th Conference on Artificial Intelligence in
Medicine, AIME 2013, pp. 238–247.
[20] E. Egho, C. Ra¨ıssi, D. Ienco, N. Jay, A. Napoli, P. Poncelet, C. Quantin,
M. Teisseire, Healthcare trajectory mining by combining multidimen-
sional component and itemsets, in: New Frontiers in Mining Complex
Patterns - First International Workshop, NFMCP 2012, Held in Con-
junction with ECML/PKDD 2012, pp. 109–123.
[21] P. Flach, V. Maraldi, F. Riguzzi, Algorithms for efficiently and effec-
tively using background knowledge in tertius, 2006.
[22] J. Han, Y. Fu, Mining multiple-level association rules in large databases,
IEEE Trans. on Knowl. and Data Eng. 11 (1999) 798–805.
[23] ICD-9-CM, International classification of diseases, 9th revision, clinical
modification, 2011.
[24] IDF, International Diabetes Federation, 2013.
[25] A.G. Karegowda, M.A. Jayaram, A.S. Manjunath, Cascading k-means
clustering and k-nearest neighbor classifier for categorization of diabetic
patients, International Journal of Engineering and Advanced Technology
(IJEAT) 1 (2012) 147–151.
[26] R. Kost, B. Littenberg, E.S. Chen, Exploring generalized association
rule mining for disease co-occurrences, in: Proceedings of the AMIA
2012 Annual Symposium, AIMA, Chicago, Illinois, USA, 2012, pp. 1284–
[27] M. Mampaey, N. Tatti, J. Vreeken, Tell me what I need to know: Suc-
cinctly summarizing data with itemsets, in: Proceedings of the 17th
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD '11, ACM, New York, NY, USA, 2011, pp. 573–581.
[28] X.H. Meng, Y.X. Huang, D.P. Rao, Q. Zhang, Q. Liu, Comparison of
three data mining models for predicting diabetes or prediabetes by risk
factors, The Kaohsiung Journal of Medical Sciences 29 (2013) 93 – 99.
[29] J. Mennis, J.W. Liu, Mining association rules in spatio-temporal data:
An analysis of urban socioeconomic and land cover change, Transactions
in GIS 9 (2005) 5–17.
[30] MeTA, MeTA source code, 2014.
[31] J. Nahar, T. Imam, K.S. Tickle, Y.P.P. Chen, Association rule mining
to detect factors which contribute to heart disease in males and females,
Expert Systems with Applications 40 (2013) 1086 – 1093.
[32] H. Pan, X. Tan, Q. Han, X. Feng, G. Yin, Gma: An approach for associ-
ation rules mining on medical images, in: Proceedings of the 8th Interna-
tional Conference on Intelligent Computing Theories and Applications,
ICIC'12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 425–432.
[33] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent
closed itemsets for association rules, in: Proceedings of the 7th Inter-
national Conference on Database Theory, ICDT '99, Springer-Verlag,
London, UK, UK, 1999, pp. 398–416.
[34] B.M. Patil, R.C. Joshi, D. Toshniwal, Classification of type-2 diabetic
patients by using apriori and predictive apriori, Int. J. Comput. Vision
Robot. 2 (2011) 254–265.
[35] I. Pramudiono, M. Kitsuregawa, Fp-tax: Tree structure based general-
ized association rule mining, in: Proceedings of the 9th ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge Discovery,
DMKD '04, ACM, New York, NY, USA, 2004, pp. 60–63.
[36] S.M. van Rooden, W.J. Heiser, J.N. Kok, D. Verbaan, J.J. van Hilten,
J. Marinus, The identification of parkinson's disease subtypes using clus-
ter analysis: A systematic review, Movement Disorders 25 (2010) 969–
[37] T. Scheffer, Finding association rules that trade support optimally
against confidence, Intell. Data Anal. 9 (2005) 381–395.
[38] A.M. Shin, I.H. Lee, G.H. Lee, H.J. Park, H.S. Park, K.I. Yoon, J.J. Lee,
Y.N. Kim, Diagnostic analysis of patients with essential hypertension
using association rule mining, Healthc Inform Res 16 (2010) 77–81.
[39] R. Srikant, R. Agrawal, Mining generalized association rules, in: Pro-
ceedings of the 21th International Conference on Very Large Data Bases,
VLDB '95, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
1995, pp. 407–419.
[40] R. Srikant, Q. Vu, R. Agrawal, Mining association rules with item con-
straints, in: D. Heckerman, H. Mannila, D. Pregibon (Eds.), Proceed-
ings of the Third International Conference on Knowledge Discovery and
Data Mining (KDD-97), AAAI Press, Newport Beach, California, USA,
1997, pp. 67–73.
[41] K. Sriphaew, T. Theeramunkong, A new method for finding generalized
frequent itemsets in generalized association rule mining, in: Proceedings
of the Seventh IEEE Symposium on Computers and Communications
(ISCC 2002), IEEE Computer Society, Taormina, Italy, 2002, pp. 1040–
[42] P.N. Tan, V. Kumar, Interestingness measures for association patterns:
A perspective, in: KDD 2000 Workshop on Post-Processing in Machine
Learning and Data Mining: Interpretation, Visualization, Integration,
and Related Topics, Boston, MA, USA.
[43] P.N. Tan, V. Kumar, J. Srivastava, Selecting the right interestingness
measure for association patterns, in: Proceedings of the Eighth ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD '02, ACM, New York, NY, USA, 2002, pp. 32–41.
[44] P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining,
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.
[45] N. Tatti, Probably the best itemsets, in: Proceedings of the 16th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, KDD '10, ACM, New York, NY, USA, 2010, pp. 293–302.
[46] P.S. Timiras, Physiological Basis of Aging and Geriatrics, Taylor & Fran-
cis, United Kingdom, 2013.
[47] J.Y. Yeh, T.H. Wu, C.W. Tsao, Using data mining techniques to predict
hospitalization of hemodialysis patients, Decis. Support Syst. 50 (2011)
[48] M.J. Zaki, Mining non-redundant association rules, Data Min. Knowl.
Discov. 9 (2004) 223–248.
Table 5: Examples of correlations between examinations (Class E-Rules).
{(
Examination,AST )
} → {(
Examination,Uric acid)
}
{(
Examination,ALT )
} → {(
Examination,Uric acid)
}
{(
Examination,Gamma GT )
} → {(
Examination,Uric acid)
}
Table 6: Examples of correlations between drugs (Class D-Rules).
{(
Drug,Category R)
} → {(
Drug,Category J )
}
R = Respiratory system
J = Anti-infectives for systemic use
{(
Drug,Category J )
} → {(
Drug,Category R)
}
Table 7: Examples of correlations between drugs and examinations (Class ED-Rules).
B = Blood and blood forming organs
Table 8: Examples of correlations between user profiles, drugs, and examinations (Class
{(
Age,[40-59]), (
Examination,HDL Cholesterol)
} → {(
Drug,Rosuvastatin)
} 1.68%
Number of non-redundant rules per template.
minsup=1%,
minlift =1.1,
Number of non-redundant rules
Average cross- and low-level
rules per high-level rule
Age profiles
Gender profiles
Source: http://porto.polito.it/2570938/1/AntonellietAl.pdf
Ejaculatory dysfunctionand the treatment ofLUTS Paul Sturch,Urology Department, King'sCollege Hospital, London. For years ejaculatory dysfunction in group receiving 10mg of alfuzosin experienced no men following medical or surgical reduction in ejaculatory volume and there was no treatment of lower urinary tract significant difference in post-ejaculatory urine symptoms (LUTS) was thought to be
TitelDer Name »Esther« diente zu allen Zeiten unverändert als Titel dieses Buches. Die Bücher Esther und Ruth sind die einzigen Bücher des AT, die nach Frauen benannt sind. Wie die Sprüche Salomos, Obadja und Nahum wird das Buch Esther im NT nicht erwähnt oder zitiert. »Hadassa« (2,7) bedeutet »Myrte« oder »Braut« und war der hebräische Name von Esther, der entweder vom