Ancad.ro
CrowdTruth Measures for Language Ambiguity
The Case of Medical Relation Extraction
Anca Dumitrache1,2, Lora Aroyo1, and Chris Welty3
1 VU University Amsterdam, Netherlands
2 IBM CAS, Amsterdam, Netherlands
3 Google Research, New York, USA
Abstract. A widespread use of linked data for information extraction isdistant supervision, in which relation tuples from a data source are foundin sentences in a text corpus, and those sentences are treated as trainingdata for relation extraction systems. Distant supervision is a cheap wayto acquire training data, but that data can be quite noisy, which limitsthe performance of a system trained with it. Human annotators can beused to clean the data, but in some domains, such as medical NLP, it iswidely believed that only medical experts can do this reliably. We havebeen investigating the use of crowdsourcing as an affordable alternativeto using experts to clean noisy data, and have found that with the properanalysis, crowds can rival and even out-perform the precision and recallof experts, at a much lower cost. We have further found that the crowd,by virtue of its diversity, can help us find evidence of ambiguous sen-tences that are difficult to classify, and we have hypothesized that suchsentences are likely just as difficult for machines to classify. In this pa-per we outline CrowdTruth, a previously presented method for scoringambiguous sentences that suggests that existing modes of truth are in-adequate, and we present for the first time a set of weighted metrics forevaluating the performance of experts, the crowd, and a trained classifierin light of ambiguity. We show that our theory of truth and our metricsare a more powerful way to evaluate NLP performance over traditionalunweighted metrics like precision and recall, because they allow us to ac-count for the rather obvious fact that some sentences express the targetrelations more clearly than others.
NLP often relies on the development of a set of gold standard annotations, orground truth, for the purpose of training, testing and evaluation. Distant super-vision [17] is a helpful solution that has given linked data sets a lot of attentionin NLP, however the data can be noisy. Human annotators can help to cleanup this noise, however for Clinical NLP domain knowledge is usually believedto be required from annotators, making the process for acquiring ground truth
more difficult. The lack of annotated datasets for training and benchmarking isconsidered one of the big challenges of Clinical NLP [6].
Furthermore, the assumption that the gold standard represents a universal
and reliable model for language is flawed [4]. Disagreement between annotatorsis usually eliminated through overly prescriptive guidelines, resulting in datathat is neither general nor reflects language's inherent ambiguity. The processof acquiring ground truth by working exclusively with domain experts is costlyand non-scalable.
Crowdsourcing can be a much faster and cheaper procedure than expert
annotation, and it allows for collecting enough annotations per task in order torepresent the diversity inherent in language. Crowd workers, however, generallylack medical expertise, which might impact the quality and reliability of theirwork in more knowledge-intensive tasks.
Our approach can overcome the limitations of gathering expert ground truth,
by using disagreement analysis on crowd annotations to model the ambiguityinherent in medical text. We have previously shown our approach can improverelation extraction classifier performance over annotated data provided by ex-perts, can effectively identify low-quality workers, and identify issues with theannotation tasks themselves. In this paper we explore the hypothesis that oursentence-level metrics are providing useful information about sentence clarity,and present initial results on the value of different approaches to scoring thatthe traditional precision, recall, and accuracy.
Crowdsourcing ground truth has shown promising results in a variety of domains.
[12] compared the crowd versus experts for the task of part-of-speech tagging.
The authors also show that models trained based on crowdsourced annotationcan perform just as well as expert-trained models. [14] studied crowdsourcingfor relation extraction in the general domain, comparing its efficiency to thatof fully automated information extraction approaches. Their results showed thecrowd was especially suited to identifying subtle formulations of relations thatdo not appear frequently enough to be picked up by statistical methods.
Other research for crowdsourcing ground truth includes: entity clustering
and disambiguation [15], Twitter entity extraction [11], multilingual entity ex-traction and paraphrasing [8], and taxonomy creation [9]. However, all of theseapproaches rely on the assumption that one black-and-white gold standard mustexist for every task. Disagreement between annotators is discarded by picking oneanswer that reflects some consensus, usually through using majority vote. Thenumber of annotators per task is also kept low, between two and five workers,also in the interest of eliminating disagreement. The novelty in our approach isto consider language ambiguity, and consequently inter-annotator disagreement,as an inherent feature of the language. The metrics we employ for determiningthe quality of crowd answers are specifically tailored to quantify disagreementbetween annotators, rather than eliminate it.
The role of inter-annotator disagreement when building a gold standard
has previously been discussed by [19]. After empirically studying part-of-speechdatasets, the authors found that inter-annotator disagreement is consistent acrossdomains, even across languages. Furthermore, most disagreement is indicativeof debatable cases in linguistic theory, rather than faulty annotation. We be-lieve these findings manifest even more strongly for NLP tasks involving seman-tic ambiguity, such as relation extraction. In assessing the Ontology AlignmentEvaluation Initiative (OAEI) benchmark, [7] found that disagreement betweenannotators (both crowd and expert) is an indicator for inherent uncertainty inthe domain knowledge, and that current benchmarks in ontology alignment andevaluation are not designed to model this uncertainty.
Human annotation is a process of semantic interpretation. It can be described
using the triangle of reference [13], that links together three aspects: sign (inputtext), interpreter (worker), referent (annotation). Ambiguity for one aspect ofthe triangle will propagate and affect the others – e.g. an unclear sentence willcause more disagreement between workers. Therefore, in our work, we use metricsto harness disagreement for each of the three aspects of the triangle, measuringthe quality of the worker, as well as the ambiguity of the text and the task.
We set up an experiment to train and evaluate a relation extraction modelfor a sentence-level relation classifier. The classifier takes, as input, sentencesand two terms from the sentence, and returns a score reflecting the likelihoodthat a specific relation, in our case the cause relation between disorders andsymptoms, is expressed in the sentence between the terms. Starting from a setof 902 sentences that are likely to contain medical relations, we constructeda workflow for collecting annotations through crowdsourcing. This output wasanalyzed with our metrics for capturing disagreement, and then used to traina model for relation extraction. In parallel, we also constructed a model basedon data from a traditional gold standard using domain experts, that we thencompare to the crowd model.
The dataset used in our experiments contains 902 medical sentences extractedfrom PubMed article abstracts. The MetaMap parser [1] ran over the corpus toidentify medical terms from the UMLS vocabulary [5]. Distant supervision [17]was used to select sentences with pairs of terms that are linked in UMLS byone of our chosen seed medical relations. The intuition of distant supervision isthat since we know the terms are related, and they are in the same sentence,it is more likely that the sentence expresses a relation between them. The seedrelations were restricted to a set of eleven UMLS relations important for clinicaldecision making [21] (see Tab.1). All of the data that we have used is availableonline at: http://data.crowdtruth.org/medical-relex.
Table 1: Set of medical relations.
therapeutic use of a drug
penicillin treats infection
preventative use of a drug
vitamin C prevents influenza
diagnostic use of an ingredient, test or a drug
RINNE test is used to diagnose hear-ing loss
the underlying reason for a symptom or a disease fever induces dizziness
has causative agent
disease has primary body part in which disease or disorder is observed leukemia is found in the circulatoryanatomic site;
disease has finding;
deviation from normal function indicating the pain is a symptom of a broken arm
have presence of disease or abnormality
manif estation has manifestation
links disorders to the observations that are closely abdominal distention is a manifesta-associated with them
tion of liver failure
contraindicate contraindicated drug a condition for which a drug or treatment should patients with obesity should avoid
signs, symptoms or findings that often appear to- patients who smoke often have yellowgether
a secondary condition or symptom that results use of antidepressants causes drynessfrom a drug
a relation that indicates that one of the terms is migraine is a kind of headachemore specific variation of the other
an anatomical or structural sub-component
the left ventricle is part of the heart
For collecting annotations from medical experts, we employed medical stu-
dents, in their third year at American universities, that had just taken UnitedStates Medical Licensing Examination (USMLE) and were waiting for their re-sults. Each sentence was annotated by exactly one person. The annotation taskconsisted of deciding whether or not the UMLS seed relation discovered by dis-tant supervision is present in the sentence for the two selected terms.
Crowdsourcing setup
The crowdsourced annotation is performed in a workflow of three tasks (Fig.1).
The sentences were pre-processed to determine whether the terms found withdistant supervision are complete or not; identifying complete medical terms isdifficult, and the automated method left a number of terms still incomplete,which was a significant source of error for the crowd in subsequent stages, so theincomplete terms were sent through a crowdsourcing task (FactSpan) in orderto get the full word span of the medical terms. Next, the sentences with thecorrected term spans were sent to a relation extraction task (RelEx), where thecrowd was asked to decide which relation holds between the two extracted terms.
We also added four new relations (e.g. associated with), to account for weaker,more general links between the terms (see Tab.1). The workers were able toread the definition of each relation, and could choose any number of relationsper sentence. There were options for the cases when the terms were related,but not by those we provided (other), and for no relation between the terms(none). Finally, the results from RelEx were passed to another crowdsourcing
task (RelDir) to determine the direction of the relation with regards to the twoextracted terms. (FactSpan and RelDir) were added to the basic RelEx task tocorrect the most common sources of errors from the crowd.
All three crowdsourcing tasks were run on the CrowdFlower platform 4 with
10-15 workers per sentence, to allow for a distribution of perspectives. Evenwith three tasks and 10-15 workers per sentence, compared to a single expertjudgment per sentence, the total cost of the crowd amounted to 2/3 of the sumpaid for the experts. In our case, cost was not the limiting factor for the experts,but their time and availability.
Fig. 1: CrowdTruth Workflow for Medical Relation Extraction on CrowdFlower [10].
For each crowdsourcing task in the workflow, the crowd output was processedwith our metrics – a set of general-purpose crowdsourcing metrics [3]. Thesemetrics attempt to model the crowdsourcing process based on the triangle ofreference [18], with the vertices being the input sentence, the worker, and thetarget relations. Our theory is that ambiguity and disagreement at any of thevertices (e.g. a sentence with unclear meaning, a poor quality worker, or an un-clear relation) will propagate in the system, influencing the other components.
For example, a worker who annotates an unclear sentence is more likely to dis-agree with the other workers, and this can impact that worker's quality. A lowquality worker is more likely to disagree with the other workers, and this canimpact the apparent quality of the sentence. If one of the target relations is it-self ambiguous, it will be difficult to identify and will generate disagreement thatmay have nothing to do with the quality of sentences or workers. Our metricsaccount for this by isolating the signals from the workers, sentences, and thetarget relations, and more accurately evaluating each. In previous work we havevalidated this premise in several empirical studies [3].
In this paper we focus specifically on sentence quality, to evaluate our claim
that low quality sentences are difficult to annotate, and likewise difficult for ma-
4 http://crowdflower.com
Sent.1: Renal osteodystrophy is a general complication of chronic renal failure andend stage renal disease.
Sent.2: If TB is a concern, a PPD is performed.
0.09 0.96 0.09 0.19 0 0 0.09 0 0 0
relation score 0.36 0.12 0.84
0 0 0.36 0 0 0 0.12 0 Sent.2
-0.97 0.99 -0.97 -0.94 -1 -1 -0.97 -1 -1 -1
training score -0.89 -0.96 0.95
-1 -1 -0.89 -1 -1 -1 -0.96 -1 Sent.2
Table 2: Example sentence with scores from the crowd dataset; training score calculatedfor negative/positive sentence-relation threshold equal to 0.5, and linear rescaling in the[−1, −0.85] interval for negative, [0.85, 1] for positive.
chines to process. To measure this effect, we begin with a simple representationof the crowd output from the RelEx task:
– annotation vector: the annotations of one worker for one sentence. For each
worker i their solution to a task on a sentence s is the vector Ws,i. If theworker selects a relation, its corresponding component would be marked with‘1', and ‘0' otherwise. For instance, in the case of RelEx, the vector will havefourteen components, one for each relation, none and other.
– sentence vector: For every sentence s, we sum the annotation vectors for all
workers on the given task: Vs = P W
The sentence vector is a simple representation of the annotations on a sen-
tence, and leads to the sentence-relation score, which measures, for each relation,the degree to which a sentence vector diverges from perfect agreement on thatrelation. It is simply the cosine similarity between the sentence vector and theunit vector for the relation: srs(s, r) = cos(Vs, ˆ
r). The higher the value of this
metric, the more clearly the relation is expressed in the sentence. The purposeof the experiments is to provide evidence that the srs is measuring the clarity,or inversely the ambiguity, of a sentence with respect to a particular relation,and that sentences with low scores present difficulty for the crowd, experts, andmachines alike.
We use a two-step process to eliminate low-quality worker annotations. We
run the sentence metrics and filter out sentences whose quality score is onestandard deviation below the mean, then we run our worker metrics [2] on theremaining sentences and filter out all workers below a trained threshold. Thepurpose of the first step is to ensure the worker quality scores are not adverselyimpacted by confusing sentences. We remove all low quality worker annotationsand re-evaluate the sentence metrics on all sentences.
Training the model
At the highest level our research goal is to investigate crowdsourcing as a way togather human annotated data for training and evaluating cognitive systems. Inthese experiments we were specifically gathering annotated data for a sentence-level relation extraction classifier [21]. This classifier is trained per individualrelation, by feeding it both positive and negative examples. It offers support forboth discrete labels, and real values for weighting the confidence of the trainingdata entries, with positive values in (0, 1], and negative values in [−1, 0).
To test our approach, we gathered four annotated data sets and trained
classifier models for the cause relation using five-fold cross-validation over the902 sentences:
1. baseline: Discrete (positive or negative) labels are given for each sentence by
the distant supervision method – for any relation, a positive example is asentence containing two terms related by cause in UMLS. Distant supervisiondoes not extract negative examples, so in order to generate a negative setfor one relation, we use positive examples for the other (non-overlapping)relations shown in Tab. 1.
2. expert: Discrete labels based on an expert's judgment as to whether the base-
line label is correct. The experts do not generate judgments for all combina-tions of sentences and relations – for each sentence, the annotator decides onthe seed relation extracted with distant supervision. We reuse positive ex-amples from the other relations to extend the number of negative examples.
3. single: Discrete labels for every sentence are taken from one randomly se-
lected crowd worker who annotated the sentence. This data simulates thetraditional single annotator setting.
4. crowd: Weighted labels for every sentence are based on the CrowdTruth
sentence-relation score. The classifier expects positive scores for positive ex-amples, and negative scores for negative, so the sentence-relation scores mustbe re-scaled. An important variable in the re-scaling is a threshold to selectpositive and negative examples. The Results section compares the perfor-mance of the crowd at different threshold values. Given a threshold, thesentence-relation score is then linearly re-scaled into the [0.85, 1] intervalfor the positive label weight, and the [−1, −0.85] interval for negative (seebelow). An example of how the scores were processed is given in Tab.2.
In order to directly compare the expert to the crowd annotations, it was
necessary to annotate precisely the same sentences using each method, and trainthe classifier on each set. The limitation on batch size came from the availabilityof our experts, we were only able to use them for 902 sentences. In a batch thissmall, we found that the sentence-relation score, which ranged between [0,1] andrarely assigned a weight of 1, diluted the positive signal too much in comparisonto the expert scores which were simply 0 or 1. We experimented, on a differentdata set, with rescaling the scores and selected the range that yielded the highestquality score, specified above.
In order for a meaningful comparison between the crowd and expert models,we verified the sentences to provide a ground truth – a discrete positive or neg-ative label on each sentence used in evaluation (for training, only the scoresfrom the respective data set were used). While the main purpose of this workis to move beyond discrete labels for truth, we needed a reference standard toestablish that our approach is at least as good as the accepted practice. To pro-duce this reference standard, we first selected the positive/negative threshold forsentence-relation score in the crowd dataset that yielded the highest agreementbetween the crowd and the experts, and then accepted all 755 sentences wherethe experts and crowd agreed as true positives. The remaining sentences weremanually evaluated and assigned either a positive, negative, or ambiguous value.
The ambiguous cases were subsequently removed resulting in 902 sentences. Inthis way we created reliable, unbiased test scores, to be used in the evaluationof the models. In some sense, removing the ambiguous cases penalizes our ap-proach, which is designed specifically to help deal with them, but again we wantto first establish our approach is at least as good as accepted practice.
Preliminary experiments
As reported in [10] and summarized here, we compared each of the four datasetsto our vetted reference standard, to determine the quality of the cause relationannotations, as shown in Fig.2. As expected, the baseline data was the low-est quality, followed closely by the single crowd worker. The expert annotationsachieved an F1 score of 0.844. Since the baseline, expert, and single sets arebinary decisions, they appear as horizontal lines. For the crowd annotations,we plotted the F1 against different sentence-relation score thresholds for deter-mining positive and negative sentences. Between the thresholds of 0.6 and 0.8,the crowd out-performs the expert, reaching a maximum of 0.907 F1 score at
Fig. 2: Annotation quality F1 score per
negative/positive threshold for cause.
trained with each dataset.
a threshold of 0.7. This difference is significant with p = 0.007, measured withMcNemar's test [16].
We next wanted to verify that this improvement in annotation quality has a
positive impact on the model that is trained with this data. In a cross-validationexperiment, we trained the model with each of the four datasets for identi-fying the cause relation (discussed in more detail in [10]). The results of theevaluation (Fig.3) show the best performance for the crowd model when thesentence-relation threshold for deciding between negative/positive equals 0.5.
Trained with this data, the classifier model achieves an F1 score of 0.642, com-pared to the expert-trained model which reaches 0.638. McNemar's test showsstatistical significance with p = 0.016. This result demonstrates that the crowdprovides training data that is at least as good, if not better than experts.
Results and Discussion
We believe the discrete notion of truth is obsolete and should be replaced bysomething more flexible. For the purposes of semantic interpretation tasks forwhich crowdsourcing is appropriate, we propose our annotation-level metrics asa suitable replacement. In this case, the sentence − relationscore gives a real-valued score that measures the degree to which a particular sentence expresses aparticular relation between two terms. We believe the preliminary experimentsdemonstrate the approach is sound. Our primary results evaluate the sentence-relation score as a measure of the clarity with which a sentence expresses therelation. To this end, we define the following metrics:
– sentence weight: For a given positive/negative threshold τ , if srs(s) ≥ τ for
sentence s then the sentence weight ws = srs(s), otherwise ws = 1 − srs(s).
– weighted precision: We collect true and false positives and negatives in the
standard way based on the vetted reference standard, such that tp(s) = 1 iffs is a true positive, and 0 otherwise, similarly for f p, tn, f n. Where normally
p = tp/(tp + f p), weighted precision p0 =
s(tp(s) + f p(s))
– weighted recall: Where normally r = tp/(tp + f n), weighted recall r0 =
s(tp(s) + f n(s))
– weighted f-measure: Is the harmonic mean of weighted precision and recall:
f 10 = 2p0r1/(p0 + r0)
If the srs metric is a true measure of clarity, then we would expect it to be
more likely for low clarity sentences to be wrong, and less likely for high claritysentences, and this should be revealed in an overall increase of the weighted scoresover the unweighted. In Tab. 3, we show a comparison of five data sets. In the firsttwo columns, the annotation quality of each data set is shown, comparing the F1to the weighted F1'. The F1' scores are higher in all cases, revealing that humanannotators are indeed having trouble correctly annotating these sentences. Thebaseline scores are the least affected by the weighting, which also fits with ourintuition since the baseline does not use human judgment at all.
Table 3: Model evaluation results for each dataset.
Annotation Quality
Classifier Performance
0.642 0.687 0.565 0.632 0.743 0.754
0.613 0.646 0.620 0.678 0.611 0.622
0.575 0.606 0.436 0.474 0.845 0.842
0.483 0.507 0.496 0.545 0.473 0.478
0.638 0.658 0.672 0.711 0.605 0.616
Fig. 4: Comparison of weighted to non-weighted F1 scores for the crowd-trained classi-fier at different thresholds.
Fig. 6: Crowd & expert agreement per
Fig. 5: Density of cause sent.-rel. score
neg./pos. threshold for cause.
over the expert data.
The next six columns in each row show classifier performance when trained
by that dataset. The first pair of columns compare the F1 to F1', and for interestthe final four columns show the precision and recall. In all cases the classifierF1' is greater than F1, indicating that, as with humans, machines have troublecorrectly interpreting sentences with a low srs. The only weighted metric thatdoes not increase is the baseline recall, again this is justified as the baseline doesnot actually require any interpretation.
In Fig. 4 we show how the classifier performs throughout the possible thresh-
olds, the weighted scores are consistently higher.
We also analyzed the data to understand the overlap between the crowd
scores and the experts. In Fig.5 we compared the frequency of sentences withcause annotations at different sentence-relation scores (measured with kerneldensity estimation [20]) to the expert annotations of the same sentences. Theresult shows high agreement between the crowd and the expert – a low sentence-relation score is highly correlated with a negative expert decision, and a highscore is highly correlated with a positive expert decision. In Fig.6 we show thenumber of sentences in which the crowd agrees with the expert (on both positiveand negative decisions), plotted against different positive/negative thresholds forthe sentence-relation score of cause. The maximum agreement with the expertset is at the 0.7 threshold, the same as for the annotation quality F1 score (Fig.2),with 755 sentences in common between crowd and expert. The remaining 147sentences were manually evaluated to build the test partition.
A widespread use of linked data for information extraction is distant supervision,in which relation tuples from a data source are found in sentences in a textcorpus, and those sentences are treated as training data for relation extractionsystems. Distant supervision is a cheap way to acquire training data, but thatdata can be quite noisy, which limits the performance of a system trained withit. Human annotators can be used to clean the data, but in some domains,such as medical NLP, it is widely believed that only medical experts can dothis reliably. Current methods for collecting this human annotation attempt tominimize disagreement between annotators, but end up failing to capture theambiguity inherent in language. We believe this is a vestige of an antiquatednotion of truth being a discrete property, and have developed a powerful newmethod for representing truth.
In this paper we have presented results that show that using a larger number
of workers per example – up to 15 – can form a more accurate model of truthat the sentence level, and significantly improve the quality of the annotations.
It also benefits systems that use this annotated data, such as machine learningsystems, significantly improving their performance with higher quality data. Ourprimary result is to show that our scoring metric for sentence quality in relationextraction supports our hypothesis that higher quality sentences are easier toclassify – for crowd workers, experts, and machines, and our model of truthallows us to more faithfully capture the ambiguity that is inherent in languageand human interpretation.
The authors would like to thank Chang Wang for support with using the medicalrelation extraction classifier, and Anthony Levas for help with collecting theexpert annotations.
1. Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus:
the MetaMap program. In: Proceedings of the AMIA Symposium. p. 17. AmericanMedical Informatics Association (2001)
2. Aroyo, L., Welty, C.: Crowd Truth: harnessing disagreement in crowdsourcing a
relation extraction gold standard. Web Science 2013. ACM (2013)
3. Aroyo, L., Welty, C.: The Three Sides of CrowdTruth. Journal of Human Compu-
tation 1, 31–34 (2014)
4. Aroyo, L., Welty, C.: Truth is a lie: Crowd truth and the seven myths of human
annotation. AI Magazine 36(1), 15–24 (2015)
5. Bodenreider, O.: The unified medical language system (UMLS): integrating
biomedical terminology. Nucleic acids research 32(suppl 1), D267–D270 (2004)
6. Chapman, W.W., Nadkarni, P.M., Hirschman, L., D'Avolio, L.W., Savova, G.K.,
Uzuner, O.: Overcoming barriers to nlp for clinical text: the role of shared tasksand the need for additional creative solutions. Journal of the American MedicalInformatics Association 18(5), 540–543 (2011)
7. Cheatham, M., Hitzler, P.: Conference v2. 0: An uncertain version of the OAEI
Conference benchmark. In: The Semantic Web–ISWC 2014, pp. 33–48. Springer(2014)
8. Chen, D.L., Dolan, W.B.: Building a persistent workforce on mechanical turk
for multilingual data collection. In: Proceedings of The 3rd Human ComputationWorkshop (HCOMP 2011) (2011)
9. Chilton, L.B., Little, G., Edge, D., Weld, D.S., Landay, J.A.: Cascade: crowd-
sourcing taxonomy creation. In: Proceedings of the SIGCHI Conference on HumanFactors in Computing Systems. pp. 1999–2008. CHI '13, ACM, New York, NY,USA (2013)
10. Dumitrache, A., Aroyo, L., Welty, C.: Achieving expert-level annotation quality
with CrowdTruth: the case of medical relation extraction. In: Proceedings of the2015 International Workshop on Biomedical Data Mining, Modeling, and SemanticIntegration (BDM2I-2015), 14th International Semantic Web Conference (2015)
11. Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., Dredze, M.: An-
notating named entities in Twitter data with crowdsourcing. In: In Proc. NAACLHLT. pp. 80–88. CSLDAMT '10, Association for Computational Linguistics (2010)
12. Hovy, D., Plank, B., Søgaard, A.: Experiments with crowdsourced re-annotation
of a POS tagging data set. In: Proceedings of the 52nd Annual Meeting of theAssociation for Computational Linguistics (Volume 2: Short Papers). pp. 377–382.
Association for Computational Linguistics, Baltimore, Maryland (June 2014)
13. Knowlton, J.Q.: On the definition of "picture". AV Communication Review 14(2),
14. Kondreddi, S.K., Triantafillou, P., Weikum, G.: Combining information extraction
and human computing for crowdsourced knowledge acquisition. In: 30th Interna-tional Conference on Data Engineering. pp. 988–999. IEEE (2014)
15. Lee, J., Cho, H., Park, J.W., Cha, Y.r., Hwang, S.w., Nie, Z., Wen, J.R.: Hybrid
entity clustering using crowds and data. The VLDB Journal 22(5), 711–726 (2013)
16. McNemar, Q.: Note on the sampling error of the difference between correlated
proportions or percentages. Psychometrika 12(2), 153–157 (1947)
17. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation ex-
traction without labeled data. In: Joint Conference of the 47th Annual Meeting
of the ACL and the 4th International Joint Conference on Natural Language Pro-cessing of the AFNLP: Volume 2. pp. 1003–1011. Association for ComputationalLinguistics (2009)
18. Ogden, C.K., Richards, I.: The meaning of meaning. Trubner & Co, London (1923)19. Plank, B., Hovy, D., Søgaard, A.: Linguistically debatable or just plain wrong?
In: Proceedings of the 52nd Annual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers). pp. 507–511. Association for ComputationalLinguistics, Baltimore, Maryland (June 2014)
20. Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC
21. Wang, C., Fan, J.: Medical relation extraction with manifold models. In: 52nd
Annual Meeting of the ACL, vol. 1. pp. 828–838. Association for ComputationalLinguistics (2014)
Source: http://www.ancad.ro/wp-content/uploads/2015/08/iswc-ld4ie15-crowdtruth.pdf
Evidence-based guideline update: NSAIDs and other complementary treatments for episodic migraine prevention in adults : Report of the Quality Standards Subcommittee of the American Academy of Neurology and the American Headache S. Holland, S.D. Silberstein, F. Freitag, et al. April 24, 2012 This information is current as of
Contact Details 85 Canning Street Launceston, TAS 7250 PH: (03) 6334-5721 FAX: (03) 6331-9769 OFFICE HOURS Monday to Friday: 8:30am—5:00pm Volume: 2 Issue: 2 Flinders Island Aboriginal Association INC. Healthy Lifestyle Program WHAT WE ARE ALL ABOUT……