Crowd-based MT Evaluation for non-English Target Languages
Michael Paul and Eiichiro Sumita
NICT
Hikaridai 3-5
619-0289 Kyoto, Japan
Luisa Bentivogli and Marcello Federico
FBK-irst
Via Sommarive, 18
38123 Povo-Trento, Italy
.@nict.go.jp
{bentivo,federico}@fbk.eu
Abstract
This paper investigates the feasibility of using crowd-sourcing services for the human assessment of machine translation quality of translations into non-English target languages. Non-expert graders are hired through the CrowdFlower interface to Amazon’s Mechanical Turk in order to carry out a ranking-based MT evaluation of utterances taken from the travel conversation domain for 10 Indo-European and
Asian languages. The collected human assessments are analyzed for their worker characteristics, evaluation costs, and quality of the evaluations in terms of the agreement between non-expert graders and expert/oracle judgments. Moreover, data quality control mechanisms including “locale qualification” “qualificatio testing”, and “on-the-fl verification are investigated in order to increase the reliability of the crowd-based evaluation results.
1
Introduction
This paper focuses on the evaluation of machine translation (MT) quality for target languages other than English. Although human evaluation of MT output provides the most direct and reliable assessment, it is time consuming, costly, and subjective. Various automatic evaluation measures were proposed to make the evaluation of MT outputs cheaper and faster (Przybocki et al., 2008), but automatic metrics have not yet proved able to consistently predict the usefulness of MT technologies. To counter the high costs in human assessment of MT outputs, the usage of crowdsourcing services such as Amazon’s Mechanical Turk1
(MTurk) and CrowdFlower2 (CF) were proposed recently (Callison-Burch, 2009; Callison-Burch et al., 2010; Denkowski and