On the Evaluation of Machine Translation n-best Lists

Jacob Bremerman, Huda Khayrallah, Douglas W Oard, Matt Post

November 2020

Abstract

The standard machine translation evaluation framework measures the single-best output of machine translation systems. There are, however, many situations where n-best lists are needed, yet there is no established way of evaluating them. This paper establishes a framework for addressing n-best evaluation by outlining three different questions one could consider when determining how one would define a `good' n-best list and proposing evaluation measures for each question. The first and principal contribution is an evaluation measure that characterizes the translation quality of an entire n-best list by asking whether many of the valid translations are placed near the top of the list. The second is a measure that uses gold translations with preference annotations to ask to what degree systems can produce ranked lists in preference order. The third is a measure that rewards partial matches, evaluating the closeness of the many items in an n-best list to a set of many valid references. These three perspectives make clear that having access to many references can be useful when n-best evaluation is the goal.

Type

Conference paper

Publication

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Association for Computational Linguistics

On the Evaluation of Machine Translation n-best Lists

Abstract

Jacob Bremerman

PhD Student