SemEval-2022 Task 7
The goal of this shared task is to evaluate the ability of NLP systems to distinguish between plausible and implausible clarifications of an instruction. Such clarifications can be critical to ensure that instructions describe clearly enough what steps must be followed to achieve a specific goal. We set up this task as a cloze task, in which clarifications are presented as possible fillers and systems have to score how well each filler plausibly fits in a given context.
Cloze tasks have become a standard framework for evaluating various discourse-level phenomena in NLP. Some prominent examples include the narrative cloze test (Chambers and Jurafksy, 2008), the story cloze test (Mostafazadeh et al., 2016), and the LAMBADA word prediction task (Paperno et al., 2016). In these tasks, NLP systems are required to make a prediction about the filler of a cloze that is most likely to continue the discourse. However, it is not always clear whether exactly one likely filler exists or how plausible different fillers would be.
This task revolves around judging the plausibility of human-inserted and machine-generated fillers in naturally occurring contexts. Specifically, the contexts are instructional texts on everyday scenarios in which clarifications may have been necessary to eliminate possible misunderstandings. Clarifications were identified using revision histories in which it is possible to observe disambiguations of various semantic and pragmatic phenomena, including implicit, underspecified, and metonymic references, as well as implicit discourse relations and implicit quantifying modifiers.
There is no formal registration for the task yet. Anyone interested can join our Google group here.
The basis of our task is wikiHowToImprove (Anthonio et al., 2020), a collection of revisions of instructional texts from the how-to website wikiHow. For this task, we extract revisions that we believe are likely to represent specific instances of clarifications. As such, each revision in this task represents an option to clarify between possibly multiple meanings. To assess the plausibility of different clarification options, we automatically generate alternatives and ask annotators to rate for each clarification option whether it "makes sense in the given how-to guide" (on a scale from 1 to 5).
The data created with the described approach is now available at the following link. Examples from the trial data are shown in the panels below.
For a given how-to guide (panel), systems participating in the task have to predict the plausibility (1-5) of each filler (listed in each panel's footer).