1. 工作 & 结论
对DROP排行榜上的模型进行控制研究。
结果表明模型不一定是通过推理得到正确答案的。所以标准度量(standard metrics)无法衡量数值推理任务的发展。
2. Dataset and Models
DROP
只保留数值推理问题。5850个number类型的问题;998个comparison类型问题。
models & difference
| methods | encoder | |
|---|---|---|
| NAQANet | based on GloVe embeddings | |
| MTMSN | BERT-uncased | |
| NeRd | solves the problem by generating a program from a domain-specific language | BERT-uncased |
| Gen-BERT | augments the language model pre-training procedure by adding two more stages, pre-training with numerical data and pre-training with numeric textual data | BERT-uncased |
| TASE | RoBERTa_LARGE | |
| NumNet+ | RoBERTa_LARGE |
3. Study
Evaluating question understanding
Question permutation experiment
改变词序。
结论:模型对词序不敏感。This experiment indicates that the models are not sensitive to word-order and this can potentially impact their utility.
original How many years did it take for the population to decrease to about 1100 from 10000?
3-gram 10000 about 1100 from did it take to decrease to How many years for the population?
2-gram population to it take How many about 1100 from 10000 for the years did decrease to?
1-gram 10000 for from years to decrease it population about 1100 did many to take How the?

Affinity to the class of questions
DROP数据集问题类型可以个根据问题前几个词判断。
2 Words How many?
3 Words How many years?
5 Words How many years did it?
结论:
-
Further showing an affinity of the models to be able to answer questions by exploiting the mere presence of a few words in the question.
-
it is more likely to get an answer correct if the space of possi- ble arithmetic expressions is smaller.

Evaluating passage comprehension
Random Passage We pair each question with a randomly assigned passage from numset.为每个问题搭配随机的passage。
Dummy Passage We create an uninformative passage that contains no numbers, this is a proxy for a blank passage as models are unable to process that. It is the sequence: ‘This is a sentence.’为每个问题搭配一个不含数字的无信息passage。
Fixed Passage We pair all questions with an unseen passage from the hidden test set. This passage has similar properties as passages in the train and dev, but is irrelevant to the corresponding question为每个问题搭配固定的同一个passage。

结论
本篇论文大概就是设置定量实验。验证模型是否通过真正理解了文章或问题来正确回答问题。实验结果也表明,模型并不一定是根据推理得到正确答案的,有些模型根本不需要上下文信息就能得到正确答案。所以排行榜上的模型大多只是分数接近人类水平,但实际上并没有真正理解文本,不能根据离散推理得到答案。
该研究通过控制实验分析DROP排行榜上模型的性能,发现模型并不总是依赖推理得出正确答案,标准度量无法准确评估数值推理任务的进步。实验包括问题词序变化、问题类型识别和篇章理解评估,揭示模型可能仅利用部分信息或无需上下文即能答题,暗示模型理解的局限性。

489

被折叠的 条评论
为什么被折叠?



