Numerical reasoning in machine reading comprehension tasks: are we there yet?

该研究通过控制实验分析DROP排行榜上模型的性能,发现模型并不总是依赖推理得出正确答案,标准度量无法准确评估数值推理任务的进步。实验包括问题词序变化、问题类型识别和篇章理解评估,揭示模型可能仅利用部分信息或无需上下文即能答题,暗示模型理解的局限性。

1. 工作 & 结论

对DROP排行榜上的模型进行控制研究。

结果表明模型不一定是通过推理得到正确答案的。所以标准度量(standard metrics)无法衡量数值推理任务的发展。

2. Dataset and Models

DROP

只保留数值推理问题。5850个number类型的问题;998个comparison类型问题。

models & difference

methodsencoder
NAQANetbased on GloVe embeddings
MTMSNBERT-uncased
NeRdsolves the problem by generating a program from a domain-specific languageBERT-uncased
Gen-BERTaugments the language model pre-training procedure by adding two more stages, pre-training with numerical data and pre-training with numeric textual dataBERT-uncased
TASERoBERTa_LARGE
NumNet+RoBERTa_LARGE

3. Study

Evaluating question understanding

Question permutation experiment

改变词序。

结论:模型对词序不敏感。This experiment indicates that the models are not sensitive to word-order and this can potentially impact their utility.

original How many years did it take for the population to decrease to about 1100 from 10000?

3-gram 10000 about 1100 from did it take to decrease to How many years for the population?

2-gram population to it take How many about 1100 from 10000 for the years did decrease to?

1-gram 10000 for from years to decrease it population about 1100 did many to take How the?

Affinity to the class of questions

DROP数据集问题类型可以个根据问题前几个词判断。

2 Words How many?

3 Words How many years?

5 Words How many years did it?

结论:

  • Further showing an affinity of the models to be able to answer questions by exploiting the mere presence of a few words in the question.

  • it is more likely to get an answer correct if the space of possi- ble arithmetic expressions is smaller.

Evaluating passage comprehension

Random Passage We pair each question with a randomly assigned passage from numset.为每个问题搭配随机的passage。

Dummy Passage We create an uninformative passage that contains no numbers, this is a proxy for a blank passage as models are unable to process that. It is the sequence: ‘This is a sentence.’为每个问题搭配一个不含数字的无信息passage。

Fixed Passage We pair all questions with an unseen passage from the hidden test set. This passage has similar properties as passages in the train and dev, but is irrelevant to the corresponding question为每个问题搭配固定的同一个passage。

结论

本篇论文大概就是设置定量实验。验证模型是否通过真正理解了文章或问题来正确回答问题。实验结果也表明,模型并不一定是根据推理得到正确答案的,有些模型根本不需要上下文信息就能得到正确答案。所以排行榜上的模型大多只是分数接近人类水平,但实际上并没有真正理解文本,不能根据离散推理得到答案。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值