(完成)EMNLP2020
-
AmbigQA: Answering Ambiguous Open-domain Questions研究开放域问答中问题的歧义性,找到歧义性问题对应的多个答案,并针对每一个答案对问题进行重写以实现消歧。
提出了一个对应的任务和数据集,以及baseline。
-
【句子表示学习】An Unsupervised Sentence Embedding Method by Mutual Information Maximization
句子输入到 BERT 后被编码,其输出的 token embeddings 通过多个不同 kernel size 的一维卷积神经网络得到多个 n-gram 特征。我们把每一个 n-gram 特征当成局部表征(Local representation), 将平均池化后的局部表征称为全局表征(Global representation)。最后,我们用一个基于互信息的损失函数来学习最终的句向量。该损失函数的出发点是最大化句子的全局表征(句向量)与局部表征之间的平均互信息值,因为对于一个好的全局句向量,它与所对应的局部表征之间的 MI 应该是很高的, 相反,它与其他句子的局部表征间的 MI 应该是很低的。
这样的任务类似 contrastive learning,可以鼓励编码器更好地捕捉句子的局部表征,并且更好地区分不同句子之间的表征。
-
【模型的可解释性】Analyzing Individual Neurons in Pre-trained Language Models
分析模型:ELMo,T-ELMo,BERT,XLNet
分析方式:从neuron而不是layer的角度进行探测。
一些分析结论:1)it is possible to extract a very small subset of neurons to predict a linguistic task with the same accuracy as using the entire network. Low level tasks such as predicting morphology require fewer neurons compared to high level tasks such as predicting syntax. 2)ELMo generally needed fewer neurons while T-ELMo required more neurons compared to the other models to achieve oracle performance. 说明knowledge of lexical semantics in T-ELMo is distributed in more neurons。3)lexical tasks such as learning morphology (POS tagging) and word semantics (SEM tagging) are dominantly captured by the neurons at lower layers, whereas the more complicated task of modeling syntax (CCG supertagging) is taken care of at the final layer. An exception to this overall pattern is the BERT model. Top neurons in BERT spread across all the layers, unlike other models where top neurons (for a particular task) are contributed by fewer layers. 4)BERT is the most distributed model with respect to all properties while XLNet exhibits focus with the most disjoint set of neurons and layers designated for different linguistic properties. 5)Some phenomena (e.g. Verbs) are distributed across many neurons while others (e.g. Interjections) are localized in a fewer neurons。
-
Analyzing Redundancy in Pretrained Transformer Models
结论:1)相邻层的冗余度高,除了最后两层。2)85%的神经元可以被remove,而且没有性能的损失。大部分冗余的神经元来自于同一层或者相邻层。3)high layer-level redundancy for sequence labeling tasks, The amount of redundancy is substantially lower for sequence classification tasks. all the sequence classification tasks are learned at higher layers and none of the lower layers were found to be redundant. 4)XLNet is more redundant than BERT。5)Complex core language tasks require more neurons。Less task-specific redundancy for core linguistic tasks compared to higher-level tasks
-
**
-
Context-Aware Answer Extraction in Question Answering**. (没找到)
-
Dense Passage Retrieval for Open-Domain Question Answering**.
-
Don’t Read Too Much Into It: Adaptive Computation for Open-Domain Question Answering**. (没找到)
-
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models**.
实验方法:使用预训练模型到下游任务时,不是采用微调所有参数的方法,而是选择一组weights,mask掉其他的参数,最后可以取得和微调差不多的性能。 实验结果:1)Intrinsic evaluations show that masked language models extract valid representations for downstream tasks。2)the representations are generalizable。3)the minima obtained by finetuning and masking can be easily connected by a line segment。
-
Modularized Transfomer-based Ranking Framework**.(没找到)
-
Multi-Step Inference for Reasoning Over Paragraphs
提出推理模型,分为select+chain+predict三个模块。
-
MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale**.
结论:1)neither domain similarity nor training data size are suitable for predicting the best models。
-
Multilevel Text Alignment with Cross-Document Attention
在hierarchical attention 的双塔模型中加入cross-attention。
提出一个doc-doc和sentence-doc对齐的benchmark。
-
On the Sentence Embeddings from BERT for Semantic Textual Similarity**.(没找到)
-
Probing Pretrained Language Models for Lexical Semantics(找不到)
-
Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting
在微调阶段使用多任务学习的思想,联合学习预训练任务,但是需要Pretraining Simulation mechanism解决预训练任务不可得的困难,而且多任务学习时要使用Objective Shifting mechanism逐渐偏重下游任务。
-
Ad-hoc Document Retrieval using Weak-Supervision with BERT and GPT2**. (找不到)
-
Towards Better Context-aware Lexical Semantics: Adjusting Contextualized Representations through Static Anchors (找不到)
-
FastFormers: Highly Efficient Transformer Models for Natural Language Understanding
-
A Little Bit Is Worse Than None: Ranking with Limited Training Data
-
Early Exiting BERT for Efficient Document Ranking
-
Don’t Read Too Much Into It: Adaptive Computation for Open-Domain Question Answering
-
Exploring the Boundaries of Low-Resource BERT Distillation
-
[ ]