cyq's blog

~~AmbigQA: Answering Ambiguous Open-domain Questions~~

研究开放域问答中问题的歧义性，找到歧义性问题对应的多个答案，并针对每一个答案对问题进行重写以实现消歧。

提出了一个对应的任务和数据集，以及baseline。

【句子表示学习】An Unsupervised Sentence Embedding Method by Mutual Information Maximization

句子输入到 BERT 后被编码，其输出的 token embeddings 通过多个不同 kernel size 的一维卷积神经网络得到多个 n-gram 特征。我们把每一个 n-gram 特征当成局部表征（Local representation），将平均池化后的局部表征称为全局表征（Global representation）。最后，我们用一个基于互信息的损失函数来学习最终的句向量。该损失函数的出发点是最大化句子的全局表征（句向量）与局部表征之间的平均互信息值，因为对于一个好的全局句向量，它与所对应的局部表征之间的 MI 应该是很高的，相反，它与其他句子的局部表征间的 MI 应该是很低的。

这样的任务类似 contrastive learning，可以鼓励编码器更好地捕捉句子的局部表征，并且更好地区分不同句子之间的表征。

【模型的可解释性】Analyzing Individual Neurons in Pre-trained Language Models

分析模型：ELMo，T-ELMo，BERT，XLNet

分析方式：从neuron而不是layer的角度进行探测。

一些分析结论：1）it is possible to extract a very small subset of neurons to predict a linguistic task with the same accuracy as using the entire network. Low level tasks such as predicting morphology require fewer neurons compared to high level tasks such as predicting syntax. 2）ELMo generally needed fewer neurons while T-ELMo required more neurons compared to the other models to achieve oracle performance. 说明knowledge of lexical semantics in T-ELMo is distributed in more neurons。3）lexical tasks such as learning morphology (POS tagging) and word semantics (SEM tagging) are dominantly captured by the neurons at lower layers, whereas the more complicated task of modeling syntax (CCG supertagging) is taken care of at the ﬁnal layer. An exception to this overall pattern is the BERT model. Top neurons in BERT spread across all the layers, unlike other models where top neurons (for a particular task) are contributed by fewer layers. 4）BERT is the most distributed model with respect to all properties while XLNet exhibits focus with the most disjoint set of neurons and layers designated for different linguistic properties. 5）Some phenomena (e.g. Verbs) are distributed across many neurons while others (e.g. Interjections) are localized in a fewer neurons。

Analyzing Redundancy in Pretrained Transformer Models

结论：1）相邻层的冗余度高，除了最后两层。2）85%的神经元可以被remove，而且没有性能的损失。大部分冗余的神经元来自于同一层或者相邻层。3）high layer-level redundancy for sequence labeling tasks, The amount of redundancy is substantially lower for sequence classiﬁcation tasks. all the sequence classiﬁcation tasks are learned at higher layers and none of the lower layers were found to be redundant. 4）XLNet is more redundant than BERT。5）Complex core language tasks require more neurons。Less task-speciﬁc redundancy for core linguistic tasks compared to higher-level tasks

**

Context-Aware Answer Extraction in Question Answering**. （没找到）

Dense Passage Retrieval for Open-Domain Question Answering**.

Don’t Read Too Much Into It: Adaptive Computation for Open-Domain Question Answering**. （没找到）

Masking as an Efficient Alternative to Finetuning for Pretrained Language Models**.

实验方法：使用预训练模型到下游任务时，不是采用微调所有参数的方法，而是选择一组weights，mask掉其他的参数，最后可以取得和微调差不多的性能。实验结果：1）Intrinsic evaluations show that masked language models extract valid representations for downstream tasks。2）the representations are generalizable。3）the minima obtained by ﬁnetuning and masking can be easily connected by a line segment。

Modularized Transfomer-based Ranking Framework**.（没找到）

Multi-Step Inference for Reasoning Over Paragraphs

提出推理模型，分为select+chain+predict三个模块。

MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale**.

结论：1）neither domain similarity nor training data size are suitable for predicting the best models。

Multilevel Text Alignment with Cross-Document Attention

在hierarchical attention 的双塔模型中加入cross-attention。

提出一个doc-doc和sentence-doc对齐的benchmark。

On the Sentence Embeddings from BERT for Semantic Textual Similarity**.（没找到）

Probing Pretrained Language Models for Lexical Semantics(找不到)

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

在微调阶段使用多任务学习的思想，联合学习预训练任务，但是需要Pretraining Simulation mechanism解决预训练任务不可得的困难，而且多任务学习时要使用Objective Shifting mechanism逐渐偏重下游任务。

Ad-hoc Document Retrieval using Weak-Supervision with BERT and GPT2**. (找不到)

Towards Better Context-aware Lexical Semantics: Adjusting Contextualized Representations through Static Anchors (找不到)

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

A Little Bit Is Worse Than None: Ranking with Limited Training Data

Early Exiting BERT for Efficient Document Ranking

Don’t Read Too Much Into It: Adaptive Computation for Open-Domain Question Answering

Exploring the Boundaries of Low-Resource BERT Distillation

[ ]

个人杂谈之EMNLP2020

（完成）EMNLP2020

CATALOG