个人杂谈之ICLR2021

Posted by cyq on November 29, 2020
  • Universal Sentence Representations Learning with Conditional Masked Language Model

    A novel pre-training technique CMLM for unsupervised sentence representation learning on unlabeled corpora (either in monolingual and multilingual). CMLM通过以相邻句子的编码向量为条件,将句子表示学习整合到MLM训练中。

  • Synthesizer: Rethinking Self-Attention for Transformer Models

    This paper seeks to develop a deeper understanding of the role that the dot product self-attention mechanism plays in Transformer models.

    This paper proposes SYNTHESIZER, a new model that learns to synthesize the self-alignment matrix instead of manually computing pairwise dot products.

  • Multi-Head Attention: Collaborate Instead of Concatenate

    As the multiple heads are inherently solving similar tasks, they can collaborate instead of being independent. Collaborative MHA introduces weight sharing across the key/query projections and decreases the number of parameters and FLOPS.

  • Deepening Hidden Representations from Pre-trained Language Models

    We argue that only taking single layer’s output restricts the power of pre-trained representation. Thus we deepen the representation learned by the model by fusing the hidden representation in terms of an explicit HIdden Representation Extractor (HIRE), which automatically absorbs the complementary representation with respect to the output from the final layer.

  • Contextual Knowledge Distillation for Transformer Compression

    we present a new knowledge distillation strategy for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. WR is proposed to capture the knowledge of relationships between word representations and LTR defines how each word representation changes as it passes through the network layers.

  • A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

  • JAKET: Joint Pre-training of Knowledge Graph and Language Understanding

    These Pre-trained language models models struggle to grasp world knowledge about entities and relations. we propose JAKET, a Joint pre-trAining framework for KnowledgE graph and Text. Under our framework, the knowledge module and language module both provide essential information for each other.

  • Optimizing Transformers with Approximate Computing for Faster, Smaller and more Accurate NLP Models

    We observe that for a given downstream task, some parts of the pre-trained Transformer are more significant to obtain good accuracy, while other parts are less important or unimportant. In order to exploit this observation in a principled manner, we introduce a framework to introduce approximations while fine-tuning a pre-trained Transformer network, optimizing for either size, latency, or accuracy of the final network.

  • Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

    In this paper, we study these core questions, through detailed analysis of a family of ResNet models with varying depths and widths trained on CIFAR-10, CIFAR-100 and ImageNet.

  • 【MDR】Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval

    We propose a simple and efficient multi-hop dense retrieval approach for answering complex open-domain questions, which achieves state-of-the-art performance on two multi-hop datasets, HotpotQA and multi-evidence FEVER.

    Our method iteratively encodes the question and previously retrieved documents as a query vector and retrieves the next relevant documents using efficient MIPS methods.

    It improves DPR by sampling negatives from a memory bank (Wu et al., 2018) — in which the representations of negative candidates are frozen so more candidates can be stored.

  • Cluster-Former: Clustering-based Sparse Transformer for Question Answering

    self-attention suffers when handling inputs with extreme long-range dependencies, as its complexity grows quadratically

    we propose Cluster-Former, a novel clustering-based sparse Transformer to perform attention across chunked sequences. Cluster-Former consists of two types of encoding layer. The first one (noted as SlidingWindow Layer) focuses on extracting local information within a sliding window. The other one (noted as Cluster-Former Layer) learns to encode global information beyond the initial chunked sequences. we stack these two types of layer interchangeably to capture both global and local context efficiently.

  • Block Skim Transformer for Efficient Question Answering

    We then propose Block Skim Transformer (BST), a plug-and-play module to the transformer-based models, to accelerate transformer-based models on QA tasks.

    By handling the attention weight matrices as feature maps, the CNN-based Block Skim module extracts information from the attention mechanism to make a skim decision. With the predicted block mask, BST skips irrelevant context blocks, which do not enter subsequent layers’ computation.

  • Distilling Knowledge from Reader to Retriever for Question Answering

    In this paper, we propose a technique to learn retriever models for downstream tasks, inspired by knowledge distillation, and which does not require annotated pairs of query and documents. Our approach leverages attention scores of a reader model, used to solve the task based on retrieved documents, to obtain synthetic labels for the retriever.

    Relevance-guided supervision for openqa with colbert

  • When Do Curricula Work?

  • Rethinking Embedding Coupling in Pre-trained Language Models

  • SEED: Self-supervised Distillation For Visual Representation

    we find that existing techniques like contrastive learning do not work well on small networks.

    Instead of directly learning from unlabeled data, we proposed SEED as a novel self-supervised learning paradigm, which learns representation by self-supervised distillation from a bigger SSL pre-trained model.

  • Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval

    In this paper, we first theoretically show the learning bottleneck of dense retrieval is the domination of uninformative negatives sampled locally in batch, which yield diminishing gradient norms, large stochastic gradient variances, and slow learning convergence.

    We then propose Approximate nearest neighbor Negative Contrastive Learning (ANCE), a learning mechanism that selects hard training negatives globally from the entire corpus, using an asynchronously updated ANN index.

  • PMI-Masking: Principled masking of correlated spans

    不只是mask一个subtoken,而且mask一个span。

  • Decomposing Mutual Information for Representation Learning

  • Semantic Hashing with Locality Sensitive Embeddings

    we propose a method for learning continuous representations in which the optimized similarity is the angular similarity.

  • Memory Representation in Transformer

    Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model.

    In this work, we propose and study few extensions of the Transformer baseline (1) by adding memory tokens to store non-local representations, (2) creating memory bottleneck for the global information, (3) controlling memory update with dedicated layer.

  • Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

    In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. To reduce the variance due to the sampling of masks, we proposed a fully-explored masking strategy, where a text sequence is divided into multiple non-overlapping segments. During training, all tokens in one segment are masked out, and the model is asked to predict them with the other segments as the context. It was demonstrated theoretically that the gradients obtained with such a novel masking strategy have a smaller variance, thus enabling more efficient pre-training.

    现有的BERT等模型往往采用masked language model进行自监督学习,但是其往往采用随机的方法确定mask的word或者span;本文提出不合适的mask会导致梯度方差变大,并影响模型的效果,并分析原因在于同时mask的word之间具有一定的相似度;故本文提出一种特殊的mask机制,其考虑增大被mask的word之间的差异,进而削弱梯度方差大带来的影响。

  • Deep Retrieval: An End-to-End Structure Model for Large-Scale Recommendations

    we have proposed Deep Retrieval, an end-to-end learnable structure model for largescale recommender systems. DR uses an EM-style algorithm to learn the model parameters and paths of items jointly. Experiments have shown that DR performs well compared with brute-force baselines in two public recommendation datasets.