Papers
arxiv:1904.09482

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Published on Apr 20, 2019
Authors:
,
,
,

Abstract

Knowledge distillation applied to a Multi-Task Deep Neural Network (MT-DNN) improves performance across multiple natural language understanding tasks, achieving a significant boost in the GLUE benchmark.

AI-generated summary

This paper explores the use of knowledge distillation to improve a Multi-Task Deep Neural Network (MT-DNN) (Liu et al., 2019) for learning text representations across multiple natural language understanding tasks. Although ensemble learning can improve model performance, serving an ensemble of large DNNs such as MT-DNN can be prohibitively expensive. Here we apply the knowledge distillation method (Hinton et al., 2015) in the multi-task learning setting. For each task, we train an ensemble of different MT-DNNs (teacher) that outperforms any single model, and then train a single MT-DNN (student) via multi-task learning to distill knowledge from these ensemble teachers. We show that the distilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 GLUE tasks, pushing the GLUE benchmark (single model) to 83.7\% (1.5\% absolute improvement Based on the GLUE leaderboard at https://gluebenchmark.com/leaderboard as of April 1, 2019.). The code and pre-trained models will be made publicly available at https://github.com/namisan/mt-dnn.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/1904.09482 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/1904.09482 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.