Deep Transformer-based Representation for Text Chunking
محورهای موضوعی : Natural Language ProcessingParsa Kavehzadeh 1 , Mohammad Mahdi Abdollah Pour 2 , Saeedeh Momtazi 3
1 - Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran
2 - Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran
3 - Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran
کلید واژه: Text Chunking, Sequence labeling, Contextualized Word Representation, Deep learning, Transformers,
چکیده مقاله :
Text chunking is one of the basic tasks in natural language processing. Most proposed models in recent years were employed on chunking and other sequence labeling tasks simultaneously and they were mostly based on Recurrent Neural Networks (RNN) and Conditional Random Field (CRF). In this article, we use state-of-the-art transformer-based models in combination with CRF, Long Short-Term Memory (LSTM)-CRF as well as a simple dense layer to study the impact of different pre-trained models on the overall performance in text chunking. To this aim, we evaluate BERT, RoBERTa, Funnel Transformer, XLM, XLM-RoBERTa, BART, and GPT2 as candidates of contextualized models. Our experiments exhibit that all transformer-based models except GPT2 achieved close and high scores on text chunking. Due to the unique unidirectional architecture of GPT2, it shows a relatively poor performance on text chunking in comparison to other bidirectional transformer-based architectures. Our experiments also revealed that adding a LSTM layer to transformer-based models does not significantly improve the results since LSTM does not add additional features to assist the model to achieve more information from the input compared to the deep contextualized models.
Text chunking is one of the basic tasks in natural language processing. Most proposed models in recent years were employed on chunking and other sequence labeling tasks simultaneously and they were mostly based on Recurrent Neural Networks (RNN) and Conditional Random Field (CRF). In this article, we use state-of-the-art transformer-based models in combination with CRF, Long Short-Term Memory (LSTM)-CRF as well as a simple dense layer to study the impact of different pre-trained models on the overall performance in text chunking. To this aim, we evaluate BERT, RoBERTa, Funnel Transformer, XLM, XLM-RoBERTa, BART, and GPT2 as candidates of contextualized models. Our experiments exhibit that all transformer-based models except GPT2 achieved close and high scores on text chunking. Due to the unique unidirectional architecture of GPT2, it shows a relatively poor performance on text chunking in comparison to other bidirectional transformer-based architectures. Our experiments also revealed that adding a LSTM layer to transformer-based models does not significantly improve the results since LSTM does not add additional features to assist the model to achieve more information from the input compared to the deep contextualized models.
[1] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, 2018.
[2] Adnan Akhundov, Dietrich Trautmann, and Georg Groh. Sequence labeling: A practical approach. arXiv preprint arXiv:1808.03926, 2018.
[3] Avi Chawla, Nidhi Mulay, Vikas Bishnoi, and Gaurav Dhama. Improving the performance of transformer context encoders for ner. In 2021 IEEE 24th International Conference on Information Fusion (FUSION), pages 1–8. IEEE, 2021.
[4] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[5] Kevin Clark, Minh-Thang Luong, Christopher D Manning, and Quoc V Le. Semi-supervised sequence modeling with cross-view training. arXiv preprint arXiv:1809.08370, 2018.
[6] Ronan Collobert, Jason Weston, L ́eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE):2493–2537, 2011.
[7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
[8] Zihang Dai, Guokun Lai, Yiming Yang, and Quoc V Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. arXiv preprint arXiv:2006.03236, 2020.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[10] Sean R Eddy. Hidden markov models. Current opinion in structural biology, 6(3):361–365, 1996.
[11] Alex Graves and Jurgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602–610, 2005.
[12] Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587, 2016.
[13] Zhiyong He, Zanbo Wang, Wei Wei, Shanshan Feng, Xianling Mao, and Sheng Jiang. A survey on recent advances in sequence labeling from deep learning models. arXiv preprint arXiv:2011.06727, 2020.
[14] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[15] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
[16] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
[17] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
[18] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
[19] Bofang Li, Tao Liu, Zhe Zhao, and Xiaoyong Du. Attention-based recurrent neural network for sequence labeling. In Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pages 340–348. Springer, 2018.
[20] Jerry Chun-Wei Lin, Yinan Shao, Youcef Djenouri, and Unil Yun. Asrnn: a recurrent neural network with an attention model for sequence labeling. Knowledge-Based Systems, 212:106548, 2021.
[21] Liyuan Liu, Jingbo Shang, Frank F Xu, Xiang Ren, Huan Gui, Jian Peng, and Jiawei Han. Empower sequence labeling with task-aware neural language model. arXiv preprint arXiv:1709.04109, 2017.
[22] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[23] Chunping Ma, Huafei Zheng, Pengjun Xie, Chen Li, Linlin Li, and Luo Si. Dm nlp at semeval-2018 task 8: neural sequence labeling with linguistic features. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 707–711, 2018.
[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
[25] Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108, 2017.
[26] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
[27] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[28] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155, 2009.
[29] Adwait Ratnaparkhi. A linear observed time statistical parser based on maximum entropy models. arXiv preprint cmp-lg/9706014, 1997.
[30] Marek Rei. Semi-supervised multitask learning for sequence labeling. arXiv preprint arXiv:1704.07156, 2017.
[31] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[32] Erik F Sang and Sabine Buchholz. Introduction to the conll-2000 shared task: Chunking. arXiv preprint cs/0009008, 2000.
[33] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing , 45(11):2673–2681, 1997.
[34] Anders Søgaard and Yoav Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 231–235, 2016.
[35] Trieu H Trinh and Quoc V Le. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
[36] Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. Small and practical bert models for sequence labeling. arXiv preprint arXiv:1909.00100, 2019.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[38] Yaqing Wang, Subhabrata Mukherjee, Haoda Chu, Yuancheng Tu, Ming Wu, Jing Gao, and Ahmed Hassan Awadallah. Meta self-training for few-shot neural sequence labeling. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1737–1747, 2021.
[39] Wei Wei, Zanbo Wang, Xianling Mao, Guangyou Zhou, Pan Zhou, and Sheng Jiang. Position-aware self-attention based neural sequence labeling. Pattern Recognition, 110:107636, 2021.
[40] Yingwei Xin, Ethan Hart, Vibhuti Mahajan, and Jean-David Ruvini. Learning better internal structure of words for sequence labeling. arXiv preprint arXiv:1810.12443, 2018.
[41] Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. Neural models for sequence chunking. arXiv preprint arXiv:1701.04027, 2017.
[42] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.