Recent advances are still dominated by large pre-trained models.
by Subhadarshi Panda, an NLP researcher
Natural language processing (NLP) is a field that uses text data and runs computations to gain insights and build predictive systems. Vast amounts of text data are available in written manuscripts and more so online on the web. These data sources have been used in the research community and industries to solve meaningful problems such as predicting the sentiment in a user comment, question answering, and fact-checking. Recent machine learning algorithms have enabled superhuman performance in a wide range of NLP tasks[1]. In this article, we summarize the recent machine learning research trends in NLP which have not only led to a plethora of breakthroughs but also resulted in a growing interest in this field of research.
A big chunk of the breakthroughs can be attributed to the large language models that are built using neurons and trained using backpropagation. These language models use a concept called self-attention to encode words into representations that a machine learning system can readily utilize. Instead of having one single layer, multiple stacks of layers are used in the model architecture, encouraging the model to learn various nuances of language. These language models can be used for many downstream tasks according to the use case. However, training such models requires huge amounts of data and computational power. Although large amounts of text data is relatively straightforward to scrape, obtaining computational resources is costly. As a result, most such models have been trained by groups having computational powerhouses of GPUs. These models are often released for use by the research community, so one does not need to re-train them from scratch. However, deploying such systems on a mini device such as a handphone or a watch is still a challenge. To solve this problem, recent efforts have tried to reduce the model size by distilling a pre-trained model.
The knowledge encoded in a large language model is usually diverse and covers a wide range of the language’s spectrum. This knowledge, however, is as good as the data it learns from. It has been shown that most real-world data sets are biased. For instance, gender bias is prevalent in a large number of written documents where certain names are associated with specific occupations. When the model learns from these datasets, it can pick up these associations and eventually develop a biased notion. Moreover, the data in online platforms are not fully clear; the content may be toxic and abusive. Using a model trained on such data for generating text can result in unacceptable machine generated text. To address this type of problems, one approach is to check the quality of the training data to make sure it does not make the model learn biases in the first place. Another recent approach is to take a pre-trained model and debias it by identifying the biases that it has. Neither of these two approaches is guaranteed to remove bias and toxicity and currently this is an open problem for the research community.
Moving on, when it comes to text data, one notable downside is that not all languages have equal amounts of text data readily available. For example, if we consider the languages English, Hindi, and Odia, the size of the data would be highest in English. Among Hindi and Odia, we would find less data for the latter. The language disparity is due to two reasons. One, the fraction of people who speak a certain language differs for different languages. Second, not all languages are used as well on online platforms. For instance, a huge number of Hindi and Odia speakers may still use English as the primary language of communication in online platforms. There are ongoing efforts to address the issue of language disparity by developing machine learning methods that can support all the languages reasonably well, even if some languages have very little training data. One approach is to encourage the creation of training data in under-resourced languages. Data creation can be done not just by professional annotators but also by the community[2]. Another research trend for addressing language disparity is to build multilingual models instead of monolingual models. The key idea here is to share the model parameters across languages. It has been shown that a machine learning model can benefit from this approach by a concept called transfer learning, where knowledge from a high resource language flows to low resource languages.
Most real-world NLP applications are nuanced and even after obtaining high-quality training data, the problem of data imbalance still remains. Data imbalance is the problem of having unequal amounts of training data for various categories. Although traditional methods such as undersampling and oversampling are useful, they do not add new information to the training. An approach to address data imbalance is data augmentation, the key idea for which is to generate artificial sentences for the under-represented categories and add them to the original data for training. More recent methods have used paraphrase generation to create new sentences that are similar in meaning to the original sentences. Training a paraphrase generation model is a separate problem itself, which can be done using an annotated paraphrase corpus. Notably, compared to images, generating diverse and yet semantically invariant perturbations of text is much more challenging.
Finally, it is noteworthy to highlight the importance of automatic evaluation in NLP, especially for cases where the system generates text. When new sentences are generated, they might be very subjective and multiple expressions of the same sentence can all be correct. This is also a major challenge for open-domain chatbots that generate replies to user comments. Although this is still an open problem, recent approaches for automatic evaluation include measuring the novelty and diversity of the generated text. In most situations, the right balance of novelty and diversity is all we need when generating text.
References
One such example is https://twitter.com/mte2o.
About the Author
Subhadarshi Panda is a research scholar working on deep learning for natural language processing. He mainly works on cross-lingual and cross-domain applications involving text data.