A brief history of NLP
In this piece of note, I will give an overview of NLP's development history, focusing on how recent neural approaches revolutionise the NLP field.
Natural language processing could be roughly devided into 3 stages based on the domainance methods used. 1
- Symbolic NLP
- Statitical NLP
- Neural NLP
Symbolic NLP (1950s - early 1990s)
Research related to natural language processing originated roughly in the 1950s. In the following 40 years, limited by the size of the corpus and the computing power, early natural language processing mainly used rule-based methods to deal with generic natural language phenomena through symbolic logic knowledge summarized by experts. Such rule systems are difficult to be applied to solve real-world problems due to the complexity of natural language.
Statistical NLP (1990s - 2010s)
The rapid advances in computational power and storage capacity, as well as the increasing maturity of statistical learning approaches, have led to the large-scale application of corpus-based statistical learning methods in the field of natural language processing.
The advantages of this method includes fast training speed, little requirement for labelled data and has good performance over simple problems.
Meanwhile, there are obvious limitations to this method. Statistical approach requires transformation of the raw natural language input into a vector form that can be processed by the machine based on empirical rules. This expertise-dependent, manual process of transformation is known as feature engineering (feature extraction), which is time-consuming and not compatible for different tasks.
Neural NLP (present)
To cope with the disadvantages of feature engineering, representation learning and deep neural network-style machine leanring methods became widely applied in NLP field, proposing an end-to-end solution. Representation learning allows the machine to automatiacally recognise patterns from input which can be used for tasks like classifcation.
In 2013, Tomas Mikolov proposed word2vec method2, using a shallow neural network with large scale corpus, which uses contextual connection of each word to embed the semantics of the tokens into a dense vector. This output of such method is called Word Embedding.
This kinds of encoding methods avoid the usage of elaborate feature engineering, and it also breaks down the barriers between different tasks, as representation learning transformes input into a similar and easily tranferrable vector space.
This trend of Representation Learning has spread to knowledge graphs (using Graph Embedding techniques) and recommender systems (using user/item Embedding techniques).
Shortly after, the drawback of word2vec was discovered - the same word has different meanings in different contexts, but the word vector given by this encoding is unique and static. Accordingly, the model ELMo that introduces contextual Word Embedding was born.
In 2017, the Transformer model was released3. Compared to ELMo-like models, the biggest breakthrough of Transformer is that it does not use LSTM, but instead uses an attention mechanism. This mechanism is a function that maps a query and a set of key-value pairs to an output. The values output by the attention mechanism are weighted sums, where the weight of each value is calculated by the function of the query and the corresponding key of the value.
Some NLP researchers believe that the attention mechanism used by transformer is a better alternative to LSTM. They believe that the attention mechanism handles long-range dependencies better than LSTM and has a very promising application. transformer uses an encoder-decoder structure in its architecture. The encoder and decoder are highly similar in structure, but not in function. The encoder consists of N identical encoder layers. A decoder consists of N identical decoder layers. Both encoder and decoder layers use the attention mechanism as a core component.
The great success of Transformer in the field of machine translation has attracted the interest of many NLP scientists. As deep learning algorithms evolves, their disadvantages begin to emerge - algorithm requires massive labelled data. As the subjective natrue of congnitive task of NLP, and the large number of tasks and domains it deals with, it is time-consuming and labor-intensive for acquiring high quality annotated corpora. The large scale pre-trained langauge model precisely compensate for this shortage, helping NLP to achieve a series of breakthroughs. On this regard, two famous models were born: Bidirectional Encoder Representations from Transformers (BERT)4 and Generative Pre-Traing of language model (GPT)5.
GPT consists entirely of the decoder layer of the transformer, while BERT consists entirely of the encoder layer of the transformer. The goal of GPT is to generate human-like text, the goal of BERT, on the other hand, is to provide better language representations that help achieve better results for a wide range of downstream tasks. The BERT model reaches an advanced level on a variety of NLP tasks and have greatly improved STOA on many tasks. Now BERT has derived a large family of models, among which the famous ones are XL-Net, RoBERTa, ALBERT6, ELECTRA, ERNIE, BERT-wwm, DistillBERT, etc.
- Natural language processing#History - Wikipedia ↩︎
- Word2vec - Wikipedia ↩︎
- 「1706.03762」 Attention Is All You Need (arxiv.org) ↩︎
- 「1810.04805」 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org) ↩︎
- Improving Language Understanding by Generative Pre-Training ↩︎
- 「1909.11942」 ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (arxiv.org) ↩︎