Why there was a need for transformers
While recurrent architectures such as LSTM or GRU attempted to solve the problem of keeping important information (context), the reference window was still somewhat limited. Reference window means in simple words how many words/tokens you can look back in history from the current word. Here is description from the original paper:
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output…
Moreover, the inherently sequential nature of recurrent networks makes it hard to do computation in parallel. Thus makes them not applicable for systems which require high system performance and low latency. Therefore, the innovativation that transformers brought to NLP community was state-of-the art performing architecture (great model performance on many test data sets) which in addition is scalable since it is parallelizable.
High level architecture of transformers
The original paper divided the architecture of the whole model into two major parts: encoder (left half) and decoder (right half). The intuition behind is that encoder should extract the needed features (i.e. useful representation of tokens) and decoder should then use these to make a prediction for the needed task. (e.g. translation) Later in this chapter, we will talk about Bert, which is a model that got inspired from the transformers paper. The word inspired means that the original architecture of transformers was adjusted such that only the encoder part of the original model was used. This is why you might hear that Bert is an encoder based model. I will later talk more about the distinctions between encoder, decoder, encoder-decoder architectures. But for now, let’s focus on the initially proposed architecture.
If we further examine the encoder-decoder architecture, we can identify the following building modules:
Finally, due to these sub-words based tokenizers, we only can run into the problem of not having character in our vocabulary. Such character would be then mapped to special ‘[UNK]’ token. The reason why we can not have a problem of not having given subword in vocabulary is that each subword can be always decomposed to characters. Last but not the least, there are two types of subwords:
As can be seen above, all subwords start with ‘##’ except from the first one in the word.