Large language transformer models can constantly benefit from larger architectures and increasing amounts of data. Since 2018, larger language models like BERT and its variants GPT-2 and GPT-3 have shown that a wide range of tasks can be performed using hit learning. Models such as Microsoft and NVIDIA’s Megatron-Turing natural language generation, which had 530 billion parameters, the general purpose language model (GLAM) full version, which contained 1.2 trillion parameters, TheMDA or Language models for dialog applications that had 137 billion parameters; and Gopher, which had 280 billion parameters, have stood out in recent years just for their size. Has the desire to build bigger and bigger models become a blind race?
A new paper published by Google AI does not agree with this assumption. The results of the study reiterate that larger models have more efficient sampling than smaller models because they better apply transfer learning. And with that, the team announced Palm or Pathways Language Model, a Transformer model of 540 billion parameters, decoder only.
Last year in October, Google’s research team presented a new AI architecture that could work like a human brain. Traditionally, an AI model can only be trained to specialize in a single task. With Pathways, a single AI model can be generalized to a million different tasks. Pathways also allow the model to learn new tasks more quickly. Most models can only perform one modality: they can process images, text or speech. Journey would work in such a way that an AI model could perform tasks in all modalities.
Instead of “dense” models that normally use their entire neural network to accomplish a task, the Pathways architecture has learned to route its tasks only to the part of the network that is relevant to the task. This makes the model more energy efficient and gives it more bandwidth to learn new tasks.
PaLM has been trained on hundreds of tasks involving language comprehension and generation using the Pathways system. It is also the first time the Pathways system has been used to train a large-scale model that could scale 6144-chip training. This is the largest TPU-based setup that has been used in training. Compared to previous large language models such as GLaM and LaMDA which were trained on a single TPU v3 pod, PaLM used data parallelism to train on two Cloud TPU v4 pods.
The model was trained on English language and multilingual datasets including web documents, books, Wikipedia, GitHub code and conversations. In addition to this, the team also maintained a “lossless” vocabulary which stored all white space documents with respect to encoding and split non-vocabulary Unicode characters into bytes and numbers into digits.
Language understanding and generation: PaLM was tested on 29 of the most commonly used standard NLP tasks in English and outperformed its predecessors on 28 of those tasks. These tasks included sentence completion, Winograd-style tasks that involve natural language reasoning, reading, comprehension, and inference tasks. PaLM also performed well in multilingual NLP tests despite only being trained on 22% of non-English text.
The study found that the model’s performance as a function of scale follows a log-linear behavior like previous models, suggesting that the performance improvements have not yet leveled off. The model was compared Gopher and chinchillas. PaLM demonstrated impressive contextual understanding as it was even able to guess the name of a movie from emojis.
Reasoning: The model used chain-of-thought incitement to solve reasoning problems involving common sense and multi-step arithmetic. PaLM worked on three sets of arithmetic data and two of common sense reasoning. In arithmetic, he was able to solve 58% of the problems using the 8-step prompt in GSM8Ka challenging grade level math dataset, improving GPT-3 by 55%.
PaLM could also explain an entirely original joke that required complex multi-step logical inference and deep language understanding.
Code generation: PaLM, which was trained using only 5% pre-training code, was more than able to generalize to writing code using hit-and-miss learning. Its performance was comparable to OpenAI Manuscript even though it used 50 times less Python code in the training dataset.
PaLM has been refined on a Python data set only known as PaLM-Coder. In a code repair task called DeepFix, PaLM-Coder was able to modify C programs that were initially broken with an 82.1% success rate, surpassing the previous benchmark of 71.7%. This indicates a possibility that the PaLM-Coder may eventually solve more complex coding problems.
PaLM used its data parallelism strategy and reworked the transformer, allowing the attention and anticipation layers to be computed in parallel. This led to speedups of the TPU compiler optimizations, whereby PaLM showed a training efficiency of 57.8% hardware FLOP usage – the highest any large language model at this scale has achieved.
Breakthrough performance of PaLM proves that after keeping ethical considerations in mind, it could be the first step towards building better performing models with greater scaling capabilities using the Pathways system .