Working in NLP in the Age of Large Language Models

Zachary Brown Charlottesville Data Science Meetup, 2023-09-18
Working in NLP in the Age of
Large Language Models
An Historical Perspective

Outline
Setting the Scene
Who’s who in NLP
The Alex Net Moment
Context is King
Sequences Abound
Attention is Everything
Multi-mode, multi-task, multi-billion
A New World
Introduction An Historical Perspective

Introduction
Outline
Setting the Scene
Who’s who in NLP
The Alex Net Moment
Context is King
Sequences Abound
Multi-mode, multi-task, multi-billion
A New World
An Historical Perspective

Advances in the technology have opened up a variety of novel use cases, and the hype
has caused a massive shift in both expectations and allowances for model performance
Setting the Scene
Generative AI has taken the world by storm over the past year, driven largely by the
groundbreaking performance of a small number of closed-source models
These recent advances have their roots in over a decade of accumulating foundational
research

(Partial) Generative AI Landscape
Companies and Models
Tooling

Advances in the technology have opened up a variety of novel use cases, and the
hype has caused a massive shift in both expectations and allowances for model
performance
Setting the Scene
These recent advances have their roots in over a decade of accumulating foundational
research

The legal cases…
The harm
Novelty (Across Many Dimensions)
The good: Fun with content generation, trying to take
over the world, fun generative agents. Also, does
better on tests than you do…

The legal cases…
The harm: Hallucinations, disinformation,
workforce impacts, misguided usage,
environmental impacts
Novelty (Across Many Dimensions)
The good: Fun with content generation, trying to take
over the world, fun generative agents. Also, does
better on tests than you do…

Advances in the technology have opened up a variety of novel use cases, and the hype
has caused a massive shift in both expectations and allowances for model performance
Setting the Scene
These recent advances have their roots in over a decade of accumulating
foundational research

Foundational Roots of LLMs
2001
First Neural Language Model
2010
RNNs for Language Modeling
2013
Contextual Word Embeddings
2015
Attention Mechanism
2017
Transformer Architecture

How does the impact of previous advances compare to what’s happening right now?
Questions to Cover
What previous advances have led us here?
What have these changes meant for those in this ﬁeld, and in the periphery of this ﬁeld?

Engineers and Architects
Who: Software engineers, architects, devops and data engineers
What: Core contributors required to mature technology beyond specialized startups to enterprise
grade / scale
Who’s Who in NLP
Researchers and Practitioners
Who: Machine learning researchers and engineers, data scientists, computational linguists
What: Driving advances in tech and/or knowledgeable enough to immediately leverage
Business Interests and Specialists
Who: C-Suite members, enterprise technical leaders, founders and investors, and domain specialists
(research)
What: Recognizing maturity and leveraging advances in tech for relevant use cases

The AlexNet Moment
(+ history, 2001 - 2012)
I’d argue that one of the most important moments leading to
NLP having broader impacts outside of academic/specialist
communities wasn’t an advancement in NLP at all…
But ﬁrst, a bit of history on NLP research…

One of the foundational tasks for NLP is language modeling,
which is simply predicting the next word in a sequence given
the previous sequence
● In 2001, Bengio et. al. introduced an early neural
model for next token prediction
● In 2010, Mikolov et. al. explored the application of
RNNs for LM*
Early Neural Methods
for Language Modeling
* Extensions such as the LSTM gained massive popularity in subsequent years

Impact?
Exploration of new techniques really pushing the boundaries of what’s possible, but…
…often need to understand how to implement models from scratch with few standard libraries

The AlexNet Moment
(2012)
In 2012, the AlexNet architecture won the ImageNet
competition with a 10.5% reduction in top-5 error.
This tangibly demonstrated the promise of neural networks
in solving problems that could provide tangible business
value
The resurgent popularity of neural networks had broad
implications beyond just the researchers and practitioners
working in this space

Huge opportunity to build out standard frameworks to support deep learning research and
development
Impact?
Purpose-built neural net architectures have the potential to substantially outperform prior methods
and should be more thoroughly explored for NLP use cases
Early signals that neural nets can provide substantial business and research value, this is an
important area for early investment

Context Is King
(2013)
In 2013, Mikalov et. al. demonstrated word2vec, a technique
that efﬁciently produced contextual word embeddings at
scale, from a large, unlabeled corpus
This work was followed in 2014 by the GloVe embedding
method (Pennington et. al.) which leverages global
co-occurrance statistics to generate contextual embeddings
Both research groups made these sets of pre-trained word
embeddings publicly available under the Apache 2.0 license

New interesting opportunities and challenges for large-scale unstructured data sets. New
packages emerging for generating and using publicly-available model artifacts
Impact?
Unsupervised pre-training has the potential to capture interesting semantic relationships without
the need for expensive (and error prone) supervised human labeled data.
My unstructured data has inherent value that can be extracted in an automated way.
There’s a new ecosystem emerging of privately-funded research efforts producing and releasing
valuable IP

In 2014, Sutskever et. al. from the Google Brain team
introduced a novel approach for leveraging neural nets
to map sequences to sequences. This had major
implications for neural machine translation among
other tasks
In the next year, Bahdanau et. al. published a novel
approach to neural machine translation introducing the
attention mechanism.
Sequences Abound
(2014 - 2015)

Neural nets are starting to pop up in a variety of use cases. I should check out this new tensorﬂow
thing
Impact?
We can now tackle seq2seq problems, and the attention mechanism shows great promise in letting
a model architecture learn which context is relevant for word prediction
Large commercial R&D investments are producing truly novel tech that’s driving a step change in
capabilities. New business opportunities for startups, new investments for enterprises

In 2017, a new work from Vaswani et. al. demonstrated that
“Attention is All You Need,” extending the attention
mechanism proposed several years earlier to construct
stacked blocks of multi-headed attention, or transformers.
In the next year, the promise of transformers was ﬁrmly
established with the release of both the original BERT paper
from the Google AI Language team as well as the original GPT
paper from a team at OpenAI
(2017 - 2018)

Blocks of (multi-head) self-attention
are the key component of
transformers encoder and decoder
blocks, allowing the model to learn
deep contextual representations of
the input tokens (typically
bidirectional)
Attention is All You Need

Bidirectional Encoder Representations from Transformers
(BERT) demonstrated that transformer encoder-only models
can be efficiently trained through a two step process of
unsupervised pre-training followed by task-specific
fine-tuning
The masked-language-model (MLM) pretraining paradigm
opened the door for leveraging massive textual corpora to
produce extremely performant models, while fine-tuning
allowed practitioners to directly benefit from the substantial
investments of large industry research groups
BERT and the Encoders

The original GPT paper followed a similar “pretrain then
ﬁne-tune” approach using a decoder only transformer
architecture.
Pretraining was carried out using with an autoregressive
language model objective, with a variety of tasks for
subsequent ﬁne-tuning.
GPT

Deep learning toolkits are becoming more broadly available, and non-specialists can now
experiment and build useful machine learning systems
Impact?
Transformers are an incredibly robust tool for learning deep contextual representations for
language tasks. We should explore everything we can for transformers and pre-training paradigms
The trend in open-sourcing valuable IP is only accelerating, an ecosystem is rapidly developing for
AI. What tooling needs to be built? What NLP use cases does my organization have?

Multi-mode, Multi-task,
Multi-billion (2019 - 2021)
In the wake of the success of models such as BERT and GPT,
research interests shifted to exploring, extending, and
augmenting various paradigms introduced in these recent
works, such as:
● Extensions to attention mechanism
● Encoder-decoder architectures
● Pre-training and ﬁne-tuning paradigms
● Larger and smaller (more efﬁcient) models
● Multi-modal applications

The original attention mechanism is robust, but
computationally expensive as sequence lengths grow. Models
such as Longformer, Reformer, Performer, etc. explored
various methods for extending the attention mechanism to
longer sequence lengths.
Extend Your Attention

While encoder and decoder-only models such as BERT and
GPT demonstrated great promise in their own right,
extensive research efforts were focused on leveraging full
encoder-decoder transformer architectures for sequence to
sequence tasks such as translation, reading comprehension,
summarization, etc.
Models like BART, T5, Pegasus, and ULM all leveraged
encoder-decoder models along with novel training paradigms
to produce performant models across a variety of tasks
Encode and Decode

Extending the ﬁne-tuning paradigms introduced in previous
work, many groups shifted focus to using single architecture
to perform a variety of disparate tasks
Models like T5 probed the limits of multi-task transfer
learning, while the family of GPT models evolved (GPT-2 and
GPT-3) demonstrating the beneﬁts of multi-task learning and
highlighting emergent capabilities of larger models such as
few-shot learning and generalization.
“Fine-Tuning Language Models from Human Preferences”
demonstrated that human feedback can play a key role in
aligning generative model outputs with human expectations
Task Variety, Instructions,
and Alignment

While some work sought to make transformer models smaller
and more efﬁcient (pruning, distillation, quantization), other
works focused on scaling up to much larger models (and
larger training datasets).
Novel works explored new scaling laws for the era of large
language models
Bigger (and Smaller)
Models

A huge body of research also emerged around multi-modal
applications of transformers (Xu et. al. for a recent survey)
One particularly visible application of transformers applied
to audio was the Whisper model released by OpenAI , which
leveraged an relatively straightforward transformer and a
huge volume of data to produce a performant speech to text
model
Multi-Modal Models

Wow… Also, everything is a transformer.
With models getting so big, there are novel challenges for ML training and inference.
Access to public models allows us to incorporate ML into nearly any application with zero training
Impact?
I can build a new business on top of open source machine learning easier than ever before, and so
can everyone else. How can I leverage our data and talent to apply this new tech to our business?

In November, OpenAI announced the public release of
ChatGPT, a large (and notably unspecified) generative model
fine-tuned through reinforcement learning with human
feedback (RLHF).
For many, this system was the first glimpse into the
technological advances of the past decade and the tangible
utility these models can provide.
The model demonstrated a step change in previously
demonstrated generative language modeling capabilities,
causing a noticeable shift in research interest and
commercial applications of NLP
A New World
(2022 - Present)

In early 2023, there were seemingly no competitive systems
available to the public that could match the performance of
this new model.
Aside from the novelty of the system in information recall or
content-generation use cases, the few-shot and zero-shot
performance of the model is remarkable
Coupled with restrictive terms of service for competitive
commercial uses, a sort of moat had been established around
this novel technology
The Dominance of GPT

As 2023 has progressed, competitive open-source models
have been released monthly, if not weekly. Some key aspects
can determine the utility of these models for business use
cases:
● Commercial permissibility (and data usage ToS!)
● Training / tuning paradigm
● Code generation
Competition Heats Up

To support this new emerging ecosystem of LLMs, a whole
host of new tools have emerged, and some existing solutions
have found new life. Some prominent areas of novel tooling
include:
● Chain creation and management
● Vector data stores
● LLM training and inference platforms
● LLM testing and monitoring suites
● Labeling platforms
A New Tooling
Ecosystem

Along with the proliferation of new tools, new techniques
that aid in working with these new models have gained in
popularity.
Parameter efficient fine-tuning (PEFT) techniques including
LoRA, p-tuning, etc. have emerged as a viable path for
domain/task adaptation on consumer hardware (along with
quantization)
New research has also emerged this year focusing on more
flexible embedding techniques (ALiBi, RoPE) that allow for
model training (and adaptation) to longer context windows
Trends in Methodology,
Both New…

Along with the proliferation of new tools, many familiar tools
and paradigms are being revisited and revitalized, including
components of more traditional chatbots, information
retrieval systems, and enterprise labeling platforms
Trends in Methodology,
Both New… … and Familiar

A New Hammer
(But Not for Every Nail)
LLMs excel in a variety of generative use cases, such as
conversational assistance, content generation (code,
templates, etc.), and retrieval augmented generation. They
also enable novel opportunities via autonomous agents.
Although LLMs are capable for a wide variety of tasks, for
discriminative use cases, when data is available at scale,
and/or when factual accuracy is critical (without a human in
the loop), smaller, more efﬁcient models are often a better
option

We can build generative AI directly into our platforms (at a cost) without the typical ML R&D
lifecycle, but it’s difﬁcult to get traction beyond PoC stage
Impact?
The bulk of generative NLP use cases will likely be handled by LLMs, and we need to understand
how to best utilize, maintain, and constrain these systems. BUT, genAI is not always the answer!
Generative AI is in full-on disruption mode, and I need to ﬁgure out how to integrate it into our
business as quickly as possible

Working in NLP in the Age of Large Language Models

More Related Content

What's hot

Similar to Working in NLP in the Age of Large Language Models

More from Zachary S. Brown

Recently uploaded

Working in NLP in the Age of Large Language Models