LLM GenAI - the new processing unit for developers

LLM GenAI: the new processing unit
Jacques de Vos in collaboration with GenAI

This talk is my (subjective) take
on the question:
What does the LLM GenAI
revolution mean for software
application developers, like
myself?
(as opposed to data scientists or
general public)

Agenda
1. Computing: past & future
2. The LLM Breakthrough
3. LLMs: the new processing unit
4. The language of LLMs
5. Programming with LLMs
6. Additional ideas (by LLM)

Computing eras:
past & future
● Each breakthrough adds new development tools
● Technologies build upon each other:
● - Previous tools remain valuable
● - New capabilities enhance, don't replace

Mainstream Computing Eras
1. Transistor-Mainframe Era - mid 1960s
● Large-scale batch processing of instruction on data
● Killer App: The Database
2. CPU/Microprocessor Era - early 1970s
Early - Embedded CPU Era - early 1970s
● Small, CPUs executing instructions & I/O
● Killer apps: Specialised Hardware Controllers
Late - PC Era - late 1970s
● Innovation: Personal computing with friendly UIX
● Killer App: The Spreadsheet (1979)
3. Internet Era - mid 1990s
Early - WWW Era - mid 1990s
● Breakthrough: Global interconnection of computers
● Killer App: Search engines
Late - Mobile Era - late 2000s
● Expansion: Always-on connectivity via smartphones
● Killer App: Social media
4. GenAI LLM Era - 2023
● Multi-task processor capable of zero or few-shot learning
● Killer App: Personal Assistants

New LLM Era Dev Opportunities
Like other eras, this tech gives opportunities for developers
at many many levels. Eg:
● Industry specific applications
○ Developer IDE AI like Cursor or CoPilot
○ CoCounsel and other legal
○ Mid Journey for visual designers
○ Many industries will get one - many
opportunities
● General assistant applications
○ ChatGPT & others - plugin development
● Textual/Voice interface can become a mainstream
application UI (finally Siri, Alexa can become a proper
UI)
● Use capabilities in any application - a new normal
● Yes, robots and all that hyped Jazz as well

Dev Tools & Ecosystems
Many new tools:
● Dev environment (IDE) like Cursor or CoPilot
● Local Operating System SDKs (hopefully in next few years!)
○ O/S Apple Intelligence: Writing, Image, Siri, etc
○ Becomes just a simple capability to use in app
○ O/S will provide access to raw local LLM soon??
● Open Source Models & SDKs
○ Can run locally or in cloud
○ Big or small, ability to compose, fine tune etc
○ Watch this space!
● Industry specific local SDKs and cloud APIs:
○ A wide ecosystem of speciality APIs & SDKs that
require accuracy or speed in a domain. Use and build!
● Powerful Cloud Models via API (paid)
○ Terabytes with massive knowledge and very smart
○ Ability to prompt or fine tune or build pipelines on top
etc
○ !Incentivised to be overhyped by big players!
● Many frameworks that help you compose systems

The Transformer
Breakthrough
What just happened?
What made deep learning
suddenly work?

Great ideas!
Early 1980s & before
● The idea of machine learning
with a neural network was
developed and refined.
● Big expectations set!
● 2024 Nobel Prize in Physics
was awarded to John J.
Hopfield and Geoffrey E.
Hinton for 1980 contributions to
neural nets!
https://www.nobelprize.org/prizes/p
hysics/2024/press-release/

The specialists
1980s until 2017
● It worked, but only for
specialised neural nets -
trained by specialists with lots
of data.
● 2010s - Google showed what is
possible - eg with photo
tagging. But who else has that
amount of specialised data and
processing?
● Disappointingly narrow
success! Couldn’t even solve
turing test.

The breakthrough!
2017: Transformer Generalist LLM
● Paper by 8 Google engineers
“Attention is all you need”
introduced the Transformer: a
neural net architecture
● State of the art translations with
minimal training
● Generalized well to other
tasks
● It was a big simplification and
took much less time to train
● A “Fleming-Penicillin” moment!
● Architecture matters less after
the transformer
“We propose a new simple network architecture,
the Transformer,
based solely on attention mechanisms, dispensing
with recurrence (RNN) and convolutions entirely.”

GenAI = quick learner
2020: OpenAI: LLMs are Few-Shot
Learners
● With GPT3 (pre-Chat GPT)
● Few-Shot learning means a few
examples rather than millions of
specialized samples. Zero-shot
= just a question.
● Showed that LLM be used many
language tasks, that doesn’t
need specialised skills or
training or data science.
● Ie. it can be used by anyone to
perform many tasks! GenAI
Understated conclusion:
“…large language models may be an
important ingredient in the development of
adaptable, general language systems.”

The scaling law mystery
2020: the same OpenAI
● The same paper hinted at
something unexpected: The
more data was used to train with
the smarter (more accurate) it
got.
● It just kept on going linearly and
didn’t flatten out as expected
● This was a complete surprise!
● This law is still holding - that is
why we are seeing more an more
data and processing. Current
models more than $100M to
train.

The generalist (not just text)
2020 and beyond
● Top new LLMs (GPT-4o/LLAMA3.2)
are multimodal and can handle text,
images, audio at once.
● Transformers also even excel in other
fields like chemistry.
● 2024 Nobel Prize in Chemistry to
Demis Hassabis and John Jumper
Google DeepMind/AlphaGo to
predicting proteins’ structures
https://www.nobelprize.org/prizes/che
mistry/2024/press-release/
(my final year early 2000 was on
protein structure prediction - it sucked
then)

Evolving fast
2025 and beyond
● The potential and the underlying
workings has seemed to show itself
now.
● Models are becoming smarter, smaller,
faster, and locally available on devices.
This will probably continue for years.
● Ecosystems becoming more and more
developer friendly, eg Ollama,
Transformers.js
● Prompt Engineering and AI
applications is still a “dark art”, rather
than structured engineering, but there
are signs that it is maturing, DSPy
https://github.com/stanfordnlp/dspy
● Great time to start learning!

GenAI LLMs:
a new processing unit
Traditional CPUs and GenAI LLM both
have:
● A processor that follows instructions
● A temporary memory with
instructions+data
● Output data feeds back into memory
and alters the flow
MEMORY
with Instructions
and Data
PROCESSOR

END
INFERENCE
IF END
TOKEN
GENERATED
PROMPT
+ GENERATED RESPONSE SO FAR
+ LAST GENERATED TOKEN
= CONTEXT
(MAX = CONTEXT WINDOW)
CONTEXT (MEMORY)
Instructions+data.
Just one long text sequence
- THAT’S ALL
*Will go into detail next slide
PROMPT (INPUT)
1. String first parsed into
tokens, then
2. each token & position into
a embedding vector (to
capture meaning),
**Will go into more detail
“COMPILED” EMBEDDINGS
PREDICT MOST LIKELY
NEXT TOKEN BASED ON
THE TOKEN SEQUENCE,
BY RUNNING
THROUGH THE MODEL
(TRANSFORMER
NEURAL NETWORK)
LLM INFERENCE
LOOP
GENERATED TOKEN
Generated text.
Also called
“completion”.
RESPONSE (OUTPUT)
LLM Inference Loop
Response added to
saved history on
file/cloud.
More like a file than part
of processing memory.
CHAT HISTORY

LLM
INFERENCE
LOOP
Context-aware Prompting: the trick to steering an LLM
(the real “Prompt Engineering”)
Retrieval Augmented
Generation means
“add some search
results”. You can
build your own RAG
system to make LLM
specialised.
SEARCHED DOCS
/SUMMARY
Instruction+examples
/data.
Can be machine
generated or
anything.
FINAL PART IS
HIGHEST PRIORITY
“USER” PROMPT
Previous prompts +
responses in the
current chat.
“CHAT” HISTORY
CUSTOMER SEMANTIC
EMBEDDING VECTOR
DATABASE
GOOGLE SEARCH
TOP RESULTS
TOOL ADAPTER THAT
CAN ADD / CHANGE
Just 1 long string. Just “gooi” it in the prompt
Personality,
conciseness, style,
summary of past
interactions etc.
PREFERENCES
PROMPT
(INITIAL
CONTEXT)

Trad CPU vs LLM as processor
Proce
ssor
Memory Instructions Machine
Language
Process Process Completes
when
Trad
CPU
RAM with programs
and data (text,
vids..)
Eg 16GB RAM
Prog Lang
Programs
Assembly with
fixed instruction
set (eg
x86/ARM)
Executes program:
Computes each operation
from program sequentially
Completes program when
reaching end of program.
Operation can be to
jump/loop to avoid
completion.
LLM Context with
instructions and
examples and other
context (text, vids..)
Eg 4,096 token
context window
Human Lang
Instructions
Token vector
embeddings
sequences with
limitless
potential
instructions
Runs inference:
Generates next token
based on input context.
Appends the generated
token to the next input
context and generates the
next token.
Completes inference after
predicting task/answer has
been completed (ie after
creating end token). Can’t
jump/loop (although you
can loop by piping
inferences, like O1).

Performance Benchmarking
Speed:
● CPU is Instructions/second (MIPS)
● LLM is Tokens/second
● Also, LLM Stream Start (Seconds) -
it takes a while to “warm up”
Accuracy:
● Key for LLM, N/A for CPU
● Measured in % like school report
● Aspects:Knowledge, Reasoning,
Math, Coding, Vision
https://klu.ai/llm-leaderboard

The 2 Ways To Program an LLM
Processor Mechanism Benefits Drawbacks
1. Prompt Engineering Simply pass data
and instructions
● General purpose task
● Stays up to date
Assembly with fixed instruction
set (eg x86/ARM)
2. LLM Fine Tuning Change some of
the weights of the
neural network
given a few 100
examples (not
millions!)
● Runs cheaper and
faster at scale
● Can potentially embed
more info (not
necessarily)
Token vector embeddings
sequences with limitless
potential instructions

The language of a language model

https://platform.openai.com/tokenizer

What is a Token Vector Embedding?
1. Every possible token makes up the token vocabulary, each token has an index, eg
“cat” could have index 12345.
2. There is a table for the vocabulary, where for index there is a vector eg 12345:[0.99,
0.01, 0.75, 0.50, 0.35].
3. The vectors are called “embeddings” since the “meaning” and context of token is
captured in it.
4. E.g. the first dimension in the vector 0.99 could stand for “animalness”, the second
dimension 0.01 for “verbness” - we could define it this way.
5. But actual vector embeddings are created through training (seperate from LLM), so
the components doesn’t have nice direct meaning like that.
6. Eg GPT4 has 1536 dimensions! We don’t know exactly what they mean.
7. Very useful in your own RAG - or querying for meaning.

Developing with LLMs
Examples:
● RAG - show wordpress.
● Show a few OLlama examples
● Show Hugging Face Transformers.js

Practical
Get going!
● Install OLlama to run inference in termal
https://ollama.com/download/
● Build your own template command like “sed”
Dig deeper:
● Use Hugging Face (Github for model’s),
Transformers.js to run models and play around (used
NodeJS instead of the Python stuff)
https://huggingface.co/docs/transformers.js/index
● Specifically, run the llama-3.2-webgpu example
● https://github.com/huggingface/transformers.js-exampl
es/tree/main/llama-3.2-node
● https://github.com/huggingface/transformers.js-exampl
es/tree/main/llama-3.2-webgpu

LLM GenAI - the new processing unit for developers

More Related Content

Similar to LLM GenAI - the new processing unit for developers

Recently uploaded

LLM GenAI - the new processing unit for developers