LLM GenAI: the new processing unit
Jacques de Vos in collaboration with GenAI
This talk is my (subjective) take
on the question:
What does the LLM GenAI
revolution mean for software
application developers, like
myself?
(as opposed to data scientists or
general public)
Agenda
1. Computing: past & future
2. The LLM Breakthrough
3. LLMs: the new processing unit
4. The language of LLMs
5. Programming with LLMs
6. Additional ideas (by LLM)
Computing eras:
past & future
● Each breakthrough adds new development tools
● Technologies build upon each other:
● - Previous tools remain valuable
● - New capabilities enhance, don't replace
Mainstream Computing Eras
1. Transistor-Mainframe Era - mid 1960s
● Large-scale batch processing of instruction on data
● Killer App: The Database
2. CPU/Microprocessor Era - early 1970s
Early - Embedded CPU Era - early 1970s
● Small, CPUs executing instructions & I/O
● Killer apps: Specialised Hardware Controllers
Late - PC Era - late 1970s
● Innovation: Personal computing with friendly UIX
● Killer App: The Spreadsheet (1979)
3. Internet Era - mid 1990s
Early - WWW Era - mid 1990s
● Breakthrough: Global interconnection of computers
● Killer App: Search engines
Late - Mobile Era - late 2000s
● Expansion: Always-on connectivity via smartphones
● Killer App: Social media
4. GenAI LLM Era - 2023
● Multi-task processor capable of zero or few-shot learning
● Killer App: Personal Assistants
New LLM Era Dev Opportunities
Like other eras, this tech gives opportunities for developers
at many many levels. Eg:
● Industry specific applications
○ Developer IDE AI like Cursor or CoPilot
○ CoCounsel and other legal
○ Mid Journey for visual designers
○ Many industries will get one - many
opportunities
● General assistant applications
○ ChatGPT & others - plugin development
● Textual/Voice interface can become a mainstream
application UI (finally Siri, Alexa can become a proper
UI)
● Use capabilities in any application - a new normal
● Yes, robots and all that hyped Jazz as well
Dev Tools & Ecosystems
Many new tools:
● Dev environment (IDE) like Cursor or CoPilot
● Local Operating System SDKs (hopefully in next few years!)
○ O/S Apple Intelligence: Writing, Image, Siri, etc
○ Becomes just a simple capability to use in app
○ O/S will provide access to raw local LLM soon??
● Open Source Models & SDKs
○ Can run locally or in cloud
○ Big or small, ability to compose, fine tune etc
○ Watch this space!
● Industry specific local SDKs and cloud APIs:
○ A wide ecosystem of speciality APIs & SDKs that
require accuracy or speed in a domain. Use and build!
● Powerful Cloud Models via API (paid)
○ Terabytes with massive knowledge and very smart
○ Ability to prompt or fine tune or build pipelines on top
etc
○ !Incentivised to be overhyped by big players!
● Many frameworks that help you compose systems
The Transformer
Breakthrough
What just happened?
What made deep learning
suddenly work?
Great ideas!
Early 1980s & before
● The idea of machine learning
with a neural network was
developed and refined.
● Big expectations set!
● 2024 Nobel Prize in Physics
was awarded to John J.
Hopfield and Geoffrey E.
Hinton for 1980 contributions to
neural nets!
https://www.nobelprize.org/prizes/p
hysics/2024/press-release/
The specialists
1980s until 2017
● It worked, but only for
specialised neural nets -
trained by specialists with lots
of data.
● 2010s - Google showed what is
possible - eg with photo
tagging. But who else has that
amount of specialised data and
processing?
● Disappointingly narrow
success! Couldn’t even solve
turing test.
The breakthrough!
2017: Transformer Generalist LLM
● Paper by 8 Google engineers
“Attention is all you need”
introduced the Transformer: a
neural net architecture
● State of the art translations with
minimal training
● Generalized well to other
tasks
● It was a big simplification and
took much less time to train
● A “Fleming-Penicillin” moment!
● Architecture matters less after
the transformer
“We propose a new simple network architecture,
the Transformer,
based solely on attention mechanisms, dispensing
with recurrence (RNN) and convolutions entirely.”
GenAI = quick learner
2020: OpenAI: LLMs are Few-Shot
Learners
● With GPT3 (pre-Chat GPT)
● Few-Shot learning means a few
examples rather than millions of
specialized samples. Zero-shot
= just a question.
● Showed that LLM be used many
language tasks, that doesn’t
need specialised skills or
training or data science.
● Ie. it can be used by anyone to
perform many tasks! GenAI
Understated conclusion:
“…large language models may be an
important ingredient in the development of
adaptable, general language systems.”
The scaling law mystery
2020: the same OpenAI
● The same paper hinted at
something unexpected: The
more data was used to train with
the smarter (more accurate) it
got.
● It just kept on going linearly and
didn’t flatten out as expected
● This was a complete surprise!
● This law is still holding - that is
why we are seeing more an more
data and processing. Current
models more than $100M to
train.
The generalist (not just text)
2020 and beyond
● Top new LLMs (GPT-4o/LLAMA3.2)
are multimodal and can handle text,
images, audio at once.
● Transformers also even excel in other
fields like chemistry.
● 2024 Nobel Prize in Chemistry to
Demis Hassabis and John Jumper
Google DeepMind/AlphaGo to
predicting proteins’ structures
https://www.nobelprize.org/prizes/che
mistry/2024/press-release/
(my final year early 2000 was on
protein structure prediction - it sucked
then)
Evolving fast
2025 and beyond
● The potential and the underlying
workings has seemed to show itself
now.
● Models are becoming smarter, smaller,
faster, and locally available on devices.
This will probably continue for years.
● Ecosystems becoming more and more
developer friendly, eg Ollama,
Transformers.js
● Prompt Engineering and AI
applications is still a “dark art”, rather
than structured engineering, but there
are signs that it is maturing, DSPy
https://github.com/stanfordnlp/dspy
● Great time to start learning!
GenAI LLMs:
a new processing unit
Traditional CPUs and GenAI LLM both
have:
● A processor that follows instructions
● A temporary memory with
instructions+data
● Output data feeds back into memory
and alters the flow
MEMORY
with Instructions
and Data
PROCESSOR
END
INFERENCE
IF END
TOKEN
GENERATED
PROMPT
+ GENERATED RESPONSE SO FAR
+ LAST GENERATED TOKEN
= CONTEXT
(MAX = CONTEXT WINDOW)
CONTEXT (MEMORY)
Instructions+data.
Just one long text sequence
- THAT’S ALL
*Will go into detail next slide
PROMPT (INPUT)
1. String first parsed into
tokens, then
2. each token & position into
a embedding vector (to
capture meaning),
**Will go into more detail
“COMPILED” EMBEDDINGS
PREDICT MOST LIKELY
NEXT TOKEN BASED ON
THE TOKEN SEQUENCE,
BY RUNNING
THROUGH THE MODEL
(TRANSFORMER
NEURAL NETWORK)
LLM INFERENCE
LOOP
GENERATED TOKEN
Generated text.
Also called
“completion”.
RESPONSE (OUTPUT)
LLM Inference Loop
Response added to
saved history on
file/cloud.
More like a file than part
of processing memory.
CHAT HISTORY
LLM
INFERENCE
LOOP
Context-aware Prompting: the trick to steering an LLM
(the real “Prompt Engineering”)
Retrieval Augmented
Generation means
“add some search
results”. You can
build your own RAG
system to make LLM
specialised.
SEARCHED DOCS
/SUMMARY
Instruction+examples
/data.
Can be machine
generated or
anything.
FINAL PART IS
HIGHEST PRIORITY
“USER” PROMPT
Previous prompts +
responses in the
current chat.
“CHAT” HISTORY
CUSTOMER SEMANTIC
EMBEDDING VECTOR
DATABASE
GOOGLE SEARCH
TOP RESULTS
TOOL ADAPTER THAT
CAN ADD / CHANGE
Just 1 long string. Just “gooi” it in the prompt
Personality,
conciseness, style,
summary of past
interactions etc.
PREFERENCES
PROMPT
(INITIAL
CONTEXT)
Trad CPU vs LLM as processor
Proce
ssor
Memory Instructions Machine
Language
Process Process Completes
when
Trad
CPU
RAM with programs
and data (text,
vids..)
Eg 16GB RAM
Prog Lang
Programs
Assembly with
fixed instruction
set (eg
x86/ARM)
Executes program:
Computes each operation
from program sequentially
Completes program when
reaching end of program.
Operation can be to
jump/loop to avoid
completion.
LLM Context with
instructions and
examples and other
context (text, vids..)
Eg 4,096 token
context window
Human Lang
Instructions
Token vector
embeddings
sequences with
limitless
potential
instructions
Runs inference:
Generates next token
based on input context.
Appends the generated
token to the next input
context and generates the
next token.
Completes inference after
predicting task/answer has
been completed (ie after
creating end token). Can’t
jump/loop (although you
can loop by piping
inferences, like O1).
Performance Benchmarking
Speed:
● CPU is Instructions/second (MIPS)
● LLM is Tokens/second
● Also, LLM Stream Start (Seconds) -
it takes a while to “warm up”
Accuracy:
● Key for LLM, N/A for CPU
● Measured in % like school report
● Aspects:Knowledge, Reasoning,
Math, Coding, Vision
https://klu.ai/llm-leaderboard
The 2 Ways To Program an LLM
Processor Mechanism Benefits Drawbacks
1. Prompt Engineering Simply pass data
and instructions
● General purpose task
● Stays up to date
Assembly with fixed instruction
set (eg x86/ARM)
2. LLM Fine Tuning Change some of
the weights of the
neural network
given a few 100
examples (not
millions!)
● Runs cheaper and
faster at scale
● Can potentially embed
more info (not
necessarily)
Token vector embeddings
sequences with limitless
potential instructions
The language of a language model
https://platform.openai.com/tokenizer
What is a Token Vector Embedding?
1. Every possible token makes up the token vocabulary, each token has an index, eg
“cat” could have index 12345.
2. There is a table for the vocabulary, where for index there is a vector eg 12345:[0.99,
0.01, 0.75, 0.50, 0.35].
3. The vectors are called “embeddings” since the “meaning” and context of token is
captured in it.
4. E.g. the first dimension in the vector 0.99 could stand for “animalness”, the second
dimension 0.01 for “verbness” - we could define it this way.
5. But actual vector embeddings are created through training (seperate from LLM), so
the components doesn’t have nice direct meaning like that.
6. Eg GPT4 has 1536 dimensions! We don’t know exactly what they mean.
7. Very useful in your own RAG - or querying for meaning.
Developing with LLMs
Examples:
● RAG - show wordpress.
● Show a few OLlama examples
● Show Hugging Face Transformers.js
Practical
Get going!
● Install OLlama to run inference in termal
https://ollama.com/download/
● Build your own template command like “sed”
Dig deeper:
● Use Hugging Face (Github for model’s),
Transformers.js to run models and play around (used
NodeJS instead of the Python stuff)
https://huggingface.co/docs/transformers.js/index
● Specifically, run the llama-3.2-webgpu example
● https://github.com/huggingface/transformers.js-exampl
es/tree/main/llama-3.2-node
● https://github.com/huggingface/transformers.js-exampl
es/tree/main/llama-3.2-webgpu

LLM GenAI - the new processing unit for developers

  • 1.
    LLM GenAI: thenew processing unit Jacques de Vos in collaboration with GenAI
  • 2.
    This talk ismy (subjective) take on the question: What does the LLM GenAI revolution mean for software application developers, like myself? (as opposed to data scientists or general public)
  • 3.
    Agenda 1. Computing: past& future 2. The LLM Breakthrough 3. LLMs: the new processing unit 4. The language of LLMs 5. Programming with LLMs 6. Additional ideas (by LLM)
  • 4.
    Computing eras: past &future ● Each breakthrough adds new development tools ● Technologies build upon each other: ● - Previous tools remain valuable ● - New capabilities enhance, don't replace
  • 5.
    Mainstream Computing Eras 1.Transistor-Mainframe Era - mid 1960s ● Large-scale batch processing of instruction on data ● Killer App: The Database 2. CPU/Microprocessor Era - early 1970s Early - Embedded CPU Era - early 1970s ● Small, CPUs executing instructions & I/O ● Killer apps: Specialised Hardware Controllers Late - PC Era - late 1970s ● Innovation: Personal computing with friendly UIX ● Killer App: The Spreadsheet (1979) 3. Internet Era - mid 1990s Early - WWW Era - mid 1990s ● Breakthrough: Global interconnection of computers ● Killer App: Search engines Late - Mobile Era - late 2000s ● Expansion: Always-on connectivity via smartphones ● Killer App: Social media 4. GenAI LLM Era - 2023 ● Multi-task processor capable of zero or few-shot learning ● Killer App: Personal Assistants
  • 6.
    New LLM EraDev Opportunities Like other eras, this tech gives opportunities for developers at many many levels. Eg: ● Industry specific applications ○ Developer IDE AI like Cursor or CoPilot ○ CoCounsel and other legal ○ Mid Journey for visual designers ○ Many industries will get one - many opportunities ● General assistant applications ○ ChatGPT & others - plugin development ● Textual/Voice interface can become a mainstream application UI (finally Siri, Alexa can become a proper UI) ● Use capabilities in any application - a new normal ● Yes, robots and all that hyped Jazz as well
  • 7.
    Dev Tools &Ecosystems Many new tools: ● Dev environment (IDE) like Cursor or CoPilot ● Local Operating System SDKs (hopefully in next few years!) ○ O/S Apple Intelligence: Writing, Image, Siri, etc ○ Becomes just a simple capability to use in app ○ O/S will provide access to raw local LLM soon?? ● Open Source Models & SDKs ○ Can run locally or in cloud ○ Big or small, ability to compose, fine tune etc ○ Watch this space! ● Industry specific local SDKs and cloud APIs: ○ A wide ecosystem of speciality APIs & SDKs that require accuracy or speed in a domain. Use and build! ● Powerful Cloud Models via API (paid) ○ Terabytes with massive knowledge and very smart ○ Ability to prompt or fine tune or build pipelines on top etc ○ !Incentivised to be overhyped by big players! ● Many frameworks that help you compose systems
  • 8.
    The Transformer Breakthrough What justhappened? What made deep learning suddenly work?
  • 9.
    Great ideas! Early 1980s& before ● The idea of machine learning with a neural network was developed and refined. ● Big expectations set! ● 2024 Nobel Prize in Physics was awarded to John J. Hopfield and Geoffrey E. Hinton for 1980 contributions to neural nets! https://www.nobelprize.org/prizes/p hysics/2024/press-release/
  • 10.
    The specialists 1980s until2017 ● It worked, but only for specialised neural nets - trained by specialists with lots of data. ● 2010s - Google showed what is possible - eg with photo tagging. But who else has that amount of specialised data and processing? ● Disappointingly narrow success! Couldn’t even solve turing test.
  • 11.
    The breakthrough! 2017: TransformerGeneralist LLM ● Paper by 8 Google engineers “Attention is all you need” introduced the Transformer: a neural net architecture ● State of the art translations with minimal training ● Generalized well to other tasks ● It was a big simplification and took much less time to train ● A “Fleming-Penicillin” moment! ● Architecture matters less after the transformer “We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence (RNN) and convolutions entirely.”
  • 12.
    GenAI = quicklearner 2020: OpenAI: LLMs are Few-Shot Learners ● With GPT3 (pre-Chat GPT) ● Few-Shot learning means a few examples rather than millions of specialized samples. Zero-shot = just a question. ● Showed that LLM be used many language tasks, that doesn’t need specialised skills or training or data science. ● Ie. it can be used by anyone to perform many tasks! GenAI Understated conclusion: “…large language models may be an important ingredient in the development of adaptable, general language systems.”
  • 13.
    The scaling lawmystery 2020: the same OpenAI ● The same paper hinted at something unexpected: The more data was used to train with the smarter (more accurate) it got. ● It just kept on going linearly and didn’t flatten out as expected ● This was a complete surprise! ● This law is still holding - that is why we are seeing more an more data and processing. Current models more than $100M to train.
  • 14.
    The generalist (notjust text) 2020 and beyond ● Top new LLMs (GPT-4o/LLAMA3.2) are multimodal and can handle text, images, audio at once. ● Transformers also even excel in other fields like chemistry. ● 2024 Nobel Prize in Chemistry to Demis Hassabis and John Jumper Google DeepMind/AlphaGo to predicting proteins’ structures https://www.nobelprize.org/prizes/che mistry/2024/press-release/ (my final year early 2000 was on protein structure prediction - it sucked then)
  • 15.
    Evolving fast 2025 andbeyond ● The potential and the underlying workings has seemed to show itself now. ● Models are becoming smarter, smaller, faster, and locally available on devices. This will probably continue for years. ● Ecosystems becoming more and more developer friendly, eg Ollama, Transformers.js ● Prompt Engineering and AI applications is still a “dark art”, rather than structured engineering, but there are signs that it is maturing, DSPy https://github.com/stanfordnlp/dspy ● Great time to start learning!
  • 16.
    GenAI LLMs: a newprocessing unit Traditional CPUs and GenAI LLM both have: ● A processor that follows instructions ● A temporary memory with instructions+data ● Output data feeds back into memory and alters the flow MEMORY with Instructions and Data PROCESSOR
  • 17.
    END INFERENCE IF END TOKEN GENERATED PROMPT + GENERATEDRESPONSE SO FAR + LAST GENERATED TOKEN = CONTEXT (MAX = CONTEXT WINDOW) CONTEXT (MEMORY) Instructions+data. Just one long text sequence - THAT’S ALL *Will go into detail next slide PROMPT (INPUT) 1. String first parsed into tokens, then 2. each token & position into a embedding vector (to capture meaning), **Will go into more detail “COMPILED” EMBEDDINGS PREDICT MOST LIKELY NEXT TOKEN BASED ON THE TOKEN SEQUENCE, BY RUNNING THROUGH THE MODEL (TRANSFORMER NEURAL NETWORK) LLM INFERENCE LOOP GENERATED TOKEN Generated text. Also called “completion”. RESPONSE (OUTPUT) LLM Inference Loop Response added to saved history on file/cloud. More like a file than part of processing memory. CHAT HISTORY
  • 18.
    LLM INFERENCE LOOP Context-aware Prompting: thetrick to steering an LLM (the real “Prompt Engineering”) Retrieval Augmented Generation means “add some search results”. You can build your own RAG system to make LLM specialised. SEARCHED DOCS /SUMMARY Instruction+examples /data. Can be machine generated or anything. FINAL PART IS HIGHEST PRIORITY “USER” PROMPT Previous prompts + responses in the current chat. “CHAT” HISTORY CUSTOMER SEMANTIC EMBEDDING VECTOR DATABASE GOOGLE SEARCH TOP RESULTS TOOL ADAPTER THAT CAN ADD / CHANGE Just 1 long string. Just “gooi” it in the prompt Personality, conciseness, style, summary of past interactions etc. PREFERENCES PROMPT (INITIAL CONTEXT)
  • 19.
    Trad CPU vsLLM as processor Proce ssor Memory Instructions Machine Language Process Process Completes when Trad CPU RAM with programs and data (text, vids..) Eg 16GB RAM Prog Lang Programs Assembly with fixed instruction set (eg x86/ARM) Executes program: Computes each operation from program sequentially Completes program when reaching end of program. Operation can be to jump/loop to avoid completion. LLM Context with instructions and examples and other context (text, vids..) Eg 4,096 token context window Human Lang Instructions Token vector embeddings sequences with limitless potential instructions Runs inference: Generates next token based on input context. Appends the generated token to the next input context and generates the next token. Completes inference after predicting task/answer has been completed (ie after creating end token). Can’t jump/loop (although you can loop by piping inferences, like O1).
  • 20.
    Performance Benchmarking Speed: ● CPUis Instructions/second (MIPS) ● LLM is Tokens/second ● Also, LLM Stream Start (Seconds) - it takes a while to “warm up” Accuracy: ● Key for LLM, N/A for CPU ● Measured in % like school report ● Aspects:Knowledge, Reasoning, Math, Coding, Vision https://klu.ai/llm-leaderboard
  • 21.
    The 2 WaysTo Program an LLM Processor Mechanism Benefits Drawbacks 1. Prompt Engineering Simply pass data and instructions ● General purpose task ● Stays up to date Assembly with fixed instruction set (eg x86/ARM) 2. LLM Fine Tuning Change some of the weights of the neural network given a few 100 examples (not millions!) ● Runs cheaper and faster at scale ● Can potentially embed more info (not necessarily) Token vector embeddings sequences with limitless potential instructions
  • 22.
    The language ofa language model
  • 23.
  • 24.
    What is aToken Vector Embedding? 1. Every possible token makes up the token vocabulary, each token has an index, eg “cat” could have index 12345. 2. There is a table for the vocabulary, where for index there is a vector eg 12345:[0.99, 0.01, 0.75, 0.50, 0.35]. 3. The vectors are called “embeddings” since the “meaning” and context of token is captured in it. 4. E.g. the first dimension in the vector 0.99 could stand for “animalness”, the second dimension 0.01 for “verbness” - we could define it this way. 5. But actual vector embeddings are created through training (seperate from LLM), so the components doesn’t have nice direct meaning like that. 6. Eg GPT4 has 1536 dimensions! We don’t know exactly what they mean. 7. Very useful in your own RAG - or querying for meaning.
  • 25.
    Developing with LLMs Examples: ●RAG - show wordpress. ● Show a few OLlama examples ● Show Hugging Face Transformers.js
  • 26.
    Practical Get going! ● InstallOLlama to run inference in termal https://ollama.com/download/ ● Build your own template command like “sed” Dig deeper: ● Use Hugging Face (Github for model’s), Transformers.js to run models and play around (used NodeJS instead of the Python stuff) https://huggingface.co/docs/transformers.js/index ● Specifically, run the llama-3.2-webgpu example ● https://github.com/huggingface/transformers.js-exampl es/tree/main/llama-3.2-node ● https://github.com/huggingface/transformers.js-exampl es/tree/main/llama-3.2-webgpu