Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

Embedded Systems with ARM Cortex-M Microcontrollers in Assembly Language and C

Chapter 1
Computer and Assembly Language

CE-475
Real-Time Embedded Systems

Spring 2023

1
Special thanks to Dr. Yifeng Zhu
Embedded Systems

2
Amazon Warehouse

Kiva Robot

3
Why ARM processor
 ARM: Advanced RISC Machine

 In 2010 alone, 6.1 billion ARM-based processor, representing


95% of smartphones, 35% of digital televisions and set-top
boxes and 10% of mobile computers

 As of 2019, 150 billion ARM processors have been produced

4
iPhone 11

A13 Bionic:
• 64-bit system on chip (SoC)
• 6 ARMv8.3-A cores
• Introduced on Sept. 10, 2019

ifixit.com 5
Amazon Echo (Alexa)
 Texas Instruments
DM3725 Digital
Media Processor
 ARM Cortex-A8

6
Kindle HD Fire

Texas Instruments
OMAP 4460 dual-core
ARM processor

http://www.ifixit.com 7
Fitbit Flex

ARM Cortex M3
Microcontroller

8
www.ifixit.com
Samsung Galaxy Gear

 STMicroelectronics STM32F401B
ARM-Cortex M4 MCU with
128KB Flash
source: ifixit.com

9
Pebble Smartwatch

source: ifixit.com

 STMicroelectronics STM32F205RE ARM Cortex-M3


MCU, with a maximum speed of 120 MHz

10
Oculus VR

 Facebook’s $2 Billion Acquisition Of Oculus in 2014 source: ifixit.com

 ST Microelectronics STM32F072VB ARM Cortex-M0 32-bit RISC Core


Microcontroller

11
HTC Vive

STMicroelectronics 32F072R8
ARM Cortex-M0
Microcontroller

source: ifixit.com 12
Nest Learning Thermostat

source: ifixit.com

 ST Microelectronics STM32L151VB ultra-low-power 32 MHz


ARM Cortex-M3 MCU
13
Samsung Gear Fit Fitness Tracker

source: ifixit.com

 STMicroelectronics STM32F439ZI 180


MHz, 32 bit ARM Cortex-M4 CPU

14
ML Model Evolution Problem

Our board (in your kit for Course 3)


● MobileNet (2015)
○ MobileNetv1 only has 256KB of RAM (memory) yet
■ 70.6% accuracy
MobileNetv1 needs 16.9MB!
■ 16.9MB in size

Source: S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative


deep neural network architectures,” IEEE Access, vol. 6, pp. 64 270–64 277, 2018
How do we engineer a TinyML Think:
vision network?
● Compute operations
How do we engineer a TinyML Think:
vision network?
● Compute operations
● Operator numerics
How do we engineer a TinyML Think:
vision network?
● Compute operations
● Operator numerics
● Compression methods
(e.g., pruning,
quantization)
sense
(input)
process
actuate
(output)
Your Embedded System
Your Embedded System

MediaTek 7658CSN: Wi-Fi +ARM® Cortex-R4


Real Embedded System Developer Embedded Systems
Processor
+ Bluetooth
Processor
+ Bluetooth

Microphone
IMU

Processor
+ Bluetooth

Microphone
IMU

Processor
+ Bluetooth

Temperature Microphone
+ Humidity
IMU

Processor
I/O (USB)
+ Bluetooth

Temperature Microphone
+ Humidity
Tiny Applications?

Embedded
Systems

TinyML
Machine
Learning
Tiny Applications?

Embedded
Systems

TinyML
Machine
Learning
Tiny Applications?

Embedded
Systems

TinyML
Machine
Learning
Let’s Take an Example

Google Assistant
Let’s Take an Example

Okay, Google!

Hi, how can I help?


The Three Basic Steps
The Three Basic Steps

Step 1
Audio input
from microphone (sensor)

input
complete
The Three Basic Steps

Step 1 Step 2
Audio input Process input
from microphone (sensor) translation, then
execute command

input
complete
The Three Basic Steps

Step 1 Step 2 Step 3


Audio input from Process input Generate Output
microphone (sensor) translation, then play response through
execute command embedded speaker

input
complete
Input

Step 1 Step 2 Step 3


Audio input Process input Generate Output
from microphone (sensor) translation, then play response through
execute command embedded speaker

input
complete
Endpoints Have Sensors, Tons of Sensors

Motion Sensors Acoustic Sensors Environmental Sensors


Gyroscope, radar, Ultrasonic, Microphones, Temperature, Humidity,
magnetometer, accelerator Geophones, Vibrometers Pressure, IR, etc.

Touchscreen Sensors Image Sensors Biometric Sensors


Capacitive, IR Thermal, Image Fingerprint, Heart rate, etc.

Force Sensors Rotation Sensors


Pressure, Strain Encoders
Endpoints Have Sensors, Tons of Sensors

Motion Sensors Acoustic Sensors Environmental Sensors


Gyroscope, radar, Ultrasonic, Microphones, Temperature, Humidity,
magnetometer, accelerator Geophones, Vibrometers Pressure, IR, etc.

Touchscreen Sensors Image Sensors Biometric Sensors


Capacitive, IR Thermal, Image Fingerprint, Heart rate, etc.

Force Sensors Rotation Sensors


Pressure, Strain Encoders
Biometric Sensors

Non-invasive Glucose
Monitoring
Fingerprint + Photoplethysmography
(PPG)
Endpoints Have Sensors, Tons of Sensors

Motion Sensors Acoustic Sensors Environmental Sensors


Gyroscope, radar, Ultrasonic, Microphones, Temperature, Humidity,
magnetometer, accelerator Geophones, Vibrometers Pressure, IR, etc.

Touchscreen Sensors Image Sensors Biometric Sensors


Capacitive, IR Thermal, Image Fingerprint, Heart rate, etc.

Force Sensors Rotation Sensors


Pressure, Strain Encoders
Endpoints Have Sensors, Tons of Sensors
Processing

Step 1 Step 2 Step 3


Audio input from Process input Generate Output
microphone (sensor) translation, then play response through
execute command embedded speaker

input
complete
Thinking Big
Thinking Big
Thinking Big

BIG
GPU / CPU
561mm2
Thinking Small

BIG
GPU / CPU
561mm2
Thinking Small

BIG
GPU / CPU
561mm2
Thinking Small

BIG
GPU / CPU SMALL
561mm2

Mobile SoC
83mm2
Thinking Tiny

BIG
GPU / CPU SMALL
561mm2

Mobile SoC
83mm2
Thinking Tiny

BIG
GPU / CPU SMALL
561mm2

Mobile SoC
83mm2
Thinking Tiny

BIG
GPU / CPU SMALL
561mm2

Mobile SoC
83mm2
Thinking Tiny

BIG
GPU / CPU SMALL TINY

561mm2
Apple 0778
Mobile SoC 30mm2
83mm2
We’re just getting started.
Thinking Record-breaking

BIG
GPU / CPU SMALL TINY

561mm2
Apple 0778 Kinetis KL03
Mobile SoC 30mm2 3.2mm2
83mm2
Thinking Record-breaking

world’s smallest ARM-


Powered MCU
48MHz, 32KB flash, 20-pin

BIG
GPU / CPU SMALL TINY

561mm2
Apple 0778 Kinetis KL03
Mobile SoC 30mm2 3.2mm2
83mm2
250 Billion
MCUs today
MCU Demand Forecast

Millions of Units

forecasted

Source: IC Insights
MCU Demand Forecast

Millions of Units

forecasted

Source: IC Insights
MCU Pricing Forecast

Average Selling Price

forecasted

Source: IC Insights
Comparing Power

BIG SMALL
GPU / CPU
140 μW
Syntiant NDP100
3.64W
Apple A12

300W
NVIDIA Tesla K80
Comparing Power

BIG SMALL
GPU / CPU
140 μW
Syntiant NDP100
3.64W
Apple A12

300W
NVIDIA Tesla K80
Comparing Power

Neural Decision Processor


Always-on deep learning
speech/audio recognition
Ultra low power, 128KB SRAM,
12-pin, 2.52mm2

BIG SMALL
GPU / CPU
140 μW
Syntiant NDP100
3.64W
Apple A12

300W
NVIDIA Tesla K80
Comparing Power

Neural Decision Processor


Always-on deep learning
speech/audio recognition
Ultra low power, 128KB SRAM,
12-pin, 2.52mm2

140 μW
Syntiant NDP100

Use case: button cell battery


Output

Step 1 Step 2 Step 3


Audio input from Process input Generate Output
microphone (sensor) translation, then play response through
execute command embedded speaker

input
complete
Output
Output
Output
MCUs enable TinyML

LOW LOW
SIZE
POWER COST
MCUs enable TinyML

LOW LOW
SIZE
POWER COST
MCUs enable TinyML

LOW LOW
SIZE
POWER COST

< 140 μW
Syntiant NDP100
MCUs enable TinyML

LOW LOW
SIZE
POWER COST

Average Selling Price


MCUs enable TinyML

LOW LOW
SIZE
POWER COST

HIGH DEMAND
What Makes TinyML?

Embedded
Systems

TinyML
Machine
Learning
What Makes TinyML?

Embedded
Systems

TinyML
Machine
Learning
What Makes TinyML?

Embedded
Systems

TinyML
Machine
Learning
Data Address
Memory 8 bits 32 bits

0xFFFFFFFF
 Memory is arranged as a series of “locations”
 Each location has a unique “address”
 Each location holds a byte (byte-addressable)
 e.g. the memory location at address 0x080001B0
contains the byte value 0x70, i.e., 112
 The number of locations in memory is limited 70 0x080001B0
BC 0x080001AF
 e.g. 4 GB of RAM 18 0x080001AE
 1 Gigabyte (GB) = 230 bytes 01 0x080001AD
 232 locations ➔ 4,294,967,296 locations! A0 0x080001AC
 Values stored at each location can represent
either program data or program instructions
 e.g. the value 0x70 might be the code used to tell
the processor to add two values together
0x00000000
Memory 83
Computer Architecture
Von-Neumann Harvard
Instructions and data are stored Data and instructions are stored
in the same memory. into separate memories.

84
Computer Architecture
Von-Neumann Harvard
Instructions and data are stored Data and instructions are stored
in the same memory. into separate memories.

85
ARM Cortex-M Series Family
Von-Neumann Harvard
Instructions and data are stored Data and instructions are stored
in the same memory. into separate memories.

ARM ARM ARM ARM


Cortex-M0 Cortex-M0+ Cortex-M3 Cortex-M4

ARMv6-M ARMv6-M ARMv7-M ARMv7E-M

ARM ARM ARM ARM


Cortex-M1 Cortex-M23 Cortex-M7 Cortex-M33

ARMv6-M ARMv8-M ARMv7E-M ARMv8-M

86
Levels of Program Code
C Program Assembly Program Machine Program
0010000100000000
int main(void){ 0010000000000000
int i; 1110000000000001
int total = 0; Compile Assemble 0100010000000001
for (i = 0; i < 10; i++) {
0001110001000000
total += i;
}
0010100000001010
while(1); // Dead loop 1101110011111011
} 1011111100000000
1110011111111110

 High-level language  Assembly language  Hardware


 Level of abstraction closer to  Textual representation representation
problem domain of instructions  Binary digits (bits)
 Provides for productivity and  Human-readable  Encoded instructions
portability format instructions and data
 Computer-readable
format instructions

87
Why should we learn Assembly?
 Assembly isn’t “just another language”.
 Help you understand how does the processor work

 Assembly program runs faster than high-level language.


 Latency-sensitive applications

 Hardware/processor specific code,


 Device drivers
 Compiler, assembler, linker

 Cost-sensitive applications
 Embedded devices, where the size of code is limited, wash machine controller, automobile controllers

 Better understand high-level programming languages

88
See a Program Runs
C Code
Assembly Code
int main(void){
int a = 0; MOVS r1, #0x00 ; int a = 0
int b = 1; compiler MOVS r2, #0x01 ; int b = 1
int c; ADDS r3, r1, r2 ; c = a + b
c = a + b; MOVS r0, 0x00 ; set return value
return 0; BX lr ; return
}

Machine Code
0010000100000000 2100 ; MOVS r1, #0x00
0010001000000001 2201 ; MOVS r2, #0x01
0001100010001011 188B ; ADDS r3, r1, r2
0010000000000000 2000 ; MOVS r0, #0x00
0100011101110000 4770 ; BX lr
In Binary In Hex

89
Processor Registers
32 bits
 Fastest way to read and write
R0  Registers are within the processor chip
R1
 A register stores 32-bit value
R2
Low R3  ARM Cortex-M has
Registers
R4  R0-R12: 13 general-purpose registers
R5  R13: Stack pointer (Shadow of MSP or PSP)
General
R6 Purpose
Register
 R14: Link register (LR)
R7
 R15: Program counter (PC)
R8
 Special registers (xPSR, BASEPRI, PRIMASK, etc)
R9
High
32 bits
Registers R10
R11 xPSR
R12 BASEPRI
Special
R13 (SP) R13 (MSP) R13 (PSP) PRIMASK Purpose
Register
R14 (LR) FAULTMASK
R15 (PC) CONTROL

90
Program Execution
 Program Counter (PC) is a register that holds the memory address of the next instruction
to be fetched from the memory.

Memory Address
1. Fetch
instruction at
PC address 4770 0x080001B4
2000 0x080001B2
PC 188B 0x080001B0
2201 0x080001AE
3. Execute 2. Decode 2100 0x080001AC
the the
instruction instruction PC = 0x080001B0
Instruction = 188B or
2000188B or 8B180020

91
Three-state pipeline:
Fetch, Decode, Execution
 Pipelining allows hardware resources to be fully utilized
 One 32-bit instruction or two 16-bit instructions can be fetched.

Pipeline of 32-bit instructions

92
Three-state pipeline:
Fetch, Decode, Execution
 Pipelining allows hardware resources to be fully utilized
 One 32-bit instruction or two 16-bit instructions can be fetched.

Clock

Instruction Instruction Instruction


Instruction i
Fetch Decode Execution

Instruction Instruction Instruction


Instruction i + 1
Fetch Decode Execution

Instruction Instruction Instruction


Instruction i + 2
Fetch Decode Execution

Instruction Instruction Instruction


Instruction i + 2
Fetch Decode Execution

Pipeline of 16-bit instructions

93
Machine codes are stored in memory
Data Address
r15 pc
0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3
r2
r1
r0
0x00000000
Registers CPU
Memory 94
Fetch Instruction: pc = 0x08001AC
Decode Instruction: 2100 = MOVS r1, #0x00
Data Address
r15 0x080001AC pc 0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3
r2
r1
r0
0x00000000
Registers CPU
Memory 95
Execute Instruction:
MOVS r1, #0x00
Data Address
r15 0x080001AC pc 0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3
r2
r1 0x00000000
r0
0x00000000
Registers CPU
Memory 96
Fetch Next Instruction: pc = pc + 2
Data Address
r15 0x080001AE pc
0xFFFFFFFF
r14 lr
• Thumb-2 consists of a
mix of 16- & 32-bit r13 sp
instructions r12
• In reality, we always
fetch 4 bytes from the
r11
instruction memory r10
(either one 32-bit
r9 4770 0x080001B4
instruction or two 16-
bit instructions) r8 ALU 2000 0x080001B2
• To simplify the demo, r7 188B 0x080001B0
we assume we only 2201
r6 0x080001AE
fetch 2 bytes from the
instruction memory in r5 2100 0x080001AC
this example. r4
r3
r2
r1 0x00000000
r0
0x00000000
Registers CPU
Memory 97
Fetch Next Instruction: pc = pc + 2
Decode & Execute: 2201 = MOVS r2, #0x01
Data Address
r15 0x080001AE pc
0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3
r2 0x00000001
r1 0x00000000
r0
0x00000000
Registers CPU
Memory 98
Fetch Next Instruction: pc = pc + 2
Decode & Execute: 188B = ADDS r3, r1, r2
Data Address
r15 0x080001B0 pc
0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3 0x00000001
r2 0x00000001
r1 0x00000000
r0
0x00000000
Registers CPU
Memory 99
Fetch Next Instruction: pc = pc + 2
Decode & Execute: 2000 = MOVS r0, #0x00
Data Address
r15 0x080001B2 pc
0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3
r2 0x00000001
r1 0x00000000
r0 0x00000000
0x00000000
Registers CPU
Memory 100
Fetch Next Instruction: pc = pc + 2
Decode & Decode: 4770 = BX lr
Data Address
r15 0x080001B4 pc
0xFFFFFFFF
r14 lr
r13 sp
r12
r11
r10
r9 4770 0x080001B4
r8 ALU 2000 0x080001B2
r7 188B 0x080001B0
r6 2201 0x080001AE
r5 2100 0x080001AC
r4
r3
r2 0x00000001
r1 0x00000000
r0 0x00000000
0x00000000
Registers CPU
Memory 101
Realities
 In the previous example,
 PC is incremented by 2

Well, In ARM Cortex-M4, its different.

102
Realities
 PC is always incremented by 4.
 Each time, 4 bytes are fetched from the instruction memory
 It is either two 16-bit instructions or one 32-bit instruction

If bit [15-11] = 11101, 11110, or 11111, then, it is the first half-word of a 32-bit instruction.
Otherwise, it is a 16-bit instruction.

103
Example:
Calculate the Sum of an Array

int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};


int total;

int main(void){
int i;
total = 0;
for (i = 0; i < 10; i++) {
total += a[i];
}
while(1);
}

104
Example:
Calculate the Sum of an Array

Instruction Data
Memory (Flash) Memory (RAM)

int main(void){ int a[10] = {1, 2, 3, 4, 5, 6, 7,


int i; 8, 9, 10};
total = 0; int total;
for (i = 0; i < 10; i++) {
total += a[i];
CPU } I/O Devices
while(1);
}
Starting memory address Starting memory address
0x08000000 0x20000000

105
Loading Code and Data into Memory

108
Loading Code and Data into Memory

109

You might also like