Measuring the True Cost of Div (32‑bit vs 64‑bit with Rust and Inline Asm)
Modern CPUs are fast—but some instructions still hide surprising costs. One of the most misunderstood is DIV. Is 32‑bit division faster than 64‑bit? Does instruction width matter anymore on x86‑64?
To answer this properly, we need more than wall‑clock timers. We need cycle counters, instruction retirement statistics, serialization barriers, and tight control over CPU affinity.
In this post, we’ll build a microbenchmark in Rust using inline assembly that measures the real cost of integer division using hardware performance counters—and the results may challenge your assumptions.
Why division benchmarks are tricky
Benchmarking arithmetic instructions on modern out‑of‑order CPUs is deceptively hard:
- CPUs reorder instructions aggressively
- Timers can be skewed by frequency scaling or migration across cores
- Modern CPUs overlap instruction execution, hiding latency
- The OS scheduler can move your thread mid‑measurement
If you want instruction‑level truth, you must:
- Pin execution to a specific core
- Serialize timestamp reads
- Measure cycles and retired instructions, not time
- Run enough repetitions to eliminate noise
This benchmark does exactly that.
Pinning execution to a single core
We start by fixing the thread to a specific logical CPU using the Windows API:
fn pin_to_core(core: usize) {
let mask: usize = 1usize << core;
unsafe {
let prev = SetThreadAffinityMask(GetCurrentThread(), mask);
if prev == 0 {
panic!("SetThreadAffinityMask failed for core {}", core);
}
}
}
Why this matters:
- Prevents scheduler migration
- Ensures invariant TSC behavior
- Keeps performance counters consistent
Without this, your measurements with rdpmc are garbage.
High‑precision timing with RDTSC/RDTSCP and RDPMC
We use RDTSC and RDTSCP with fencing to get accurate cycle deltas, but at base CPU clock.
Using hardware performance counters (RDPMC)
Clock cycles alone aren’t enough. We also want:
- Unhalted core cycles
- Instructions retired
#[inline(always)]
fn rdpmc(counter: u32) -> u64 {
let low: u32;
let high: u32;
unsafe {
asm!(
"lfence",
"rdpmc",
"lfence",
out("eax") low,
out("edx") high,
in("ecx") counter,
options(nomem, nostack, preserves_flags),
);
}
((high as u64) << 32) | (low as u64)
}
We use:
const PMC_INSTRUCTIONS: u32 = 0x40000000;
const PMC_CYCLES: u32 = 0x40000001;
The actual workload: forcing real division latency
To avoid dead‑code elimination or partial execution overlap, we repeat a real div instruction 1,000 times inside a single inline assembly block.
64‑bit division
#[inline(always)]
fn measured_div_64() {
unsafe {
asm!(
".rept 1000",
"mov rcx, 0x2034",
"mov rdx, 0x0008",
"mov rax, 0x2B7C",
"div rcx",
".endr",
lateout("rax") _,
lateout("rdx") _,
lateout("rcx") _,
options(nostack, nomem),
);
}
}
32‑bit division
#[inline(always)]
fn measured_div_32() {
unsafe {
asm!(
".rept 1000",
"mov ecx, 0x2034",
"mov edx, 0x0008",
"mov eax, 0x2B7C",
"div ecx",
".endr",
lateout("rax") _,
lateout("rdx") _,
lateout("rcx") _,
options(nostack, nomem),
);
}
}
Full Code:
use std::arch::asm;
use windows::Win32::System::Threading::{GetCurrentThread, SetThreadAffinityMask};
fn pin_to_core(core: usize) {
// Logical CPU → bitmask
let mask: usize = 1usize << core;
unsafe {
let prev = SetThreadAffinityMask(GetCurrentThread(), mask);
if prev == 0 {
panic!("SetThreadAffinityMask failed for core {}", core);
}
}
}
#[inline(always)]
fn rdtsc_start() -> u64 {
let low: u32;
let high: u32;
unsafe {
asm!(
"lfence",
"rdtsc",
out("eax") low,
out("edx") high,
options(nomem, nostack, preserves_flags),
);
}
((high as u64) << 32) | (low as u64)
}
#[inline(always)]
fn rdtscp_end() -> u64 {
let low: u32;
let high: u32;
unsafe {
asm!(
"rdtscp",
"lfence",
out("eax") low,
out("edx") high,
out("ecx") _, // IA32_TSC_AUX (must be declared)
options(nomem, nostack, preserves_flags),
);
}
((high as u64) << 32) | (low as u64)
}
#[inline(always)]
fn rdpmc(counter: u32) -> u64 {
let low: u32;
let high: u32;
unsafe {
asm!(
"lfence",
"rdpmc",
"lfence",
out("eax") low,
out("edx") high,
in("ecx") counter,
options(nomem, nostack, preserves_flags),
);
}
((high as u64) << 32) | (low as u64)
}
#[inline(always)]
fn measured_div_64() {
unsafe {
asm!(
".rept 1000",
"mov rcx, 0x2034",
"mov rdx, 0x0008",
"mov rax, 0x2B7C",
"div rcx",
".endr",
lateout("rax") _, // div writes quotient
lateout("rdx") _, // div writes remainder
lateout("rcx") _, // rcx is modified by us
options(nostack, nomem),
);
}
}
#[inline(always)]
fn measured_div_32() {
unsafe {
asm!(
".rept 1000",
"mov ecx, 0x2034",
"mov edx, 0x0008",
"mov eax, 0x2B7C",
"div ecx",
".endr",
lateout("rax") _, // div writes quotient
lateout("rdx") _, // div writes remainder
lateout("rcx") _, // rcx is modified by us
options(nostack, nomem),
);
}
}
fn get_core_from_args() -> usize {
std::env::args()
.nth(1) // first user argument
.and_then(|s| s.parse::<usize>().ok())
.unwrap_or(0)
}
fn main() {
const ITER: usize = 1_000_000;
const PMC_CYCLES: u32 = 0x40000001; // PMC1 Unhalted Core Cycles
const PMC_INSTRUCTIONS: u32 = 0x40000000; // PMC0 Instructions Retired
let mut min_ticks = u64::MAX;
let mut min_cycles = u64::MAX;
let mut min_instructions = u64::MAX;
let core: usize = get_core_from_args();
pin_to_core(core);
println!("Running {} iterations on core {}", ITER, core);
for _ in 0..ITER {
let start = rdtsc_start();
measured_div_64();
let end = rdtscp_end();
let delta = end.wrapping_sub(start);
if delta < min_ticks {
min_ticks = delta;
}
}
for _ in 0..ITER {
let start = rdpmc(PMC_CYCLES);
measured_div_64();
let end = rdpmc(PMC_CYCLES);
let delta = end.wrapping_sub(start);
if delta < min_cycles {
min_cycles = delta;
}
}
for _ in 0..ITER {
let start = rdpmc(PMC_INSTRUCTIONS);
measured_div_64();
let end = rdpmc(PMC_INSTRUCTIONS);
let delta = end.wrapping_sub(start);
if delta < min_instructions {
min_instructions = delta;
}
}
let ipc = min_instructions as f64 / min_cycles as f64;
println!(
"64-bit registers: Ticks: {}, Cycles: {}, Instructions: {}, IPC: {:.2}",
min_ticks, min_cycles, min_instructions, ipc
);
min_ticks = u64::MAX;
min_cycles = u64::MAX;
min_instructions = u64::MAX;
for _ in 0..ITER {
let start = rdtsc_start();
measured_div_32();
let end = rdtscp_end();
let delta = end.wrapping_sub(start);
if delta < min_ticks {
min_ticks = delta;
}
}
for _ in 0..ITER {
let start = rdpmc(PMC_CYCLES);
measured_div_32();
let end = rdpmc(PMC_CYCLES);
let delta = end.wrapping_sub(start);
if delta < min_cycles {
min_cycles = delta;
}
}
for _ in 0..ITER {
let start = rdpmc(PMC_INSTRUCTIONS);
measured_div_32();
let end = rdpmc(PMC_INSTRUCTIONS);
let delta = end.wrapping_sub(start);
if delta < min_instructions {
min_instructions = delta;
}
}
let ipc = min_instructions as f64 / min_cycles as f64;
println!(
"32-bit registers: Ticks: {}, Cycles: {}, Instructions: {}, IPC: {:.2}",
min_ticks, min_cycles, min_instructions, ipc
);
}
Results
on i7-13850HX:
| P Core 64-bit | P-Core 32 bit | E-Core 64-bit | E-Core 32bit | |
|---|---|---|---|---|
| Ticks (rdtsc) | 4534 | 2728 | 11538-11542 | 6688 |
| Instructions Retired | 4007 | 4007 | 4008 | 4008 |
| CPU Cycles (rdpmc) | 10057 | 6057 | 19073 | 11073 |
| IPC (Instructions Per Cycle) | 0,40 | 0,66 | 0,21 | 0,36 |
Div commented out:
"mov ecx, 0x2034",
"mov edx, 0x0008",
"mov eax, 0x2B7C",
// "div ecx",
IPC now:
| P Core 64-bit | P-Core 32 bit | E-Core 64-bit | E-Core 32bit | |
|---|---|---|---|---|
| Ticks (rdtsc) | 264 | 288 | 480 | 470 |
| Instructions | 3007 | 3007 | 3008 | 3008 |
| CPU Cycles | 694-695 | 648 | 843-844 | 820 |
| IPC | 4,33 | 4,64 | 3,56-3,57 | 3,67 |