Using llvm-mca to analyze assembly code
llvm-mca is a performance analysis tool to statically measure the performance of machine code in a specific CPU.
Performance is measured in terms of throughput as well as processor resource consumption.
The main goal of this tool is not just to predict the performance of the code when run on the target, but also help with diagnosing potential performance issues.
Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle (IPC), as well as hardware resource pressure. The analysis and reporting style were inspired by the IACA tool from Intel.
llvm-mca - LLVM Machine Code Analyzer.
https://llvm.org/docs/CommandGuide/llvm-mca.html
Releases:
https://github.com/llvm/llvm-project/releases
By default llvm-mca.exe installed in
C:\Program Files\LLVM\bin
Prepare Assembly Code - add Intel syntax:
.intel_syntax noprefix
.loop:
IMUL R10, R10
DEC R8
JNZ .loop
Command Line
llvm-mca -mcpu=haswell -timeline -iterations=4 test.asm
Result:
Iterations: 4
Instructions: 12
Total Cycles: 15
Total uOps: 12
Dispatch Width: 4
uOps Per Cycle: 0.80
IPC: 0.80
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 3 1.00 imul r10, r10
1 1 0.25 dec r8
1 1 0.50 jne .loop
Resources:
[0] - HWDivider
[1] - HWFPDivider
[2] - HWPort0
[3] - HWPort1
[4] - HWPort2
[5] - HWPort3
[6] - HWPort4
[7] - HWPort5
[8] - HWPort6
[9] - HWPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 0.75 1.00 - - - 0.50 0.75 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 1.00 - - - - - - imul r10, r10
- - - - - - - 0.50 0.50 - dec r8
- - 0.75 - - - - - 0.25 - jne .loop
Timeline view:
01234
Index 0123456789
[0,0] DeeeER . . imul r10, r10
[0,1] DeE--R . . dec r8
[0,2] D=eE-R . . jne .loop
[1,0] D===eeeER . . imul r10, r10
[1,1] .DeE----R . . dec r8
[1,2] .D=eE---R . . jne .loop
[2,0] .D=====eeeER . imul r10, r10
[2,1] .D=eE------R . dec r8
[2,2] . D=eE-----R . jne .loop
[3,0] . D=======eeeER imul r10, r10
[3,1] . D=eE--------R dec r8
[3,2] . D==eE-------R jne .loop
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 4 4.8 0.3 0.0 imul r10, r10
1. 4 1.5 0.3 5.0 dec r8
2. 4 2.3 0.0 4.0 jne .loop
4 2.8 0.2 3.0 <total>
Three IMUL Commands
.intel_syntax noprefix
.loop:
IMUL R10, R10
IMUL R11, R11
IMUL R12, R12
DEC R8
JNZ .loop
Result:
Iterations: 4
Instructions: 20
Total Cycles: 17
Total uOps: 20
Dispatch Width: 4
uOps Per Cycle: 1.18
IPC: 1.18
Block RThroughput: 3.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 3 1.00 imul r10, r10
1 3 1.00 imul r11, r11
1 3 1.00 imul r12, r12
1 1 0.25 dec r8
1 1 0.50 jne .loop
Resources:
[0] - HWDivider
[1] - HWFPDivider
[2] - HWPort0
[3] - HWPort1
[4] - HWPort2
[5] - HWPort3
[6] - HWPort4
[7] - HWPort5
[8] - HWPort6
[9] - HWPort7
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
- - 0.75 3.00 - - - 0.50 0.75 -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] Instructions:
- - - 1.00 - - - - - - imul r10, r10
- - - 1.00 - - - - - - imul r11, r11
- - - 1.00 - - - - - - imul r12, r12
- - - - - - - 0.50 0.50 - dec r8
- - 0.75 - - - - - 0.25 - jne .loop
Timeline view:
0123456
Index 0123456789
[0,0] DeeeER . .. imul r10, r10
[0,1] D=eeeER . .. imul r11, r11
[0,2] D==eeeER . .. imul r12, r12
[0,3] DeE----R . .. dec r8
[0,4] .DeE---R . .. jne .loop
[1,0] .D==eeeER . .. imul r10, r10
[1,1] .D===eeeER. .. imul r11, r11
[1,2] .D====eeeER .. imul r12, r12
[1,3] . DeE-----R .. dec r8
[1,4] . D=eE----R .. jne .loop
[2,0] . D====eeeER .. imul r10, r10
[2,1] . D=====eeeER .. imul r11, r11
[2,2] . D=====eeeER .. imul r12, r12
[2,3] . DeE-------R .. dec r8
[2,4] . D=eE------R .. jne .loop
[3,0] . D======eeeER.. imul r10, r10
[3,1] . D======eeeER. imul r11, r11
[3,2] . D=======eeeER imul r12, r12
[3,3] . DeE---------R dec r8
[3,4] . D=eE--------R jne .loop
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 4 4.0 0.3 0.0 imul r10, r10
1. 4 4.8 0.5 0.0 imul r11, r11
2. 4 5.5 0.8 0.0 imul r12, r12
3. 4 1.0 0.5 6.3 dec r8
4. 4 1.8 0.0 5.3 jne .loop
4 3.4 0.4 2.3 <total>