Using llvm-mca to analyze assembly code

llvm-mca is a performance analysis tool to statically measure the performance of machine code in a specific CPU.

Performance is measured in terms of throughput as well as processor resource consumption.

The main goal of this tool is not just to predict the performance of the code when run on the target, but also help with diagnosing potential performance issues.

Given an assembly code sequence, llvm-mca estimates the Instructions Per Cycle (IPC), as well as hardware resource pressure. The analysis and reporting style were inspired by the IACA tool from Intel.

llvm-mca - LLVM Machine Code Analyzer.

https://llvm.org/docs/CommandGuide/llvm-mca.html

Releases:

https://github.com/llvm/llvm-project/releases

By default llvm-mca.exe installed in

C:\Program Files\LLVM\bin

Prepare Assembly Code - add Intel syntax:

.intel_syntax noprefix

.loop:
	IMUL R10, R10
	DEC R8
    JNZ .loop

Command Line

llvm-mca -mcpu=haswell -timeline -iterations=4 test.asm

Result:

Iterations:        4
Instructions:      12
Total Cycles:      15
Total uOps:        12

Dispatch Width:    4
uOps Per Cycle:    0.80
IPC:               0.80
Block RThroughput: 1.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      3     1.00                        imul	r10, r10
 1      1     0.25                        dec	r8
 1      1     0.50                        jne	.loop


Resources:
[0]   - HWDivider
[1]   - HWFPDivider
[2]   - HWPort0
[3]   - HWPort1
[4]   - HWPort2
[5]   - HWPort3
[6]   - HWPort4
[7]   - HWPort5
[8]   - HWPort6
[9]   - HWPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     0.75   1.00    -      -      -     0.50   0.75    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     1.00    -      -      -      -      -      -     imul	r10, r10
 -      -      -      -      -      -      -     0.50   0.50    -     dec	r8
 -      -     0.75    -      -      -      -      -     0.25    -     jne	.loop


Timeline view:
                    01234
Index     0123456789     

[0,0]     DeeeER    .   .   imul	r10, r10
[0,1]     DeE--R    .   .   dec	r8
[0,2]     D=eE-R    .   .   jne	.loop
[1,0]     D===eeeER .   .   imul	r10, r10
[1,1]     .DeE----R .   .   dec	r8
[1,2]     .D=eE---R .   .   jne	.loop
[2,0]     .D=====eeeER  .   imul	r10, r10
[2,1]     .D=eE------R  .   dec	r8
[2,2]     . D=eE-----R  .   jne	.loop
[3,0]     . D=======eeeER   imul	r10, r10
[3,1]     . D=eE--------R   dec	r8
[3,2]     . D==eE-------R   jne	.loop


Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     4     4.8    0.3    0.0       imul	r10, r10
1.     4     1.5    0.3    5.0       dec	r8
2.     4     2.3    0.0    4.0       jne	.loop
       4     2.8    0.2    3.0       <total>

Three IMUL Commands

.intel_syntax noprefix

.loop:
	IMUL R10, R10
	IMUL R11, R11
	IMUL R12, R12
	DEC R8
    JNZ .loop

Result:

Iterations:        4
Instructions:      20
Total Cycles:      17
Total uOps:        20

Dispatch Width:    4
uOps Per Cycle:    1.18
IPC:               1.18
Block RThroughput: 3.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      3     1.00                        imul	r10, r10
 1      3     1.00                        imul	r11, r11
 1      3     1.00                        imul	r12, r12
 1      1     0.25                        dec	r8
 1      1     0.50                        jne	.loop


Resources:
[0]   - HWDivider
[1]   - HWFPDivider
[2]   - HWPort0
[3]   - HWPort1
[4]   - HWPort2
[5]   - HWPort3
[6]   - HWPort4
[7]   - HWPort5
[8]   - HWPort6
[9]   - HWPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     0.75   3.00    -      -      -     0.50   0.75    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     1.00    -      -      -      -      -      -     imul	r10, r10
 -      -      -     1.00    -      -      -      -      -      -     imul	r11, r11
 -      -      -     1.00    -      -      -      -      -      -     imul	r12, r12
 -      -      -      -      -      -      -     0.50   0.50    -     dec	r8
 -      -     0.75    -      -      -      -      -     0.25    -     jne	.loop


Timeline view:
                    0123456
Index     0123456789       

[0,0]     DeeeER    .    ..   imul	r10, r10
[0,1]     D=eeeER   .    ..   imul	r11, r11
[0,2]     D==eeeER  .    ..   imul	r12, r12
[0,3]     DeE----R  .    ..   dec	r8
[0,4]     .DeE---R  .    ..   jne	.loop
[1,0]     .D==eeeER .    ..   imul	r10, r10
[1,1]     .D===eeeER.    ..   imul	r11, r11
[1,2]     .D====eeeER    ..   imul	r12, r12
[1,3]     . DeE-----R    ..   dec	r8
[1,4]     . D=eE----R    ..   jne	.loop
[2,0]     . D====eeeER   ..   imul	r10, r10
[2,1]     . D=====eeeER  ..   imul	r11, r11
[2,2]     .  D=====eeeER ..   imul	r12, r12
[2,3]     .  DeE-------R ..   dec	r8
[2,4]     .  D=eE------R ..   jne	.loop
[3,0]     .  D======eeeER..   imul	r10, r10
[3,1]     .   D======eeeER.   imul	r11, r11
[3,2]     .   D=======eeeER   imul	r12, r12
[3,3]     .   DeE---------R   dec	r8
[3,4]     .   D=eE--------R   jne	.loop


Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     4     4.0    0.3    0.0       imul	r10, r10
1.     4     4.8    0.5    0.0       imul	r11, r11
2.     4     5.5    0.8    0.0       imul	r12, r12
3.     4     1.0    0.5    6.3       dec	r8
4.     4     1.8    0.0    5.3       jne	.loop
       4     3.4    0.4    2.3       <total>