2023幎7æãIntel ããæ°ããªåœä»€æ¡åŒµãAPX(Advanced Performance Extension) ãš AVX10 ãçºè¡šãããã
ãããŸã§ã®x86æ¡åŒµåœä»€ã¯SIMDåœä»€ãäžå¿ãšããŠã䞊åæ§ã®ããåŠçãå¹çããæ±ããã®ãã¡ã€ã³ã ã£ãã®ã«å¯ŸããŠãAPX ã§ã¯ãã¹ã«ã©æŽæ°åœä»€ãã€ãŸããæ®æ®µãã䜿ãåœä»€ã«
- æ±çšã¬ãžã¹ã¿æ°ãåå¢ (16â32)
- 3 operand ã®æ¡çš
ãšãã倧ããªæ¡åŒµãå ¥ã£ãŠãããããã®è§£èª¬ãæžããŠããã
APX
REX2 prefixãšããprefixãå¢ããŠãã(ãŸãïŒprefixãå¢ããïŒ)
REX2 prefixã¯ã0xD5ã«å²ãåœãŠãããŠããã
0xD5 㯠32bit ãŸã§ã¯ AAD ãšããåœä»€ã«å²ãåœãŠãããŠããã
x86ã䜿ã£ãŠãã人ãªãåœç¶ç¥ã£ãŠããã®ã§èª¬æäžèŠãšæããã32bit ãŸã§ã® x86 ã«ã¯ãBCD ãæ±ãããã®å°çšåœä»€ãããã€ããã£ãŠãããããªãã³ãŒãããããç¡é§ã«åããŠããããããã®åœä»€ã¯ãx86-64 CPU ã§ã¯äœ¿ãããŠããããå®è¡ãããš#UD(æªå®çŸ©)äŸå€ãåããŠããã
REX2 prefix ã¯ãx86-64 ã ããµããŒãããã°ããã®ã§ããã®0xD5 ãprefix ãšããŠäœ¿ã£ãŠããã
æ¢åã® REX prefixã¯ã0x4? ãšãã1byteã§ãäžäœ4bitã0x40 ãäžäœ4bit ã§ãWRXBãšãã4ã€ã®ãããããšã³ã³ãŒãããŠããã(ææžãã : https://w0.hatenablog.com/entry/20130126/1359183872)
R,X,Bã¯ãããããã¬ãžã¹ã¿ã®ã€ã³ãã¯ã¹ã«å¯Ÿå¿ããŠããŠãããã䜿ãããšã§ãåœä»€ã«ã¬ãžã¹ã¿ã®ã€ã³ãã¯ã¹ã1bitæ¡åŒµããŠãã¬ãžã¹ã¿æ°ã16æ¬ã«å¢ãããŠãããWã¯ãæŒç®ã64bitã32bitãã瀺ããŠããã
ãããŸã§ã® REX prefix
add r13, r15 : 4d 01 fd
r/m = r13
reg = r15
4d : REX prefix ã® 0x40 | w=1, R=1, X=0, B=1 (Xã¯æªäœ¿çš)
01 : add rm, reg
fd : mod=11, reg=111, r/m=101
r13 = Bã®1 | r/m ã® 101 㧠1101 = r13
r15 = Rã®1 | reg ã® 111 㧠1111 = r15
REX2 ã¯å é 1byteã0xD5 ã§ã次ã«ç¶ã1byte ã«ãM0,R4,X4,B4,W,R3,X3,B3 ãšãã8bitããšã³ã³ãŒãããŠãããW,R3,X3,B3 ã¯ãREX prefixã® WRXB ãšå¯Ÿå¿ããŠãããREX2ã§ã¯ãããã«ããã«R4,X4,B4ãä»ããŠãã¬ãžã¹ã¿ã®ã€ã³ãã¯ã¹ã5bitè¡šçŸã§ããããã«ãªã£ãŠããã
REX2 prefix (確èªæ¹æ³ãªãã®ã§ãŸã¡ãã£ãŠãããŽã¡ã³ã)
add r29, r31 : d5 5d 01 fd
r/m = r29
reg = r31
d5 : REX2 prefix
5d : M0=0, R4=1, X4=0, B4=1, W=1, R3=1, X3=0, B3=1 (X4, X3 ã¯æªäœ¿çš)
01 : add rm, reg
fd : mod=11, reg=111, r/m=101
r29 = B4ã®1 | B3ã®1 | r/m ã® 101 㧠11101 = r29
r31 = R4ã®1 | R3ã®1 | reg ã® 111 㧠11111 = r31
æ®ã£ãM0
ãããã¯ãåœä»€ã®ãããã瀺ããŠããã
x86 ã§ã¯ãæã¯äœ¿çšé »åºŠã®äœãã£ãåœä»€ããããšããæ¡åŒµãããåœä»€ã¯ã0x0f ãšãããã¬ãã£ã¯ã¹ãä»ããŠãããããšãã°ãcmov
åœä»€ãªã©ã¯ã0x0f ãã¬ãã£ã¯ã¹ãä»ããŠãã
cmove eax, eax # 0f 44 c0
REX2 ã® M0
ãããã¯ããã® 0x0f ãã¬ãã£ã¯ã¹ãä»ããŠãããã©ããã瀺ããŠããã0x0f ãã¬ãã£ã¯ã¹ãä»ããªãåœä»€ã§ã¯ãREX2 㯠REXå代ãã1byteå¢ããŠããŸã£ãŠããããREX2 ã§ã¯ 0x0f ãã¬ãã£ã¯ã¹ã¯1bit ã«ãšã³ã³ãŒããããŠããã®ã§é·ããå€ãããªãããã€ãæ°ã¯å€ããããæå®ã§ããã¬ãžã¹ã¿æ°ãåã«ãªã£ãŠããã
cmove r15, r15 # 4d 0f 44 ff # with REX prefix
cmove r31, r31 # d5 ff 44 ff # with REX2 prefix
ãã®REX2 prefixã«ãã£ãŠãã¬ãžã¹ã¿æ°ãåå¢ããŠããã
ããã«ãAPXã§ã¯ãEVEX prefix ãæŽæ°ã¹ã«ã©åœä»€ã«ãä»ããããããã«ãªã£ãŠãããEVEX prefix ã¯ã3-operand åœä»€ãè¡šçŸã§ããã®ã§ã3-operand åœä»€ã䜿ããšãã¯ãæŽæ°ã¹ã«ã©åœä»€ã«ãEVEX prefixãä»ããã
AVX-512 ã§ã¯ãEVEX ãã¬ãã£ã¯ã¹ãä»ããŠãããã«ç¶ã3byteã«æ
å ±ããšã³ã³ãŒãããŠããã
(ãã¯ã¿çšã«äœ¿ãEVEX (AVX-512ã§äœ¿ã£ãŠããã€))
ãããã¹ã«ã©æŽæ°åœä»€ã«ã䜿ããããã«ä»¥äžã®ããã«æ¡åŒµããŠãã
(æŽæ°ã¹ã«ã©çšã«äœ¿ãEVEX (APXã§äœ¿ããã€))
- mmmm bit ã®äžäœ2bit㯠0 ã« reserved ãããŠãããããã®1bitã¯å¥ã®B4ã«äœ¿ãããŠãããæ®ãã®1bitã1ã«ãªã£ãŠãããšã¹ã«ã©æŽæ°åœä»€ã«ãªã
- æäžäœ1bitã¯B4ã«å²ãåœãŠãããŠãããAVX-512ã§ã¯ã¢ãã¬ã¹ã¬ãžã¹ã¿ã16æ¬ããç¡ãã£ãã®ã§ãã¢ãã¬ã¹ãšããŠäœ¿ãããæŽæ°ã¬ãžã¹ã¿ã瀺ãX,Bã¯4bitã§ããã£ããAPXã«ãã£ãŠæŽæ°ã¬ãžã¹ã¿ã32æ¬ã«ãªã£ãã®ã§ããããè¡šçŸã§ããããã«X,Bãæ¡åŒµãããŠãã
- aaa bit (ãã¹ã¯ã¬ãžã¹ã¿æå®) ã¯æŽæ°ã¹ã«ã©åœä»€ã§ã¯äœ¿ããªãããã«ãªã£ãŠããŠããããã«NF-bit ãå ¥ã£ãŠãã
- b bit (broadcast, äžžãæå®) ã¯æŽæ°ã¹ã«ã©åœä»€ã§ã¯äœ¿ããªãããã«ãªã£ãŠããŠããããã«ND-bit ãå ¥ã£ãŠãã
- pp bit (prefixæå®) ã¯ãSIMDã§ã¯ãã»ãŒãªãã³ãŒãã®äžéšã«ãªã£ãŠãã prefix ããšã³ã³ãŒãããŠããããã¹ã«ã©æŽæ°åœä»€ã§ã¯0x66 prefixã«æå³ãããã®ã§ãã¹ã«ã©æŽæ°åœä»€ã§ã¯è§£éãå€ãã£ãŠãã(åŸè¿°)
ãã®EVEX ãã"Extended EVEX prefix = Extended Enhanced Vector Extension prefix!! = æ¡åŒµããã匷åããããã¯ã¿æ¡åŒµ prefix!!!!" ãšåŒã¶ãšæžããŠããã
ND-bit ãç«ã£ãŠãããšã3-operand ãæå¹ã«ãªããVVVVV ã§æå®ãããã¬ãžã¹ã¿ã3åç®ã®ãªãã©ã³ãã«ãªãã
NF-bit ãç«ã£ãŠãããšããã©ã°(EFLAGS)ãæŽæ°ããããååã®å€ãä¿åãããããã«ãªãã
æ¡åŒµãããåœä»€ã¯ããªãã³ãŒãã®åã« 0x0f, 0x0f38, 0x0f3a ã®ã©ãããä»ããŠãããã«ãã£ãŠåœä»€ãèå¥ãããŠããããAPXæ¡åŒµã§ã¯ã0x0f, 0x0f38, 0x0f3a ãããããã opcode map ãæã£ãŠãããšããããã«ç解ãæŽçãããŠããã
Extended EVEX ã® mmmm ã¯ãæå³ããããšã®EVEXãšã»ãŒå€ãã£ãŠããªããããã® mmmm 㯠map indexã瀺ãå€ã ãšããããã«å解éãããŠããŠã
- mmmm = 100 : legacy map 0
- mmmm = 101 : legacy map 1 # 0x0f ãä»ããŠããåœä»€
- mmmm = 110 : legacy map 2 # 0x0f38 ãä»ããŠããåœä»€
- mmmm = 111 : legacy map 3 # 0x0f3a ãä»ããŠããåœä»€
ãšããããã«è¡šçŸãããŠããã
pp bit 㯠AVX-512 ã® pp ãšäŒŒãŠãããã0x66, 0xf2, 0xf3 ãã»ãŒãªãã³ãŒãã®äžéšã«ãªã£ãŠããSIMDåœä»€ãšéã£ãŠãæŽæ°ã¹ã«ã©åœä»€ã§ã¯0x66 prefixã¯å³å¯ã«ãªãã©ã³ããµã€ãºãå€ããprefixã ãšæ±ºãŸã£ãŠããã
SIMD åœä»€ã§ã¯ã0x66, 0xf2, 0xf3 ã¯æä»çã«ãã䜿ãããªãã£ãã®ã§ãéžæåŒã§åé¡ãªãã£ãã
ããããAPXã§ã¯ãCRC32ã®ããã«ããªãã³ãŒãã®äžéšãšããŠ0xf2ãã¬ãã£ã¯ã¹ã䜿ããªããããªãã©ã³ããµã€ãºãå€ããåœä»€ãããã0xf2ãš0x66ãåæã«äœ¿ãããããšãããã
crc32 eax, ax # 66 f2 0f 38 f1 c0 # 66 prefix ãš f2 prefix ãåæã«å«ã
ãã®ãããªå Žåã¯âŠããŸãã¡ç解ã§ããâŠ
ãããããææ§æ§ããªãæŽåæ§ãšããããã«ããªãã³ãŒããæžãããããããããŠãã®ã ãšæãã
CRC32ã®å Žåã¯ã0xf2 prefixããªããŠãå¥ã®åœä»€ãšè¡çªããªãã®ã§ã0xf2 prefix ãæ¶ãããŠããã"PP"ã®åã®æ¬åŒ§ã§ãããããŠããã®ã¯å ã®åœä»€ãšå²ãåœãŠãããŠãããã€ããå€ãã£ãŠããšããæå³ã§ãæ¬åŒ§å ãããªãªãžãã«x86ã§ã®ãã¬ãã£ã¯ã¹ã
https://www.felixcloutier.com/x86/lzcnt lzcnt ãšãã¯ãªãªãžãã«x86åœä»€ã§ã¯ 0xf3 prefixããªãã³ãŒãã®äžéšã«ãªã£ãŠããŠãããã«0x66 prefixãä»ããŠoperand size ãå€ãããããã0xf3 prefix ãç¡ããªã£ãŠããªãã³ãŒãã 0xBD ãã 0xF5ã«å€ãã£ãŠããã(ãšæžããŠããã¯ãâŠç解äžè¶³)
0xf2, 0xf3 ã¯ãæ¬æ¥ã¯ movs ãªã©ã® string åœä»€ã«ä»ããrep prefixã§ããããextended EVEX 㯠string åœä»€ã«ã¯ä»ããªãã®ã§ãrep prefixãEVEX prefixå ã«ãšã³ã³ãŒãããæ¹æ³ã¯ãªãããã ã
eflagsã«é¢ããåœä»€ã®åŒ·å
æ¡ä»¶åå²ã«ãããããcmp, test åœä»€ãæ¡ä»¶ä»ãmovãããcmov,setccã«å€§å¹ ãªåŒ·åãå ¥ã£ãŠããã
- SCC (Source Condition Code)
cmp, test åœä»€ã« SCC ãšãã4ã€ã®ããããä»ããããããã«ãªã£ãŠãããä»ããããåœä»€ã¯ãccmp, ctest ãšããååã«ãªã£ãŠããã
SCCã®åãããã¯x86åœä»€ã®16çš®é¡ã®condition codeãšå¯Ÿå¿ããŠããã https://www.sandpile.org/x86/cc.htm
ccmp, ctest åœä»€ã¯ããã®SCCãããã䜿ã£ãŠããããæ¯èŒãè¡ããã©ãããæå®ã§ããã
if ((x!=4) && (y!=4)) {
aaa();
}
ããããã³ãŒããèããã
æ¢åã®x86ã§ã¯ãflags å士㮠AND ã¯åããªãã®ã§ãcmp ããããšã«åå²ããå¿ èŠãããã
cmp edi, 4
jne not4
cmp esi, 4
jne not4
call aaa
not4:
...
ccmp ãããã°ããã®åå²ãäžåã«æžããããšãã§ãã
cmp edi, 4
# eflags ã® ne ãèŠãŠãcmp ãå®è¡ãããã©ããã決ãã
# ne ãæç«ããªãå Žå(Zãç«ã£ãŠãå Žå)ã¯ãeflags ã dfv ã§æå®ãããå€ã«ãªã(æžåŒäžæ)
ccmpne esi, 4, dfv
jne not4
call aaa
not4:
...
ããã«ãã£ãŠãåå²ã®æ°ãããããæžãããããã«ãªããšæãã(çµæ§ã€ã³ãã¯ããããããæžãããããããªãããªâŠ)
åå²äºæž¬ãå€ãããããªå Žåã¯ãã¡ãããå®è¡ããåå²åœä»€ã®æ°ã¯ãåå²äºæž¬ã®ãªãœãŒã¹ã䜿ãã®ã§ãåå²äºæž¬ãåœããå Žåã§ããåå²åœä»€ã¯å°ãªãã»ããããã
- zu
x86 ã§ã€ããç¹ãšããŠãsetcc ã¯ã 8bit ã¬ãžã¹ã¿ãããªãã©ã³ãã«åããªããšããã€ããç¹ããã£ãã
int x = 0;
if (flag)
x = 1
ãšãããšããsetccãã¬ãžã¹ã¿ã®äž8bitããæžããããŠãããªãã®ã§ã32bit ã® 1 ãäœãããšãããšãäºåã«äžäœãã¯ãªã¢ããŠããå¿ èŠããã£ã
xor eax, eax # setcc ã al ããæžããããŠãããªãã®ã§äžäœã¯ãªã¢ããã
test edi, edi
setne al
ret
ããã¯ãåœä»€ãç¡é§ãªäžã«ãpartial register æžããããªã®ã§ããããªãäŸåãå¢ããŠãŠææªã§ããã
ZU ãä»ãããšãã¡ãããšäžäœãããããŒãã¯ãªã¢ããŠãããããã«ãªãã
- cfcmov (Cnditionally Faulting cmov)
cfcmov ãšããåœä»€ãè¿œå ããããåäœã¯ãcmov ãšåãã ããcondition ãåœãããªãã£ãåŽã®ã¡ã¢ãªã®ããŒãžãã©ã«ããçºçãããªããšããåäœã«ãªã£ãŠããã
int x = 4;
if (p) {
x = *p;
}
ãã®ãããªã³ãŒããèãããããã¯ãcmov ã«ãããããããšã¯ã§ããªãã
mov rax, 4
mov rcx, p
test rcx, rcx # p != 0
cmovnz rax, [rcx] # rcx ã zero ã§ãªããªãããŒã
ããã¯ãå ã®Cããã°ã©ã ãç¶æããŠããªãããªããªãã[rcx] ã®å€ã¯äœ¿ã£ãŠããªãããããŒãããŠããŸã£ãŠãããããã¢ãã¬ã¹ãŒãã«ã¢ã¯ã»ã¹ããŠãããŒãžãã©ã«ããåºãŠããŸãããã ã
mov rax, 4
mov rcx, p
test rcx, rcx # p != 0
jz 1f
mov rax, [rcx] # rcx ã zero ã§ãªããªãããŒã
1:
..
ã¡ãããšããæžããŸãããã
cfcmov ãããã°ãæ¡ä»¶æç«ããªãå Žåã¯ããŒãžãã©ã«ãèµ·ãããªãã®ã§ã
mov rax, 4
mov rcx, p
test rcx, rcx # p != 0
cfcmovnz rax, [rcx] # rcx ã zero ã§ãªããªãããŒã
ããã§ããããã
PUSH/POPã®åŒ·å
aarch64 ã® ldp, stp ã¿ãããªã®ãå¢ããã2åã®ã¬ãžã¹ã¿ãåæã«push,popããpush2, pop2 ãšããåœä»€ãã§ããŠããã
ããã«ãpush, pop ããã¢ã§äœ¿ãæçšã®ãã³ããä»ããããããã«ãªã£ãã
push, pop ã®ãã¢ãäžèŽããŠãããšãã¹ãã¢ãããã¡ãçµç±ããªãã§ãçŽæ¥ã¬ãžã¹ã¿ãªããŒã ã ãã§ã¬ãžã¹ã¿éã®ç§»åãå®çŸã§ãã(ãšãããããªããšãæžããŠãã)
AVX10
AVX10 ã¯âŠ(ãŸã ããèŠãŠãªã(ãããŸãã))
AVX-512 ã§ã¯ã¢ãŒããã¯ãã£ããšã«å¯Ÿå¿ããåœä»€ãéããã©ã®CPUãã©ã®åœä»€ã«å¯Ÿå¿ããŠãããææ¡ãé£ãããšããç¶æ ã ã£ãã(gcc ã« -mavx512 ãªãã·ã§ã³ãæž¡ã床㫠-mavx512 ãªãã·ã§ã³ããªããŠãã¬ããã«ãªã)ãããæŽçããŠãããããã¯ãæ°ããæ¡åŒµã¯ä»¥åã®æ¡åŒµãå«ãããšããããã«èªã£ãã®ãéèŠâŠã ãšæãã"AVX10" ãšããååã¯ããä»åŸã¯AVX11, AVX12ãšããããã«é ã«å¢ãããŠãããŸãããšããæ°æã¡ã®è¡šãã§ã¯ãªãããšæãã(AVX512ERãAVX512DQãAVX512CDâŠãšãã®éããèŠããªããŠãããªã)
https://fuse.wikichip.org/news/3099/centaur-unveils-its-new-server-class-x86-core-cns-adds-avx-512/2/ ãã
éã£ããAVX10 ã¯ã AVX10.1, AVX10.2, AVX10.X ãšããããã«å¢ãããŠããã£ãœããCPUID ãèŠããšã"AVX10 Version N" ã®ã"N" ã®éšåããšããããã«ãªã£ãŠããã
https://cdrdv2.intel.com/v1/dl/getContent/784267 ãã
AVX-512ã§ã¯ãfeature ããšã« bit ãç«ã£ãŠããããAVX10 ã§ã¯ã8bit æŽæ°(EAX=24H,ECX=00H:EBX[bit 7:0])ãä»ããŠãã ãã§ãAVX10.N ãšããæ¡åŒµããèš±ãããªãããã«ãªã£ãŠãã(feature bit ã¯ãããããŸãããšããIntelã®èªã)
ãããªããšãªãã£ãã"Reserved for discrete feature bits" ãããããâŠã€ãŸãâŠã€ãŸãã©ãããããšãªã®âŠ(æŸæ£)
Vector Length ã ãããOptional ã«ãªã£ãŠããŠãCPUã«ãã£ãŠå¯Ÿå¿ããSIMDé·ã128bit, 256bit, 512bitãšå€ããããã«ãªã£ãŠããã
åæã® AlderLake ã§ã¯ãP-Core ã512bitåœä»€ãµããŒããE-Core ã256bitåœä»€ãµããŒããšãªã£ãŠããããçŸä»£ã®ãœãããŠã§ã¢ç°å¢ã§ã¯ãåœä»€ã»ããã®éããµãã€ã®CPUãæ··ããã®ã¯ãããŸã察å¿ãããŠããããP-core ã®512bitåœä»€ã¯ç¡å¹åãããç¶æ
ã§è²©å£²ãããŠãã(ãã®åŸåè·¯ãæ¶ãããããã)ã
ããã¯ãP-core ã®æŒç®åšãç¡é§ã«ãªã£ãŠããã
ãŸããAVX-512 åœä»€ã¯ã3-operandãç¬ç«ãããã¹ã¯ã¬ãžã¹ã¿ãäžžãã¢ãŒãã®æå®ãã¹ã±ãŒã«ã®æ¡åŒµãªã©ã256bitæŒç®ã«äœ¿ãå Žåã§ããæçšãªæ©èœãå«ãŸããŠããã512bit察å¿ããªãå Žåã§ãAVX-512ã§æ¡çšãããEVEXãšã³ã³ãŒãã£ã³ã°ã¯æå¹ãªå Žé¢ããã£ãã
ãããæŽçãããŠãAVX10ã§ã¯ã512bit察å¿ããªãã·ã§ãã«ãšããããšã§ã
- ãµãŒããŒåãXeon ã¯ã512bitåœä»€å¯Ÿå¿
- ãã¹ã¯ãããåã i7 ã¯ãé»åå¹çãäžããããã«256bitåœä»€ã®ã¿å¯Ÿå¿ãAVXãAVX2 ãšéãã¬ãžã¹ã¿æ°ã32åã«å¢ãããã¹ã¯ã¬ãžã¹ã¿ã3-operandã䜿ããããã«ãªã£ãŠãã
- (çé»ååãCeleron(?) ã¯ã128bitåœä»€ã®ã¿å¯Ÿå¿?)
ãšãããããªåºåããã§ããããã«ãªã£ã