Keresés: - Nil Satis Nisi Optimum - IT café Hozzászólások

Legfrissebb anyagok

IT café témák

PROHARDVER! témák

Mobilarena témák

GAMEPOD.hu témák

LOGOUT.hu témák

Keresés

Aktív témák

#47 P.H. senior tag P.H. #46

2011-04-08 19:06:27 #47
Összes hozzászólása itt Válaszok az összes hozzászólására itt Válaszok erre a hozzászólásra
Privát üzenet küldése

P.H.

senior tag

válasz P.H. #46 üzenetére

Némi módosítással mostmár az elvárható szintet hozza a kód, folyamatosan tartja a 2.4 IPC-t.
PerfMonitor Record file Counter 0 : Non-halted clock cycles Counter 1 : Retired instructions Counter 2 : Instructions per cycle (IPC) Counter 3 : L1 Data cache refill from RAM 1000 2602.3 6239.3 2.4 0.0 1050 2625.0 6279.9 2.4 0.0 1100 2622.7 6217.2 2.4 0.0 1150 2624.9 6212.2 2.4 0.0 1200 2607.5 6200.2 2.4 0.1 1250 2626.4 6246.8 2.4 0.0 1300 2621.9 6223.3 2.4 0.0 1350 2625.4 6207.1 2.4 0.0 1400 2607.3 6176.2 2.4 0.1 1450 2624.8 6223.7 2.4 0.1 1500 2593.0 6160.8 2.4 0.1 1550 2625.6 6257.4 2.4 0.1 1600 2610.2 6214.2 2.4 0.0 1650 2621.0 6224.5 2.4 0.1 1700 2621.6 6254.4 2.4 0.0 1750 2624.4 6201.1 2.4 0.0 1800 2610.4 6156.3 2.4 0.0 1850 2625.9 6208.6 2.4 0.0 1900 2622.2 6224.9 2.4 0.0 1950 2626.3 6263.7 2.4 0.0 2000 2610.2 6255.1 2.4 0.1 2050 2625.3 6284.7 2.4 0.0 2100 2618.1 6281.4 2.4 0.0 2150 2601.0 6225.6 2.4 0.1 2200 2620.6 6289.6 2.4 0.1
Mivel korábban nem vettem figyelembe, hogy bár a LEA utasítás K10-en és Core2/Nehalem/Sandy Bridge-en 1 órajel, viszont K7/K8-on 2 órajel, Atom-on 4, Prescott-on 2.5, így ezeken nagyon nem egyenértékű a sima összeadással, ezért kivettem a felesleges LEA utasításokat; Prescott-on 10%-kal gyorsult.
Ennek további hozadéka, hogy mivel a Sandy Bridge már az ADD/SUB + Jcc párokat is tudja egyesíteni (macro-fusion), az Atom szintén tudja párosítani ezeket (én pedig korábban minden ilyen pár közé tettem a LEA utasításokat, mert nem szeretem az egymást közvetlenül követő függő integer-kódokat), ezért szinte minden ciklus profitál ebből mindkettőn.
Érdekes lesz a Bulldozer, mivel ahhoz, hogy ez a kód 2.4 IPC felett tudjon futni, a következők kellenek:
1. 3 ALU
2. legalább 2 load/cycle/thread (pl. a @@5ST_STEP 9 utasításos ciklusában 3 load, 1 load+store, 1 ugrás és 4 regiszter-utasítás van)
3. CMOVcc, ADC és SBB utasítások végrehajtása 1 órajel alatt
3. a cikluszáró ADD+Jcc párosok fúziója igencsak gyorsít (szinte minden ciklus így zárul)
4. a teljes kód elfér egy pár 100 elemű uop cache-ben
Jelen pillanatban úgy tudni, az első kettővel a Sandy Bridge és a K7-K10 sorozat rendelkezik, a harmadikkal csak a K7-K10, az utolsó 2 pedig a Sandy Bridge sajátja.
(Elvileg a kód az első 754/939 K8-as generációkon gyorsabb is, mint K10-en, mivel akkor az L1-latency csak 2 órajel volt.)
A Bulldozer 1. generációjában az első kettő kizárt, a 3. szinte biztos, az utolsó kettő lehetséges, de az eddigi információk nem említik őket. Persze ha a maximum 2.0 IPC/thread megfelelő órajellel párosul, akkor nem lehet gond.
mov eax,edi pushad shl ebp,02h xor ecx,ecx lea edx,[ebp+ebp*02h] lea edi,[ebx+ebp] neg ebp @mark0: sub edx,04h mov [ebx+edx],ecx jg @mark0 mov byte ptr [edi+00h],01h @@REDUCE_ROWS: mov ebx,ebp @rowmin: mov esi,02000000h mov ecx,ebp xor edx,edx @findrowmin: cmp esi,[eax] cmovz edx,ecx cmova esi,[eax] add eax,04h add ecx,04h jnz @findrowmin sub ecx,ebp cmp esi,02000000h jz @specific add eax,ebp @subrow: xor edx,edx cmp byte ptr [eax+03h],00h cmovz edx,esi sub [eax],edx add eax,04h sub ecx,04h jnz @subrow add ebx,04h jnz @rowmin jmp @columns @specific: cmp byte ptr [edi+edx],00h mov byte ptr [edi+edx],01h jnz @@ABNORMAL_EXIT add ecx,ebx sub dword ptr [esp+__SYS0],01h mov byte ptr [edi+ebx+02h],01h mov [edi+ecx*02h+__0STAR],edx jz @count_result_STACK add ebx,04h jnz @rowmin @columns: mov [edi+00h],bl @@RECUDE_COLUMNS: sub ebx,04h sub eax,04h cmp ebx,ebp jl @@2ND_STEP test byte ptr [edi+ebx],01h jnz @@RECUDE_COLUMNS mov esi,02000000h mov ecx,ebp @findcolmin: cmp esi,[eax] cmova esi,[eax] add eax,ebp add ecx,04h jnz @findcolmin cmp esi,02000000h lea ecx,[ebp-04h] jz @@ABNORMAL_EXIT @subcol: xor edx,edx add ecx,04h jz @@RECUDE_COLUMNS sub eax,ebp cmp [eax+03h],dl cmovz edx,esi sub [eax],edx jnz @subcol mov dl,[edi+ecx+02h] or dl,[edi+ebx] mov edx,ecx jnz @subcol mov byte ptr [edi+ebx],01h sub edx,ebp mov byte ptr [edi+ecx+02h],01h sub dword ptr [esp+__SYS0],01h mov [edi+edx*02h+__0STAR],ebx jnz @subcol jmp @count_result_STACK @@ABNORMAL_EXIT: add esp,20h xor eax,eax mov edx,7FFFFFFFh stc ret { CODE PADDING } @@3RD_STEP: mov byte ptr [edi+ebx+03h],0FFh mov byte ptr [edi+edx],00h mov [edi+eax*02h+__COLON],ecx @@2ND_STEP: {0} lea ecx,[ebp-04h] {1} mov edx,00FFFFFFh {2} jmp @c2col @zeroincol: {0} cmp edx,[esi] {1} mov bl,[edi+eax+03h] {2} sbb bl,00h {0} jz @@DECIDE_NEXT_STEP @nx2mtx: {1} sub esi,ebp {2} add eax,04h {0} jnz @zeroincol @c2col: {0} mov esi,ecx {1} add esi,[esp+__MTX] {2} sub esi,ebp @check2col: {0} add esi,04h {1} add ecx,04h {2} jz @@5TH_STEP {0} cmp byte ptr [edi+ecx],00h {1} mov eax,ebp {2} jnz @check2col {0} jmp @zeroincol @@5TH_STEP: lea ebx,[ebp+03h] mov esi,[esp+__MTX] @nx5row: mov eax,[edi+ebx-03h] sub ecx,edx xor eax,edx cmovs edx,ecx mov ecx,ebp @decrease_row_free: {0} bt dword ptr [edi+ecx],00h {1} mov al,[esi+03h] {2} adc al,[edi+ebx] {0} mov eax,00000000h {1} cmovz eax,edx {2} sub [esi],eax {0} add esi,04h {1} add ecx,04h {2} jnz @decrease_row_free add ebx,04h js @nx5row mov eax,[esp+__FREE0] xor edx,edx mov esi,eax sub eax,[esp+__MTX] idiv ebp neg eax lea ecx,[ebp+edx] lea eax,[ebp+eax*04h] @@DECIDE_NEXT_STEP: xor edx,edx mov [esp+__FREE0],esi add edx,[esi] jnz @nx2mtx mov ebx,eax sub eax,ebp add edx,[edi+eax*02h+__0STAR] jnz @@3RD_STEP @@4TH_STEP: sub edx,ebp jmp @newstar @0_star: mov [edi+ebx*02h+__0STAR],ecx mov ecx,[edi+eax*02h+__COLON] @newstar: mov ebx,eax lea eax,[edx-04h] @starincol: cmp [edi+eax*02h+__0STAR],ecx jz @0_star sub eax,04h jns @starincol mov [edi+ebx*02h+__0STAR],ecx @@1ST_STEP: sub dword ptr [esp+__SYS0],01h mov ebx,edi mov ecx,ebp jz @count_result_STACK mov edx,[edi] @restructure: {0} mov esi,[ebx+__0STAR] {1} mov byte ptr [edi+ecx+03h],00h {2} add ebx,08h {0} mov byte ptr [edi+esi],01h {1} add ecx,04h {2} jnz @restructure mov [edi],edx jmp @@2ND_STEP @count_result_STACK: xor ecx,ecx neg ebp xor eax,eax mov esi,[esp+__SAVE] mov ebx,[esp+__MARKS] add esp,20h @results: {0} mov edx,[edi+ecx*02h+__0STAR] {1} add ecx,04h {2} add edx,ebp {0} add eax,[esi+edx] {1} shr edx,02h {2} add esi,ebp {0} cmp ecx,ebp {1} mov [ebx],dl {2} lea ebx,[ebx+01h] {0} jnz @results

[ Szerkesztve ]

Arguing on the Internet is like running in the Special Olympics. Even if you win, you are still ... ˙˙˙ Real Eyes Realize Real Lies ˙˙˙

Aktív témák

Hirdetés

Új prémium hirdetések

Új ingyenes hirdetések

IT café - infotech fórumok

Mobilarena - mobil fórumok

PROHARDVER! - hardver fórumok

GAMEPOD.hu - játék fórumok

LOGOUT.hu - lépj ki, lépj be!

FÁRADT GŐZ - közösségi tér szinte bármiről

Blokkméret

Rendezés

Kezdő blokk

Aktív témák

Aktív témák

IT café - infotech fórumok

Mobilarena - mobil fórumok

PROHARDVER! - hardver fórumok

GAMEPOD.hu - játék fórumok

LOGOUT.hu - lépj ki, lépj be!

FÁRADT GŐZ - közösségi tér szinte bármiről

Blokkméret

Rendezés

Kezdő blokk

H﻿i﻿r﻿de﻿tés

Hirdetés