Új hozzászólás Aktív témák

  • stratova

    veterán

    válasz gbors #18936 üzenetére

    [0031]-től kezdődik a magyarázat

    By providing a set of execution resources within each GPU compute unit tailored to a range of execution profiles, the GPU can handle irregular workloads more efficiently. Current GPUs (for example, as shown in FIG. 3) only support a single uniform wavefront size (for example, logically supporting 64 thread wide vectors by piping threads through 16 thread wide vector units over four cycles). Vector units of varying widths (for example, as shown in FIG. 4) may be provided to service smaller wavefronts, such as by providing a four thread wide vector unit piped over four cycles to support a wavefront of 16 element vectors.

    A +1 skalárhoz:
    In addition, a high-performance scalar unit may be used to execute critical threads within kernels faster than possible in existing vector pipelines, by executing the same opcodes as the vector units. Such a high performance scalar unit may, in certain instances, allow for a laggard thread (as described above) to be accelerated. By dynamically issuing wavefronts to the execution unit best suited for their size and performance needs, better performance and/or energy efficiency than existing GPU architectures may be obtained.

    Példa:
    In another example, assume that a wavefront includes 16 threads to be scheduled to a 16 thread wide SIMD unit, and only four of the threads are executing (the remaining 12 threads are not executing). So there are 12 threads doing “useless” work, but they are also filling up pipeline cycles, thereby wasting energy and power. By using smaller (and different) sizes of SIMD units (or vector pipelines), the wavefronts can be dynamically scheduled to the appropriate width SIMD units, and the other SIMD units may be powered off (e.g., power gated off or otherwise deactivated). By doing so, the saved power may be diverted to the active SIMD units to clock them at higher frequencies to boost their performance.

    Infó:
    The smaller width SIMD units may be, for example, one, two, four, or eight threads wide. When a larger thread width SIMD unit is needed, any available smaller thread width SIMD units may be combined to achieve the same result. There is no performance penalty if, for example, the wavefront needs a 16 thread wide SIMD unit, but the hardware only includes smaller thread width SIMD units.

    In some embodiments, the collection of heterogeneous execution resources is shared among multiple dispatch engines within a compute unit. For instance, rather than having four 16 thread wide vector units, four dispatchers could feed three 16 thread wide vector units, four four thread wide vector units, and four high-performance scalar units. The dispatchers could arbitrate for these units based on their required issue demands.

Új hozzászólás Aktív témák