This pass combines intersecting, adjacent and identical loads/stores into
potentially larger ones and will be used by ACO to greatly reduce the
number of memory operations.
v2: handle nir_deref_type_ptr_as_array
v3: assume explicitly laid out types for derefs
v4: create less deref casts
v4: fix shared boolean vectorization
v4: fix copy+paste error in resources_different
v4: fix extract_subvector() to pass
nir_load_store_vectorize_test.ssbo_load_intersecting_32_32_64
v4: rebase
v5: subtract from deref/offset instead of scheduling offset calculations
v5: various non-functional changes/cleanups
v5: require less metadata and preserve more
v5: rebase
v6: cleanup and improve dependency handling
v6: emit less deref casts
v6: pass undef to components not set in the write_mask for new stores
v7: fix 8-bit extract_vector() with 64-bit input
v7: cleanup creation of store write data
v7: update align correctly for when the bit size of load/store increases
v7: rename extract_vector to extract_component and update comment
v8: prevent combining of row-major matrix column acceses
v9: rework process_block() to be able to vectorize more
v9: rework the callback function
v9: update alignment on all loads/stores, even if they're not vectorized
v9: remove entry::store_value, since it will not be updated if it's was
from a vectorized load
v9: fix bug in subtract_deref(), causing artifacts in Dishonored 2
v9: handle nir_intrinsic_scoped_memory_barrier
v10: use nir_ssa_scalar
v10: handle non-32-bit offsets
v10: use signed offsets for comparison
v10: improve create_entry_key_from_offset()
v10: support load_shared/store_shared
v10: remove strip_deref_casts()
v10: don't ever pass NULL to memcmp
v10: remove recursion in gcd()
v10: fix outdated comment
v11: use the new nir_extract_bits()
v12: remove use of nir_src_as_const_value in resources_different
v13: make entry key hash function deterministic
v13: simplify mask_sign_extend()
v14: add comment in hash_entry_key() about hashing pointers
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com> (v9)
radv: set alignment for load_ssbo/store_ssbo in meta shaders
Otherwise, nir_intrinsic_align() will assert when called on the intrinsics
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
nir: add nir_num_variable_modes and nir_var_mem_push_const
These will be useful in the upcoming load/store vectorizer.
v11: rebase
Signed-off-by: Rhys Perry <pendingchaos02@gmail.com>
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
It shouldn't matter, but the 1 was leftover from when it was handled
together with workgroup_size and num_work_groups.
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
radv: Replace supports_spill with explict_scratch_args
The former was always true and hence dead code. We will want to
explicitly declare the ring offset register with ACO, but we also want
to declare the scratch offset too, and we can't try to disable it since
ACO also supports spilling and the determination of whether spilling has
to happen occurs well after setting up registers. So replace
supports_spill with something that will actually be used for ACO.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Due to how LLVM works we have to make some of the FS inputs become
vectors, and therefore have to split them early so that they don't take
up extra register pressure due to how RA currently works.
Reviewed-by: Daniel Schürmann <daniel@schuermann.dev>
ac: Add a shared interface between radv, radeonsi, LLVM and ACO
ac_shader_args will be similar to ac_shader_abi, except for being free
from LLVM-specific concepts and therefore capable of being shared
between LLVM and ACO. This will help us accomplish a few different
things:
- Decouple setting up SGPR and VGPR arguments from translating to LLVM,
so that we can reference these arguments in NIR lowering passes, which
will let us lower e.g. descriptor sets in NIR.
- Stop using radv-specific structures for things like determining the
chip generation in ACO.
In the end, we should replace ac_shader_abi with this structure +
driver-specific lowering passes.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
We'll duplicate this in a header file in the next commit, and then
remove the original enum. Just rename it temporarily so that things
keep building.
Reviewed-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
GiMark benchmark from GpuTest has such code in VS:
out vec4 lightDir0;
out vec4 lightDir1;
...
lightDir0.xyz = lp0 - vVertex.xyz;
lightDir1.xyz = lp1 - vVertex.xyz;
In FS:
float distSqr = dot(lightDir0, lightDir0);
So due to the usage of uninitialized .w channel in the dot product,
distSqr may become undefined which results in many black dots
in the test on Iris.
In https://www.geeks3d.com/forums/index.php/topic,6242.0.html
developer stated that this benchmark most likely won't be updated.
Closes: https://gitlab.freedesktop.org/mesa/mesa/issues/1919
Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Only required for Intel tools or the Vulkan overlay layer.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>
ac/llvm: fix the local invocation index for wave32
Fixes dEQP-VK.compute.builtin_var.local_invocation_index with
RADV_PERFTEST=cswave32.
My initial fix was to lower it but Rhys suggested the shift-right
and it's much better like this.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Marek Olšák <marek.olsak@amd.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
radv: disable subgroup shuffle operations on GFX10
They are broken like on GFX6-GFX7. It seems better to disable them
instead of enabling a broken feature.
Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
This fails a couple of piglits due to other bugs in llvmpipe,
but it adds support for the feature properly.
v2: don't reset pipestats, just recalc, fix CI expectation
radv: create a fresh fork for each pipeline compile
In order to prevent a potential malicious pipeline tainting our
secure compile process and interfering with successive pipelines
we want to create a fresh fork for each pipeline compile.
Benchmarking has shown that simply forking on each pipeline
creation doubles the total time it takes to compile a fossilize db
collection. So instead here we fork the process at device creation
so that we have a slim copy of the device and then fork this
otherwise idle and untainted process each time we compile a
pipeline. Forking this slim copy of the device results in only a
20% increase in compile time vs a 100% increase.
Fixes: cff53da3 ("radv: enable secure compile support")
This will be used to create a communication pipe between the user
facing device and a freshly forked (per pipeline compile) slim copy
of that device.
We can't use pipe() here because the fork will not be a direct fork
of the user facing process. Instead we use a previously forked
copy of the process that was forked at device creation in order to
reduce the resources required for the fork and avoid performance
issues.
Fixes: cff53da374 ("radv: enable secure compile support")
radv: add some infrastructure for fresh forks for each secure compile
In the following commits we want to be able to fork an existing lightweight
fork created at device creation time. In order for the user facing process
to communicate with this new fresh fork we create some members here to hold
FIFO file descriptors and a unique id.
Here we also add a new fork enum that we use to tell the lightweight
process to create a fresh fork.
For more information on why we create a fresh fork see the following
commits.
This fixes a build failure on MSVC.
BTW, it looks like clang supports _Pragma() but I don't know if it
understands the "gcc unroll N" directive.
Signed-off-by: Brian Paul <brianp@vmware.com>
Reviewed-by: Ian Romanick <ian.d.romanick@intel.com>
Fixes build with MinGW, with shared LLVM and lto
/tmp/opengl32.dll.BxiIYm.ltrans59.ltrans.o:<artificial>:(.text+0x1674): undefined reference to `LLVMAddInstructionCombiningPass'
See also scons/llvm.py
Acked-by: Dylan Baker <dylan@pnwbakers.com>
nir/serialize: support any num_components for remaining instructions
Only NPOT vectors greater than vec4 use the extra uint32.
This is for instructions that share the dest code.
load_const and undef already support 1-16 in the header.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
nir/serialize: remove up to 3 consecutive equal ALU instruction headers
vec4 scalarized ALUs typically have 4 equal instruction headers, so remove
the last 3.
There are no bits left in the ALU header for more flags, so future
extensions of NIR will have to use something like instr_type == 15
to describe more complex ALU instructions.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
nir/serialize: don't serialize mode for deref non-cast instructions
It can be derived from src and var. This frees 10 bits in the header
that will be used later.
"mode" is moved in the structure, because those bits will be used for
something else later.
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
nir/serialize: don't store deref types if not needed
- type_cast: deduplicate types if the last one is the same
- derive the type from the parent for other derefs
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
nir/serialize: pack 1-component constants into 20 bits if possible
The majority of constants can be packed like this.
v2: - use enum for the packing encoding,
- trim packed_value to 20 bits add 1 bit to last_component,
which simplifies a later commit
Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
If the repo continues development, we don't want to accidentally pick
up potentially breaking changes on our next container rebuild.
Reviewed-by: Eric Engestrom <eric.engestrom@intel.com>