* metal : fix GGML_OP_SET kernel threads
* tests : extend test_cpy to support different src/dst shapes
Extend test_cpy to support different source and destination tensor shapes
for CPY operations (reshaping), where the total number of elements must match.
- Renamed ne -> ne_src, added ne_dst parameter (default: use src shape)
- Added 50 new reshaping test cases covering 1D<->2D<->3D<->4D conversions
- Tests exercise 1024 boundary, small shapes, and large dimensionality changes
- Fixed dangling reference bug (storing & to temporary std::array)
- Updated all existing test calls with permute/transpose args for compatibility
Assisted-by: llama.cpp:local pi
* metal : optimize concat kernel with row batching for small widths
When ne0 < 256, batch multiple rows into a single threadgroup to improve
occupancy. This avoids underutilizing the GPU when processing narrow tensors.
- Dispatch nth = min(256, ne0) threads per group
- Calculate nrptg (rows per threadgroup) to fill up to 256 threads
- Update kernel index calculation to handle the row batching
- Add boundary check for i1 >= ne1
Assisted-by: llama.cpp:local pi
* tests : clean-up
* tests : refactor CPY shape tests to use dimension permutations
Replace 75 hardcoded test cases with a loop over permutations of
{3, 5, 7, 32} (total elements: 3360). Each src permutation is tested
against canonical sorted and reverse dst, skipping identical shapes.
Covers F32, F16, and Q4_0 (when both src and dst ne0 == 32).
Assisted-by: llama.cpp:local pi