* process --exportconfig and --exporttemplate after --config
This allows using `--config oldfile.kcpps --exportconfig newfile.kcpps`
to update old config items, copy a config file with changed parameters,
download and save a remote config, etc.
* filter out command flags from the saved config files
Also ident files saved by command-line.
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
* server: fix query params lost when proxying requests in multi-model router mode
* server: re-encode query params using httplib::encode_query_component in proxy
* vulkan: allow using fp16 in scalar flash attention shader
* split rows inside of subgroups for faster synchronization
* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
* use f32 scalar FA if f16 is not supported by device
* fix amd workgroup size issue
* optimize masksh use
* add medium rows FA shader Br size
* fixes
* add padding to mask shmem buffer
* cache q values into registers for KQ
* fuse lf accumulation, pf and v accumulation into a loop
* stage K loads through shmem
* stage V loads through shmem
* only stage through shmem on Nvidia
* default to Bc 32
* also stage V through shmem when this is done for K
* dynamic subgroups for intel
* use vectorized stores
* use float_type for dequantize4 functions
* use smaller scalar rows size for smaller rows count
* relax flash attention split_k condition to allow non-gqa use
* use minimal subgroup size on Intel
* fix shmem support function
* fix rebase issues
* fixes
* Bc 4 for scalar FA is not a valid configuration
* Use wave32 on AMD RDNA for scalar FA
* add Intel shader core count lookup-table
* fix regressions
* device tuning
* tmpsh size fix
* fix editorconfig
* refactor fa tuning logic into a single place
* fix gqa opt logic
* fix block_rows with small n_rows
* amd tuning
* fix hsk=72/80 issue
* tuning
* allow condition skipping for column check
* use float16 for Of if available
* address feedback
* fix bad RDNA performance on head size <= 128 by limiting occupancy
* allow printing pipeline stats
* cleanup and fixes
* limit occupancy for GCN for small batch FA with large HSK
* disable f16 FA for GCN AMD GPUs on the proprietary driver
* hexagon: refactor set/get/sum-rows ops to use local context
* hexagon: refactor ROPE and Softmax Ops to use local context
Improves performance a bit by precomputing things and saving in the context.
* hexagon: refactor activation ops to use local context struct
* hexagon: refactor unary ops to use local context struct and DMA/VTCM
* hexagon: use aligned hvx_scale function
* hexagon: remove unused fields from op_context
* hexagon: rewrite ROPE to use DMA and VTCM scratchpad
* hex-rope: keep N rows in scratchpad (instead of just two)
* hex-rope: introduce rowidx cache
* hex-rope: remove unused fields
* hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute
also removes the need for fastdiv.
* hex-rope: minor formatting
* hex-rope: use indices and unroll the loops
* hex-rope: more updates to cleanup rope-block handling
* hexagon: cleanup supported type/dims checks
* hexagon: all reduce funcs replicated across lanes
There is no need to explicitly replicate the first value.
* snapdragon: update adb and windows scripts to use ubatch-size 256
Updated Ops support handles larger ubatches.