whisper.cpp

Running

App Files Files Community

whisper.cpp / ggml-cuda.cu

Commit History

CUDA: add FP32 FlashAttention vector kernel (llama/7188)

03d4b22
unverified

JohannesGaessler commited on May 12, 2024

ggml : full ALiBi support (llama/7192)

192bda4

ggerganov HF Staff commited on May 11, 2024

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

08fc76d

agray3 slaren commited on May 8, 2024

Add an option to build without CUDA VMM (llama/7067)

38b1143

wtambellini commited on May 6, 2024

ggml : add Flash Attention (llama/5021)

34d3b03

ggerganov HF Staff

JohannesGaessler

phymbert commited on Apr 30, 2024

ggml : group all experts in a single ggml_mul_mat_id (llama/6505)

f0b5c67

slaren

ggerganov HF Staff commited on Apr 18, 2024

CUDA: fix matrix multiplication logic for tests (llama/6667)

6ccb5a5

JohannesGaessler commited on Apr 13, 2024

feat: implemented sigmoid function (ggml/806)

cd0c122

Justina Cho commited on May 1, 2024

llama : add Command R Plus support (llama/6491)

8cf7097
unverified

Carolinabanana S S slaren

ggerganov HF Staff commited on Apr 9, 2024

ggml : mul_mat_id use the same tensor for all the experts (llama/6387)

26fdc9f
unverified

slaren

ggerganov HF Staff commited on Apr 3, 2024

ggml: bypass code incompatible with CUDA < 11.1 (#2020)

32f4e35
unverified

primenko commited on Apr 4, 2024

sync : ggml (#2001)

cbbfa9e
unverified

ggerganov HF Staff commited on Mar 27, 2024

llama : add pipeline parallelism support (llama/6017)

b5bb3f3
unverified

slaren

compilade

ggerganov HF Staff commited on Mar 13, 2024

ggml : reuse quantum structs across backends (llama/5943)

bb0625f
unverified

ggerganov HF Staff commited on Mar 12, 2024

1.5 bit: we can do even better (llama/5999)

36cc71e
unverified

Kawrakow

ikawrakow commited on Mar 11, 2024

Better 1.5 bit quantization (llama/5971)

f3a62cc
unverified

Kawrakow

ikawrakow commited on Mar 11, 2024

ggml : add ggml-common.h to deduplicate shared code (llama/5940)

0a37735
unverified

ggerganov HF Staff commited on Mar 9, 2024

ggml : introduce ggml_status (ggml/750)

151c676
unverified

Michael Podvitskiy slaren

ggerganov HF Staff commited on Mar 4, 2024

cuda : fix data race in soft max (llama/5853)

d1b60e4
unverified

slaren commited on Mar 3, 2024

ggml : IQ3_S improvements (llama/5829)

06a8e30
unverified

Kawrakow

ikawrakow commited on Mar 2, 2024

add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

dd8e3f9
unverified

leejet

ggerganov HF Staff slaren commited on Mar 3, 2024

ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760)

9a07f42
unverified

Kawrakow

ikawrakow commited on Feb 28, 2024

IQ4_XS: a 4.25 bpw quantization (llama/5747)

0ee1bfb
unverified

Kawrakow

ikawrakow commited on Feb 27, 2024

cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744)

753b30d
unverified

Engininja2 commited on Feb 27, 2024

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721)

2b9bb9e
unverified

Kawrakow

ikawrakow

ggerganov HF Staff commited on Feb 26, 2024

CUDA: fix DEBUG_CUDA_MALLOC (llama/5729)

f18f386
unverified

JohannesGaessler commited on Feb 26, 2024

code : normalize enum names (llama/5697)

93e0830
unverified

ggerganov HF Staff commited on Feb 25, 2024

IQ3_S: a much better alternative to Q3_K (llama/5676)

32589c9
unverified

Kawrakow

ikawrakow commited on Feb 24, 2024

Introduce backend GUIDs (ggml/743)

a7eb9f6
unverified

UEXTM.com slaren commited on Feb 24, 2024

ggml : always define ggml_fp16_t as uint16_t (llama/5666)

bc567d3
unverified

ggerganov HF Staff commited on Feb 22, 2024

sync : llama.cpp (ggml/0)

f8e8d34
unverified

ggerganov HF Staff commited on Feb 21, 2024

cuda : ignore peer access already enabled errors (llama/5597)

a817d85
unverified

slaren commited on Feb 19, 2024

ci : enable -Werror for CUDA builds (llama/5579)

df03a10
unverified

ggerganov HF Staff commited on Feb 19, 2024

cuda, metal : fix nans in soft_max (llama/5574)

44164ac
unverified

slaren

ggerganov HF Staff commited on Feb 19, 2024

1.5 bit quantization (llama/5453)

9c3aa6a
unverified

Kawrakow

ikawrakow commited on Feb 18, 2024

ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)

26c019a
unverified

ggerganov HF Staff commited on Feb 19, 2024

cuda : print message when initialization fails (llama/5512)

1f047ca
unverified

slaren commited on Feb 15, 2024

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)

c0cfa9b
unverified

JohannesGaessler slaren commited on Feb 11, 2024

CUDA: more warps for mmvq on NVIDIA (llama/5394)

7ab774c
unverified

JohannesGaessler commited on Feb 8, 2024

CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)

3ff7660
unverified

JohannesGaessler commited on Feb 7, 2024

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)

7aa3216
unverified

JohannesGaessler commited on Feb 6, 2024

CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)

ae45b38
unverified

JohannesGaessler commited on Feb 6, 2024

cuda : fix LLAMA_CUDA_F16 (llama/5262)

5fd8fb7
unverified

slaren commited on Feb 1, 2024

llava : add MobileVLM support (llama/5132)

f17a416
unverified

JidongZhang-THU slaren commited on Jan 31, 2024

sync : ggml (llama/0)

cdb7964
unverified

ggerganov HF Staff commited on Jan 30, 2024

SOTA 3-bit quants (llama/5196)

4649943
unverified

Kawrakow

ikawrakow commited on Jan 30, 2024

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)

75d438c
unverified

John Balis slaren commited on Jan 29, 2024

ggml : add Vulkan backend (llama/2059)

5a97aba
unverified

OccamRazor

SlyEcho Concedo slaren

ggerganov HF Staff commited on Jan 28, 2024

cuda : fix tensor size calculation for non-split buffer (llama/5145)

8f3eb65
unverified

slaren commited on Jan 26, 2024

cuda : fix 2-bit quants on amd hip (llama/5105)

aadbd67
unverified

Engininja2 commited on Jan 24, 2024

Commit History

CUDA: add FP32 FlashAttention vector kernel (llama/7188) 03d4b22 unverified

ggml : full ALiBi support (llama/7192) 192bda4

Introduction of CUDA Graphs to LLama.cpp (llama/6766) 08fc76d

Add an option to build without CUDA VMM (llama/7067) 38b1143

ggml : add Flash Attention (llama/5021) 34d3b03

ggml : group all experts in a single ggml_mul_mat_id (llama/6505) f0b5c67

CUDA: fix matrix multiplication logic for tests (llama/6667) 6ccb5a5

feat: implemented sigmoid function (ggml/806) cd0c122

llama : add Command R Plus support (llama/6491) 8cf7097 unverified

ggml : mul_mat_id use the same tensor for all the experts (llama/6387) 26fdc9f unverified

ggml: bypass code incompatible with CUDA < 11.1 (#2020) 32f4e35 unverified

sync : ggml (#2001) cbbfa9e unverified

llama : add pipeline parallelism support (llama/6017) b5bb3f3 unverified

ggml : reuse quantum structs across backends (llama/5943) bb0625f unverified

1.5 bit: we can do even better (llama/5999) 36cc71e unverified

Better 1.5 bit quantization (llama/5971) f3a62cc unverified

ggml : add ggml-common.h to deduplicate shared code (llama/5940) 0a37735 unverified

ggml : introduce ggml_status (ggml/750) 151c676 unverified

cuda : fix data race in soft max (llama/5853) d1b60e4 unverified

ggml : IQ3_S improvements (llama/5829) 06a8e30 unverified

add some new ops, fix some operators and add batch operations to certain operators. (ggml/747) dd8e3f9 unverified

ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760) 9a07f42 unverified

IQ4_XS: a 4.25 bpw quantization (llama/5747) 0ee1bfb unverified

cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744) 753b30d unverified

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721) 2b9bb9e unverified

CUDA: fix DEBUG_CUDA_MALLOC (llama/5729) f18f386 unverified

code : normalize enum names (llama/5697) 93e0830 unverified

IQ3_S: a much better alternative to Q3_K (llama/5676) 32589c9 unverified

Introduce backend GUIDs (ggml/743) a7eb9f6 unverified

ggml : always define ggml_fp16_t as uint16_t (llama/5666) bc567d3 unverified

sync : llama.cpp (ggml/0) f8e8d34 unverified

cuda : ignore peer access already enabled errors (llama/5597) a817d85 unverified

ci : enable -Werror for CUDA builds (llama/5579) df03a10 unverified

cuda, metal : fix nans in soft_max (llama/5574) 44164ac unverified

1.5 bit quantization (llama/5453) 9c3aa6a unverified

ggml : add ALiBi support for ggml_soft_max_ext (llama/5488) 26c019a unverified

cuda : print message when initialization fails (llama/5512) 1f047ca unverified

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434) c0cfa9b unverified

CUDA: more warps for mmvq on NVIDIA (llama/5394) 7ab774c unverified

CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386) 3ff7660 unverified

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370) 7aa3216 unverified

CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351) ae45b38 unverified

cuda : fix LLAMA_CUDA_F16 (llama/5262) 5fd8fb7 unverified

llava : add MobileVLM support (llama/5132) f17a416 unverified

sync : ggml (llama/0) cdb7964 unverified

SOTA 3-bit quants (llama/5196) 4649943 unverified

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686) 75d438c unverified

ggml : add Vulkan backend (llama/2059) 5a97aba unverified

cuda : fix tensor size calculation for non-split buffer (llama/5145) 8f3eb65 unverified

cuda : fix 2-bit quants on amd hip (llama/5105) aadbd67 unverified

CUDA: add FP32 FlashAttention vector kernel (llama/7188)

03d4b22
unverified

ggml : full ALiBi support (llama/7192)

192bda4

Introduction of CUDA Graphs to LLama.cpp (llama/6766)

08fc76d

Add an option to build without CUDA VMM (llama/7067)

38b1143

ggml : add Flash Attention (llama/5021)

34d3b03

ggml : group all experts in a single ggml_mul_mat_id (llama/6505)

f0b5c67

CUDA: fix matrix multiplication logic for tests (llama/6667)

6ccb5a5

feat: implemented sigmoid function (ggml/806)

cd0c122

llama : add Command R Plus support (llama/6491)

8cf7097
unverified

ggml : mul_mat_id use the same tensor for all the experts (llama/6387)

26fdc9f
unverified

ggml: bypass code incompatible with CUDA < 11.1 (#2020)

32f4e35
unverified

sync : ggml (#2001)

cbbfa9e
unverified

llama : add pipeline parallelism support (llama/6017)

b5bb3f3
unverified

ggml : reuse quantum structs across backends (llama/5943)

bb0625f
unverified

1.5 bit: we can do even better (llama/5999)

36cc71e
unverified

Better 1.5 bit quantization (llama/5971)

f3a62cc
unverified

ggml : add ggml-common.h to deduplicate shared code (llama/5940)

0a37735
unverified

ggml : introduce ggml_status (ggml/750)

151c676
unverified

cuda : fix data race in soft max (llama/5853)

d1b60e4
unverified

ggml : IQ3_S improvements (llama/5829)

06a8e30
unverified

add some new ops, fix some operators and add batch operations to certain operators. (ggml/747)

dd8e3f9
unverified

ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (llama/5760)

9a07f42
unverified

IQ4_XS: a 4.25 bpw quantization (llama/5747)

0ee1bfb
unverified

cuda : replace remaining shfl_xor with calls to warp_reduce functions (llama/5744)

753b30d
unverified

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range (llama/5721)

2b9bb9e
unverified

CUDA: fix DEBUG_CUDA_MALLOC (llama/5729)

f18f386
unverified

code : normalize enum names (llama/5697)

93e0830
unverified

IQ3_S: a much better alternative to Q3_K (llama/5676)

32589c9
unverified

Introduce backend GUIDs (ggml/743)

a7eb9f6
unverified

ggml : always define ggml_fp16_t as uint16_t (llama/5666)

bc567d3
unverified

sync : llama.cpp (ggml/0)

f8e8d34
unverified

cuda : ignore peer access already enabled errors (llama/5597)

a817d85
unverified

ci : enable -Werror for CUDA builds (llama/5579)

df03a10
unverified

cuda, metal : fix nans in soft_max (llama/5574)

44164ac
unverified

1.5 bit quantization (llama/5453)

9c3aa6a
unverified

ggml : add ALiBi support for ggml_soft_max_ext (llama/5488)

26c019a
unverified

cuda : print message when initialization fails (llama/5512)

1f047ca
unverified

CUDA: mul_mat_vec_q tiling, refactor mul mat logic (llama/5434)

c0cfa9b
unverified

CUDA: more warps for mmvq on NVIDIA (llama/5394)

7ab774c
unverified

CUDA: fixed mmvq kernel for bs 2,3,4 and -sm row (llama/5386)

3ff7660
unverified

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (llama/5370)

7aa3216
unverified

CUDA: mul_mat_vec_q for batch sizes > 1 (llama/5351)

ae45b38
unverified

cuda : fix LLAMA_CUDA_F16 (llama/5262)

5fd8fb7
unverified

llava : add MobileVLM support (llama/5132)

f17a416
unverified

sync : ggml (llama/0)

cdb7964
unverified

SOTA 3-bit quants (llama/5196)

4649943
unverified

`ggml_cuda_cpy` support for 4d tensors and float16->float32 upcasting (ggml/686)

75d438c
unverified

ggml : add Vulkan backend (llama/2059)

5a97aba
unverified

cuda : fix tensor size calculation for non-split buffer (llama/5145)

8f3eb65
unverified

cuda : fix 2-bit quants on amd hip (llama/5105)

aadbd67
unverified