Need a 5% and 15% REAP

#10

by nawoalanor - opened about 19 hours ago

Discussion

nawoalanor

about 19 hours ago

•

edited about 18 hours ago

By my math a 15% reap should allow this to run on stacked DGX Sparks in FP8 with full 200K context.

14-15% reap should give enough room for F16 kv-cache
4-5% reap for 8-bit
If 4.5-bit turboquant ever materializes in useful form it should just barely fit with no reap
If you hate yourself 4-bit will fit as well, needless to say

nawoalanor changed discussion title from Need a 15% REAP to Need a 5% and 15% REAP about 19 hours ago

dennny123

about 18 hours ago

i can try running reap on m2.7, nobody's done it yet but the cerebras code already supports m2.5 so it should be close. will start with 15% and see how it goes

dehnhaide

about 15 hours ago

No REAPS, ever! Don't lobotomize the model unless you know its inner capabilities and activations. Cerebras quants have been forever a hit-and-miss!

nawoalanor

about 9 hours ago

•

edited about 8 hours ago

A 5% reap is inconsequential to intelligence but will make it possible to run the model on a fairly common platform with native hardware acceleration and acceptable kv-cache quantization. The alternative is taking a FP8 model and quantizing it a second time to INT4 (which also means no native acceleration), or - worse - using Q4 kv-cache. These are surely far more destructive than the loss of things like "how to translate from German to Estonian", the name of HP Lovecraft's cat, the number of guardhouses on the Great Wall of China, etc.

A 15% reap has also been shown to have very little effect, especially if you choose a targeted dataset to retain, such as programming, tool-calling, math, etc.

You start running into problems when you try doing something like a 30%, 40%, or 50% reap.

dehnhaide

about 8 hours ago

•

edited about 8 hours ago

A 5% reap is inconsequential to intelligence but will make it possible to run the model on a fairly common platform with native hardware acceleration. The alternative is taking a FP8 model and quantizing it a second time to INT4, which is surely far more destructive than the loss of things like "how to translate from German to Estonian", the name of HP Lovecraft's cat, etc.

I'd rather not turn this into a flame but logic stays logic: stubbing someone (an attacker) in the dark is inconsequential to life because... reasons!

The alternative is taking a FP8 model and quantizing it a second time to INT4, which is surely far more destructive than...

I am pretty sure there are experience and logic coming together in your sayings but I am a mere mortal...

nawoalanor

about 8 hours ago

•

edited about 8 hours ago

I'm not sure what your first sentence is meant to imply but I'll try my best to interpret and respond: REAP is a targeted process, not a shot in the dark.

Unlike most models, this one is distributed in pre-quantized FP8 rather than FP16. Quantizing a model twice is (arguably) worse than a 5% reap on experts irrelevant to the intended use-case, which for most of us is programming. The net effect isn't that much different from a fine-tune. I want to use this for programming, I do not care about random trivia, how to translate between languages, the smell of a beaver's anal glands, etc. Precision of data is more important to my use-case than volume of data. I do not want it to hallucinate libraries or what happens when you combine data types.

Think of it like compressing a JPG twice versus just truncating the bottom 5% of the image. We're going to lose the feet of that fat greasy guy in the distant background leering at kids versus degrading the quality of the image as a whole.

I would quite happily perform the REAP myself but I have only Sparks (128GB) and my PC at my disposal. My PC technically has enough RAM & VRAM but I seriously doubt it would work. Detailed documentation on REAP is sparse and I'm not sure how well my PC would run with just 240 GB to reap a 230 GB model with... edit: Claude seems to think it might be possible, I'll give it a shot.

dehnhaide

about 8 hours ago

I have feeling that you're treating PRUNING like being the same scale and result to creating the imatrix of a model's experts activation... which, at least for the REAP quants created by cerebras & others is not the case. Unless you ask him/other to use a custom dataset to your liking/usecase. Quanting to FP8 to INT4 has obvious issues when is done bluntly, with no respect (read calculation) of experts activation BUT that is not ablation (as in REAP) and it might be overall detrimental but not use case specific. PRUNING is just model IQ permanent ablation based on misinterpreted experts activation occurrence, with unforeseen consequences.

nawoalanor

about 8 hours ago

•

edited about 8 hours ago

I'm going to make a 75% reap now just for spite

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment