can't run inference on multi GPU

by daryl149 - opened May 24, 2023

May 24, 2023

•

edited Jun 4, 2023

Works on a single A6000:

from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer

tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="sequential", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)

Throws error on 2 V100S cards (hosting 17GB of model weights each):

from transformers import LlamaTokenizer, LlamaForCausalLM, TextStreamer

tokenizer = LlamaTokenizer.from_pretrained("oasst-rlhf-2-llama-30b")
model = LlamaForCausalLM.from_pretrained("oasst-rlhf-2-llama-30b", device_map="auto", offload_folder="offload", load_in_8bit=True)
streamer = TextStreamer(tokenizer, skip_prompt=True)
message = "<|prompter|>This is a demo of a text streamer. What's a cool fact about ducks?<|assistant|>"
inputs = tokenizer(message, return_tensors="pt").to(model.device)
tokens = model.generate(**inputs,  max_new_tokens=500, do_sample=True, temperature=0.9, streamer=streamer)

throws:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "myvenv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1558, in generate
    return self.sample(
  File "myvenv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2641, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Only difference is I'm using device_map auto to make use of both GPUs. (Also happens for .to('cuda'), .to(0), .to(1) instead of .to(model.device).)

ah, there's an open bug in transformers for it:
https://github.com/huggingface/transformers/issues/22914

daryl149 changed discussion status to closed May 24, 2023

daryl149

Jun 4, 2023

•

edited Jun 4, 2023

Update:
The inf/nan is caused by CUDA 11.8 and bitsandbytes==0.38.1. It's solved by downgrading to CUDA 11.6 and bitsandbytes 0.31.8

However, the inference on multi gpu is still broken. It returns gibberish when using load_in_8bit=True. See this issue I created in transformers https://github.com/huggingface/transformers/issues/23989

daryl149 changed discussion status to open Jun 4, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment