Instructions to use fromthesky/PLDR-LLM-v51-110M-1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fromthesky/PLDR-LLM-v51-110M-1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="fromthesky/PLDR-LLM-v51-110M-1", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("fromthesky/PLDR-LLM-v51-110M-1", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use fromthesky/PLDR-LLM-v51-110M-1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "fromthesky/PLDR-LLM-v51-110M-1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fromthesky/PLDR-LLM-v51-110M-1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/fromthesky/PLDR-LLM-v51-110M-1

SGLang

How to use fromthesky/PLDR-LLM-v51-110M-1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "fromthesky/PLDR-LLM-v51-110M-1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fromthesky/PLDR-LLM-v51-110M-1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "fromthesky/PLDR-LLM-v51-110M-1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "fromthesky/PLDR-LLM-v51-110M-1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use fromthesky/PLDR-LLM-v51-110M-1 with Docker Model Runner:
```
docker model run hf.co/fromthesky/PLDR-LLM-v51-110M-1
```

fromthesky commited on Sep 19, 2025

Commit

637fb33

1 Parent(s): 2aed2e6

Updated readme

Browse files

Updated transformers version in generation_config.json

Files changed (2) hide show

README.md +10 -8
generation_config.json +1 -1

README.md CHANGED Viewed

@@ -52,9 +52,11 @@ pipeline = pipeline(
     trust_remote_code=True
 )
-prompt=('One time they had a drumming contest, and I didn’t do very well: '
-        'They said my drumming was "too intellectual"; theirs was much more pulsing.')
-output=pipeline(prompt, top_p=0.6, top_k=0, temperature=1, do_sample=True, use_cache=True, max_new_tokens=100)
 print(output[0]["generated_text"])
 ```
@@ -71,9 +73,9 @@ tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="fromthesk
                                         legacy=False,
                                         trust_remote_code=True
                                        )
-prompt=('One time they had a drumming contest, and I didn’t do very well: '
-        'They said my drumming was "too intellectual"; theirs was much more pulsing.')
 inputs = tokenizer([prompt], return_tensors="pt").to(device=device)
 generated_ids = model.generate(**inputs,
                                max_new_tokens=100,
@@ -85,7 +87,6 @@ generated_ids = model.generate(**inputs,
                               )
 print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
 ```
-<sup>\*</sup> `prompt` string is a quote from Richard Feynman in Surely You're Joking, Mr. Feynman! Adventures of a Curious Character.
 #### PLDR-LLM specific configurations:
 - `custom_G_type`: `None` for learned G values during pretraining, `'identity'` for LLM with SDPA equivalent, `'random'` for G values from a random normal distribution, `'external'` for custom G values that can be assigned after model initialization. This setting is more important for training purposes, for inference it is set in the model config.json file.
@@ -109,7 +110,8 @@ See config.json for other model configuration details.
       pip install -e ".[dev]"
 ```
 - Static cache is not supported for models with `custom_G_type=None`.
-- When `add_bos_token=False` and `add_eos_token=False` are set for the tokenizer model, prompt `""` is an invalid input for single batch inference as it doesn't contain any tokens. When padding is enabled, batched inference with prompt `""` as one of the samples causes its `input_ids` to be pad tokens and `attention_mask` to be all zeros. This edge case is handled differently for `_attn_implementation='eager'` and `'sdpa'`, resulting in different generation outputs for this prompt. Setting `add_bos_token=True`, `add_eos_token=True` or explicitly providing prompt as `"[PAD]"`, `"[START]"`, or `"[END]"` gives same output for either implementation. This issue does not affect KV-cache and G-cache.
 ### Via Original Implementation

     trust_remote_code=True
 )
+prompt="The quick brown fox jumps over the lazy dog."
+output=pipeline(prompt, top_p=0.6, top_k=0, temperature=1, do_sample=True,
+                tokenizer_encode_kwargs={"add_special_tokens":False},
+                use_cache=True, max_new_tokens=100)
 print(output[0]["generated_text"])
 ```
                                         legacy=False,
                                         trust_remote_code=True
                                        )
+prompt="The quick brown fox jumps over the lazy dog."
 inputs = tokenizer([prompt], return_tensors="pt").to(device=device)
 generated_ids = model.generate(**inputs,
                                max_new_tokens=100,
                               )
 print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
 ```
 #### PLDR-LLM specific configurations:
 - `custom_G_type`: `None` for learned G values during pretraining, `'identity'` for LLM with SDPA equivalent, `'random'` for G values from a random normal distribution, `'external'` for custom G values that can be assigned after model initialization. This setting is more important for training purposes, for inference it is set in the model config.json file.
       pip install -e ".[dev]"
 ```
 - Static cache is not supported for models with `custom_G_type=None`.
+- PLDR-LLM uses EOS token `"[END]"` during pretraining to indicate end of a sequence. For text generation, we do not need to add the EOS token to the prompt. To achieve this, `add_eos_token=False` can be set in `tokenizer_config.json` file or while initializing the tokenizer model. For text generation `pipeline` call method, `tokenizer_encode_kwargs={"add_special_tokens":False}` can be used.
+- When `add_bos_token=False` and `add_eos_token=False` are set for the tokenizer model, prompt `""` is an invalid input for single batch inference as it doesn't contain any tokens. When padding is enabled, batched inference with prompt `""` as one of the samples causes its `input_ids` to be pad tokens and `attention_mask` to be all zeros. This edge case is handled differently for `_attn_implementation='eager'` and `'sdpa'`, resulting in different generation outputs for this prompt. Setting `add_bos_token=True`, `add_eos_token=True` or explicitly providing prompt as `"[PAD]"`, `"[START]"`, or `"[END]"` gives same output for either implementation. This issue does not affect KV-cache and G-cache.
 ### Via Original Implementation

generation_config.json CHANGED Viewed

@@ -3,5 +3,5 @@
   "bos_token_id": 2,
   "eos_token_id": 3,
   "pad_token_id": 0,
-  "transformers_version": "4.55.2"
 }

   "bos_token_id": 2,
   "eos_token_id": 3,
   "pad_token_id": 0,
+  "transformers_version": "4.56.1"
 }