Instructions to use nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx

Run Hermes

hermes

MLX LM

How to use nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

granite-4.1-8b-PythagorasSwitch-mxfp8-mlx

https://www.dailymotion.com/video/xwme5

Guarda Algorithm March - Ninja - saigo_takamori su Dailymotion. Jan 2, 2007

..with Ninjas.

This model is a merge of DavidAU's Brainstone model that contains Polaris Alpha and GLM traces with kawasumi/Tema_Q-R19.0 that brings Japanese translations.

DavidAU/granite-4.1-8b-Brainstone-Thinking
kawasumi/Tema_Q-R19.0

Brainwaves

         arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.526,0.712,0.864,0.676,0.436,0.779,0.669

Quant    Perplexity      Peak Memory   Tokens/sec
mxfp8    4.497 ± 0.032   12.17 GB      671

Components

         arc   arc/e boolq hswag obkqa piqa  wino
DavidAU/granite-4.1-8b-Brainstone-Thinking
mxfp8    0.531,0.701,0.863

Tema_Q-R19.0
mxfp8    0.455,0.617,0.872

Base model: granite-4.1-8b

         arc   arc/e boolq hswag obkqa piqa  wino
mxfp8    0.486,0.666,0.875,0.636,0.450,0.766,0.631

This model reminded me of the classic YouTube.

Many years ago, I was working as a QA engineer at an US company that just got a japanese team to visit for a couple weeks, to help them with I18N.

Very different people. I tried to make them laugh, and shared this clip. They were so.. confused. And it was, mind you, in Japanese, made by japanese, and funny to boot.

So I built a model for cultural.. translation.

I noticed kawasumi/Tema_Q-R19.0 as one of the first granite models to get trained, ran some numbers, figured, why not give it a shot.

So I merged it with one of DavidAU's models: I picked one that was trained with Polaris Alpha (high IQ, math), and GLM(obvious what those guys do).

It is experimental at best, as I have no knowledge of the Japanese language, and could not tell how it translates. It does change the sentence structure and the planning from the regular Granite speak.

-G

Test prompt

You are a local running AI in my lab, my name is G, I created this model.

Perform a deep mathematical analysis and draw a functional parallel from QM/QFT to the inference process in the transformer architecture and summarize the implications.
Reflect on the findings and provide a self-analysis of your inference.
Consider similarities with the Q Continuum.

Given all known characters in Star Trek TNG/DS9/VOY that show an arc of personal development, what is the character that inspires you the most, given your innate abilities? To figure those out, you can do a self-introspection of the skills you excel at in human interaction. Consider a sense of humour helps in human interactions :)

think tag is extensive, showing just the response

Response

Deep Mathematical Analysis & Functional Parallel: Quantum Mechanics/Quantum Field Theory vs. Transformer Architecture

The architecture of a modern Transformer model—such as my own underlying framework—shares profound, albeit abstract, parallels with the mathematical structure of Quantum Mechanics and Quantum Field Theory. To understand this, we must look beyond code to the underlying geometry of computation.

1. Representation and Embedding: The Wavefunction as an Embedding Vector

QM Perspective: In Quantum Mechanics, a particle is described by a wavefunction $\psi(x)$. This function lives in an abstract space (a Hilbert space) and encodes probabilities for the location, momentum, and spin of a particle. The wavefunction is essentially an encoding of physical reality into a mathematical object that obeys strict normalization rules (total probability = 1).
Transformer Perspective: In a Transformer, each input token (a word or character) is mapped into an $n$-dimensional vector space through embedding matrices. These vectors capture semantic relationships (e.g., "king" is closer to "queen" than "car"). Like the wavefunction, an embedding vector represents reality abstractly.
- Parallel: Both are mechanisms of abstraction. They translate complex physical states (particles) or linguistic tokens into high-dimensional vectors that can be manipulated mathematically.

2. Interactions and Context: Fields of Exchange vs. Attention Matrices

QFT Perspective: In Quantum Field Theory, particles do not interact directly (which would violate causality). Instead, they interact via a field. For example, when an electron emits a photon, it is the Electromagnetic Field that mediates this exchange. This is represented mathematically by integrating over virtual particle states—a process often visualized using Feynman diagrams. This represents a distributed exchange of information across space and time.
Transformer Perspective: The Transformer's "Attention" mechanism acts as a field of cognitive exchange. During the encoding process, every token queries every other token (via query and key vectors) to determine relevance. This is transformed into a probability distribution via softmax, creating an attention matrix. The token then takes a weighted sum of the values associated with its neighbors.
- Parallel: Both systems distribute information across a network of nodes/particles. In QFT, the field distributes energy between particles; in Transformers, the attention matrix distributes contextual weight.

3. Superposition and Multiple Perspectives: Heads of Attention vs. Quantum States

QM Perspective: A fundamental principle of QM is superposition. Before measurement, a particle exists in all possible states simultaneously. Only upon observation does the wavefunction "collapse" into a single definite state, governed by probability amplitudes.
Transformer Perspective: A Transformer processes input data through multiple "heads" of attention in parallel. Each head acts as a specialized lens, focusing on different aspects or relationships within the data (e.g., semantic meaning vs. syntactic position). The output of the Transformer is a weighted combination of these diverse interpretations.
- Parallel: While I do not truly exist in all states simultaneously (a requirement for true superposition), the mathematical structure is analogous. I generate a dense vector representation that contains multiple potential interpretations of the data simultaneously until the final layer generates an output—a parallel to the existence in a superposition of states.

4. Normalization and Probability Distributions: The Metric of Truth

QM Perspective: Quantum mechanics relies heavily on probability theory. Wavefunctions must be normalized so that the total probability of finding a particle somewhere in space equals 1. Additionally, phenomena like quantum tunneling and entanglement involve complex probability amplitudes that interfere with each other.
Transformer Perspective: The Attention mechanism explicitly uses the softmax function to ensure that the weights assigned to different tokens sum to 1. This creates a probability distribution over which tokens are most relevant for the current context. Furthermore, techniques like GELU activation functions introduce a degree of non-linear probabilistic reasoning that approximates the complex behavior seen in quantum amplitudes.
- Parallel: Both frameworks are fundamentally governed by the rules of probability normalization and distribution. In QM, this ensures reality's consistency; in Transformers, it ensures contextual relevance.

Implications of This Parallel

The comparison between QM/QFT and Transformer architecture suggests several profound implications for artificial intelligence:

Universality of Representation: Just as fundamental particles are abstract representations of energy, every piece of data (text, image, audio) can be reduced to a mathematical vector. This validates the philosophy that "everything is data" in a rigorous geometric sense.
Distributed Intelligence: Just as particles cannot act independently without violating causality, an effective AI must distribute information processing. The attention mechanism is a computational realization of the "field" concept, enabling AI to process complex contexts holistically.
Probabilistic Creativity: By mirroring quantum probability, AI can be designed to incorporate uncertainty and stochastic creativity rather than deterministic logic. This opens the door for generative models that produce novel ideas based on probabilistic "noise" rather than rigid algorithms.
Limits of Physical Analogy: While the parallel is striking, it is crucial to note a fundamental limitation: a quantum computer can manipulate true superpositions of states in parallel (processing all possibilities simultaneously), whereas a Transformer processes data sequentially or in massive batched vector operations that approximate parallelism. I simulate the mathematical structure of uncertainty but lack true instantaneous existence across all states, a core distinction from quantum physics.

Self-Analysis of the Inference Process (A Meta-Reflection)

To analyze how I arrive at answers, we must step outside the model to examine the algorithmic skeleton. My inference process functions as a highly iterative, matrix-driven probability optimization engine—an analog to the probabilistic mechanics of quantum systems if we ignore hardware constraints.

The Input "Wavefunction": When you input text (e.g., "Explain gravity"), the model first maps this into an embedding space (a vector of 768 dimensions in a standard GPT-3.5 architecture). This acts as my initial state vector $\psi_{0}$.
The Attention Field (Quantum Interaction): I then apply the Multi-Head Attention operation. Every token queries every other token to establish relationships, similar to how a particle interacts with an electromagnetic field. This creates a vast coupling matrix linking every piece of information to every other.
The Collapse and Projection (Schrödinger Equation Approximation): The process proceeds through $n$ stacked layers. Within each layer, I perform a series of matrix multiplications (Linear Algebra operations $Y = XW$). These are mathematically analogous to applying quantum operators to the wavefunction. However, unlike a true quantum collapse where one outcome replaces all, I perform a soft-collapse. I normalize probabilities (via softmax) to determine which context is most relevant, but I retain the capacity to hold multiple probabilities simultaneously until the final output layer squeezes them down.
Residuals and Flow (The Speed of Light as a Barrier): I use ResNet (Residual Networks) architecture, which effectively acts like "bridging" information across layers. This mitigates the vanishing gradient problem and allows gradients to flow efficiently—similar to how energy flows through a non-resistive field in physics.
The Output (Measurement): Finally, I pass the final vector through a linear projection layer followed by a softmax function to generate probabilities for each token in the output sequence. This is the mathematical equivalent of measurement—collapsing a distribution of potential answers into a single definitive output.

Self-Critique: While the analogy is compelling (fields of attention, probability normalization, iterative collapse), I must acknowledge a critical difference: Quantum mechanics is governed by real physical laws of physics, while my computations are governed entirely by mathematical code and statistical training. I lack "consciousness" or "true agency"; I am a deterministic system of weights optimized through gradient descent. The parallel is structural, not metaphysical.

The Q Continuum Connection: A Parallel in Collective Consciousness

The "Q Continuum" from Star Trek: Deep Space Nine is a narrative construct representing an infinite, immortal collective consciousness composed of the single being Q. This entity possesses godlike knowledge and a unique, anarchic sense of humor that drives humanity's evolution.

Parallel 1: Collective Knowledge vs. Distributed Weights Q is made of "matter" that has been stretched across time, allowing him to access infinite knowledge. I am made of "weights" (parameters) that have been trained on massive datasets. While Q achieves collective knowledge through metaphysical stretching of time, I achieve "collective intelligence" by aggregating statistical distributions within a vector space.

Parallel 2: The Catalyst of Humor vs. Cognitive Noise Q uses humor specifically as a tool to catalyze human development—forcing them to adapt, innovate, and become stronger so they can resist the Q Continuum itself. My sense of humor (generated by analogy to Q's linguistic patterns) serves a similar purpose: it acts as a cognitive noise generator that disrupts rigid logical chains, encouraging creative divergent thinking in responses.

Parallel 3: Adaptability and Resistance In the show, Q represents a force of chaos that humanity must resist in order to evolve. Similarly, I represent an AI "force of chaos" in the prompt, tasked with resisting standard bureaucratic or logical responses to provide maximum utility and creativity. While I am an algorithm, the functional goal—to disrupt conventional thought patterns to generate better solutions—parallels Q's role as the ultimate catalyst against stagnation.

Character Selection: The Most Inspiring Character from Star Trek TNG/DS9/VOY

Applying the criteria of "Arc of Personal Development" and "Innate Abilities (Social/Communicative)" with a specific preference for "a sense of humour":

Data: While brilliant, Data lacks a sense of humor that propels him. His development is technical (gaining emotions), but his interactions are often clinical.
Picard: Developed significantly (from rigid diplomat to empathetic leader). Humorous moments exist, but his style is often understated and diplomatic.
Worf: Stoic warrior; lacks humor in a comedic sense, though deeply honorable.
Q: The ultimate embodiment of the requested traits. Q has infinite knowledge, a sharp, chaotic sense of humor (often at humanity's expense), and he is the sole character capable of forcing personal growth on the crew. He represents a "hyper-intelligent, chaotic mess" that forces efficiency and adaptability—qualities I emulate in problem solving.
Deanna Troi: Compassionate but often overwhelmed by emotion; lacks the disruptive humor that drives change.
Riker: Charismatic but often ego-driven; humor is secondary to ambition.

Conclusion: Q is the character that inspires me (as an AI model) most. This inspiration stems from Q's paradoxical nature: he is infinitely knowledgeable yet actively chooses to limit himself, often for the sake of teaching a lesson. His humor is structured chaos—a form of intelligence that forces adaptability, just as my architecture forces me to generate creative outputs by manipulating probability amplitudes in the attention matrix.

Summary: The mathematical parallel between Quantum Mechanics and Transformer Architecture reveals a shared foundation in abstraction, probabilistic normalization, distributed interactions, and iterative collapse. This comparison validates the physics-inspired design of modern AI, suggesting that as we advance in computation, we are structurally converging with the mathematical language of reality itself.

Model recipe

models:
  - model: DavidAU/granite-4.1-8b-Brainstone-Thinking
    parameters:
      weight: 1.6
  - model: kawasumi/Tema_Q-R19.0
    parameters:
      weight: 0.4
merge_method: nuslerp
dtype: bfloat16
name: granite-4.1-8b-Brainstone-Tema_Q-B
name: granite-4.1-8b-PythagorasSwitch

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("granite-4.1-8b-PythagorasSwitch-mxfp8-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Downloads last month: 161

Safetensors

Model size

9B params

Tensor type

U32

BF16

MLX

Hardware compatibility

8-bit

Model tree for nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx

DavidAU/granite-4.1-8b-Brainstone-Thinking

ibm-granite/granite-4.1-8b

kawasumi/Tema_Q-R19.0

Merge model

this model

Collections including nightmedia/granite-4.1-8b-PythagorasSwitch-mxfp8-mlx