Diffusion Single File
comfyui

Artist style becomes inconsistent with multiple artists / longer prompts — detailed analysis & suggestions

#112
by Henry-Yan - opened

Artist style becomes inconsistent with multiple artists / longer prompts — detailed analysis & suggestions

TL;DR

I've been experimenting with Anima 2B and noticed that artist tags produce somewhat inconsistent style results depending on prompt context — especially with multiple artists or longer prompts. I ran some controlled experiments to understand why, capturing and replacing the adapter's output vectors at artist token positions. The main observations: (1) multi-artist prompts appear to cause the most significant drift, likely through Qwen3's causal attention allowing preceding tokens to permeate the artist representation; (2) the adapter's use of RoPE seems to introduce additional position-dependent variation even with identical prompt content; (3) locking the artist vector to a stable reference partially recovers the original style, which suggests that a fixed-embedding approach for artist tokens might be worth exploring. Details, comparison images, and some thoughts on potential directions below.


Origin

While generating images with Anima 2B (the current preview2 release), I noticed a phenomenon: the same artist tag would produce inconsistent style results under different prompt combinations. Swapping out a few quality tags, adding another artist, or moving the artist tag to a different position in the prompt would cause subtle -- or sometimes dramatic -- shifts in the resulting art style.

Many people in the community say "it works fine for me," and they may not be wrong -- under a fixed prompt template with a single artist and commonly used quality tags, things do stay stable. But what I wanted to understand was: is that stability guaranteed by the architecture, or does it just happen that common usage patterns avoid the problem?


Reasoning

Why are artist tags relatively stable in SDXL/Illustrious?

The SDXL family uses dual CLIP text encoders (CLIP-L with 12 layers + OpenCLIP-G with 32 layers). I want to clear up a common misconception: CLIP's text encoder actually uses causal attention too, and what gets sent to the UNet is the contextual hidden states from the second-to-last layer, not the initial fixed embeddings. So strictly speaking, CLIP's artist token representations are not entirely context-free either.

However, in my long-term experience using the SDXL/Illustrious family, artist tag style stability and multi-artist blending capability have been somewhat better than what I observed with Anima preview2 in these experiments. I suspect the most likely reasons are:

Visual grounding. CLIP was trained with contrastive learning on hundreds of millions of image-text pairs. The subword tokens for wlop were repeatedly paired with images in wlop's art style during training, and the contrastive loss pushed these token representations toward a vector region "aligned with wlop's visual features." This likely creates an anchoring effect -- even when causal attention introduces contextual perturbation, the token representations are not easily pulled out of this region. By contrast, Qwen3 was pretrained on pure text. @wlop is just a few subword fragments with no visual meaning to it, lacking this kind of visual anchoring.

Complementary dual-path architecture. SDXL takes the second-to-last layer hidden states from both CLIP-L and OpenCLIP-G, concatenates them into a 2048-dimensional vector for cross-attention. The two CLIP paths encode the same text independently -- if one path's contextual drift leans in a particular direction, the other may drift in a different direction or remain stable, potentially reducing the variance of the combined signal received by the UNet. This is speculation, though; I have not run controlled experiments to verify it.

What did Anima change?

Anima replaced CLIP with Qwen3-0.6B (a causal LLM), with a 6-layer LLM Adapter bridging to the DiT. My concerns were:

  1. Qwen3 was pretrained on pure text and has never seen any images. @wlop is just a concatenation of subword tokens to it, with no visual semantics
  2. Causal attention means that artist token representations are permeated by information from all preceding tokens -- change the quality tags or add another artist, and the downstream artist embedding changes
  3. 0.6B parameters is relatively compact for simultaneously handling natural language understanding and semantic mapping for 60,000 artists

That said, Anima does have a 6-layer LLM Adapter, and tdrussell noted on the model page that it "contains a surprising amount of knowledge." This adapter's self-attention is bidirectional, which in theory could partially correct the contextual bias in Qwen3's causal output. So the question becomes: how much can the adapter correct? And what happens when it is not enough?

Architecture specifics (confirmed from source code)

By reading the ComfyUI source code (comfy/ldm/anima/model.py), I confirmed the complete pipeline:

  • The same prompt is tokenized separately by the T5 tokenizer and the Qwen3 tokenizer, producing different numbers of tokens
  • Qwen3's hidden states serve as the KV for the adapter's cross-attention
  • T5 token IDs go through the adapter's own embedding layer to generate Q
  • After 6 layers of self-attn + cross-attn + MLP, the output is a 1024-dimensional vector, padded to length 512
  • DiT's cross-attention does not apply positional encoding to text tokens -- text is an unordered set as far as the DiT is concerned

Experiments

Experiment 1: Vector Stability Sampling

I independently loaded Qwen3 + Adapter and constructed 5,000 prompts with different contexts for each artist (covering pure tags, natural language, tag+NL hybrids, complex compositions, etc.), ran forward passes to collect the adapter output vectors at artist token positions, and computed the cosine similarity of all vectors against the mean anchor.

Results (8 artists x 5,000 variants):

Artist T5 tokens mean cos p1 (worst 1%) min
@tianliang duohe fangdongye 13 0.978 0.950 0.927
@gsusart 5 0.965 0.929 0.919
@zhibuji loom 9 0.964 0.919 0.899
@vanripper 4 0.962 0.909 0.873
@guweiz 5 0.951 0.878 0.857
@solipsist 4 0.944 0.898 0.875
@mento 3 0.943 0.883 0.866
@wlop 3 0.939 0.883 0.859

Longer names (more T5 tokens) are more stable -- averaging over more subword tokens acts as a smoothing effect.

Experiment 2: Visual Verification -- Vector Replacement Image Generation

Using custom ComfyUI tooling, I captured the actual adapter output vectors at artist token positions during ComfyUI's real pipeline, then replaced them in a target prompt and generated images at a fixed seed for comparison.

Controlled Variable Test

The prompt was fixed as 1girl, school uniform, smile, @wlop , the seed was fixed, and only the source of the injected @wlop vector was changed. "Pixel diff" below refers to the mean absolute difference per pixel across all RGB channels (scale 0–255); 0 means identical images:

Vector source pixel diff
Own vector (sanity check) 0.00 (100% identical)
Different pose/scene 1 - 6
+1 artist 5 - 21
+2 artists 11 - 17
3 artists with target first 9 - 69

Sensitivity varies greatly between artists -- @wlop (3 T5 tokens) drifts far more in multi-artist scenarios than @zhibuji loom (9 T5 tokens).

Extreme Drift Test

To make the visual impact of drift more intuitive, I designed a high-discriminability experiment: capture @wlop 's vector in the company of 12 other artists, then inject it into a complex scene prompt containing only @wlop . I used two groups of artists with very different styles:

  • Group A (realistic/impasto style, 12 artists): throtem, cogecha, nekobell, cutesexyrobutts, oda non, sam yang, kim hyung tae, hiramedousa, unfairr, soejima shigenori, solipsist, hu dako
  • Group B (anime style, 11 artists): fajyobore, quasarcake, tianliang duohe fangdongye, vanripper, zhibuji loom, machuuu68, kyou 039, meadow (morphinecaca), mikozin, ebora, ixy

Office scene (mature OL drinking water in a break room):

baseline (solo @wlop ) 12 realistic artists contamination 12 anime artists contamination
scene_office_baseline scene_office_from_groupA scene_office_from_groupB

Rain scene (young man standing in the rain outside a convenience store):

baseline (solo @wlop ) 12 realistic artists contamination 12 anime artists contamination
scene_rain_baseline scene_rain_from_groupA scene_rain_from_groupB

The baseline shows typical wlop style: semi-realistic CG texture, cinematic rim lighting, thick painterly strokes, atmospheric light-and-shadow with warm-cool contrast. After contamination, regardless of whether it was the realistic group or the anime group, the results shifted toward a "flat illustration" direction -- thinner brushwork, flattened lighting, more anime-like facial features. The convergence of both groups suggests that the causal attention stacking from multiple artists is "diluting" wlop's features in general, rather than pulling them toward any specific artist's direction.

It is important to emphasize: the generation prompt always contained only @wlop as the sole artist. The other 12 artists never appeared in the generation prompt. The differences come entirely from the vectors at @wlop 's 3 token positions being replaced.

Experiment 3: Artist Count Gradient + Vector Locking Fix

If the drift is real, then conversely -- locking @wlop 's vector to its "stable solo vector" in a multi-artist prompt -- should that preserve the art style?

baseline 12 artists, no fix 12 artists + wlop vector lock
Rain grad_rain_baseline grad_rain_nofix_12art grad_rain_fixed_12art
Office grad_office_baseline grad_office_nofix_12art grad_office_fixed_12art

Rain scene: In the unfixed version (middle), the tie changes from red to black, the convenience store sign changes from red to green, the rain streaks shift from rough and realistic to clean line art, and the overall feel goes from "movie still" to "light novel illustration." After locking the fix (right): the cinematic lighting and rain streak texture noticeably return, the atmospheric perspective depth is restored, and the brushwork texture thickens.

Office scene: In the unfixed version, the expression shifts from "eyes closed, head lowered, gentle smile" to "eyes open, head tilted back, drinking water," and the skin texture becomes harder and flatter. After locking the fix: it returns to the eyes-closed gentle smile pose, the delicacy of the backlit rim lighting is restored, and the misty city background reappears.

The locking fix is not perfect -- residual drift remains because we only locked @wlop 's 3 tokens. The vectors for other tokens in the prompt are still contaminated by the 12 artists through Qwen3's causal attention and the adapter's cross-attention. But the direction is right: locking just the artist vectors is enough to pull the art style back from "completely off" to "roughly correct."

Experiment 4: Isolating the RoPE Position Effect

While tracing the sources of drift, I discovered an additional factor. Anima's adapter internally uses RoPE (Rotary Position Embedding), which means the same token at different positions in the sequence will produce a different vector.

To isolate this effect, I designed a controlled experiment: padding different numbers of highres tags before @wlop . The semantic context is identical -- all prompts are highres, ..., masterpiece, best quality, 1girl, school uniform, @wlop , with only @wlop 's absolute position in the sequence varying.

@wlop 's T5 position pixel diff
pos 20 (baseline) 0
pos 60 2.6
pos 100 4.5
pos 180 6.5
pos 340 7.2
pos 20 (baseline) pos 340 (pure positional shift)
pos_wlop_t020 pos_wlop_t340

The semantics are identical, yet visible differences emerge purely from the position change -- color palette shifts, uniform detail variations.

This happens because the adapter uses RoPE instead of the relative position bias found in the original Cosmos/T5 architecture. RoPE encodes absolute position as rotation angles (theta * pos); the angular difference between pos 20 and pos 340 is large, directly altering the attention scores. T5's relative position bias, by contrast, only encodes the relative distance between tokens (bias[i-j]); as long as the adjacency relationships remain unchanged, the output is the same regardless of absolute position.

This means the longer the user's prompt, and the further back the artist tag sits in the sequence, the more purely positional drift affects the art style -- even if the actual content of the prompt has not changed. This effect stacks independently on top of the causal attention context contamination.


The Complete Causal Chain of Drift

After all experiments, artist vector drift has four cascading sources:

  1. Qwen3's causal attention -- later tokens are contaminated by information from preceding context; earlier tokens cannot see what comes after
  2. Qwen3's RoPE -- absolute position affects the hidden state
  3. Adapter's RoPE -- both self-attn and cross-attn introduce additional absolute position sensitivity (the original T5 uses relative position bias, which does not have this issue)
  4. Adapter's cross-attn KV comes from Qwen3 output already contaminated by sources 1 and 2

The DiT side does not contribute to drift -- the code confirms it does not apply positional encoding to text tokens (if self.is_selfattn and rope_emb is not None); text is a purely content-based unordered set of vectors to the DiT. All sources of drift occur before the text reaches the DiT.


Conclusions and Suggestions

The drift is measurable and reproducible

Artist vector drift has clear architectural causes and a significant impact in multi-artist and long-prompt scenarios. However, it should also be acknowledged that:

  • In the common use case of a single artist with a fixed template, the drift is small and most users will not notice
  • The adapter's bidirectional self-attention is indeed performing useful correction
  • The fact that tdrussell has brought this architecture to preview-stage image quality deserves respect from an engineering standpoint

A direction I believe is worth exploring: artists as independent tokens

Starting from CLIP's experience

Going back to the CLIP situation. Although CLIP also uses causal attention, artist performance in SDXL/Illustrious has been relatively stable in my experience. If we speculate that CLIP's stability mainly comes from visual grounding -- artist tokens being repeatedly paired with images in the corresponding art style during contrastive learning, forming vector anchors that are not easily pulled away by context -- then a natural follow-up is: could a similar anchoring mechanism be introduced into Anima's architecture?

What CLIP's contrastive training essentially does is: make each artist token's representation occupy a stable position in vector space that is aligned with visual features. If we do not obtain this stability through contrastive training, the most direct alternative is to explicitly assign each artist a fixed vector -- bypassing Qwen3 and the adapter's contextualization entirely, injecting via direct table lookup.

Specific proposal

The approach I lean toward is: have each artist name occupy a single independent token at the adapter output, with its vector coming from a fixed embedding table.

Why "single token" instead of maintaining the current multi-subword approach:

  • Eliminates positional dependence between subwords: @wlop gets split by T5 into _@, w, lop -- three tokens, each with a different RoPE angle in the adapter's self-attention. In the experiments, lop (the last subword) consistently showed the largest drift. A single token eliminates this layer of complexity entirely
  • Avoids artist confusion from subword overlap: @abc123 and @abc456 share the subwords @, a, b, c, and their adapter output space representations may end up very close. Independent tokens in high-dimensional space are naturally near-orthogonal
  • Naturally compatible with how DiT reads tokens: The DiT does not apply positional encoding to text tokens -- each token is an independent semantic signal. One artist per token is a natural match for this design
  • Cleaner multi-artist blending: 5 artists = 5 independent, fixed vectors in c_crossattn. The DiT's cross-attention can read each artist signal separately -- similar to how CLIP provides the UNet with relatively stable per-token signals in SDXL, rather than the current situation where signals have been mutually permeated through causal attention

Specific workflow:

  • Maintain an [N_artists, 1024] embedding table (60,000 artists x 1024 dimensions ~ 240MB)
  • When @artist is detected, look up the table to get a fixed vector and inject it at one token position in c_crossattn
  • Remove @artist from Qwen3's prompt -- to prevent an identifier that the language model does not understand from contaminating other tags' representations through causal attention. When such identifiers are processed by Qwen3, they may not receive meaningful representations and could introduce unintended perturbations into other tokens' hidden states (for example, an artist name like horn/wood might cause Qwen3 to interpret the literal semantics of "horn" and "wood," potentially interfering with the intended prompt meaning). Removing artist tags also frees up Qwen3's limited context capacity, letting it focus on the natural language descriptions that genuinely require language understanding
  • Train the embedding table end-to-end with the DiT

Cost considerations

This approach may have cost advantages on both the training and inference sides:

Training side: Under the current architecture, artist knowledge must be learned through the full Qwen3 -> adapter -> DiT chain, with the adapter's 6-layer transformer (self-attn + cross-attn + MLP) participating in computing artist representations on every forward pass. If artists go through independent embeddings, this computation can be saved -- the artist embedding is just a table lookup, and gradients only backpropagate to the embedding table and the DiT's cross-attention weights, without needing to pass through the adapter and Qwen3. Training a separate CLIP-like model to generate artist vectors is also a viable path; CLIP itself has modest parameter counts (CLIP-L is around 400M), and it would only need contrastive training or similar alignment training on Danbooru's image-artist pairs. For reference, Illustrious XL fine-tuned both CLIP-L and OpenCLIP-G during training (text encoder lr around 4-6e-6, roughly 1/5 to 1/10 of the UNet lr, trained on 7.5 to 12 million Danbooru images; see their arXiv paper, Table 5). NoobAI-XL also introduced text encoder training in later versions (from v0.75 onward) to improve understanding of less common tags. This shows that knowledge transfer from existing CLIP weights on Danbooru data is a proven path in practice, and that at the scale of tens of millions of images, the text encoder does not suffer severe catastrophic forgetting.

Inference side: Currently, @tianliang duohe fangdongye occupies 13 tokens under the T5 tokenizer, each of which must pass through the adapter's 6 layers of attention computation, ultimately occupying 13 of the 512 positions in c_crossattn. If an artist occupies only 1 token, the adapter does not need to process these artist tokens (or rather, artists bypass the adapter entirely), and 12 positions are freed up in c_crossattn for other content. For multi-artist scenarios, 3 artists currently might consume 30+ tokens (3-13 subwords each); under the independent token approach, they would occupy just 3. In the DiT's cross-attention, this means shorter KV length and correspondingly reduced computation, while also leaving more token space for natural language descriptions and other tags.

Limitations to acknowledge

This approach has clear trade-offs:

  • It loses context-dependent artist representations -- the adapter's current artist vectors, while unstable, do carry information about "how this artist should perform in the current scene" (we saw in Experiment 2 that locking vectors restored the style but not perfectly, suggesting contextual adaptation does have value)
  • It requires retraining together with the DiT -- it cannot be used as a drop-in replacement
  • The table of 60,000 artists needs sufficient training data to learn each embedding well; tail-end artists may lack data

Lower-cost alternative improvements

If artist separation is not pursued, replacing the adapter's RoPE with relative position bias is a point worth considering -- it would at least eliminate the purely positional drift observed in Experiment 4 without requiring changes to the overall architecture.

These are just observations and ideas from an outside perspective — I'm sure there are constraints and considerations I'm not aware of. Curious to hear thoughts from anyone who's looked at this problem, or if the current behavior is a known tradeoff that's been considered. Happy to share more details on the experimental setup if helpful.

Yeah, CLIPs are great and the architectural disaster of llm, llmadapter, t5 is very apparent.
This architecture is deeply flawed.
Your solution is a stop gap for what a CLIP would've done.

Not a tech guy, so I can't say much other than it was an interesting read. But from an image creator point of view I was in particular interested in the result of experiment 3. Is there a way to turn whatever you used to lock the vector of the artist tag into a custom node for comfyui? That would probably be very useful!

CircleStone Labs org

I'm sorry if this comes across as instantly combative, but this post reads like completely (or nearly completely) AI written. I don't doubt you did these experiments (you have the images to prove it), but I don't know how much of all this is your own genuine human thought process and analysis, as opposed to your Claude agent sycophantically agreeing with you and generating what it thinks you want to hear. Some of the things are outright wrong, more are just misleading.

Let me first ask this: in real-world prompts where you are following the correct prompt formatting, are you actually seeing significant artist knowledge degradation when you make the prompt longer? Because I haven't seen this.

And now I will lay out a few high level points to maybe help ground any discussion.

  • Anima doesn't really behave much different than any other modern DiT model. You can say "LLM is flawed, DiT is flawed", but then that equally applies to basically all new models people are creating these days.
  • "LLM adapter is flawed". To my knowledge, people saying this are basing it entirely off of bad conclusions from one experiment someone did. I don't think the LLM adapter architecture modification is flawed; I strongly believe it to be the exact opposite (it is an improvement over base Cosmos) and I have data showing this.
  • A lot of the data presented in the post is from running the model out out-of-distribution inputs. I'll elaborate more on this as I go, and it is kind of a cop-out answer, but just keep this in mind.
  • Artist mixing doesn't work the same way as SDXL, but this seems to be true for all models with LLM text encoders (it is true for NovelAI for instance).

There is way too much written here to address every line but I'll start going through major points.

The subword tokens for wlop were repeatedly paired with images in wlop's art style during training

I don't think this is true. I doubt the pretraining for the original CLIP models contained anime booru images with their cleanly extracted tags. And if it did, they are many years out of date at this point and wouldn't know newer artists. The CLIP models are finetuned as part of the UNET training (for the anime models we are talking about at least), and it's during this process that CLIP can learn to align whatever "wlop" embeds to, to something the UNET can more easily respond to. Anima's LLM adapter being trainable has a similar effect, btw, that's why I have referred to it as a "mini trainable text encoder".

What did Anima change?

Everything in this section is true, and I don't believe any of it is a problem. But it's written in a way that it is implying something is a problem.

Experiment 1: Vector Stability Sampling

Anima is not trained with artist tokens at arbitrary positions. There are only 2 formats: 1) very beginning in case of pure NL prompt, 2) after series name in the tag list. Anything else is out-of-distribution. Nonetheless those mean cosine similarities are quite high. Early on in Anima training I measured LLM adapter output embeddings with original T5 embeddings, and based on my experience, above 0.9 cossim and certainly above 0.95, you will have a very hard time visually distinguishing any difference in generated images. You don't have images for this experiment so it's speculation though.

Experiment 2: Visual Verification -- Vector Replacement Image Generation

I don't think the "pixel diff" data is meaningful. There is always a butterfly effect: you change one part of the text embeddings just a bit, and you can get an entirely different (pixelwise) image.

Extreme Drift Test

Now we get to images. This experiment shows large differences, but I'm not sure how much that matters for real-world use, or why you would ever expect differently. As mentioned before, every token is influencing every other token. This is true with CLIP also, what would happen if you ran this same experiment with SDXL? Somehow I doubt that taking an artist CLIP embedding from a 12-artist prompt and injecting it back to a single-artist prompt gives you exactly the same style as the single artist prompt.

Experiment 3: Artist Count Gradient + Vector Locking Fix

Maybe I'm blind but the overall style for the "no fix" vs "fix" looks basically the same. This feels like reaching for straws.

locking just the artist vectors is enough to pull the art style back from "completely off" to "roughly correct."

It is definitely not "completely off" without the "fix". And you're using 12 (!!) artists here.

Experiment 4: Isolating the RoPE Position Effect

Anima's adapter internally uses RoPE (Rotary Position Embedding), which means the same token at different positions in the sequence will produce a different vector.

Yeah that's what everything uses now and it works really well.

To isolate this effect, I designed a controlled experiment: padding different numbers of highres tags before @wlop

Wait, so you are prompting the model like "highres, highres, highres, highres, highres, 1girl, etc"? This is just weird, it is out-of-distribution, it "dilutes" the attention scores of all other tokens when the DiT does cross-attention because one token is repeated so many times. And despite all that, there is nearly no difference in these 2 images.

RoPE encodes absolute position as rotation angles (theta * pos); the angular difference between pos 20 and pos 340 is large, directly altering the attention scores. T5's relative position bias, by contrast, only encodes the relative distance between tokens (bias[i-j]); as long as the adjacency relationships remain unchanged, the output is the same regardless of absolute position.

This makes no sense, your Claude just made up a nonsensical conclusion. RoPE, and T5's relative position bias, are both forms of relative positional encoding. The value of the dot product between 2 vectors with RoPE injected, is dependent only on the relative difference between their positions. Exact same high-level effect as the relative position bias of T5.

as long as the adjacency relationships remain unchanged, the output is the same regardless of absolute position.

True for both forms of position encoding, but in this experiment the adjacency relationships are changed (that's the point of the experiment).

The Complete Causal Chain of Drift

A huge reaching for straws here to try to make it seem like the Anima architecture is somehow flawed.

Qwen3's causal attention -- later tokens are contaminated by information from preceding context; earlier tokens cannot see what comes after

All the mentioned text encoders "contaminate" tokens from information elsewhere in the context. Not all are causal; T5 isn't, and (I think) CLIP also isn't. Why does this matter? Nearly all modern DiT diffusion models are using causal LLMs / VLMs as text encoders.

Qwen3's RoPE -- absolute position affects the hidden state

Same for CLIP and T5.

Adapter's RoPE -- both self-attn and cross-attn introduce additional absolute position sensitivity (the original T5 uses relative position bias, which does not have this issue)

As I mentioned, this is just wrong. T5 has the same "issue".

Conclusions and Suggestions

A direction I believe is worth exploring: artists as independent tokens

I actually thought of doing this from the very beginning, with the goal of possibly making it easier to do artist mixing. But 1) it complicates implementation a lot, and 2) a fixed sized artist lookup table makes it hard to train on more artists in the future.

If we speculate that CLIP's stability mainly comes from visual grounding

Personally I doubt "visual grounding" is why CLIP is supposedly "stable" or the reason why artist mixing works different there. At any rate, you didn't run these experiments with SDXL so why are you claiming CLIP is stable where Anima's architecture is not?

Cleaner multi-artist blending: 5 artists = 5 independent, fixed vectors in c_crossattn. The DiT's cross-attention can read each artist signal separately

I actually have evidence this doesn't matter or help. I made some custom nodes (nonpublic) that can compute Anima text embeddings for two separate prompts, then concatenate them before sending to the DiT. The goal was to do better artist mixing. Have the same prompt twice, each with a different artist, so each artist embeddings is computed in isolation. Then the concat is effectively averaging the artist vectors (due to how the DiT does cross attn with no additional position embeddings). This made nearly no difference in the generated images compared to having just "artist1, artist2" in a single prompt.

TLDR

This whole thing reads like you prompted your Claude agent to prove why Anima's architecture is flawed. I don't agree with it. I do agree that artist blending is different (and worse) than SDXL, but I think this was always a happy accident of how CLIP worked and that the downsides of CLIP are not worth it.

  • "LLM adapter is flawed". To my knowledge, people saying this are basing it entirely off of bad conclusions from one experiment someone did. I don't think the LLM adapter architecture modification is flawed; I strongly believe it to be the exact opposite (it is an improvement over base Cosmos) and I have data showing this.

You can namedrop me for that test, no one will be mad imagen

Now onto the relevant stuff

Training the self and cross-attn layers on a lora causes slight forgetting on the model, character features, poses, "from side, from front" etc etc, if something is not in the training data the model does become very stiff and 'forgets' its latent knowledge in favor of the data used to train a LoRA.

Training the AdaLN layers only increases this issue, all of this with the LLMAdapter frozen.

Base model prompt

masterpiece, best quality, year 2025, fleurdelys (wuthering waves), beach, bikini, full body, smile, open mouth, sitting, face close up, blue eyes, gigantic breasts

imagen

Lora trained on Musaigen no Phantom World screencaps with AdaLN layers trained fleurdelys' features basically go away, her crown, and her horn, also severely nerfed breast size due to lack of it in the data.

imagen

Same but without AdaLN, recovers some features and again more breast size.

imagen

Same but without AdaLN self-attn and cross-attn, recovers even more features and again more breast size.

imagen

I apologize if my words regarding the LLMAdapter came off as harsh, and I'm sure /ldg only fueled the flames, but hey, perhaps, just perhaps freeze the LLMAdapter? since the similarity between the one in Preview1 and Preview2 shows that Preview2's was trained even more? Doubtful the model will collapse if you freeze it, right?

CircleStone Labs org
edited 11 days ago

I avoided namedropping you because I've only heard everything indirectly. I'm not in whatever discord servers you're using to discuss this stuff.

The LLM adapter is actually currently frozen, as of a few epochs ago. And it has been "soft-frozen" (10% of the DiT LR) since shortly after preview2. Not that it matters: there is basically no change to training loss curves if I freeze the adapter (which I did, as an experiment, a long while ago before resuming training it).

I have seen a lot of people say stuff like "the model can't be trained, it just forgets everything". This is not at all my experience training loras for this model, which I've done a decent amount of for testing purposes. If we're sharing images, allow me to share mine:

no_catestrophic_forgetting
"normal quality, safe, 1girl, neuro-sama, rune (dualhart), solo, standing, cowboy shot, cardigan, blue skirt, blue sailor collar, brown hair, red ribbon, shirt, skirt, long sleeves, heart hair ornament, long hair, skirt lift, covering own mouth, naughty face, blush, open mouth, smile, medium breasts, aged up"
(removed @ from artist tag to not ping whoever that is on HF)
edit: I don't know how to markdown, the actual prompt escapes the parens also

Left: preview2
Right: preview2 with realism lora (yes, really), but prompted for anime.

A realism lora trained on 1800 photos. No anime images at all. Pretty good diversity in the training data, and captioned with both NL and tags. Trained for 120 (!) epochs, well into the overfitting regime as measured by validation loss. Global batch size 16, rank 32 lora, AdamW optimizer with 2e-5 LR, no LLM adapter training (basically my recommended defaults from the model card).

This lora almost fully converts Anima into a proper realism model, nearly on par with the likes of Chroma. It is the most catastrophic-forgetting-inducing lora you could possibly train, I imagine. And there is basically zero loss of both character and artist knowledge. What are your training settings? Because I just never have any issues at all when using my standard settings.

I have it on my TODO list to train some loras using sd-scripts. Maybe there is some problem with that trainer. I use diffusion-pipe (obviously) for everything.

I avoided namedropping you because I've only heard everything indirectly. I'm not in whatever discord servers you're using to discuss this stuff.

The LLM adapter is actually currently frozen, as of a few epochs ago. And it has been "soft-frozen" (10% of the DiT LR) since shortly after preview2. Not that it matters: there is basically no change to training loss curves if I freeze the adapter (which I did, as an experiment, a long while ago before resuming training it).

I have seen a lot of people say stuff like "the model can't be trained, it just forgets everything". This is not at all my experience training loras for this model, which I've done a decent amount of for testing purposes. If we're sharing images, allow me to share mine:

no_catestrophic_forgetting
"normal quality, safe, 1girl, neuro-sama, rune (dualhart), solo, standing, cowboy shot, cardigan, blue skirt, blue sailor collar, brown hair, red ribbon, shirt, skirt, long sleeves, heart hair ornament, long hair, skirt lift, covering own mouth, naughty face, blush, open mouth, smile, medium breasts, aged up"
(removed @ from artist tag to not ping whoever that is on HF)

Left: preview2
Right: preview2 with realism lora (yes, really), but prompted for anime.

A realism lora trained on 1800 photos. No anime images at all. Pretty good diversity in the training data, and captioned with both NL and tags. Trained for 120 (!) epochs, well into the overfitting regime as measured by validation loss. Global batch size 16, rank 32 lora, AdamW optimizer with 2e-5 LR, no LLM adapter training (basically my recommended defaults from the model card).

This lora almost fully converts Anima into a proper realism model, nearly on par with the likes of Chroma. It is the most catastrophic-forgetting-inducing lora you could possibly train, I imagine. And there is basically zero loss of both character and artist knowledge. What are your training settings? Because I just never have any issues at all when using my standard settings.

I have it on my TODO list to train some loras using sd-scripts. Maybe there is some problem with that trainer. I use diffusion-pipe (obviously) for everything.

Hey there, thanks for answering

imagen

Interesting to see that the LLMAdapter when mostly frozen barely moves the L2 loss anyhow, wanted to ask the following:

  1. Are you using the main branch of your diffusion-pipe? because the sd-scripts code is leveraged from that one.
  2. If you are not, and you are instead using a private branch, would you be willing to make it public even if stripped of "secret sauces"? The forgetting has been observed by finetuners and other people, so if in your case it doesn't forget regardless of the dataset, then it would greatly benefit a lot of trainers,
  3. I've used the main-branch of your diffusion-pipe and the results I showed were with that until I modified it to not train other layers of the LoRA

All the I loras trained were Rank 32 with AdamW up 5e-5 micro batch size 48 at 512px and accumulation 2.

CircleStone Labs org

It is the main branch of diffusion-pipe.

Those settings seem mostly reasonable. Possibly 5e-5 is a bit too high even at that batch size but it doesn't look dramatically too high. Is it a large dataset? Because 96 global batch size is either going to duplicate a lot of images to make a bucket, or else drop images (depending on which trainer you're using).

It is the main branch of diffusion-pipe.

Those settings seem mostly reasonable. Possibly 5e-5 is a bit too high even at that batch size but it doesn't look dramatically too high. Is it a large dataset? Because 96 global batch size is either going to duplicate a lot of images to make a bucket, or else drop images (depending on which trainer you're using).

Around 400 images, and others with up to 1600 had the same situation happen to them, specially with concepts that the model doesn't have the slightest idea about.

Also I use your diffusion pipe main branch, I'm aware of how it drops buckets if buckets are smaller than batch size, I made sure my dataset wasn't dropped at all, and well as I showed in my previous response stuff did happen, I guess it now means my approach is bad somehow, its not like I'll find gigantic badonkas on an anime screencap dataset, I'll drop to 2e-5 but not hopeful, since others have had this happen to them as well.

imagen

CircleStone Labs org

Everything is dataset dependent, so it is hard to make direct comparisons without knowing every single detail. I have trained some realism loras, and a couple of anime loras. I have never had any problems with forgetting. The only time forgetting or unexpected degradation is a problem, is if I train LLM adapter with relatively high LR. As long as I don't do that, there is no issue.

When I get time I want to train at least one "official" style or character lora, and release it, with full dataset and training config file and everything. Just to show it works fine.

Everything is dataset dependent, so it is hard to make direct comparisons without knowing every single detail. I have trained some realism loras, and a couple of anime loras. I have never had any problems with forgetting. The only time forgetting or unexpected degradation is a problem, is if I train LLM adapter with relatively high LR. As long as I don't do that, there is no issue.

When I get time I want to train at least one "official" style or character lora, and release it, with full dataset and training config file and everything. Just to show it works fine.

Thank you for your willingness on doing this
imagen

It is the main branch of diffusion-pipe.

Those settings seem mostly reasonable. Possibly 5e-5 is a bit too high even at that batch size but it doesn't look dramatically too high. Is it a large dataset? Because 96 global batch size is either going to duplicate a lot of images to make a bucket, or else drop images (depending on which trainer you're using).

@tdrussell hi, i've been training style loras for Anima Preview 2 and really enjoying using the model. But I've also frankly been noticing the "knowledge forgetting" problem frequently when training on the model, on multiple datasets.
Epoch 35
35c
Epoch 50
50c lol it forgetting hard
Original look of the character:
image
Prompt: indigo (arknights), 1girl, holding staff, thigh strap, vial, test tube, black footwear, white dress, clothing cutout, shrug (clothing), material growth, looking at viewer, infection monitor (arknights), very long hair, purple eyes, black gloves, hand on own hip, sitting, knee to chest, parted lips, multiple straps, ankle boots, pouch, black socks, id card, tail, blonde hair, short dress, thigh belt, small breasts, pointy ears, animal ears, oripathy lesion (arknights), torn clothes, headpiecer, outside border, white background, full body, film grain, masterpiece, best quality

As you can see, not only are specific character details not resembling what they look like from before, even basic traits of the character are being mixed up.

Another example.
Epoch 5
5
Epoch 35
35
Epoch 50
50
Original look of the character:
image
Prompt: lady avalon (second ascension) (fate), 1girl, ahoge, bare shoulders, bikini, hair, blue sky, breasts, cleavage, day, flower, frilled bikini, frills, grin, hair flower,white umbrella, holding umbrella, long hair, looking at viewer, navel, ocean, outdoors, parasol, pointy ears, red eyes, side-tie bikini bottom, sky, smile, swimsuit, thigh strap,white thighhighs,single thighhigh, thighs, twintails, very long hair, wading, water, white bikini, white hair,from below,wet,hip bones,groin,from side,wind,floating hair,waves,splashing, swimsuit cover-up, frills, masterpiece, best quality

Config for both loras: https://huggingface.co/RicemanT/Loras_Collection/resolve/main/anima%20preview%202%20experiment%20with%20block%20lr-dim.toml
The config is exactly that, just minus the network args i added later in as an experiment to improve this specific issue. The training program is https://github.com/67372a/LoRA_Easy_Training_Scripts , which adopted Anima training code from duongve implementation on sd script, which was in turn originated from Bluvoll's code.

Point is, I'm not in anyway knowledgable or experienced about training and finetuning, but even just as a general user of the model who train loras sometimes, these kind of forgetting are frequent enough that they've become quite a concern of balancing, as it become a problem of "how much should i let it learn the style before knowledge of every characters become fucked?" Even knowledge of things like windows and scenery get degraded the more the lora is cooked. I've also went around talking to about 5-6 other creators on civitai and read through a few others opinions on other platforms about their own encounter with these kinds of issues. Circumstances and training settings varied, but the issue is often present. Though these are only small scale data training for style and character loras in most case, you and bluvoll's examples are much larger tune data wise.

@RicemanT Steps matter, not epochs. Epochs are meaningless without knowing the dataset. 50 epochs of 20 images, and 50 epochs of 200 images are massively different. 12 epochs of 200 images, 24 epochs of 100 images and 48 epochs of 50 images are much less different.
If you train even SDXL on a very limited dataset for long enough (e.g. 50 epochs with 200 images) without any other data, it too will forget. All loras degrade any model, just go back to SDXL and add some unrelated validation images and see for yourself. Anima just used to forget way faster with preview 1, though at least in my experience this is less of an issue with preview 2 now.

As I mentioned in a few other threads here, you can just train with more images. Initially I had assumed 1:1, but 22% (~3.5:1) of unrelated random gel images also worked. I don't have the hardware to train and test various mixes, I have just an A770 16GB. You do not need a lot of images, you do not need a lot of training and you do not need every single concept the model previously knew represented to combat the forgetting.

Scrolling through all these images is getting kinda hard. Grid or spoiler your images, people.

I do what I preach. Here are Bluvoll's examples with my method:

Grid, validation and more comments

Loras for Remedios Custodio, trained on lots of screenshots from the Overlord movie, which has very strong color grading.

____, Original,
2100 steps without any other data, 2400 steps with 22% random gel images,
4000 steps without any other data, 5000 steps with 22% random gel images

bluvoll_grid

Note that the original does NOT have a crown.

Validation on one image at 0.4, 0.0 sigmas in Comfy:

ranni_0.4_plot

My examples above are from the 16 dim loras. The image tested against is the only image of ranni by cutesexyrobutts on gel, average over 8 seeds for more stable results. The extra dataset I used for training is 7525 images, as regular training data, of which exactly 1 is by cutesexyrobutts and exactly 0 are ranni. Do you think when training, the lora even saw the cutesexyrobutts image? ~14.6% chance it did at all when finished.

The dataset also contained 2 images of fleurdelys (27% for at least 1) and 22 of "gigantic breasts" (96% for at least 1).

Validation on 0.6, an image of Remedios outside the dataset:

remedios_0.6_plot

The difference in character details when using the lora normally to me looks within placebo (though this loss says it's mildly better). I don't want to bother testing 500 images for this like last time.

Also, I think Bluvoll's prompt is very bad. You can't complain about a missing crown if you didn't prompt for it. That is how the model works by default, and to be honest, I am increasingly finding that desirable.

@Espamholding very interesting insights! though please allow me to ask for some more details for maybe other onlookers:

Steps matter, ...

for this, batch_size and gradient_accumulation_steps can affect how many effective images 1 step contains; e.g. both at 1 then each step makes the model see 1 image, but batch_size=3, gradient_accumulation_steps=2 for example makes the model see 6 images per step. can you disclose these 2 numbers that you used? (so we can get an idea of how many images the model saw)

Validation on one image at 0.4, 0.0 sigmas in Comfy: (line charts)

can you elaborate on what you mean validation exactly, what are the x and y axis supposed to represent and how the values are calculated?

You can't complain about a missing crown if you didn't prompt for it.

I think the point isn't that this is a high quality prompt that one should use when actually genning, but a non-specific demonstration where a trait that should be highly associated with a character's design (in this case, crown / horn + fleurdelys) is "forgotten," which may have bad implications on other knowledge that should be associated together? if you have deeper insights on this though please share them

Just pointing out I have forgetting issues too with anima and several other people I know on several circles pointed out that it's an issue.
I also like to mix lora and it's even more obvious doing that, compared to SDXL just working.
I either don't seem to get the style actually well learned, I'm training mostly styles, or I get the style but it starts to forget character details. I tried several datasets, with around 200 images each and a dataset of 2k. I tried 5e-5 lr and 3e-4 lr. It's the same, I just need more steps to reach the issue but the the style is washed out until I get there.
I'll try with even lower LR, but I'm not hopeful.

If you train even SDXL on a very limited dataset for long enough (e.g. 50 epochs with 200 images) without any other data, it too will forget.

It will overfit to the images and burn. But with anima I see the samples resemble the style gradually, but losing the other knowledge, rather than the frying SDXL did after some time. This is why I assume, there is an issue with forgetting. I actually didn't get Anima to fry, just forget.

As I mentioned in a few other threads here, you can just train with more images.

The model can't retain knowledge so we're going to add a lot of regularization. We went back in time to sd2.0 ...
While it kinda works, but not perfectly, I don't have so much compute to use a lot of regularization each time, why is the architecture so flawed I have to double or more my training time? But yes, this is how I got the best results.

The issue with LLMs giving importance to the 1st tokens is very real too, changing the order of the tokens will give you different results, putting tags at the end of a long prompt is like adding noise rather than new data. It is how LLMs work currently unfortunately. Rigidly following the prompt structure tdrussel suggested is actually important as it gives you the best results.

Also, I think Bluvoll's prompt is very bad. You can't complain about a missing crown if you didn't prompt for it. That is how the model works by default, and to be honest, I am increasingly finding that desirable.

No, that is not how the model works by default because it was trained with dropout and is thus supposed to add unprompted features automatically and that is the desired behavior. If you want a model that actually works like that by default, try Pony v7

I have also trained several style LoRAs on anima V1 and V2 on diffusion-pipe froked by Bluvoll
The criterion I adopt for judging forgetting mainly :
(1). isekaijoucho's several versions hair flower, especially the side of the stamen that resembles a bird's beak(just because she's pretty)
(2).The ability to adjust the tones and orientations of lighting and shadows
(3).br34sts size
(4).The expressive ability of architecture and correct perspective, especially when the training material is only a bust
(5).A completely opposite style expression, for example, requiring abstraction in a realistic LoRA

1.For V1, my settings are basically lr=2e-5, AdamW, rank 32 or 64, and bs 632 (depending on the resolution and training data conditions),0.040.06drop,

For research purposes, I also trained a Realism LoRA (LoRA)model ,covering different perspectives across approximately 30 scenes, all 240 images featuring the same one model(girl).That model(girl) is sm411 br34sts or nearly flat chest but model can still generate gigantic breasts.I evaluate that this LoRA is already usable at 13 epochs(13epoch x 240image x 3shuffle,Although anima seemingly should not shuffle)but no CFG burnor forgetting occurred even at 100 epochs.(By the way, It seems that apart from Qwen Image, I haven't encountered CFG burning or line collapse during training on other DiT(orNextDiT,mmDiT) models.(newbie,zimage turbo,lumina,etc.)
The one with 100 epochs seems to be better at maintaining dominance when mixing multiple LoRAs.

But thing change when I try a pure illustration style with clear lines(but not ligne claire style),small accessories always appear messy texture. Perhaps it can be regarded as the forgetting of character traits?The LoRA is generally usable,so I didn't look into it in detail at that time.

2.(1).For V2,I found that the settings on v1 are completely unable to learn style,until set lr=8e-5.(to learn 'nyte_tyde' )
(2).Moreover, when the style is properly fitted, a certain degree of forgetting is observed even when rolling back a few epochs.
(The validation loss indicated overfitting much earlier, but in reality it didn't seem to be properly fitted).
(3).A LoRA trained purely on bust images struggles to pull the camera back.
(4).Additionally, for a character with multiple hairstyles, when there is only one hairstyle in the material, it is very difficult to change to other hairstyles when lora trained at high lr.
In my case, there are two images of ninomae ina'nis in the training materials(75images),with long hair but no tentacle hair .I tag it , on purpose, as only 'ninomae ina'nis (loungewear)'(Actually that should have short hair) ,but all version of ninomae ina'nis still can't generate tentacle hair

Therefore, I can't believa the claimed lr=2e-5 especially for v2.

for this, batch_size and gradient_accumulation_steps can affect how many effective images 1 step contains

It should be identical to what I recall the linked config being - 1 batch size, 1 gradient accumulation steps. 1MP images. Also no TE/LLM adapter training, and using sd_scripts. Though I normally train with 32/32 dim/alpha and 0.000122.

can you elaborate on what you mean validation exactly, what are the x and y axis supposed to represent and how the values are calculated?

I wrote myself a small custom node in Comfy that takes an image, model, sigmas and calculates MSE loss. Sampled 8 seeds and averaged out the loss. The Y is the loss, the X is the steps. One lora (point) every 300 steps, finishing at 4000/5000. 0.4 sigmas should be 400 timesteps if comfy doesn't distort things, but even if it does the big jump in forgetting is always there for 0.85, 0.6, 0.4, 0.3, whatever. I have plots for other sigmas if you want.

Note that this increase in validation loss (regardless of how you measure it) on general images also happened with older anime models too and you will see it even with doing validation on random images with your preferred trainer.

I think the point isn't that this is a high quality prompt that one should use when actually genning, but a non-specific demonstration where a trait that should be highly associated with a character's design (in this case, crown / horn + fleurdelys) is "forgotten," which may have bad implications on other knowledge that should be associated together? if you have deeper insights on this though please share them

It is my experience in general with Anima with no lora that when I don't prompt for things, I get them missing/broken/differently interpreted far more than with SDXL finetunes. Though, sorry, I assumed crown meant the pseudo-halo thing on top rather than the vertical thing. In either case, IMO, the style bleed that's also very visible in Bluvoll's example, the smaller boobs, geometry like RicemanT mentions, consistently broken text that I had with preview 1 or in my examples very strong style bleed again, are far better indicators.

All that being said though, IIRC the forgetting was much worse with preview 1 than 2, and it depends a lot on how you use the model. With longer prompts, with artist styles explicitly mentioned (the masterpiece and year tags feel particularly), it is harder to see forgetting. If anything, I'm more interested in why masterpiece and year tags seem so weak in general. That is why tdrussell's prompt works better - it is longer, descriptive and names a specific artist.

Omit lots = model generates slop close to what it was most recently trained on. Include lots = least problems.

I decided to go back and test Noobai Vpred 1.0 w/ the EQ VAE (the "Experimental" one, not the supposedly better one the person doing the adaptation released later):

Pic Original, ~3000 steps, ~5000 steps, and character I was training for reference.

bluvoll_grid2

I might have omitted "facial mark" in some of the training data, but I don't think I did for blunt bangs and, well, the only images with grey hair I had in my training data, were greyscale/monochrome/spot color images which were absolutely tagged as such. The character I trained is blonde.
Once again, 1 batch size, 1 gradient accum, sd scripts. This time 12 dim 12 alpha 0.0002 and again no TE training.

Forgetting in this manner happened there too. Short prompts are more prone to forgetting/bleeding/degradation/whatever even if there is no LLM adapter attached to the model.
I was skeptical of the LLM adapter, that it might not be ideal or was at least a problem with preview 1. I wanted some comment from tdrussell if the issue is disastrous or if it will go away or what, and I got that and I'm happy. Now especially after seeing Bluvoll's example and going back to Noob I've changed my mind and I think the issue is overblown.

I'd much rather people complained about the lack of e621 or possibly wished for an easily finetunable very omni controlnet+IPAdapter to work as a not-edit model adapter...

Ok, 1 more reply.

No, that is not how the model works by default because it was trained with dropout and is thus supposed to

supposed to
I judge the model on how it actually works. I am pretty aware of the dropout. Base SDXL is supposed to do bigger text (IE when the VAE doesn't kill it).

And as I mentioned, short prompts bad in general.

I decided to go back and test Noobai Vpred 1.0 w/ the EQ VAE (the "Experimental" one, not the supposedly better one the person doing the adaptation released later):

That one doesn't have knowledge of that character, the dataset is older. Your sample images will not work.

I'd much rather people complained about the lack of e621

Ah, ok. I can ignore you whole post. You should've put this 1st line, I would've lost less time.

Ah, ok. I can ignore you whole post. You should've put this 1st line, I would've lost less time.

I don't have so much compute to use a lot of regularization each time, why is the architecture so flawed I have to double or more my training time

Initially I had assumed 1:1, but 22% (~3.5:1) of unrelated random gel images also worked

You didn't read the first one, no need to tell me you didn't want to read the second.

Is it too late to recaption named characters with a discriminator like "#name"? Was it considered to use one for characters like the @ for artist names and if so what reasoning led to the decision to not do so? Would they become too rigid then and make it more troublesome to prompt alternative traits like different hair color etc?

Also, I think Bluvoll's prompt is very bad. You can't complain about a missing crown if you didn't prompt for it. That is how the model works by default, and to be honest, I am increasingly finding that desirable.

I won't deny its a bad prompt
imagen

But breasts size being reduced to a fraction of its prompted size since data in lora doesn't have it, is well, concerning, if that style doesn't have such proportions then what?
I do think that having to prompt Fleur's features is a bad idea, her 'trigger word' or token should leak by nature, some of her features are present in 90% of her arts like her crown and horn, even if dropout wasn't used the model would link such horn and crown to "fleurdelys (wuthering waves)"

Also, I think Bluvoll's prompt is very bad. You can't complain about a missing crown if you didn't prompt for it. That is how the model works by default, and to be honest, I am increasingly finding that desirable.

I won't deny its a bad prompt

Sorry, I assumed the crown was referring to the halo-esque thing above, if it's the vertical thing then I can definitely see that. Otherwise yes very much agreed, the rest is pretty degraded. I just think your and tdrussell's prompts are too polar opposites.

If you want model to memorize it, just train a single input 1000 times. What else do you want?

This comment has been hidden (marked as Off-Topic)

@tdrussell Just throwing my feedback. I trained couple of loras using your diffusion-pipe, and did not notice any forgetting. Though I did it with a rather diverse synthetic dataset with ~200 images for one and minimalistic one for other. Though I have a headscratcher since lora defaults to vampires with large breasts, which is not even in dataset (though it is totally possible to have been picked up from imagery itself, upon further inspection most characters have prominent teeth there, including fangs, maybe it catches on that without prompt needed).
Though I back up OP in claims that longer prompts generally drifting style to what I consider lower quality. But my current conclusion is that this model is really dataset biased. For example preview1 has issues with backgrounds, preview2 uplifts that and preview3 has same issues as preview1, it defaults to a blurry background with rather flat character (anime coloring).
But there is a real flaw in all models released so far: landscape images. They suffer for whatever reason. Good luck on your journey to best anime model, don't listen to naysayers.

Sign up or log in to comment