# QwenImageTransformer2DModel

The model can be loaded with the following code snippet.

```python
from diffusers import QwenImageTransformer2DModel

transformer = QwenImageTransformer2DModel.from_pretrained("Qwen/QwenImage-20B", subfolder="transformer", torch_dtype=torch.bfloat16)
```

## QwenImageTransformer2DModel[[diffusers.QwenImageTransformer2DModel]]

#### diffusers.QwenImageTransformer2DModel[[diffusers.QwenImageTransformer2DModel]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_qwenimage.py#L750)

The Transformer model introduced in Qwen.

forwarddiffusers.QwenImageTransformer2DModel.forwardhttps://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_qwenimage.py#L850[{"name": "hidden_states", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor = None"}, {"name": "encoder_hidden_states_mask", "val": ": Tensor = None"}, {"name": "timestep", "val": ": LongTensor = None"}, {"name": "img_shapes", "val": ": list[tuple[int, int, int]] | None = None"}, {"name": "txt_seq_lens", "val": ": list[int] | None = None"}, {"name": "guidance", "val": ": Tensor = None"}, {"name": "attention_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "controlnet_block_samples", "val": " = None"}, {"name": "additional_t_cond", "val": " = None"}, {"name": "return_dict", "val": ": bool = True"}]- **hidden_states** (`torch.Tensor` of shape `(batch_size, image_sequence_length, in_channels)`) --
  Input `hidden_states`.
- **encoder_hidden_states** (`torch.Tensor` of shape `(batch_size, text_sequence_length, joint_attention_dim)`) --
  Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
- **encoder_hidden_states_mask** (`torch.Tensor` of shape `(batch_size, text_sequence_length)`, *optional*) --
  Mask for the encoder hidden states. Expected to have 1.0 for valid tokens and 0.0 for padding tokens.
  Used in the attention processor to prevent attending to padding tokens. The mask can have any pattern
  (not just contiguous valid tokens followed by padding) since it's applied element-wise in attention.
- **timestep** ( `torch.LongTensor`) --
  Used to indicate denoising step.
- **img_shapes** (`list[tuple[int, int, int]]`, *optional*) --
  Image shapes for RoPE computation.
- **txt_seq_lens** (`list[int]`, *optional*, **Deprecated**) --
  Deprecated parameter. Use `encoder_hidden_states_mask` instead. If provided, the maximum value will be
  used to compute RoPE sequence length.
- **guidance** (`torch.Tensor`, *optional*) --
  Guidance tensor for conditional generation.
- **attention_kwargs** (`dict`, *optional*) --
  A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
  `self.processor` in
  [diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
- **controlnet_block_samples** (*optional*) --
  ControlNet block samples to add to the transformer blocks.
- **return_dict** (`bool`, *optional*, defaults to `True`) --
  Whether or not to return a `~models.transformer_2d.Transformer2DModelOutput` instead of a plain
  tuple.0If `return_dict` is True, an `~models.transformer_2d.Transformer2DModelOutput` is returned, otherwise a
`tuple` where the first element is the sample tensor.

The `QwenTransformer2DModel` forward method.

**Parameters:**

patch_size (`int`, defaults to `2`) : Patch size to turn the input data into small patches.

in_channels (`int`, defaults to `64`) : The number of channels in the input.

out_channels (`int`, *optional*, defaults to `None`) : The number of channels in the output. If not specified, it defaults to `in_channels`.

num_layers (`int`, defaults to `60`) : The number of layers of dual stream DiT blocks to use.

attention_head_dim (`int`, defaults to `128`) : The number of dimensions to use for each attention head.

num_attention_heads (`int`, defaults to `24`) : The number of attention heads to use.

joint_attention_dim (`int`, defaults to `3584`) : The number of dimensions to use for the joint attention (embedding/channel dimension of `encoder_hidden_states`).

guidance_embeds (`bool`, defaults to `False`) : Whether to use guidance embeddings for guidance-distilled variant of the model.

axes_dims_rope (`tuple[int]`, defaults to `(16, 56, 56)`) : The dimensions to use for the rotary positional embeddings.

**Returns:**

If `return_dict` is True, an `~models.transformer_2d.Transformer2DModelOutput` is returned, otherwise a
`tuple` where the first element is the sample tensor.

## Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

#### diffusers.models.modeling_outputs.Transformer2DModelOutput[[diffusers.models.modeling_outputs.Transformer2DModelOutput]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/modeling_outputs.py#L21)

The output of [Transformer2DModel](/docs/diffusers/main/en/api/models/transformer2d#diffusers.Transformer2DModel).

**Parameters:**

sample (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` or `(batch size, num_vector_embeds - 1, num_latent_pixels)` if [Transformer2DModel](/docs/diffusers/main/en/api/models/transformer2d#diffusers.Transformer2DModel) is discrete) : The hidden states output conditioned on the `encoder_hidden_states` input. If discrete, returns probability distributions for the unnoised latent pixels.