AWS Trainium & Inferentia documentation
LoRA for Neuron
LoRA for Neuron
LoRA (Low-Rank Adaptation) implementation optimized for distributed training on AWS Trainium devices. This module provides efficient parameter-efficient fine-tuning with tensor parallelism and sequence parallelism support.
PEFT Model Classes
NeuronPeftModel
class optimum.neuron.peft.NeuronPeftModel
< source >( model: PreTrainedModel peft_config: PeftConfig adapter_name: str = 'default' autocast_adapter_dtype: bool = True **kwargs: Any )
NeuronPeftModelForCausalLM
class optimum.neuron.peft.NeuronPeftModelForCausalLM
< source >( model: PreTrainedModel peft_config: PeftConfig adapter_name: str = 'default' autocast_adapter_dtype: bool = True **kwargs: Any )
LoRA Layer Implementations
Base LoRA Layer
class optimum.neuron.peft.tuners.lora.layer.NeuronLoraLayer
< source >( base_layer: Module ephemeral_gpu_offload: bool = False **kwargs )
Parallel Linear LoRA
class optimum.neuron.peft.tuners.lora.layer.ParallelLinear
< source >( base_layer adapter_name: str r: int = 0 lora_alpha: int = 1 lora_dropout: float = 0.0 fan_in_fan_out: bool = False is_target_conv_1d_layer: bool = False init_lora_weights: bool | str = True use_rslora: bool = False use_dora: bool = False lora_bias: bool = False **kwargs )
merge
< source >( safe_merge: bool = False adapter_names: list[str] | None = None )
Merge the active adapter weights into the base weights.
This works with distributed parallel linear layers (RowParallelLinear, ColumnParallelLinear). The merge happens on the sharded weights - each rank merges its own shard.
Unmerge all merged adapter layers from the base weights.
This works with distributed parallel linear layers (RowParallelLinear, ColumnParallelLinear). The unmerge happens on the sharded weights - each rank unmerges its own shard.
GQA QKV Column Parallel LoRA
class optimum.neuron.peft.tuners.lora.layer.GQAQKVColumnParallelLinear
< source >( base_layer adapter_name: str r: int = 0 lora_alpha: int = 1 lora_dropout: float = 0.0 fan_in_fan_out: bool = False is_target_conv_1d_layer: bool = False init_lora_weights: bool | str = True use_rslora: bool = False use_dora: bool = False lora_bias: bool = False **kwargs )
get_delta_weight
< source >( adapter: str )
Compute the delta weights for Q, K, V for the given adapter.
Returns a dict with keys “q”, “k”, “v” (or “qkv” if fused) containing the delta tensors.
merge
< source >( safe_merge: bool = False adapter_names: list[str] | None = None )
Merge the active adapter weights into the base Q, K, V weights.
This works with GQAQKVColumnParallelLinear layers. The merge happens on the sharded weights - each rank merges its own shard.
Unmerge all merged adapter layers from the base Q, K, V weights.
This works with GQAQKVColumnParallelLinear layers. The unmerge happens on the sharded weights - each rank unmerges its own shard.
Parallel Embedding LoRA
class optimum.neuron.peft.tuners.lora.layer.ParallelEmbedding
< source >( base_layer: Module adapter_name: str r: int = 0 lora_alpha: int = 1 lora_dropout: float = 0.0 fan_in_fan_out: bool = False init_lora_weights: bool | str = True use_rslora: bool = False use_dora: bool = False lora_bias: bool = False **kwargs )
merge
< source >( safe_merge: bool = False adapter_names: list[str] | None = None )
Merge the active adapter weights into the base embedding weights.
This works with ParallelEmbedding layers. The merge happens on the sharded weights - each rank merges its own shard.
Unmerge all merged adapter layers from the base embedding weights.
This works with ParallelEmbedding layers. The unmerge happens on the sharded weights - each rank unmerges its own shard.
LoRA Model
NeuronLoraModel
class optimum.neuron.peft.tuners.NeuronLoraModel
< source >( model config adapter_name low_cpu_mem_usage: bool = False )
Utility Functions
get_peft_model
optimum.neuron.peft.get_peft_model
< source >( model: PreTrainedModel peft_config: PeftConfig adapter_name: str = 'default' mixed: bool = False autocast_adapter_dtype: bool = True revision: str | None = None low_cpu_mem_usage: bool = False )
Architecture Support
The Neuron LoRA implementation supports the following parallel layer types:
- ColumnParallelLinear: For layers that split weights along the output dimension
- RowParallelLinear: For layers that split weights along the input dimension
- ParallelEmbedding: For embedding layers distributed across ranks
- GQAQKVColumnParallelLinear: For grouped query attention projections with challenging tensor parallel configurations
Each layer type has a corresponding LoRA implementation that maintains the parallelization strategy while adding low-rank adaptation capabilities.
Key Features
- Distributed Training: Full support for tensor parallelism and sequence parallelism
- Checkpoint Consolidation: Automatic conversion between sharded and consolidated checkpoints
- Weight Transformation: Seamless integration with model weight transformation specs
- Compatibility: Works with all supported custom modeling architectures in Optimum Neuron