Request for fine-tuning script/parameters for parakeet-tdt-0.6b-v3-fr-tv-media

by madoss - opened Apr 22

Apr 22

Thank you for sharing this model! I’ve been testing it on French media content, and the performance is impressive.

I am interested in fine-tuning the model further on a niche dataset. Would you be willing to share the training script, configuration files, or the specific hyperparameters (learning rate, scheduler, freezing layers, etc.) used for this version?

Any insights into the preprocessing pipeline for the fr-tv-media data would also be incredibly helpful.

Thanks again for your contribution to the community!

Best regards,
Mahamadi

Archime

Owner Apr 23

Hi @madoss ,

Thank you again for your feedback!
As for the training scripts, I used the official framework. You can find everything you need directly in the official repository: https://github.com/NVIDIA-NeMo/NeMo

To help you get started with your niche dataset, here is the exact configuration I used for the parakeet-tdt-0.6b-v3-fr-tv-media fine-tuning :
I kept the encoder frozen (freeze_encoder: True) to retain the base acoustic representations.

name: "parakeet-tdt-0.6b-v3-finetune-french-tv-media"
manifest_dir: ??

init_from_pretrained_model: "nvidia/parakeet-tdt-0.6b-v3"
freeze_encoder: True
freeze_decoder: False
model:
  sample_rate: 16000
  compute_eval_loss: false # Saves VRAM during validation
  log_prediction: true     # Logs sample transcriptions to the console
  skip_nan_grad: false     # Important for debugging BF16 anomalies

  train_ds:
    manifest_filepath: 
      - ${manifest_dir}/sdp-ft-media-info/splits/train.json
      - ${manifest_dir}/sdp-ft-media-societe/splits/train.json
      - ${manifest_dir}/sdp-ft-media-divertissements/splits/train.json
      - ${manifest_dir}/sdp-ft-media-documentaires/splits/train.json
      - ${manifest_dir}/sdp-ft-media-sports/splits/train.json

    sample_rate: ${model.sample_rate}
    batch_size: 2
    shuffle: true
    num_workers: 2 # Can be increased (e.g., to 8) to feed GPUs more efficiently
    pin_memory: true
    max_duration: 20.0
    min_duration: 1.0
    is_tarred: false
    bucketing_strategy: "fully_randomized"

  validation_ds:
    manifest_filepath:  # To be generated by the dataset split script
      - ${manifest_dir}/sdp-ft-media-info/splits/validation.json
      - ${manifest_dir}/sdp-ft-media-sports/splits/validation.json
      - ${manifest_dir}/sdp-ft-media-societe/splits/validation.json
      - ${manifest_dir}/sdp-ft-media-divertissements/splits/validation.json
      - ${manifest_dir}/sdp-ft-media-documentaires/splits/validation.json

    sample_rate: ${model.sample_rate}
    batch_size: 1 # Matches training batch size scale to prevent OOM errors
    shuffle: false
    num_workers: 2
    pin_memory: true
    max_duration: 20.0

  test_ds:
    manifest_filepath: 
      - ${manifest_dir}/sdp-ft-media-info/splits/test.json
      - ${manifest_dir}/sdp-ft-media-societe/splits/test.json
      - ${manifest_dir}/sdp-ft-media-divertissements/splits/test.json
      - ${manifest_dir}/sdp-ft-media-documentaires/splits/test.json
      - ${manifest_dir}/sdp-ft-media-sports/splits/test.json
    sample_rate: ${model.sample_rate}
    batch_size: 1
    shuffle: false
    num_workers: 2
    pin_memory: true
    max_duration: 20.0

  # Retaining the original tokenizer since the target language remains French
  tokenizer:
    update_tokenizer: false
    # dir: null
    # type: bpe

  loss:
    loss_name: "tdt"
    ctc_loss_weight: 0.3    # Explicit 0.3 weight 
    tdt_kwargs:
      fastemit_lambda: 0.0
      clamp: -1.0
      durations:  # Matches model_defaults

  # Data Augmentation (Spectrogram masking)
  # Crucial for preventing overfitting on a 55-hour dataset
  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 2
    time_masks: 5 # Slightly reduced to avoid overly degrading the audio signal
    freq_width: 27
    time_width: 0.05

  optim:
    name: adamw
    lr: 1e-4 
    betas: [0.9, 0.98]
    weight_decay: 1e-3

    sched:
      name: CosineAnnealing
      warmup_steps: 2000 # ~10% of estimated total steps (adjust based on batch size)
      min_lr: 5e-6

trainer:
  devices: -1 # -1 automatically uses all available GPUs (e.g., 2x RTX 3090)
  num_nodes: 1
  max_epochs: 40 # 20-30 epochs are often enough to converge without overfitting on 55h of data
  val_check_interval: 1.0
  accelerator: gpu
  strategy:
    _target_: lightning.pytorch.strategies.DDPStrategy
    find_unused_parameters: false # Optimization for DDP (Multi-GPU)
  
  # Gradient Accumulation
  # Effective Batch Size = batch_size (2) * devices (2) * accumulate_grad_batches (16) = 64
  accumulate_grad_batches: 16 
  
  gradient_clip_val: 1.0 # Prevents gradient explosion
  precision: 16-mixed # Uses BF16/FP16. Essential for FastConformer Large on <40GB VRAM GPUs
  log_every_n_steps: 10
  enable_progress_bar: True
  
  # Must be False so exp_manager handles checkpoints exclusively.
  # Otherwise, a CheckpointMisconfigurationError occurs due to conflicting saving mechanisms.
  enable_checkpointing: False
  
  # Must be False as exp_manager already handles the logger.
  logger: False 

exp_manager:
  exp_dir: "nemo_parakeet-v3"
  name: ${name}
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: "val_wer" # Monitors the Word Error Rate (WER)
    mode: "min"
    save_top_k: 3 # Retains the 3 best checkpoints
    always_save_nemo: True
  resume_if_exists: true
  resume_ignore_no_checkpoint: true

Let me know if you need any clarifications on the pipeline. Good luck with your fine-tuning!

Best regards,
Archime

madoss

Apr 24

Thank you.
I will start by fine-tuning on your dataset.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment