Finetune practice #57

SWivid · 2024-10-14T01:49:59Z

SWivid
Oct 14, 2024
Maintainer

Full finetune is currently supported, lora or adapter not yet.

Set checkpoint_path to pretrained model dir in test_train.py, model/trainer.py will load from there to resume. Reuse the vocab.txt under data /Emilia_ZH_EN_pinyin (Emilia_ZH_EN_pinyin <- tokenizer = "pinyin"; dataset_name = "Emilia_ZH_EN" in test_train.py setting)
For preparing finetune data, see model/dataset.py. Just need e.g. the audio path, text (tokenized, leverage convert_char_to_pinyin func in model/utils.py see script/prepare_xxxx.py), duration of audio in seconds.

def __getitem__(self, index):
    row = self.data[index]
    audio_path = row["audio_path"]
    text = row["text"]
    duration = row["duration"]

Set a smaller batchsize according to your GPU mem. The grad_accumulation_steps could be used to simulate a large batchsize. Also other settings, e.g. few warmup steps, 1e-4 lr, etc.

We didn't specifically experiment with finetuning, so if you get positive results, welcome to share :)

Some helpful issues, #16 #27

Welcome to share your successful results with finetuning, maybe also start a new tutorial doc helping others to get start with it.
Many Thanks !

acul3 · 2024-10-14T08:03:05Z

acul3
Oct 14, 2024

hello @SWivid do you think is possible fine tuning pretrained model on new language

planing to add another language italian + english (to avoid catastrophic forgetting)

51 replies

JarodMica Dec 4, 2024
Collaborator

Yeah, the model should only see hiragana, whether that be at inference or training just because you'd probably need a LOT of data to generalize all those tokens in kanji (other readings etc, also make it suboptimal IMO).

You might find this useful if you format data in the same way the GUI does to convert train text files

https://github.com/JarodMica/tortoise_dataset_tools/blob/master/japanese_tools/hiragana_train_file.py

yiwei0730 Dec 4, 2024

def is_japanese(c):
        return (
            "\u3040" <= c <= "\u309f"  # Hiragana
            or "\u30a0" <= c <= "\u30ff"  # Katakana
            or "\uff66" <= c <= "\uff9f"  # Half-width Katakana
        )
  else:  # if mixed chinese characters, alphabets and symbols
                for c in seg:
                    if ord(c) < 256:
                        char_list.extend(c)
                    elif is_japanese(c):
                        char_list.append(c)
                    else:
                        if c not in "。，、；：？！《》【】—…":
                            if not char_list or not is_japanese(char_list[-1]):
                                char_list.append(" ")
                            char_list.extend(lazy_pinyin(c, style=Style.TONE3, tone_sandhi=True))
                        else:  # if is zh punc
                            char_list.append(c)

Thanks for your repo, i see..., first i need to use pykakasi to preprocessed my dataset，and add the convert code into, then i can use the original vocab to finetuning, is it corrected? if i missing something, tell me thanks!

yiwei0730 Dec 4, 2024

Yeah, the model should only see hiragana, whether that be at inference or training just because you'd probably need a LOT of data to generalize all those tokens in kanji (other readings etc, also make it suboptimal IMO).

You might find this useful if you format data in the same way the GUI does to convert train text files

https://github.com/JarodMica/tortoise_dataset_tools/blob/master/japanese_tools/hiragana_train_file.py

Thanks, this help me a lot!

JarodMica Dec 4, 2024
Collaborator

About right, however, a small clarification is that you do not need that conversion code for training, you only need it for inference.

The steps would be:

Convert kanji to all hiragana with pykakaski in your dataset
Train/fine-tune on the original code.

Then after training, you can implement the is_japanese function when you use convert to pinyin in inference

yiwei0730 Dec 4, 2024

Yeah! got it
just preprocessed the japanese dataset by yourself and running the finetune, at last use the is_japanese function to run infer.

kunibald413 · 2024-10-16T11:52:34Z

kunibald413
Oct 16, 2024

create a dataset:

audio files of maybe 3 - 12s duration, i'm not sure what's good, and their transcripts

/your_dataset
|-- metadata.csv
|-- wavs/
|   |-- audio_0001.wav
|   |-- audio_0002.wav
|   `-- ...

metadata.csv contents:

<relative_path_to_wav>|<transcript>

audio_file|text
wavs/audio_0001.wav|Yo! Hello? Hello?
wavs/audio_0002.wav|Hi, how are you doing today? I want to go shopping and buy me some lemons.

call script to prepare dataset
it doesn't handle other tokenizers, always assumes english dataset and pinyin, can adjust to your liking

python scripts/prepare_csv_wavs.py <path_to_your_dataset> <F5-TTS_repo_data_path>/<dataset_name>_pinyin

example:

python scripts/prepare_csv_wavs.py /my_pc/your_dataset /my_pc/F5-TTS/data/your_dataset_pinyin

adjust hyperparams in train.py

set dataset name to name of your dataset in f5-tts data folder

dataset_name = "your_dataset"

play around with these parameters and see what give the best results:

set max samples to 2, or whatever you seem fit

max_samples = 2

also play around with learning rate, don't know which one is best

learning_rate = 5e-06

change epochs and warmup to whatver you seem fit for your dataset
maybe for 100 audio files 10 epochs and 20 warum steps is fine, i have no clue

epochs = 10  # use linear decay, thus epochs control the slope
num_warmup_updates = 20  # warmup steps

adjust this to your dataset size, eg for 100 audio files and 2 max samples, maybe 500
or add code to trainer to save final checkpoint after training is done

last_per_steps = 500  # save last checkpoint per steps

python train.py

hopefully we find good hyperparams for good finetuning results

could put prepare_csv_wavs.py into scripts folder @SWivid

it doesn't handle other tokenizers, always assumes english dataset and pinyin, can adjust to your liking

import sys, os
sys.path.append(os.getcwd())

from pathlib import Path
import json
import shutil
import argparse

from tqdm import tqdm
from datasets.arrow_writer import ArrowWriter

from model.utils import (
    convert_char_to_pinyin,
)

PRETRAINED_VOCAB_PATH = Path(__file__).parent.parent / "data/Emilia_ZH_EN_pinyin/vocab.txt"

def is_csv_wavs_format(input_dataset_dir):
    fpath = Path(input_dataset_dir)
    metadata = fpath / "metadata.csv"
    wavs = fpath / 'wavs'
    return metadata.exists() and metadata.is_file() and wavs.exists() and wavs.is_dir()


def prepare_csv_wavs_dir(input_dir):
    assert is_csv_wavs_format(input_dir), f"not csv_wavs format: {input_dir}"
    input_dir = Path(input_dir)
    metadata_path = input_dir / "metadata.csv"
    audio_path_text_pairs = read_audio_text_pairs(metadata_path.as_posix())

    sub_result, durations = [], []
    vocab_set = set()
    polyphone = True
    for audio_path, text in audio_path_text_pairs:
        if not Path(audio_path).exists():
            print(f"audio {audio_path} not found, skipping")
            continue
        audio_duration = get_audio_duration(audio_path)
        # assume tokenizer = "pinyin"  ("pinyin" | "char")
        text = convert_char_to_pinyin([text], polyphone=polyphone)[0]
        sub_result.append({"audio_path": audio_path, "text": text, "duration": audio_duration})
        durations.append(audio_duration)
        vocab_set.update(list(text))

    return sub_result, durations, vocab_set

def get_audio_duration(audio_path):
    import torchaudio
    audio, sample_rate = torchaudio.load(audio_path)
    num_channels = audio.shape[0]
    return audio.shape[1] / (sample_rate * num_channels)

def read_audio_text_pairs(csv_file_path):
    import csv
    audio_text_pairs = []

    parent = Path(csv_file_path).parent
    with open(csv_file_path, mode='r', newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile, delimiter='|')
        next(reader)  # Skip the header row
        for row in reader:
            if len(row) >= 2:
                audio_file = row[0].strip()  # First column: audio file path
                text = row[1].strip()          # Second column: text
                audio_file_path = parent / audio_file
                audio_text_pairs.append((audio_file_path.as_posix(), text))

    return audio_text_pairs


def save_prepped_dataset(out_dir, result, duration_list, text_vocab_set, is_finetune):
    out_dir = Path(out_dir)
    # save preprocessed dataset to disk
    out_dir.mkdir(exist_ok=True, parents=True)
    print(f"\nSaving to {out_dir} ...")

    # dataset = Dataset.from_dict({"audio_path": audio_path_list, "text": text_list, "duration": duration_list})  # oom
    # dataset.save_to_disk(f"data/{dataset_name}/raw", max_shard_size="2GB")
    raw_arrow_path = out_dir / "raw.arrow"
    with ArrowWriter(path=raw_arrow_path.as_posix(), writer_batch_size=1) as writer:
        for line in tqdm(result, desc=f"Writing to raw.arrow ..."):
            writer.write(line)

    # dup a json separately saving duration in case for DynamicBatchSampler ease
    dur_json_path = out_dir / "duration.json"
    with open(dur_json_path.as_posix(), 'w', encoding='utf-8') as f:
        json.dump({"duration": duration_list}, f, ensure_ascii=False)

    # vocab map, i.e. tokenizer
    # add alphabets and symbols (optional, if plan to ft on de/fr etc.)
    # if tokenizer == "pinyin":
    #     text_vocab_set.update([chr(i) for i in range(32, 127)] + [chr(i) for i in range(192, 256)])
    voca_out_path = out_dir / "vocab.txt"
    with open(voca_out_path.as_posix(), "w") as f:
        for vocab in sorted(text_vocab_set):
            f.write(vocab + "\n")

    if is_finetune:
        file_vocab_finetune = PRETRAINED_VOCAB_PATH.as_posix()
        shutil.copy2(file_vocab_finetune, voca_out_path)
    else:
        with open(voca_out_path, "w") as f:
            for vocab in sorted(text_vocab_set):
                f.write(vocab + "\n")

    dataset_name = out_dir.stem
    print(f"\nFor {dataset_name}, sample count: {len(result)}")
    print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}")
    print(f"For {dataset_name}, total {sum(duration_list)/3600:.2f} hours")


def prepare_and_save_set(inp_dir, out_dir, is_finetune: bool = True):
    if is_finetune:
        assert PRETRAINED_VOCAB_PATH.exists(), f"pretrained vocab.txt not found: {PRETRAINED_VOCAB_PATH}"
    sub_result, durations, vocab_set = prepare_csv_wavs_dir(inp_dir)
    save_prepped_dataset(out_dir, sub_result, durations, vocab_set, is_finetune)


def cli():
    # finetune: python script.py /path/to/input_dir /path/to/output_dir
    # pretrain: python script.py /path/to/input_dir /path/to/output_dir --pretrain
    parser = argparse.ArgumentParser(description="Prepare and save dataset.")
    parser.add_argument('inp_dir', type=str, help="Input directory containing the data.")
    parser.add_argument('out_dir', type=str, help="Output directory to save the prepared data.")
    parser.add_argument('--pretrain', action='store_true', help="Enable for new pretrain, otherwise is a fine-tune")

    args = parser.parse_args()

    prepare_and_save_set(args.inp_dir, args.out_dir, is_finetune=not args.pretrain)

if __name__ == "__main__":
    cli()

8 replies

TEJASAMA-TECH Nov 5, 2024

F5-TTS/src/f5_tts/api.py

Line 66 in 6a104b4

self.ema_model = load_model(model_cls, model_cfg, ckpt_file, vocab_file, ode_method, use_ema, self.device)

at time of writing set ckpt_file to the path to your model @TEJASAMA-TECH
self.ema_model = load_model(model_cls, model_cfg, ckpt_file, vocab_file, ode_method, use_ema, self.device)

thanks a lot @kunibald413

kostum123 Nov 5, 2024

prepare_csv_wavs.py

While using the prepare_csv_wavs.py file, if our language characters are contained inside PRETRAINED_VOCAB_PATH = files("f5_tts").joinpath("../../data/Emilia_ZH_EN_pinyin/vocab.txt"), do we need to edit the prepare_csv_wavs code to make it work with languages other than English and Latin-based languages? Should we change the tokenizer to a character-based one, or is it acceptable to keep it as is and run python src/f5_tts/train/datasets/prepare_csv_wavs.py?
@SWivid @kunibald413

kunibald413 Nov 5, 2024

the csv_wavs script is mostly copy paste from prepare_emilia.py

this is how they deal with different tokenizers:

F5-TTS/src/f5_tts/train/datasets/prepare_emilia.py

Line 140 in 4a69e6b

if tokenizer == "pinyin":

i assume at least this line would need to change, it always assumes pinyin tokenizer:

F5-TTS/src/f5_tts/train/datasets/prepare_csv_wavs.py

Line 47 in 4a69e6b

text = convert_char_to_pinyin([text], polyphone=polyphone)[0]

# assume tokenizer = "pinyin"  ("pinyin" | "char")
text = convert_char_to_pinyin([text], polyphone=polyphone)[0]

i'm not sure about the implications and maybe there's more that needs to change.
currently I don't have much time looking into it, preferably SWivid fixes it up if he has time.

if you know what's going on there feel free to adjust accordingly and make a merge reuquest @kostum123

TEJASAMA-TECH Nov 8, 2024

Hi @kunibald413, I'm currently finetuning F5-TTS with gradio interface, it's really amazing. But while I'm trying to finetune it from the previous checkpoint, I'm facing the below issue:

finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train /home/teja/voice-cloning/F5-TTS/ckpts/test_en/model_last.pt

Can you please help me with this. thank you.

HuuHuy227 Nov 8, 2024

Hi @kunibald413, I'm currently finetuning F5-TTS with gradio interface, it's really amazing. But while I'm trying to finetune it from the previous checkpoint, I'm facing the below issue:

finetune_cli.py: error: unrecognized arguments: --file_checkpoint_train /home/teja/voice-cloning/F5-TTS/ckpts/test_en/model_last.pt

Can you please help me with this. thank you.

Try changing --file_checkpoint_train into --pretrain in finetune_cli.py (Line 454). There is confusion about the argument between finetune_cli.py and finetune_gradio.py

lpscr · 2024-10-16T18:56:17Z

lpscr
Oct 16, 2024

@kunibald413 Thank you for the script. I've already created something similar here. #62 (comment)
2 days ago ...

can you update this part i think be nice to have like this

def format_seconds_to_hms(seconds):
    hours = int(seconds / 3600)
    minutes = int((seconds % 3600) / 60)
    seconds = seconds % 60
    return "{:02d}:{:02d}:{:02d}".format(hours, minutes, int(seconds))

    print(f"\nFor {dataset_name}, sample count: {len(result)}")
    print(f"For {dataset_name}, vocab size is: {len(text_vocab_set)}")
    print(f"For {dataset_name}, total {format_seconds_to_hms(sum(duration_list))}")
    print(f"For {dataset_name}, min {min(duration_list)} sec")
    print(f"For {dataset_name}, max {max(duration_list)} sec")

before

For , sample count: 242
For , vocab size is: 53
For , total 0.20 hours

after

For , sample count: 242
For , vocab size is: 53
For , total 00:12:17 
For , min 1.519 sec
For , max 8.294 sec

1 reply

kunibald413 Oct 16, 2024

it's merged into main repo, feel free to adjust to your liking

mhenrichsen · 2024-10-16T20:52:16Z

mhenrichsen
Oct 16, 2024

The code @kunibald413 has provided works.

However, when training it seems to initialize from a model with random weights. Can we initialize from the trained model weights instead?

1 reply

thunn Oct 16, 2024

By default, the training script will look for an existing model in ckpts/<exp_name>/model_last.pt (exp_name set in the script)

If you place a model at that path, it will be loaded in as the base model

jpgallegoar · 2024-10-16T22:44:44Z

jpgallegoar
Oct 16, 2024
Collaborator

Just started my spanish finetune from the facebook libraspeech dataset. Single 4090 so it will take a while.

4 replies

MithrilMan Oct 17, 2024

can you report back the time it takes?
I'm interested in an italian version and I've a 3090ti

jpgallegoar Oct 17, 2024
Collaborator

Unfortunately I did it wrong and have to start over. But I ran it over 1 Epoch overnight, which took
6h 50m 53s for 4,000 batch size and 93000 batches, reaching around 0.75 loss rate. I will start over and report back when I have something to show for it.

jpgallegoar Oct 18, 2024
Collaborator

Small update on the spanish finetune.

original.wav (in training data):
https://vocaroo.com/19oXO8sJm0WH

finetuned.wav (same input voice):
https://voca.ro/1aKKzX7pBhf3

Still much to go

anarucu Nov 19, 2024

Hello @jpgallegoar , thank you for the model in Spanish. I tried cloning a voice with a Cuban accent, and it’s not bad at all, even though it’s an accent you didn’t use during fine-tuning. I wonder if it’s possible to do fine-tuning starting from the model you trained... I only have 12 hours of Spanish with a Cuban accent.

bensonbs · 2024-10-17T03:38:55Z

bensonbs
Oct 17, 2024

I am using a Chinese dataset (about 33hr) to fine-tune my model. The loss is continuously decreasing, and the generated voice tone is getting closer to the target. However, as the training steps increase, the pronunciation of words is becoming increasingly unclear.
It's like the following audio file:

model_12620.pt
https://voca.ro/11Ny6egSZ7zf
model_126200.pt
https://voca.ro/1cXdiNNM0zRt

parm

exp_name = "F5TTS_Base"  
learning_rate = 7.5e-5
batch_size_per_gpu = 38400/8
batch_size_type = "frame
max_samples = 64
grad_accumulation_steps = 1 
max_grad_norm = 1.

14 replies

charleypeng Oct 20, 2024

I used a 200-hour Chinese dataset. Successfully fine-tuned the model with params:




learning_rate = 1e-5



batch_size_per_gpu = 38400/8  

batch_size_type = "frame" 

max_samples = 64  

grad_accumulation_steps = 1 

max_grad_norm = 1.



epochs = 11  

num_warmup_updates = 500

hi can you share the dataset?

bensonbs Oct 21, 2024

@jpgallegoar There are different accents in Chinese, and fine-tuning can also make the model's voice closer to the dataset.

bensonbs Oct 21, 2024

@charleypeng

I'm sorry, but my training dataset is private and cannot be provided.

jpgallegoar Oct 21, 2024
Collaborator

@jpgallegoar There are different accents in Chinese, and fine-tuning can also make the model's voice closer to the dataset.

Ah yes, thank you for the explanation

yc930401 Dec 2, 2024

I am using a Chinese dataset (about 33hr) to fine-tune my model. The loss is continuously decreasing, and the generated voice tone is getting closer to the target. However, as the training steps increase, the pronunciation of words is becoming increasingly unclear. It's like the following audio file:

model_12620.pt
https://voca.ro/11Ny6egSZ7zf

model_126200.pt
https://voca.ro/1cXdiNNM0zRt

parm
exp_name = "F5TTS_Base"  
learning_rate = 7.5e-5
batch_size_per_gpu = 38400/8
batch_size_type = "frame
max_samples = 64
grad_accumulation_steps = 1 
max_grad_norm = 1.

Hello friend, I also want to finetune a chinese model, because I think the voice generated by the current model is not very similar to the target voice sometimes. I want to train a generic model that with 5-10s reference voice by any person, the model can mimic that person's voice. May I ask if your finetuned model get better result in mimicing the reference voice? And how many person's voices did you use and how long for each person? Thanks a lot !

lpscr · 2024-10-17T12:09:02Z

lpscr
Oct 17, 2024

hi i just create gradio interface for easy user-friendly and accessible for beginners you can see here

#143

Features

Transcription Tab: Easily transcribe audio files to create a dataset.
Dataset Preparation Tab: Prepare your dataset for training.
Training Tab:
    Select fine-tuning options.
    Automatically calculate settings, with the option to manually adjust them.
Reduction Tab: Convert your model from 5GB to 1.3GB.
Check Vocab: Check if it is possible to fine-tune in another language

0 replies

acul3 · 2024-10-17T18:02:11Z

acul3
Oct 17, 2024

can confirm also training work
almost 3 days finally got it work

i training 3 language indonesia-italian-english

eng: https://vocaroo.com/1mGEFlRNgouY
(you are likely overfitting. Impossible to know without evaluating. Bigger dataset of same quality is always better.)

italian: https://voca.ro/1l6SYplhnSxz

(Quattro imperdibili appuntamenti con l’Orchestra da Camera di Caserta e solisti internazionali.)

indonesia:
https://vocaroo.com/11e5OQQucQDY
(Joan Laporta mengumumkan Barcelona kini akhirnya meraih laba positif, sebesar dua belas juta euro.)

it even can do code switching (eng-indonesia):
https://vocaroo.com/1iZkXBo6vII5
(Sebenarnya sih gak juga, There's always something there which is a little bit different.)

using same config as train

19 replies

paulovasconcellos-hotmart Oct 28, 2024

Just to confirm: you have used a dataset composed of english, Italian and Indonesian audios do fine-tune this model, right? Can you share how many hours of each you used for each language, and how long are the clips in seconds?

luterz Nov 4, 2024

Very good , when was integrated italian language in main f5-tts?

MithrilMan Nov 4, 2024

@leoiania finetuning after 3 days,

@leoiania i cannot release the weight because i use company data and hardware, but now i training from scratch using data and rent some gpu

will share the weight once complete

@acul3
did you have any news on the training?

sw-els Nov 11, 2024

@acul3 can you tell me how many epochs you used, especially for indo voice?

SyamsQ Dec 15, 2024

Boleh minta model fine-tunenya gan? @acul3

lpscr · 2024-10-17T23:35:52Z

lpscr
Oct 17, 2024

Hi, I was just wondering why you dont try to train on small data first instead of starting with a large dataset. For me, I trained for only 40 hours greek and with 20 hours (LibriTTS-R) focused on English, and it’s working fine speak very well. in half a day about with the 4090, and after about 100k to 150k steps, the model can speak greek and english in same time, very well and have great zero shot ,

try see if ths working for you i hope this help

3 replies

jpgallegoar Oct 17, 2024
Collaborator

Can you please elaborate? Why did you train in English again?

lpscr Oct 17, 2024

because i want speak both english and greek for example i give english and speak greek , and like i see this working gine with this method , if i train only greek not speak well english

ppc2017 Oct 19, 2024

Can you share the greek model?

lpscr · 2024-10-17T23:55:29Z

lpscr
Oct 17, 2024

here the setting i use

learning_rate = 1e-5

batch_size_per_gpu = 1618# 8 GPUs, 8 * 38400 = 307200
batch_size_type = "frame"  # "frame" or "sample"
max_samples = 64  # max sequences per batch if use frame-wise batch_size. we set 32 for small models, 64 for base models
grad_accumulation_steps = 1  # note: updates = steps / grad_accumulation_steps
max_grad_norm = 1.

epochs = 11  # use linear decay, thus epochs control the slope
num_warmup_updates = 500 # warmup steps
save_per_updates = 10000 # save checkpoint per steps
last_per_steps = 20000  # save last checkpoint per steps

3 replies

lpscr Oct 18, 2024

here sample greek and english
say first in
greek : Καλώς ήρθατε όλοι στην μεγαλύτερη πόλη ψυχαγωγίας του κόσμου!
then say the same in english : welcome one and all to the world's greatest entertainment city!
thank you @SWivid for suport greek symbols !

https://voca.ro/1kn7SqhCQjis

you see my method working great ;)

BTW: i working in finetune and plan to suport easy finetune for other language i have some ideas to add and tips , you can see here #143 already merge in main repo

justinjohn0306 Oct 18, 2024

@lpscr how did you load the model for the inference since you have changed the language from chinese to greek the vocab.txt must have chanced and so with the default config it would error out, right?

lpscr Oct 18, 2024

i have create my api script class to work easy load model and more stuff

but if you want now something simple just change in interface-clu.py

def load_model(repo_name, exp_name, model_cls, model_cfg, ckpt_step):
    ckpt_path = f"ckpts/{exp_name}/model_{ckpt_step}.pt" # .pt | .safetensors

ckpt_path with the file .pt your model you have finetune other take preetrain !

lpscr · 2024-10-18T08:27:46Z

lpscr
Oct 18, 2024

Hi all, this is very important and might be confusing for some. You need to copy the original model
F5TTS_Base/model_1200000.pt into the folder where you are training for fine-tuning.

If you start training without copying this model, it will train from scratch!

I’ve created a script called finetune-cli.py that can automate this process. However, before running the script, you need to update all the settings accordingly.

Please make sure to do this before you start.

or you can run simple

https://github.com/SWivid/F5-TTS/blob/182b0f08e4cde7280996c4018575b4a80425754b/finetune-cli.py#L41C1-L59C86

run simple change only the dataname my_speak in 3090 with about 60-80 hours dataset working well
accelerate launch finetune-cli.py --exp_name F5TTS_Base --learning_rate 0.00001 --batch_size_per_gpu 1618 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 10 --num_warmup_updates 500 --save_per_updates 10000 --last_per_steps 20000 --dataset_name my_speak --finetune True

for 4090 like say @JarodMica working very well and also with very big dataset
batch_size_per_gpu = 4000
grad_accumulation_steps = 78

about the vocab i dont replace anything because suport all symbols in language i train

make sure if suport all symbols in your language you want to train if miss symbols not working correct
i think you can replace with the miss symbols with unsued this correct @SWivid ? and what symbols not use ? to can replace safe ?

or another idea it's in case miss symbols , you can simple covert all symbols in english language ,

here how check the vocab in finetune_gradio.py

make sure in data/project_name/ you have inside metadata.csv for all text
you need also write in project name in gradio same name

thats why i make gradio_finetune.py to dont confuse for begin users
also like i say i plane to make all this automatic soon

i hope this help

23 replies

JarodMica Oct 18, 2024
Collaborator

@jpgallegoar Ah! My bad, yes, I'm pretty much topped out on RAM if you had 32GB, but chrome takes up 2 gb 💀 and vscode another 2 gb in this screenshot

However, I just stopped training just to check and I'm idling at 20 GB used... Idk what's taking it up exactly but possible some data didn't cleared.

Anywho, you can try to reduce the num_workers that are being used, the default it 16.

If I looked at Ipscr's code, it's not setting it here so you can pass it,

trainer.train(train_dataset,
                  resumable_with_seed=666,  # seed for shuffling dataset
                  num_workers=1
                  )

This should help to lower RAM usage I think as it should reduce how much data is being prepared before hand.

One more note, you don't want too little workers or else you're GPU won't saturate and it'll train a little slower, so it's a balancing game

HuuHuy227 Oct 21, 2024

In my case, I encountered a problem with missing symbols. As you suggested, maybe changing out unused symbols could help, so does this mean I should add these missing symbols to the unused symbols (and what are they in the vocab.txt file)? I have my own tokenizer for my language, and if I use it, does that mean I need to train from scratch rather than fine-tune?

jonnytracker Oct 23, 2024

Hi all, this is very important and might be confusing for some. You need to copy the original model F5TTS_Base/model_1200000.pt into the folder where you are training for fine-tuning.

If you start training without copying this model, it will train from scratch!

I’ve created a script called finetune-cli.py that can automate this process. However, before running the script, you need to update all the settings accordingly.

Please make sure to do this before you start.

or you can run simple

https://github.com/SWivid/F5-TTS/blob/182b0f08e4cde7280996c4018575b4a80425754b/finetune-cli.py#L41C1-L59C86

run simple change only the dataname my_speak in 3090 with about 60-80 hours dataset working well accelerate launch finetune-cli.py --exp_name F5TTS_Base --learning_rate 0.00001 --batch_size_per_gpu 1618 --batch_size_type frame --max_samples 64 --grad_accumulation_steps 1 --max_grad_norm 1 --epochs 10 --num_warmup_updates 500 --save_per_updates 10000 --last_per_steps 20000 --dataset_name my_speak --finetune True

for 4090 like say @JarodMica working very well and also with very big dataset batch_size_per_gpu = 4000 grad_accumulation_steps = 78

about the vocab i dont replace anything because suport all symbols in language i train

make sure if suport all symbols in your language you want to train if miss symbols not working correct i think you can replace with the miss symbols with unsued this correct @SWivid ? and what symbols not use ? to can replace safe ?

or another idea it's in case miss symbols , you can simple covert all symbols in english language ,

here how check the vocab in finetune_gradio.py

make sure in data/project_name/ you have inside metadata.csv for all text you need also write in project name in gradio same name

thats why i make gradio_finetune.py to dont confuse for begin users also like i say i plane to make all this automatic soon

i hope this help

youtube tutorial

lpscr Oct 27, 2024

@jonnytracker check the gradio finetune i have also video tutorial #143

atlonxp Nov 12, 2024

@JarodMica I have the same issue of out of RAM memory. I have found the cause? and any solutions for fixing it?

My case is on supercomputer which has 500GB ram; it still run out of memory every time.

lpscr · 2024-10-18T14:13:53Z

lpscr
Oct 18, 2024

@jpgallegoar I’m trying to train in Spanish as an experiment , let see this take some hours. I just hope the dataset I’m using is okay since I don’t speak Spanish. I’ll let you know soon.

16 replies

lpscr Oct 18, 2024

i just leave to 60k i dont want burn my gpu more... i need make other test also , i try train spanish only for test stuff , like i see you say now working also for you that's great ,

jpgallegoar Oct 18, 2024
Collaborator

Can you send me your model? Perhaps I can keep training it

lpscr Oct 18, 2024

lol you have 1k sound and i have only 20 hours . thre is not point to compare or train ...

jpgallegoar Oct 18, 2024
Collaborator

yeah but end result is what matters no?

cristianosoy Oct 28, 2024

Hello, I am a native Spanish speaker, I would be honored to help you if you need it.

henriklied · 2024-10-18T18:57:15Z

henriklied
Oct 18, 2024

Given a large dataset, how important is it that the transcription is 1-1 with the source audio? The reason I ask, most of my datasets are built using a Whisper model, and they often do some text compression and correct misspoken words or stutter. Is this TTS-architecture forgiving for those kinds of variations or inconsistencies in transcription, or should I consider using a more verbose Whisper model for creating this dataset?

13 replies

jpgallegoar Oct 18, 2024
Collaborator

83C 388W

lpscr Oct 18, 2024

for me same 83C here @jpgallegoar you have 4090 ?
i just worder if this safe to have like this to train days ... for example , like i see @JarodMica in post image ger 78C this the max you get in 4090 ?

BTW:
there is app call after burn by msi you can control the temp but make train make slower because you change the power of your gpu but like this you get more health to have temp 75 around , i dont test yet anyone test this ?

jpgallegoar Oct 18, 2024
Collaborator

I read up to 90 is safe, so 83 should be fine for a few days. I will not do this forever but keep in mind some people mine crypto 24/7 for YEARS and it doesn't break. Same thing

lpscr Oct 18, 2024

maybe i put in gradio like this you can see your memory and temp in your gpu right now gpu sleep thats why you see 46C

jpgallegoar Oct 18, 2024
Collaborator

I read up to 90 is safe, so 83 should be fine for a few days. I will not do this forever but keep in mind some people mine crypto 24/7 for YEARS and it doesn't break. Same thing

oh yes I didnt answer I have 4090

@lpscr That looks cool, PR!

justinjohn0306 · 2024-10-18T22:11:19Z

justinjohn0306
Oct 18, 2024

Has anyone here tried finetuning the base model on a single speaker dataset? I tried finetuning with a 6 hr English dataset, but I don’t hear any difference after the training.

12 replies

jpgallegoar Oct 18, 2024
Collaborator

Change the line I told you to, you're not using the correct model yet

Oh, I thought you meant 6 hours of different speakers. If you train on 6 hours of a single person, that's only useful for generating audio of that specific person, and make it sound closer to them.

Yes but the issue is, it doesn't make any difference after the training which I find really odd.

justinjohn0306 Oct 18, 2024

Change the line I told you to, you're not using the correct model yet

Oh, I thought you meant 6 hours of different speakers. If you train on 6 hours of a single person, that's only useful for generating audio of that specific person, and make it sound closer to them.

Yes but the issue is, it doesn't make any difference after the training which I find really odd.

Actually it did use my finetuned model and yeah...I don't hear much difference:

Here's the audio generated using the base F5-TTS model: https://voca.ro/1hfgaoFTKcIi

Here's the audio generated using my finetuned model: https://voca.ro/1gT2MaNzZXW0

The reference audio: https://voca.ro/15snOJ8WmHF5

jpgallegoar Oct 18, 2024
Collaborator

The audio should be closer to 15 seconds, at least 10-12. The first mhm part does not get transcribed so for the model, it's a long silence. You're giving a 6 second audio which is 20% silence that's why it's so unnatural. You should use longer audio and splice together the sentences so there's not much silences in between.

Either way, the finetuned audio does sound close to the original voice to me, even if the silences didn't dissapear (that can be fixed with a better input audio)

GUUser91 Oct 19, 2024

@jpgallegoar
Thanks for the tip. I've been trying to finetune a cartoon character with a scottish accent.

The prompt is:
Baby coughing on a bus right as a needed tae cough so a nearly exploded hawdin it in cos a didny wanty look like the guy who copies babies.

Output files are from the finetune model

Old reference input audio(6 Seconds)
https://vocaroo.com/1iEAymADCOma
Output:
https://vocaroo.com/1jz9TcKrXHUH

Multiple Reference Input Audio Files Merged (13 Seconds)
https://vocaroo.com/16wuhY0yGHxg
Output
https://vocaroo.com/19yKnWO2byNu

S-T-K Feb 7, 2025

@justinjohn0306

I'm facing the same issue at the moment. Finetuning makes no discernable difference. Did you eventually find a way to make it work? I mean, the quality is fine as is, but I was hoping finetuning would make it even better.

jpgallegoar · 2024-10-18T22:24:49Z

jpgallegoar
Oct 18, 2024
Collaborator

After much testing, I'm gonna have to give up on the spanish finetune for now.
280k samples in and, although I have gotten a decent result which says 85% of the words correctly, it's unusable, since you need much more for an acceptable result. I'm attributing the failure in part to the model's poor transcription quality (I expect more from facebook) and in part to my own lack of skill in this regard. I am eager to give it another try if a better method is found, after careful revision of the dataset. Another mistake is that now the model has lost the capability to speak English (and I assume Chinese too).

Anyway, if anyone wants it, here is the model: Link

4 replies

zephirusgit Oct 30, 2024

gracias por compartir, ahi vere como funciona, justo miraba si seria posible hacer un finetuning, en español. estaba pensando si serviria para ello, utilizar audios generados por bark por ejemplo. que uno puede "Crearlos", de cualquier manera es como que no entiendo aun el proceso de finetuning, estaria dando palos de ciego y nose si mi rtx2060 de 12gb sirve para ese proceso. si logras hacer un avance nuevo , estaria feliz de probarlo, y si hay algo que se pueda hacer para colaborar para hacer uno, tambien. Saludos.

zephirusgit Oct 30, 2024

jpgallegoar, pregunta desde la absoluta ignorancia, el archivo model_last.pt como lo utilizas? porque veo que en pinokio los modelos los envia al cache del hub y como nombre ilegible,

jpgallegoar Oct 30, 2024
Collaborator

jpgallegoar, pregunta desde la absoluta ignorancia, el archivo model_last.pt como lo utilizas? porque veo que en pinokio los modelos los envia al cache del hub y como nombre ilegible,

hay que hardcodearlo directamente en load_model() de utils_infer.py por ahora. Si en el futuro tenemos varios finetunes decentes, podemos integrarlos en la aplicacion.

jpgallegoar Oct 30, 2024
Collaborator

gracias por compartir, ahi vere como funciona, justo miraba si seria posible hacer un finetuning, en español. estaba pensando si serviria para ello, utilizar audios generados por bark por ejemplo. que uno puede "Crearlos", de cualquier manera es como que no entiendo aun el proceso de finetuning, estaria dando palos de ciego y nose si mi rtx2060 de 12gb sirve para ese proceso. si logras hacer un avance nuevo , estaria feliz de probarlo, y si hay algo que se pueda hacer para colaborar para hacer uno, tambien. Saludos.

Por ahora creo que el mayor problema son los datos. Si sería posible entrenar el modelo con audios generados por Bark, pero ten en cuenta que la variabilidad de esos datos no sería muy alta, por lo que el finetuning no sería flexible.

Se necesita más VRAM para hacer un finetuning, pero puedes contribuir recopilando datasets de alta calidad, por ejemplo.

ABDe3N · 2024-12-22T16:56:07Z

ABDe3N
Dec 22, 2024

i ran 50k steps on a 10 hours of professional Arabic audio recordeing segmented in 2 to 17 seconds chunks, with very good transcribe. but the results are no good. it feels like the chinese is over powering it. any suggestions? can it be the vocab.txt file?

4 replies

Alykasym Dec 22, 2024

Try setting up Epochs very high, something like 1000000. Learning rate curves/changes by every epoch. So, setting Epoch low number like 100, makes learning rate change very quickly before the model can learn the new language properly.
There is a very small chance that 8-bit Adam optimizer can cause slight data loss, and it might be critical, especially when it comes to small datasets with very sensitive data like in your case Arabic language, which is harder than Chinese.

So, try training with very high epoch count and turning off 8-bit Adam optimizer.

ABDe3N Dec 22, 2024

ok. I'll try it and comeback with feedback

ABDe3N Dec 24, 2024

after 200k steps the results are no good, it feels like the 90k steps version was kind of better, still no good, ay ideas?
Thanks

Alykasym Dec 25, 2024

Hi, sorry for late reply. Maybe you can look at this one Issue 464. This thread might be helpful.

jpgallegoar · 2024-12-23T11:37:27Z

jpgallegoar
Dec 23, 2024
Collaborator

Hello, has anyone experimented with creating a finetune on another language while adding english audios in the dataset to prevent catastrophic forgetting?

What was the ratio of new language and english audios?
Does it work well? Is the new model able to switch from the new language to english and back in a smart and accurate manner?
How many hours of audio of new language and english?

Thanks!

2 replies

MilanaShhanukova Dec 25, 2024

Hi, have you also checked the ability of the model to generate the speech in a language different from the source? For instance, your source is german, while the target text you generate is english.

jpgallegoar Dec 26, 2024
Collaborator

It just reads the text in my language (spanish), as it would be read if it were spanish text

Baytro · 2024-12-24T02:39:20Z

Baytro
Dec 24, 2024

Hello everyone. I am trying to finetune a model that was previously already finetuned on German. The model which I want to finetune sounds good, but as soon as I try to finetune it on a small single speaker dataset (around 10 minutes) the output is basically just noise. It almost sounds like it is training from scratch and not finetuning the model. Has anybody have similar issues or am I doing something wrong? Or is a dataset of 10 minutes too small? With others this was plenty enough.

3 replies

jpgallegoar Dec 24, 2024
Collaborator

Hello, this happened to me before too. First of all make extra sure you're finetuning the model you want, you can change the code to enforce this. Second of all, try reducing the learning rate and testing, and don't train for too long. To be honest I haven't been successful when trying 10m of audio but around 1 hour was better.

Baytro Dec 25, 2024

So you think it's because of the length of only 10 minutes? I just thought it needs very little since I stay in the same language. I'll try to use a bigger dataset and try to make sure it's fine-tuning the right model. I only tried it with gradio where I ticket the box with fine-tuning although I don't know exactly what it did in code. Probably trying that tomorrow or the day after.

ZeaMays14142 Dec 30, 2024

So you think it's because of the length of only 10 minutes? I just thought it needs very little since I stay in the same language. I'll try to use a bigger dataset and try to make sure it's fine-tuning the right model. I only tried it with gradio where I ticket the box with fine-tuning although I don't know exactly what it did in code. Probably trying that tomorrow or the day after.

10min seems like too little data

ZeaMays14142 · 2024-12-30T17:53:48Z

ZeaMays14142
Dec 30, 2024

Hello. I am trying to do single speaker finetuning, English.
I have 6.5 hrs of audio in clips from 5 to 20seconds.

It training for only 100 epochs too little?
or might be there something wrong with my config params?

Thanks

6 replies

ZeaMays14142 Jan 4, 2025

ok thank you, will try

ZeaMays14142 Jan 4, 2025

100 epochs seems too much because:

The base model already supports English language, so it doesn't need to learn the language but the voice, which doesn't need much training.

6.5 hrs is a small dataset and training it for more than 20-30 epochs will easily overfit the model.

For starters, maybe you can just use Auto Settings from the WebUI, only changing the batch size to match your VRAM, and set the epoch count to something like 1000000, but manually stop around 15-20 epochs during training.

would it be better to train for fewer epochs on a bigger dataset? what size of dataset and number of epochs do you suggest?
I could maybe reach 10 hrs, but more would be increasingly hard

Alykasym Jan 4, 2025

Yeah, if it is a single speaker big dataset, then it is better to train for fewer epochs.
I think even just a 1 hour dataset would work for your case. Because, you are training for single speaker, and english language. The base model already capable of doing it, so it doesn't need much data to adapt to it.
The only case you would need ~10 hrs single speaker dataset for english model is that if you want it to learn very specific English accent. For example some english dialects from UK, or non-native accents like Mexicans or Russians speaking English with the accent and etc.

Alykasym Jan 4, 2025

Btw, when I said "6.5 hrs is a small dataset and training it for more than 20-30 epochs will easily overfit the model.", I didn't mean that it is small for your goal, I meant it is small to train for 100 epochs.

You can easily use that 6.5 hrs single speaker english dataset, and make the model speak with that speaker's voice by training for just 20 epochs.

ZeaMays14142 Jan 6, 2025

Yeah, if it is a single speaker big dataset, then it is better to train for fewer epochs.

I think even just a 1 hour dataset would work for your case. Because, you are training for single speaker, and english language. The base model already capable of doing it, so it doesn't need much data to adapt to it.

The only case you would need ~10 hrs single speaker dataset for english model is that if you want it to learn very specific English accent. For example some english dialects from UK, or non-native accents like Mexicans or Russians speaking English with the accent and etc.

Btw, when I said "6.5 hrs is a small dataset and training it for more than 20-30 epochs will easily overfit the model.", I didn't mean that it is small for your goal, I meant it is small to train for 100 epochs.

You can easily use that 6.5 hrs single speaker english dataset, and make the model speak with that speaker's voice by training for just 20 epochs.

understood, will try, thank you for the clarification

RobertAgee · 2025-01-04T18:47:59Z

RobertAgee
Jan 4, 2025

alternatively if overtraining is a big problem, modify the learning rate to be slightly less so the model doesn't converge as quickly. Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: Alykasym Begov ***@***.***> Sent: Saturday, January 4, 2025 1:43:12 PM To: SWivid/F5-TTS ***@***.***> Cc: Robert Agee ***@***.***>; Mention ***@***.***> Subject: Re: [SWivid/F5-TTS] Finetune practice (Discussion #57) 100 epochs seems too much because: 1. The base model already supports English language, so it doesn't need to learn the language but the voice, which doesn't need much training. 2. 6.5 hrs is a small dataset and training it for more than 20-30 epochs will easily overfit the model. For starters, maybe you can just use Auto Settings from the WebUI, only changing the batch size to match your VRAM, and set the epoch count to something like 1000000, but manually stop around 15-20 epochs during training. — Reply to this email directly, view it on GitHub<#57 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMWGMPIEKLQRKTUA36G74CL2JATUBAVCNFSM6AAAAABP4ADUBSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZTGQ4TIMQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

ZeaMays14142 Jan 4, 2025

ok thanks

ABDe3N · 2025-01-06T22:40:30Z

ABDe3N
Jan 6, 2025

anyone has anyl luck training Arabic?
i could never get it to speak natural Arabic
i trained for 200k steps on a high quality 10 hours singlr speaker dataset
but results are all over the place.
epoch at 105k is better than 150k but worse than 180k.

any ideas? i have more audio for the same speaker. more than 100 hours. if that would help

2 replies

Alykasym Jan 7, 2025

Maybe the issue with the vocab? Did you check vocab whether it contains all the characters?
And also how does the tokenized version of the text look? Is there anomalies with the Arabic text and characters?
Did you use "char" tokenizer or "pinyin"? For Arabic, char tokenizer will be suitable.

Also can you share your config parameters? Like batch size, learning rate, warmup updates and etc...

SyamsQ Jan 7, 2025

How to train the model? Do you have a video tutorial for this?

jpgallegoar · 2025-01-08T09:35:17Z

jpgallegoar
Jan 8, 2025
Collaborator

Have you guys found a good solution for splitting long audio files into shorter ones?

9 replies

jpgallegoar Jan 8, 2025
Collaborator

@isolveit-aps The one you shared is much better and works extremely fast. Thank you very much

ukemamaster Jan 10, 2025

@jpgallegoar try https://github.com/feldberlin/timething

sarpba Jan 21, 2025

I wrote a couple of scripts, the basis of which is whisperx, of course together with an aligning (wav2vec) model. So far, this solution has given me the most accurate results, albeit in Hungarian. After cutting, I scan the pieces again with whisperx, I no longer align them here. I compare the second reading with the pieces based on their text, and pronounce the pieces recognized as different languages. I just wrote the cutting script, you can change the distribution curve, the target length and the standard deviation. I'll make a small github repo if you're interested, because my previous database creation repo is out of date.

jpgallegoar Jan 21, 2025
Collaborator

@sarpba Thank you, that sounds interesting for everyone here!

isolveit-aps Jan 28, 2025

@ukemamaster Have you been able to make timething actually run? I've now tried it with various versions of python and dependencies, but I am unable to get it running. But the way it's described, it should do exactly what I'm after, so I would love to get it working.

ukemamaster · 2025-01-10T10:52:06Z

ukemamaster
Jan 10, 2025

Hi @jpgallegoar, Thanks for your Spanish model, it works great.
The only thing is the accent: i would like is to have more of the European Spanish (es-ES) accent instead of South American/Mexican Spanish (es-MX) one. To achieve this, i want to train the model with es-ES accent data. My data is 3 speakers, 370 hours.

Could you please share your experience?

Should i start fine-tuning from the base model, or from your Spanish model, or from scratch?
The batch size you used was 38400 (as in train.py), or 3200 (as you mentioned in huggingface) ?
Should i use your training configuration? OR i should take into account the single speaker issue discussed above?
Any other tips?

Thanks

4 replies

jpgallegoar Jan 10, 2025
Collaborator

Hello, I think you should use more speakers if you want a good generalized model, 3 speakers will probably overfit it if trained with 370 hours. I would start from my model and use less hours, but if you manage to get more speakers (>100h of >50 distinct speakers) you can start from scratch. If you decide to start from my model, I would use around 20h of those speakers if you want it to maintain some generalization, and lower the learning rate to 5e-6 or 7.5e-6. I am also in the process of training a peninsular spanish model. best of luck and keep me updated :D

ukemamaster Jan 10, 2025

@jpgallegoar Yeah sure, will keep you updating.
By the way, Which data you are using for training your peninsular spanish model? How many speakers do you have? Can you share your data?
Also, i would like to know if i should follow your F5-Spanish repo for fine-tuning? or the original one? I mean did you make any changes in training scripts?

ukemamaster Jan 10, 2025

@jpgallegoar And what do you mean by "generalization" here? You mean generalization in accent? To have both accents? Or generalization in cloning abilities in zero shot cloning?

jpgallegoar Jan 16, 2025
Collaborator

@jpgallegoar And what do you mean by "generalization" here? You mean generalization in accent? To have both accents? Or generalization in cloning abilities in zero shot cloning?

I mean voice cloning generalization. If your dataset only has 1 person, you can only clone that person. If it has more people, the model starts to learn how to create every voice.

emircanerkul · 2025-01-16T09:32:07Z

emircanerkul
Jan 16, 2025

I've found F5 for Turkish https://huggingface.co/marduk-ra/F5-TTS-Turkish but it has some turkish character problem like ı,ü,ö,ç

Unfortunately, I only have 6800xt. What do you suggest? What could be the problem? I'm considering hiring GPU and train in the cloud.

3 replies

jpgallegoar Jan 16, 2025
Collaborator

You can train on Runpod, for example.

emircanerkul Jan 16, 2025

@jpgallegoar For sake of using the output for commercial purpose, i cannot train based on the previous training which based on https://huggingface.co/datasets/amphion/Emilia-Dataset

How much time needed for this for example? Also i do not have any experience in this field except i'm fullstact web dev and know linux. There is data, there is script so i just need to run (this is my dream, but facts probably differ ^^)

jpgallegoar Jan 16, 2025
Collaborator

Around 100-150 hours for new language should be good enough, with many different speakers (try max 2h per speaker) so the model learns how to voice clone any voice

jpgallegoar · 2025-01-16T10:25:19Z

jpgallegoar
Jan 16, 2025
Collaborator

Has anyone tested training on fp32 vs fp16 vs bf16? Is there a noticeable quality dropoff? Which is the best?

8 replies

jpgallegoar Jan 16, 2025
Collaborator

I have only trained with fp32 with 100h dataset.

I'm planning to increase my dataset and train on fp32. May I know how was the quality? For example:

Does it clone reference audio well?

Is the pronunciation accurate?

Does it skip words?

The truth is 2. and 3. depends only on the quality of your transcriptions. If your transcriptions are 100% and your reference audio, text and gen text is in the domain of your dataset (not training on normal voice and then using cartoon voice with fast speech or things like that), the generated audio will be perfect. On 100h, 1.2 milliok steps, it's really really good.

And number 1 depends on the variety of speech patterns of your dataset. If you have only 5 speakers, it won't be able to clone the voices very well, but if you have 100+ speakers with different voices or something like that, it will learn to generalize and clone any voice. (Again, if you train only on normal voices, don't expect the model to clone a very high pitched cartoon voice, for example)

Alykasym Jan 22, 2025

Hi! Just wanted to share some findings from my recent experiments.

I tested the impact of different batch sizes on the quality of speech synthesis. The dataset I used is a 35-hour, high-quality, multi-speaker dataset with accurate annotations. I fine-tuned two models with identical configurations, only changing the batch size: one with a batch size of 3000 and the other with 500. Both were trained for 50 epochs using a learning rate of 1e-5.

Surprisingly, the model trained with a batch size of 500 produced more accurate speech compared to the one with a batch size of 3000. The 500 batch size run left more than half of my VRAM unused during fine-tuning and took slightly longer, but the improved results made it worth the trade-off.

I’m not entirely sure why this happened, as everything else about the models was identical. I plan to dig into the code and experiment further when I have more time, but for now, I thought I’d share these results in case anyone finds them useful.

jpgallegoar Jan 22, 2025
Collaborator

I just think the higher batch size one needs longer to learn. I am absolutely certain the ceiling of higher batch size is higher than lower batch size (spent hundreds renting H100 / H200 and locally 4090). If you give it enough time, it will improve. For reference. my 50h dataset was trained on 1300 epochs on 12000 batch size. I'm pretty sure it's 99-100% accurate on a new language.

hcsolakoglu Jan 30, 2025

@Alykasym smaller batch size provides more optimization opportunities because the number of steps increases. This is partly why you may see better results compared to a larger batch size.

goranskular Feb 5, 2025

Surprisingly, the model trained with a batch size of 500 produced more accurate speech compared to the one with a batch size of 3000.... Both were trained for 50 epochs using a learning rate of 1e-5....

When increasing batch size, you should typically increase the learning rate. A common rule is linear scaling: multiply the learning rate by the ratio. it's compensation for reduced gradient noise in larger batches. If you don't adjust it, training might slow down or become unstable. Can be that's why...

holycowdude · 2025-01-17T12:18:50Z

holycowdude
Jan 17, 2025

I'm finetuning models with F5-TTS via Pinokio but i'm struggling to identify how to use the models i've trained
Please can someone help?

Would a kind person possibly update the Gradio UI for Pinokio and add an ability to automatically find and be able to select any of the finetuned models that have been trained / created to make it easy please? 😊

1 reply

sarpba Jan 22, 2025

@holycowdude The easyest way use ComfyUI with this cudtom node https://github.com/niknah/ComfyUI-F5-TTS
or my bach script from here https://github.com/sarpba/F5-TTS_scripts

firstpixel · 2025-01-19T15:08:00Z

firstpixel
Jan 19, 2025

I'm training 200hrs for pt-br reaching 1M steps, using google colab, half with A100 and half with T4, but it still not perfect, it is actually doing a little inference, but have some misspellings, and for numbers, just does not work.
it also seems to be worst if the sample is bigger than 6s.. if the sample is bigger than 10s, it becomes a mess.
the numbers issue is easy, I can just use a python to convert numbers to words, will work, but the misspells, I think it should need a finetune.

Is it possible to finetune it with a new dataset with only numbers and misspells? will it destroy the previous trainings?
Have anyone tried finetune to fix issues?
or should I keep training it for more time on same dataset, just adding more samples to the corner cases?

12 replies

lumpidu Jan 19, 2025

No need to regenerate the data. You could just concatenate your audios if spoken by the same speaker. It doesn't matter, if the audio contains 3 spoken sentences or 1 spoken long sentence. Just make the silence padding consistent between sentences when concatenating. You could also aim for a normal distribution of samples between 3-30 secs.

jpgallegoar Jan 19, 2025
Collaborator

Yes but if he has the long audios, and the transcriptions were for the long audios, it's better to resplit them to avoid unnatural timings and artifacts when rejoining them.

lumpidu Jan 20, 2025

The single splits should be "logical units", i.e. indepent utterances that make sense standalone. But it's okay to have 3 such utterances concatenated together, like:

"The weather seemed fine"
"Current stocks went down by 15%".
"We just put the blame on that region where bad stuff happens"

firstpixel Jan 30, 2025

Another question, if I want to make it pt-br + en, can I use the same dataset, with both languages? will it be able to speak both? or for multi language is different? pt-br I used the same vocab.txt from original, I want to add english to it as many words in portuguese are english words, specially on tech industry.

lumpidu Feb 7, 2025

Yes, add good amount, though (e.g. LJSpeech). It will have an portugese accent, though.

sarpba · 2025-01-21T19:41:07Z

sarpba
Jan 21, 2025

Hello, I would like to re-finetune the hungarian model again from the original. I collect about 2600 hours of ultra clear audio from about 50 speakers. Unfortunately, the average audio length is around 5 seconds for me as well. Is there an ideal curve for the distribution of sounds? The amount of data is abundant, I can select a data set corresponding to an ideal distribution curve.

edit: I think i need to reroll my dataset, it's worse than I thought.

19 replies

sarpba Jan 22, 2025

btw 1200000 step with 12000 bach size is eqvivalent 4500000 step with 3200 bach size. Maybe this is why the quality impruved?

Even 400k steps on 12000 was better than 1.2 million steps on 3200. I think it can learn better this way, regardless of further training.

That's probably true. The model generalizes better at larger batch values. The question is whether the higher bach size achieved with gradient accumulation is equivalent to the large bach_size achieved in 1 step with higher vram.

jpgallegoar Jan 22, 2025
Collaborator

The question is whether the higher bach size achieved with gradient accumulation is equivalent to the large bach_size achieved in 1 step with higher vram.

Unfortunately I don't think anyone has made that test yet.

sarpba Jan 22, 2025

@jpgallegoar @sch0ngut here is the scripts https://github.com/sarpba/ADCS I no have more time now. It's working, but not too nice (gpt translated) and the last scripts is missing. (trash_dropout, numbers_drouout, csv_maker) I'll continue friday night. under windows need to use WSL.

I ran a quick test. I processed 20 hours of audio in 35 minutes. it turned out to be about 15 hour usable data. (2x3090)

sch0ngut Jan 23, 2025

Awesome, thank you!

campar Feb 8, 2025

@sarpba What software/library did you use to create that kind of graph for distribution of audio durations?

kdcyberdude · 2025-01-24T19:32:03Z

kdcyberdude
Jan 24, 2025

Has anyone able to train this on Multiple-4090 GPU's setup (2 or more)??

I am getting this - #728 (comment)

1 reply

jpgallegoar Feb 3, 2025
Collaborator

I was able to do it, but it was from Replicate

jpgallegoar · 2025-02-03T09:41:15Z

jpgallegoar
Feb 3, 2025
Collaborator

Has anyone tried parallelized training with multi GPUs? I mean getting parallel performance, not only more VRAM. Is it even possible?

10 replies

jpgallegoar Feb 3, 2025
Collaborator

I'm sorry, I'm not sure.
What I'm able to do: 4x4090 setup gives me 4x the VRAM but 1x the speed
What I want: 4x4090 setup gives me 4x the VRAM and 4x the speed

hcsolakoglu Feb 4, 2025

Since NVIDIA's consumer GPUs, such as the 4090 and 3090, do not support P2P communication, they may not provide significant speedup in multi-GPU training. Instead, you could try training with NVIDIA's data center GPUs. Rather than using 4x 4090, you might consider renting a single H100 or A100. @jpgallegoar

jpgallegoar Feb 4, 2025
Collaborator

@hcsolakoglu Thank you for your answer, that is in fact what I ended up doing. I even rented H200 because I calculated it was the most efficient.

I want to spend the same amount of money (or a little bit more) and train in less time via parallelization.

Do you think it would work with these server GPUs?

sarpba Feb 4, 2025

@hcsolakoglu @jpgallegoar
I don't understand why are you waiting for speedup.

theoretically:
1xGPU - 3200 batch size -> 5 update/s
4xGPU 3200 batch size / GPU owerall batc size 12800 -> 5 update/s but with x4 batch If you want speedup, than use
4xGPU 800 batch size / GPU owerall batc size 3200 -> around 20 update/s

So there is an increase in speed, just in a different way.
but I noticed that at low batch sizes it throws away most of the training data, leaving barely anything.

My train with 3200 batch size / 2x3090 GPU (NVlik connected) 6400 owerall batch size -> I have 7-8 update/s (underpowered to 280watt)

jpgallegoar Feb 4, 2025
Collaborator

Thanks for the answer, the speedup would be nice for commercial purposes. I did 9600 batch size and the results were very good indeed. Perhaps I am confused and it was speeding up, I will have to test again.

Finetune practice #57

SWivid Oct 14, 2024 Maintainer

Replies: 88 comments · 824 replies

JarodMica Dec 4, 2024 Collaborator

JarodMica Dec 4, 2024 Collaborator

jpgallegoar Oct 16, 2024 Collaborator

jpgallegoar Oct 17, 2024 Collaborator

jpgallegoar Oct 18, 2024 Collaborator

jpgallegoar Oct 21, 2024 Collaborator

jpgallegoar Oct 17, 2024 Collaborator

JarodMica Oct 18, 2024 Collaborator

jpgallegoar Oct 18, 2024 Collaborator

jpgallegoar Oct 18, 2024 Collaborator

jpgallegoar Oct 18, 2024 Collaborator

SWivid
Oct 14, 2024
Maintainer

Replies: 88 comments 824 replies

JarodMica Dec 4, 2024
Collaborator

JarodMica Dec 4, 2024
Collaborator

jpgallegoar
Oct 16, 2024
Collaborator

jpgallegoar Oct 17, 2024
Collaborator

jpgallegoar Oct 18, 2024
Collaborator

jpgallegoar Oct 21, 2024
Collaborator

jpgallegoar Oct 17, 2024
Collaborator

JarodMica Oct 18, 2024
Collaborator

jpgallegoar Oct 18, 2024
Collaborator

jpgallegoar Oct 18, 2024
Collaborator

jpgallegoar Oct 18, 2024
Collaborator