A month ago, I prepared a dataset for QA_MDT music generation training. I added about 12 hours of audio — 0.1% of the total material in the original model, using multitracks of songs I familiar with: from The Beatles and DIre Straits to Queen, Peter Gabriel and Cerrone. I found some files on the net — for research purposes only.But while inference worked well, training didn't. The method suggested by qa-mdt authors to train the model by packing files into LMDB database, as I tried earlier under the GPT supervision, did not work.The training was started, the video card RAM filled up, and the process froze. Oh.I wrote to the author of the code about it, and he confirmed my doubts. He said if it didn't work, I should throw out his code and insert another one. take the code from AudioLDM, he wrote me, only the loader, and read the files without the base, so it will be easier.I tried AudioLDM. The training worked well there. But the architecture of the model there is different, and I really needed a file loader only to read new files directly from the file system, not from the database.The first solution that came to mind was to upload files from both projects to GPT and ask 4o to transfer some logic from one to another. They were very similar. I assumed that the 4o model would do a better job because you can upload files to it and it goes online.I was so naive: gpt 4o suggested changing individual small parts of the code, constantly forgetting about the rest of the context. The code didn't work, although I did make some progress and old bugs were replaced by new ones. GPT started going in circles, google didn't know anything of the sort. All of this cost me 40 cents/hour of server time. Not much, but I don’t like that anyway.50 dollars later I was starting to lose hope. After thinking for a couple more days, I decided to ask the newer o1 model to simply rewrite the old code to read from the system, removing all mention of the LMDB. This, with some caveats, worked better than directly borrowing the code from another file: the training ran, but was now swearing at the file size. More precisely, the length of the files in the samples did not match the expected length. GPT confidently advised to rewrite everything again (he loves it!), but I didn't believe him anymore.And then it turned out that the problem, as it usually happens, is not in the code, but in the user: I simply forgot to trim frequencies and bring the length to the exact 10.24 seconds — and I did it, of course.Modification of dataloader: files + json instead of LMDBSo, what to modify?- Replaced LMDB with loading from JSON in hhhh.py.- Removed all references to lmdb_path, mos_path, and filter_all.lst.- Read JSON directly, where each record has fields like {“wav”: “…”, “caption”: “…”, “mos”: 4.5}.- Made the text field in the dataset be a regular string, not a list/tuple.- It used to be data = , now data = caption. This removed the error TextEncodeInput must be Union.- Fixed mos so that it wasn't -1 (which broke mos_embed(…) in PixArt.py).- For the test, just set mos = 4.5 so as not to cause an error with Embedding (or you could assign an integer 3 if Embedding is waiting for an index).- In the latent_diffusion.py file:- Reduced num_workers from 8 to 2 (in DataLoader) to reduce potential memory issues.- You could also put num_sanity_val_val_steps=0 to disable sanity check and go straight to training.- You can reduce batch_size in config (or directly in code) to 1 if GPU memory is low.Files & scriptsPreparing a dataset. Here are various scripts for my needs.Python script for auto audio slicing, trimming & normalization Read the full article








