Discover Top Posts Tagged with #audioldm

A month ago, I prepared a dataset for QA_MDT music generation training. I added about 12 hours of audio — 0.1% of the total material in the original model, using multitracks of songs I familiar with: from The Beatles and DIre Straits to Queen, Peter Gabriel and Cerrone. I found some files on the net — for research purposes only.But while inference worked well, training didn't. The method suggested by qa-mdt authors to train the model by packing files into LMDB database, as I tried earlier under the GPT supervision, did not work.The training was started, the video card RAM filled up, and the process froze. Oh.I wrote to the author of the code about it, and he confirmed my doubts. He said if it didn't work, I should throw out his code and insert another one. take the code from AudioLDM, he wrote me, only the loader, and read the files without the base, so it will be easier.I tried AudioLDM. The training worked well there. But the architecture of the model there is different, and I really needed a file loader only to read new files directly from the file system, not from the database.The first solution that came to mind was to upload files from both projects to GPT and ask 4o to transfer some logic from one to another. They were very similar. I assumed that the 4o model would do a better job because you can upload files to it and it goes online.I was so naive: gpt 4o suggested changing individual small parts of the code, constantly forgetting about the rest of the context. The code didn't work, although I did make some progress and old bugs were replaced by new ones. GPT started going in circles, google didn't know anything of the sort. All of this cost me 40 cents/hour of server time. Not much, but I don’t like that anyway.50 dollars later I was starting to lose hope. After thinking for a couple more days, I decided to ask the newer o1 model to simply rewrite the old code to read from the system, removing all mention of the LMDB. This, with some caveats, worked better than directly borrowing the code from another file: the training ran, but was now swearing at the file size. More precisely, the length of the files in the samples did not match the expected length. GPT confidently advised to rewrite everything again (he loves it!), but I didn't believe him anymore.And then it turned out that the problem, as it usually happens, is not in the code, but in the user: I simply forgot to trim frequencies and bring the length to the exact 10.24 seconds — and I did it, of course.Modification of dataloader: files + json instead of LMDBSo, what to modify?- Replaced LMDB with loading from JSON in hhhh.py.- Removed all references to lmdb_path, mos_path, and filter_all.lst.- Read JSON directly, where each record has fields like {“wav”: “…”, “caption”: “…”, “mos”: 4.5}.- Made the text field in the dataset be a regular string, not a list/tuple.- It used to be data = , now data = caption. This removed the error TextEncodeInput must be Union.- Fixed mos so that it wasn't -1 (which broke mos_embed(…) in PixArt.py).- For the test, just set mos = 4.5 so as not to cause an error with Embedding (or you could assign an integer 3 if Embedding is waiting for an index).- In the latent_diffusion.py file:- Reduced num_workers from 8 to 2 (in DataLoader) to reduce potential memory issues.- You could also put num_sanity_val_val_steps=0 to disable sanity check and go straight to training.- You can reduce batch_size in config (or directly in code) to 1 if GPU memory is low.Files & scriptsPreparing a dataset. Here are various scripts for my needs.Python script for auto audio slicing, trimming & normalization Read the full article

#ai #audio #audioldm #generativeai #music #qa_mdt #training

Improving new AI music models qa_mdt and AudioSR for a small project I have in mind.For the first time in a long time I'm writing code, including revision of other people's sources. A year ago, working with code in GPT was like working with a schoolboy: I wrote some prototypes quickly, but everything had to be double-checked and given to a real programmer for revision and re-write.Today GPT-4o is already a full-fledged assistant, similar to a normal developer: it explains details, understands poorly formulated tasks, keeps changes in its head, and writes code normally. Problems arise from incompleteness of available information, but are solved by loading and analyzing specific files.Having a little experience with picture generators (in fact, I only know how to read code in Python and roughly understand the scheme of work of different functions and components of neural networks, for example, samplers), in a week I finished the small features I needed for sound generation. And then I sliced and normalized audio files, configured the software on the runpod server, assigned tokens to the samples and uploaded them to the database for further additional training of the basic model (the authors, by the way, write that it can be complex and they don't recommend it). Well, not me, but GPT, of course — I just wrote "I want to do all this, help me please". And it works.For example, I found that the generation in QA_MDT always uses a fixed seed. I went into the files and made the generation random, as is usually done in Stable Diffusion (yes, it's technically SD-based sound frequency picture generation, paper). Then put the control of the seed and other parameters in the interface for easy management and model testing.ModificationsExample modification of infer_mos5.pyimport redef sanitize_filename(name: str) -> str: return re.sub(r'', '_', name).replace(" ", "_")def infer(dataset_key, configs, config_yaml_path, exp_group_name, exp_name, seed=0, output_filename=None, prompt=None): # If pipeline.py already add name, use it if output_filename is None: sanitized_prompt = sanitize_filename(prompt) output_filename = f"{sanitized_prompt}_{seed}.wav" print(f"Infer: Will save to {output_filename}") latent_diffusion.generate_sample( val_loader, unconditional_guidance_scale=guidance_scale, ddim_steps=ddim_sampling_steps, n_gen=n_candidates_per_samples, name=output_filename ) ...Change ddpm.py to save output files with different names instead of 'awesome.wav'if name is None: name = "awesome.wav"self.save_waveform(waveform, savepath="./", name=name)output_filename = f"{sanitized_prompt}_{seed}.wav"latent_diffusion.generate_sample( batchs=batchs, ddim_steps=ddim_steps, unconditional_guidance_scale=guidance_scale, name=output_filename, ...)Example modification of pipeline.py for random seed and other parametersimport randomdef __call__(self, prompt: str, seed: int = None, cfg: float = None, steps: int = None): # If no seed, use random if seed is None: seed = random.randint(0, 999999) print(f"Using seed = {seed}") # Make filename from parameters cfg_str = f"cfg{cfg}" if cfg is not None else "cfg?" steps_str = f"steps{steps}" if steps is not None else "steps?" filename = f"{sanitized_prompt}_{cfg_str}_{steps_str}_{seed}.wav" # Add cfg / steps to self.configs if cfg is not None: self.configs = cfg if steps is not None: self.configs = steps dataset_key = self.build_dataset_json_from_prompt(prompt) infer( dataset_key=dataset_key, configs=self.configs, config_yaml_path=self.config_yaml, exp_group_name="qa_mdt", exp_name="mos_as_token", seed=seed, output_filename=filename, # Set filename prompt=prompt ) return filenamepipe = MOSDiffusionPipeline()result = pipe("A modern synthesizer creating futuristic soundscapes.", seed=1234, cfg=10.0, steps=100)print(f"Generated file: {result}")Example of inference with parametersfrom qa_mdt.pipeline import MOSDiffusionPipelinepipe = MOSDiffusionPipeline()filename = pipe(prompt="smoke_on_water", seed=42, cfg=10.0, steps=100)print("Generation done. File:", filename)Which results in someting like this:Infer: Will save to smoke_on_water_cfg10.0_steps100_42.wavWaveform saved at -> smoke_on_water_cfg10.0_steps100_42.wavI also started to implement negative prompts and weights change from https://huggingface.co/blog/audioldm2.How to run QA_MDT (OpenMusic) on RunpodMinimum requirements: 36-48gb video ram, 30gb system and 100gb files SSD.Installation Instructions# 0. Docker commandsbash -c "apt update;apt install -y wget;DEBIAN_FRONTEND=noninteractive apt-get install openssh-server -y;apt-get install -y magic-wormhole;apt-get install -y nano;apt-get install -y curl;apt-get install -y git;apt-get install -y git-lfs;apt-get install -y ffmpeg;apt-get install -y unzip;cd home;curl -O https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh;chmod 777 Anaconda3-2024.10-1-Linux-x86_64.sh;cd ..;mkdir -p ~/.ssh;cd $_;chmod 700 ~/.ssh;echo YOUR_PUBLIC_KEY > authorized_keys;chmod 700 authorized_keys;service ssh start;sleep infinity"# 1. Condacd /homebash Anaconda3-2024.10-1-Linux-x86_64.sh;source ~/.bashrcconda create -n oml python=3.11 -y && conda activate oml# 2. QA_MDT installgit clone https://huggingface.co/jadechoghari/openmusic qa_mdtls -ltrpip install diffuserspip install matplotlibpip install pandaspip install einopspip install h5pypip install gdownpip install xformers==0.0.26.post1pip install torchlibrosa==0.0.9 librosa==0.9.2pip install -q pytorch_lightning==2.1.3 torchlibrosa==0.0.9 librosa==0.9.2 ftfy==6.1.1 braceexpandpip install torch==2.3.0+cu121 torchvision==0.18.0+cu121 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121pip install -r qa_mdt/requirements.txt# 3. Now create a file with interference commandnano new.py# from qa_mdt.pipeline import MOSDiffusionPipeline## pipe = MOSDiffusionPipeline()# pipe("A modern synthesizer creating futuristic soundscapes.")# 4. Runpython new.pyResultAudioSR RestorationI also ran AudioSR locally on my macbook, solving minor technical problems (as usual in open source). Comparing the restored files with the original ones I noticed more emphasis on high frequencies. To be fair, it is enough to restore drums and voice, and to make low-quality and generative samples a bit better.Installation instructionsapt-get updateapt-get install magic-wormholeapt-get install nanoapt-get install curlapt-get install gitapt-get install ffmpegcurl -O https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.shchmod 777 Anaconda3-2024.10-1-Linux-x86_64.shbash Anaconda3-2024.10-1-Linux-x86_64.shsource ~/.bashrcconda -Vconda create -n audiosr python=3.9; conda activate audiosrgit clone https://github.com/haoheliu/versatile_audio_super_resolution/pip3 install audiosr==0.0.7audiosr -i example/music.wavaudiosr -i INPUT_AUDIO_FILEResult before/afterThe progress over the year is awesome. But I wanted to train my own model, and so I did. Read the full article

#ai #audioldm #audiosr #generativeai #gpt #music

#audioldm

Trending Tags

Recently Viewed Tags

#audioldm