LLMs can see and hear without any training

a year ago

Official implementation of the paper 'LLMs can see and hear without any training'.
Install the conda environment using `conda env create -f environment.yml` and activate it with `conda activate MILS`.
Download datasets: MS-COCO, Clotho, and MSR-VTT, along with annotations and checkpoints.
Update variables in `paths.py` to set dataset directory and output folder.
MILS is an inference-only method that can run on a single A100 GPU, but experiments were run on eight A100 GPUs.
Generate captions for images, audio, and videos using provided scripts and evaluate using corresponding evaluation scripts.
Generate high-quality images using `main_image_generation_enhancement.py`.
Perform style transfer by placing style and content images in the `images/` folder and running `main_style_transfer.py`.
Combine captions from image and audio to create prompts for image generation.
MILS is available under a CC-by-NC 4.0 license, with third-party content subject to their own licenses.
Cite the work using the provided BibTeX entry.

Hasty Briefsbeta