LLMs can see and hear without any training
a year ago
- #LLM
- #captioning
- #multimodal
- Official implementation of the paper 'LLMs can see and hear without any training'.
- Install the conda environment using `conda env create -f environment.yml` and activate it with `conda activate MILS`.
- Download datasets: MS-COCO, Clotho, and MSR-VTT, along with annotations and checkpoints.
- Update variables in `paths.py` to set dataset directory and output folder.
- MILS is an inference-only method that can run on a single A100 GPU, but experiments were run on eight A100 GPUs.
- Generate captions for images, audio, and videos using provided scripts and evaluate using corresponding evaluation scripts.
- Generate high-quality images using `main_image_generation_enhancement.py`.
- Perform style transfer by placing style and content images in the `images/` folder and running `main_style_transfer.py`.
- Combine captions from image and audio to create prompts for image generation.
- MILS is available under a CC-by-NC 4.0 license, with third-party content subject to their own licenses.
- Cite the work using the provided BibTeX entry.