Hasty Briefsbeta

Bilingual

LLMs can see and hear without any training

a year ago
  • #LLM
  • #captioning
  • #multimodal
  • Official implementation of the paper 'LLMs can see and hear without any training'.
  • Install the conda environment using `conda env create -f environment.yml` and activate it with `conda activate MILS`.
  • Download datasets: MS-COCO, Clotho, and MSR-VTT, along with annotations and checkpoints.
  • Update variables in `paths.py` to set dataset directory and output folder.
  • MILS is an inference-only method that can run on a single A100 GPU, but experiments were run on eight A100 GPUs.
  • Generate captions for images, audio, and videos using provided scripts and evaluate using corresponding evaluation scripts.
  • Generate high-quality images using `main_image_generation_enhancement.py`.
  • Perform style transfer by placing style and content images in the `images/` folder and running `main_style_transfer.py`.
  • Combine captions from image and audio to create prompts for image generation.
  • MILS is available under a CC-by-NC 4.0 license, with third-party content subject to their own licenses.
  • Cite the work using the provided BibTeX entry.