This guy makes crazy AI music videos, but also with a lot of manual video editing skills:
https://www.youtube.com/watch?v=m8dMy5_Lox4I think he has a workflow to create in low resolution and then upscale the video at the end.
Couldn't tell you what models to use. Usually you need to create images (first frame / last frame), e.g. with Nano Banana and then use a video model to create the sequence.
It will definitely be costly and a lot of work.
I also like lyric videos, this one is amazing:
https://www.youtube.com/watch?v=bXS2MF5XW60or you just film some real shit...
https://www.youtube.com/watch?v=WyRL4edCM1Q