Rethinking the Evaluation of Compositional Reasoning for Modern VLMs