Compare commit messages and label the better one
Pairwise compare model outputs and label them
B-Norm metric