HOTR is a novel framework which directly predicts a set of {human, object, interaction} triplets from an image using a transformer-based encoder-decoder. Through the set-level prediction, our method ...