This is a small barebones demo of the SNMN, from this paper. I'm using my own reproduction of the paper from this repository. To use, choose an image from the list, and then choose a question to use, or type your own in. You can also choose between a model trained on ground truth layouts or without them. Check out my blog post for more details!
If the prediction is taking a bit of time to run, just reload the page and try again (sometimes the requests timeout due to slow servers).Below is the program executed (i.e. the modules used by the network in the order they are used). Given is their name and description, a visualisation of the attention given to the question words at that timestep (darker blue means more attention), the module's effect on the stack (i.e. how many inputs it pops and outputs it pushes), and then a visualisation of the attention over the image currently placed at the top of the stack. For No-op modules, these details aren't given (since No-op does nothing).