I think this description is wrong, the encoder-decoder attention creasts its VALUE matrix from the…

Source: Deep Learning on Medium

I think this description is wrong, the encoder-decoder attention creasts its VALUE matrix from the layer below it, which is decoder self attention, and takes the QUERY and KEY matrix from output of the encoder stack.

if you check the formula, softmax(QK)*V, if V is from encoder, then the output of encoder-decoder pass down to later decoders is from encoder, which defeat the purpose of decoder.