I think it would be better instead of “…correspond to two points 16 pixels apart…” using “1 pixel…

Source: Deep Learning on Medium

I think it would be better instead of “…correspond to two points 16 pixels apart…” using “1 pixel in the output have 16×16 pixel receptive field in the input image”. CNN process is not a simple sampling, there are many convolution operations; therefore, using the term “receptive field” seems better.