Intuition on Deep Residual Network -
i reading deep residual network paper , in paper there concept cannot understand:
question:
what mean "hope 2 weight layers fit f(x)" ?
here f(x) processing x 2 weight layers(+ relu non-linear function), desired mapping h(x)=f(x)? residual?
what mean "hope 2 weight layers fit f(x)" ?
so residual unit shown obtains f(x) processing x 2 weight layers. adds x f(x) obtain h(x). now, assume h(x) ideal predicted output matches ground truth. since h(x) = f(x) + x, obtaining desired h(x) depends on getting perfect f(x). means 2 weight layers in residual unit should able produce desired f(x), getting ideal h(x) guaranteed.
here f(x) processing x 2 weight layers(+ relu non-linear function), desired mapping h(x)=f(x)? residual?
first part correct. f(x) obtained x follows.
x -> weight_1 -> relu -> weight_2 h(x) obtained f(x) follows.
f(x) + x -> relu so, don't understand second part of question. residual f(x).
the authors hypothesize residual mapping (i.e. f(x)) may easier optimize h(x). illustrate simple example, assume ideal h(x) = x. direct mapping difficult learn identity mapping there stack of non-linear layers follows.
x -> weight_1 -> relu -> weight_2 -> relu -> ... -> x so, approximate identity mapping these weights , relus in middle difficult.
now, if define desired mapping h(x) = f(x) + x, need f(x) = 0 follows.
x -> weight_1 -> relu -> weight_2 -> relu -> ... -> 0 # @ last 0 achieving above easy. set weight 0 , 0 output. add x , desired mapping.
other factor in success of residual networks uninterrupted gradient flow first layer last layer. out of scope question. can read paper: "identity mappings in deep residual networks" more information on this.

Comments
Post a Comment