typo

#90
by jvelja - opened

In the Activation Memory section, the calculations for $$m_act$$ is the following

mact=Lโ‹…seqโ‹…bsโ‹…hโ‹…(34+5โ‹…nheadsโ‹…seqh) m_{act} = L \cdot seq \cdot bs \cdot h \cdot \left(34 + \frac{5 \cdot n_{heads} \cdot seq}{h}\right)

And you guys state that this scales linearly with $$seq_len$$ and $$bs$$. Though, this actually scales quadratically with $$seq_len$$:

mact=Lโ‹…seqโ‹…bsโ‹…hโ‹…34+Lโ‹…seqโ‹…bsโ‹…hโ‹…5โ‹…nheadsโ‹…seqhmact=34โ‹…Lโ‹…seqโ‹…bsโ‹…h+5โ‹…Lโ‹…seqโ‹…bsโ‹…nheadsโ‹…seqmact=34โ‹…Lโ‹…seqโ‹…bsโ‹…h+5โ‹…Lโ‹…bsโ‹…nheadsโ‹…seq2 m_{act} = L \cdot seq \cdot bs \cdot h \cdot 34 + L \cdot seq \cdot bs \cdot h \cdot \frac{5 \cdot n_{heads} \cdot seq}{h} \\ m_{act} = 34 \cdot L \cdot seq \cdot bs \cdot h + 5 \cdot L \cdot seq \cdot bs \cdot n_{heads} \cdot seq \\ m_{act} = 34 \cdot L \cdot seq \cdot bs \cdot h + 5 \cdot L \cdot bs \cdot n_{heads} \cdot seq^2

  • The first term, $$34 \cdot L \cdot seq \cdot bs \cdot h$$, scales linearly with $$seq$$.
  • The second term, $$5 \cdot L \cdot bs \cdot n_{heads} \cdot seq^2$$, scales quadratically with $$seq$$
Nanotron Research org

fixed here #91 thanks! :)

eliebak changed discussion status to closed

Sign up or log in to comment