nanotron/ultrascale-playbook

6 days ago

In the Activation Memory section, the calculations for $$m_act$$ is the following

$m_{act} = L \cdot seq \cdot bs \cdot h \cdot \left(34 + \frac{5 \cdot n_{heads} \cdot seq}{h}\right)$

And you guys state that this scales linearly with $$seq_len$$ and $$bs$$. Though, this actually scales quadratically with $$seq_len$$:

$m_{act} = L \cdot seq \cdot bs \cdot h \cdot 34 + L \cdot seq \cdot bs \cdot h \cdot \frac{5 \cdot n_{heads} \cdot seq}{h} \\ m_{act} = 34 \cdot L \cdot seq \cdot bs \cdot h + 5 \cdot L \cdot seq \cdot bs \cdot n_{heads} \cdot seq \\ m_{act} = 34 \cdot L \cdot seq \cdot bs \cdot h + 5 \cdot L \cdot bs \cdot n_{heads} \cdot seq^2$

The first term, $$34 \cdot L \cdot seq \cdot bs \cdot h$$, scales linearly with $$seq$$.
The second term, $$5 \cdot L \cdot bs \cdot n_{heads} \cdot seq^2$$, scales quadratically with $$seq$$

eliebak

Nanotron Research org 6 days ago

fixed here #91 thanks! :)

eliebak changed discussion status to closed 6 days ago

Spaces:

nanotron
/

ultrascale-playbook

Running

typo