Why does falcon-7b have 71 attention heads?
#100
by
alpindale
- opened
This makes tensor parallelism impossible, as it needs a symmetrical number of attention heads. This design choice doesn't make any sense, 71 is a prime number.
Would be interested to know as well.