A Taxonomy of Tensor Operations
Tensor operations are the building blocks of deep learning frameworks, but not all ops are created equal. In this post, I’ll give you a simple framework for understanding them, and show why reshape is nearly free, while matrix multiplication burns GPU cycles.
View vs. Data Operations
The two main categories for tensor operations are view operations, which only change how a tensor is indexed, without moving data, and data operations, which change the underlying values. View operations are typically very fast since they just change indexing instructions, while data operations do the actual computational work.
View Operations
View operations take a tensor and change how the tensor is indexed. These are not very mathematically interesting, but can make tensor programs more conceptually simple or computationally efficient.
Permute
The simplest example of a view operation is permute.
For example:
a b c -> b a c
If you try to access the element at [i, j, k]
in the output tensor, it is
equivalent to accessing the element at [j, i, k]
in the original tensor.
For a concrete example:
A tensor representing a batch of RGB images might have shape
[batch, height, width, channels]
. Permuting to [batch, channels, height, width]
rearranges the dimensions without copying any pixel data.
Reshape
Before we get started with reshape, let's first review a few key concepts.
Raveling and Unraveling
Think of flattening a 2D spreadsheet into a single column. You go row by row, placing each cell sequentially. This is essentially what happens when any multi-dimensional tensor gets stored in your computer's linear memory. It needs to be "flattened" into a 1D representation.
Similar to my last post, I am using 1-based indexing into tensors because I like its symmetry for counting from the end of the list or beginning. However, the actual values of the index vector and linear_index are 0-based for reasons that will become apparent shortly.
For a multi-dimensional index
for a tensor with shape
, the canonical way
that this mapping occurs is:
linear_index = index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ....
For example, we can think of a grayscale image as a 2D tensor of shape [h, w]
.
You can imagine this linear index going row by row down the image.
Where i
indices height (h
) and j
indices width (w
):
linear_index = j + i*w = i*w + j
So each i
jumps a row, and each j
selects within the row.
This is called raveling.
If we want to get the index
back from the linear_index
, assuming we have the
original shape
, it is a bit trickier. If you're curious,
it’s worth working through this yourself first.
Let’s clarify with our grayscale image example, where:
linear_index = i*w + j
If we wanted to get j
back from this, we should remember how we would get the
remainder of a division by w
using the modulo (%
). Since j
is between [0, w)
:
%
is the modulo operator (gives the remainder after division)
j = linear_index % w
Now we just need i
, which is:
i = linear_index - j = linear_index - linear_index % w = linear_index // w
Where //
is floor division, i.e., division then rounded down to the nearest
integer.
For higher-dimensional indices, the pattern is as follows:
index[-1] = linear_index % shape[-1]
This ensures that anything multiplied by shape[-1]
in our linear_index
goes
to 0
.
So:
= linear_index % shape[-1]
= (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) % shape[-1]
= index[-1] % shape[-1] + index[-2]*shape[-1] % shape[-1] + index[-3]*shape[-1]*shape[-2] % shape[-1] + ...
= index[-1] % shape[-1] + 0 + 0 + ...
= index[-1]
Then for the next one, we subtract the previously found index using our floor
division //
from earlier:
linear_index // shape[-1]
= (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) // shape[-1]
= index[-1] // shape[-1] + index[-2]*shape[-1] // shape[-1] + index[-3]*shape[-1]*shape[-2] // shape[-1] + ...
= 0 + index[-2] + index[-3]*shape[-2] + ...
= index[-2] + index[-3]*shape[-2] + ...
But since we have more dimensions, we need to take the modulo again for the next dimension:
index[-2] = (linear_index // shape[-1]) % shape[-2]
index[-3] = (linear_index // shape[-1]*shape[-2]) % shape[-3]
index[-4] = (linear_index // shape[-1]*shape[-2]*shape[-3]) % shape[-4]
Returning to our grayscale image example:
If we have linear_index = 15
for a 4x5
image, we get j = 15 % 5 = 0
and i = 15 // 5 = 3
, so we're at position [3, 0]
, the first pixel of the
fourth row.
This clean math is only possible for 0-based indexing.
This is called unraveling.
To me, the naming feels a little counterintuitive, because linearizing a tensor feels like unraveling a string from a fabric, but whatever.
Merge and Split
Now that that is out of the way, there are 2 fundamental reshape operations: Merge and Split.
Merge
The simpler of the two, Merge, takes a sub-tensor and converts the indexing of a multi-dimensional subtensor to linear indexing.
For example:
b h w c -> b (h w) c
Indexing into the output tensor [i, j, k]
is equivalent to
indexing into the original tensor at the unraveled index [i, j // w, j % w, k]
.
Merge doesn't require additional information besides the adjacent dimensions to be merged.
Split
Split is a bit more challenging because you must know all but one of the dimensions.
b (h w) c -> b h w c
In this case, you must either know the size of h
or the size of w
.
Most tensor frameworks require you to specify target dimensions explicitly when splitting, because the target shape isn't fully inferable from the input.
Indexing into the output tensor [i, j, k, l]
is equivalent to indexing into
the original tensor at the raveled index [i, j*w + k, l]
.
Notice how easy it is to add singleton dimensions by doing:
(1 1 1 1 1 1 1 1 1 b) h c -> 1 1 1 1 1 1 1 1 1 b h c
You can see that the linear indices remain unaffected:
linear_index = index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + index[-3]*shape[-1]*shape[-2]*shape[-3] + ...
linear_index = k + j*c + i*c*h + 0*c*h*b + ...
linear_index = k + j*c + i*c*h
Window and Expand
Window
Window takes a tensor and outputs a sub-tensor.
For example:
b h w c -> b h[10:-10] w[10:-10] c
In this example, we are "cutting out" a subtensor where the h
and
w
dimensions start 10 indices later and end 10 indices earlier,
shortening the size of both the height and width dimensions by 20.
Expand
Expand takes a tensor and outputs a tensor with extra padding or edge extension.
For example:
b h[10:-10] w[10:-10] c -> b h w c
This adds padding so the h
and w
dimensions start 10 indices earlier
and end 10 indices later, lengthening the size of both the height and
width dimensions by 20.
Now this is a lot trickier because you need extra information on how to fill in the new indices.
The most common padding modes are:
- Zero padding: Fill new positions with zeros (most common)
- Edge padding: Repeat the nearest edge value
- Wrap padding: Tile the tensor by repeating from the opposite edge
- Reflection padding: Mirror the tensor at the boundaries
Getting the padding mode wrong is a common source of subtle bugs, especially in image processing where edge artifacts can propagate through your model.
Data Operations
Data operations are a different beast entirely; the actual values of the tensor are modified. This is usually where most of the computational work is being done.
Map
The map operation takes a tensor and performs an operation on each element independently. In machine learning frameworks, think of it as applying element-wise functions like ReLU, tanh or sigmoid to a tensor.
It is a function that takes in a tensor of shape
and outputs a tensor of shape
.
For example:
b h w c -func> b h w c
or
func(Tensor<shape>) -> Tensor<shape>
Reduce
The reduce operation takes a tensor and combines all the subtensors over specified axes using a (usually associative) function. In machine learning frameworks, you can think of this as a sum or product function.
It is a function that takes a tensor and combines all the subtensors over
specified axes using a (usually associative) function. Essentially taking
a list of tensors of shape
and outputting one tensor of shape
.
For example, reducing over the h dimension:
b h w c -func> b w c
or
func(List<Tensor<shape>>) -> Tensor<shape>
Transform
The transform operation takes a tensor and performs an operation independently on all the subtensors over specified axes. In machine learning frameworks, you can think of this as applying softmax to the last dimension of a tensor, or applying a filter to each image in a batch independently.
It is a function that takes a list of tensors of shape
and outputs a list of
tensors of shape
.
For example:
b h w c / h w c -func> b h w c
or
func(List<Tensor<shape>>) -> List<Tensor<shape>>
Transform is the version of map generalized to arbitrary tensors. So you can
reproduce map via b h w c / b h w c -func> b h w c
.
Amplect
The Amplect operation takes multiple tensors and applies a scalar function (usually associative) elementwise over given corresponding directions. In machine learning frameworks, this is equivalent to broadcast ops like multiplication or addition.
Amplect (from Latin amplector, ‘to embrace’) is a non-standard name I’m using here to differentiate it from
broadcast-<name>
.
For example:
b h w c, c -func> b h w c
Or
associative_func(Tensor<[]>) -> Tensor<[]>
Difference Between View and Data Ops
As you can see, view ops can be done one after another while just keeping note of instructions on how you should index into a tensor.
In the Candle framework, after performing operations you sometimes need to call
the contiguous
function. This is because reindexing doesn't change memory layout,
it is really cheap and is lazily evaluated when contiguous
is called.
You typically need to call contiguous()
when you're about to perform data operations
that require the tensor to be laid out sequentially in memory,
like matrix multiplications or passing data to external libraries that expect
contiguous arrays.
A surprisingly high percentage of nodes in any given graph are just these cheap view operations.
Deriving Attention
These aren't all the operations you need, but it is most of them. To show you how
powerful these operations are, let's create attention with them.
I'll add names to the tensors using $
to help see the flow.
Step 1
We are starting with our input tensor, $Input{b s d}
, with shape b
for
batch size, s
for number of tokens, and d
for token dimension size,
and $W_k{d e}
which is our K weight matrix with shape d
for token dimension
size and e
for hidden size.
$Input{b s d}, $W_k{d e} -multiply> $K_p{b s d e} -sum> $K{b s e}
In this step, we can see the first operation is an amplect op followed by a reduce,
which is performing matrix multiplication to get $K{b s e}
.
This is followed by:
$K{b s e} -> $K_T{b e s}
a permute to get $K_T{b e s}
where the e and s dimensions have swapped.
We follow a similar procedure for our value weight matrix $W_v{d e}
as input:
$Input{b s d}, $W_v{d e} -multiply> $V_p{b s d e} -sum> $V{b s e}
And unlike last time, we do not permute $V{b s e}
.
And similarly for $W_q{d e}
we get $Q{b sq e}
:
$Input{b s d}, $W_q{d e} -multiply> $Q_p{b s d e} -sum> $Q{b s e}
Step 2
I use
sq
andsk
instead of justs
to distinguish the query and key sequence dimensions in the einops notation, even though they're the same length.
First, we perform matrix multiplication between our Q and K^T matrices to get
our attention matrix $A_p{b sq sk}
:
$Q{b sq e}, $K_T{b e sk} -multiply> $A_pp{b sq e sk} -sum> $A_p{b sq sk}
Then we apply softmax across the sk dimension to compute attention weights:
$A_p{b sq sk}/{sk} -softmax> $A{b sq sk}
Then we perform a matrix multiplication between our final attention matrix and our values to get the output:
$A{b sq sk}, $V{b sk e} -multiply> $O_p{b sq sk e} -sum> $O{b sq e}
Step 3
Using our last input $W_o{e d}
, we perform matrix multiplication to project
the output dimension back into the token space:
$O{b sq e}, $W_o{e d} -multiply> $F_p{b sq e d} -sum> $F{b sq d}
Conclusion
I hope this was a helpful introduction to the taxonomy of tensor operations, and maybe even sparked some joy for how straightforward tensor operations are.
Next time you see a reshape followed by a matrix multiplication, you'll know the reshape is almost free, only changing indexing instructions, while the matrix multiplication is where your GPU actually does the heavy lifting.
This taxonomy isn't just academic. It helps you spot optimization opportunities, understand why some operations bottleneck while others barely register in your profiler, and see how complex algorithms like attention are really just compositions of these fundamental primitives.
The next time you're debugging performance or designing a new model architecture, you'll be able to think in terms of cheap view operations versus expensive data operations. That mental model will sharpen your intuition and tensor programming efficiency.
Operation Type | Name | Einops Notation Example | Description |
---|---|---|---|
View | Permute | b h w c -> b c h w |
Reorders dimensions |
View | Merge | b h w c -> b (h w) c |
Changes shape via merging dimensions |
View | Split | b (h w) c -> b h w c |
Changes shape via splitting dimensions |
View | Window | b h w c -> b h[10:-10] w[10:-10] c |
Selects a subregion of the tensor |
View | Expand | b h[10:-10] w[10:-10] c -> b h w c |
Adds padding to dimensions |
Data | Map | b h w c -> b h w c |
Applies function element-wise |
Data | Reduce | b h w c -> b w c |
Aggregates along axes (e.g. sum, mean) |
Data | Transform | b h w c / h w c -> b h w c |
Applies function to subtensors along axes (e.g. softmax) |
Data | Amplect | b h w c, c -> b h w c |
Element-wise function over multiple tensors (e.g. broadcast add/multiply) |