A Taxonomy of Tensor Operations

Community Article Published June 12, 2025

Tensor operations are the building blocks of deep learning frameworks, but not all ops are created equal. In this post, I’ll give you a simple framework for understanding them, and show why reshape is nearly free, while matrix multiplication burns GPU cycles.

View vs. Data Operations

The two main categories for tensor operations are view operations, which only change how a tensor is indexed, without moving data, and data operations, which change the underlying values. View operations are typically very fast since they just change indexing instructions, while data operations do the actual computational work.

View Operations

View operations take a tensor and change how the tensor is indexed. These are not very mathematically interesting, but can make tensor programs more conceptually simple or computationally efficient.

Permute

The simplest example of a view operation is permute.

For example:

a b c -> b a c

If you try to access the element at [i, j, k] in the output tensor, it is equivalent to accessing the element at [j, i, k] in the original tensor.

For a concrete example:

A tensor representing a batch of RGB images might have shape [batch, height, width, channels]. Permuting to [batch, channels, height, width] rearranges the dimensions without copying any pixel data.

image/gif

Reshape

Before we get started with reshape, let's first review a few key concepts.

Raveling and Unraveling

Think of flattening a 2D spreadsheet into a single column. You go row by row, placing each cell sequentially. This is essentially what happens when any multi-dimensional tensor gets stored in your computer's linear memory. It needs to be "flattened" into a 1D representation.

Similar to my last post, I am using 1-based indexing into tensors because I like its symmetry for counting from the end of the list or beginning. However, the actual values of the index vector and linear_index are 0-based for reasons that will become apparent shortly.

For a multi-dimensional index for a tensor with shape, the canonical way that this mapping occurs is:

linear_index = index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ....

For example, we can think of a grayscale image as a 2D tensor of shape [h, w]. You can imagine this linear index going row by row down the image.

Where i indices height (h) and j indices width (w):

linear_index = j + i*w = i*w + j

So each i jumps a row, and each j selects within the row.

This is called raveling.

If we want to get the index back from the linear_index, assuming we have the original shape, it is a bit trickier. If you're curious, it’s worth working through this yourself first.

Let’s clarify with our grayscale image example, where:

linear_index = i*w + j

If we wanted to get j back from this, we should remember how we would get the remainder of a division by w using the modulo (%). Since j is between [0, w):

% is the modulo operator (gives the remainder after division)

j = linear_index % w

Now we just need i, which is:

i = linear_index - j = linear_index - linear_index % w = linear_index // w

Where // is floor division, i.e., division then rounded down to the nearest integer.

For higher-dimensional indices, the pattern is as follows:

index[-1] = linear_index % shape[-1]

This ensures that anything multiplied by shape[-1] in our linear_index goes to 0.

So:

  • = linear_index % shape[-1]
  • = (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) % shape[-1]
  • = index[-1] % shape[-1] + index[-2]*shape[-1] % shape[-1] + index[-3]*shape[-1]*shape[-2] % shape[-1] + ...
  • = index[-1] % shape[-1] + 0 + 0 + ...
  • = index[-1]

Then for the next one, we subtract the previously found index using our floor division // from earlier:

  • linear_index // shape[-1]
  • = (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) // shape[-1]
  • = index[-1] // shape[-1] + index[-2]*shape[-1] // shape[-1] + index[-3]*shape[-1]*shape[-2] // shape[-1] + ...
  • = 0 + index[-2] + index[-3]*shape[-2] + ...
  • = index[-2] + index[-3]*shape[-2] + ...

But since we have more dimensions, we need to take the modulo again for the next dimension:

  • index[-2] = (linear_index // shape[-1]) % shape[-2]
  • index[-3] = (linear_index // shape[-1]*shape[-2]) % shape[-3]
  • index[-4] = (linear_index // shape[-1]*shape[-2]*shape[-3]) % shape[-4]

Returning to our grayscale image example:

If we have linear_index = 15 for a 4x5 image, we get j = 15 % 5 = 0 and i = 15 // 5 = 3, so we're at position [3, 0], the first pixel of the fourth row.

This clean math is only possible for 0-based indexing.

This is called unraveling.

To me, the naming feels a little counterintuitive, because linearizing a tensor feels like unraveling a string from a fabric, but whatever.

image/gif

Merge and Split

Now that that is out of the way, there are 2 fundamental reshape operations: Merge and Split.

Merge

The simpler of the two, Merge, takes a sub-tensor and converts the indexing of a multi-dimensional subtensor to linear indexing.

For example:

b h w c -> b (h w) c

Indexing into the output tensor [i, j, k] is equivalent to indexing into the original tensor at the unraveled index [i, j // w, j % w, k].

Merge doesn't require additional information besides the adjacent dimensions to be merged.

image/gif

Split

Split is a bit more challenging because you must know all but one of the dimensions.

b (h w) c -> b h w c

In this case, you must either know the size of h or the size of w.

Most tensor frameworks require you to specify target dimensions explicitly when splitting, because the target shape isn't fully inferable from the input.

Indexing into the output tensor [i, j, k, l] is equivalent to indexing into the original tensor at the raveled index [i, j*w + k, l].

image/gif

Notice how easy it is to add singleton dimensions by doing:

(1 1 1 1 1 1 1 1 1 b) h c -> 1 1 1 1 1 1 1 1 1 b h c

You can see that the linear indices remain unaffected:

  • linear_index = index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + index[-3]*shape[-1]*shape[-2]*shape[-3] + ...
  • linear_index = k + j*c + i*c*h + 0*c*h*b + ...
  • linear_index = k + j*c + i*c*h

Window and Expand

Window

Window takes a tensor and outputs a sub-tensor.

For example:

b h w c -> b h[10:-10] w[10:-10] c

In this example, we are "cutting out" a subtensor where the h and w dimensions start 10 indices later and end 10 indices earlier, shortening the size of both the height and width dimensions by 20.

Expand

Expand takes a tensor and outputs a tensor with extra padding or edge extension.

For example:

b h[10:-10] w[10:-10] c -> b h w c

This adds padding so the h and w dimensions start 10 indices earlier and end 10 indices later, lengthening the size of both the height and width dimensions by 20.

Now this is a lot trickier because you need extra information on how to fill in the new indices.

The most common padding modes are:

  • Zero padding: Fill new positions with zeros (most common)
  • Edge padding: Repeat the nearest edge value
  • Wrap padding: Tile the tensor by repeating from the opposite edge
  • Reflection padding: Mirror the tensor at the boundaries

Getting the padding mode wrong is a common source of subtle bugs, especially in image processing where edge artifacts can propagate through your model.

Data Operations

Data operations are a different beast entirely; the actual values of the tensor are modified. This is usually where most of the computational work is being done.

Map

The map operation takes a tensor and performs an operation on each element independently. In machine learning frameworks, think of it as applying element-wise functions like ReLU, tanh or sigmoid to a tensor.

It is a function that takes in a tensor of shape and outputs a tensor of shape.

For example:

b h w c -func> b h w c

or

func(Tensor<shape>) -> Tensor<shape>

image/gif

Reduce

The reduce operation takes a tensor and combines all the subtensors over specified axes using a (usually associative) function. In machine learning frameworks, you can think of this as a sum or product function.

It is a function that takes a tensor and combines all the subtensors over specified axes using a (usually associative) function. Essentially taking a list of tensors of shape and outputting one tensor of shape.

For example, reducing over the h dimension:

b h w c -func> b w c

or

func(List<Tensor<shape>>) -> Tensor<shape>

image/gif

Transform

The transform operation takes a tensor and performs an operation independently on all the subtensors over specified axes. In machine learning frameworks, you can think of this as applying softmax to the last dimension of a tensor, or applying a filter to each image in a batch independently.

It is a function that takes a list of tensors of shape and outputs a list of tensors of shape.

For example:

b h w c / h w c -func> b h w c

or

func(List<Tensor<shape>>) -> List<Tensor<shape>>

image/gif

Transform is the version of map generalized to arbitrary tensors. So you can reproduce map via b h w c / b h w c -func> b h w c.

Amplect

The Amplect operation takes multiple tensors and applies a scalar function (usually associative) elementwise over given corresponding directions. In machine learning frameworks, this is equivalent to broadcast ops like multiplication or addition.

Amplect (from Latin amplector, ‘to embrace’) is a non-standard name I’m using here to differentiate it from broadcast-<name>.

For example:

b h w c, c -func> b h w c

Or

associative_func(Tensor<[]>) -> Tensor<[]>

image/gif

Difference Between View and Data Ops

As you can see, view ops can be done one after another while just keeping note of instructions on how you should index into a tensor.

In the Candle framework, after performing operations you sometimes need to call the contiguous function. This is because reindexing doesn't change memory layout, it is really cheap and is lazily evaluated when contiguous is called. You typically need to call contiguous() when you're about to perform data operations that require the tensor to be laid out sequentially in memory, like matrix multiplications or passing data to external libraries that expect contiguous arrays.

A surprisingly high percentage of nodes in any given graph are just these cheap view operations.

Deriving Attention

These aren't all the operations you need, but it is most of them. To show you how powerful these operations are, let's create attention with them. I'll add names to the tensors using $ to help see the flow.

Step 1

We are starting with our input tensor, $Input{b s d}, with shape b for batch size, s for number of tokens, and d for token dimension size, and $W_k{d e} which is our K weight matrix with shape d for token dimension size and e for hidden size.

$Input{b s d}, $W_k{d e} -multiply> $K_p{b s d e} -sum> $K{b s e}

In this step, we can see the first operation is an amplect op followed by a reduce, which is performing matrix multiplication to get $K{b s e}.

This is followed by:

$K{b s e} -> $K_T{b e s}

a permute to get $K_T{b e s} where the e and s dimensions have swapped.

We follow a similar procedure for our value weight matrix $W_v{d e} as input:

$Input{b s d}, $W_v{d e} -multiply> $V_p{b s d e} -sum> $V{b s e}

And unlike last time, we do not permute $V{b s e}.

And similarly for $W_q{d e} we get $Q{b sq e}:

$Input{b s d}, $W_q{d e} -multiply> $Q_p{b s d e} -sum> $Q{b s e}

Step 2

I use sq and sk instead of just s to distinguish the query and key sequence dimensions in the einops notation, even though they're the same length.

First, we perform matrix multiplication between our Q and K^T matrices to get our attention matrix $A_p{b sq sk}:

$Q{b sq e}, $K_T{b e sk} -multiply> $A_pp{b sq e sk} -sum> $A_p{b sq sk}

Then we apply softmax across the sk dimension to compute attention weights:

$A_p{b sq sk}/{sk} -softmax> $A{b sq sk}

Then we perform a matrix multiplication between our final attention matrix and our values to get the output:

$A{b sq sk}, $V{b sk e} -multiply> $O_p{b sq sk e} -sum> $O{b sq e}

Step 3

Using our last input $W_o{e d}, we perform matrix multiplication to project the output dimension back into the token space:

$O{b sq e}, $W_o{e d} -multiply> $F_p{b sq e d} -sum> $F{b sq d}

image/gif

Conclusion

I hope this was a helpful introduction to the taxonomy of tensor operations, and maybe even sparked some joy for how straightforward tensor operations are.

Next time you see a reshape followed by a matrix multiplication, you'll know the reshape is almost free, only changing indexing instructions, while the matrix multiplication is where your GPU actually does the heavy lifting.

This taxonomy isn't just academic. It helps you spot optimization opportunities, understand why some operations bottleneck while others barely register in your profiler, and see how complex algorithms like attention are really just compositions of these fundamental primitives.

The next time you're debugging performance or designing a new model architecture, you'll be able to think in terms of cheap view operations versus expensive data operations. That mental model will sharpen your intuition and tensor programming efficiency.

Operation Type Name Einops Notation Example Description
View Permute b h w c -> b c h w Reorders dimensions
View Merge b h w c -> b (h w) c Changes shape via merging dimensions
View Split b (h w) c -> b h w c Changes shape via splitting dimensions
View Window b h w c -> b h[10:-10] w[10:-10] c Selects a subregion of the tensor
View Expand b h[10:-10] w[10:-10] c -> b h w c Adds padding to dimensions
Data Map b h w c -> b h w c Applies function element-wise
Data Reduce b h w c -> b w c Aggregates along axes (e.g. sum, mean)
Data Transform b h w c / h w c -> b h w c Applies function to subtensors along axes (e.g. softmax)
Data Amplect b h w c, c -> b h w c Element-wise function over multiple tensors (e.g. broadcast add/multiply)

Community

Sign up or log in to comment