Building Tensors from Scratch in Rust (Part 1): Core Structure and Indexing

Community Article Published June 12, 2025

In this series, we’ll build a tensor library from scratch in Rust, similar to PyTorch or NumPy, focusing on understanding the fundamental concepts behind tensor operations. In this first part, we'll create the basic tensor structure and implement indexing operations.

Tensor Struct

A tensor might seem like a simple multi-dimensional array, but the design choices we make here will ripple through every operation we implement. The question is: how do we store multi-dimensional data efficiently while keeping operations like slicing, reshaping, and device transfers fast?

To make our implementation cleaner down the road, we’ll split our tensor struct into two parts: a shape component and a storage component.

We do this for a couple reasons:

First, we want to be flexible about where data is stored, whether in RAM, VRAM, or on a network drive somewhere.

Second, there are two different categories of operations:

  • view operations: which reinterpret or reshape the data without copying it (like slicing or reshaping)
  • data operations: which actually modify or transform the data (like addition or multiplication)

This means view operations don’t need access to the actual data, saving us from unnecessary data retrieval.

So we want to split the tensor into 2 parts:

  • TensorShape, which stores the shape.
  • TensorStorage<T>, which stores the raw values of type T

Let's combine them together in the Tensor<T> type.

#[derive(Debug, Clone, PartialEq)]
struct TensorShape {
    shape: Vec<usize>,
}

impl TensorShape {
    fn size(&self) -> usize {
        self.shape.iter().product()
    }
}

#[derive(Debug, Clone, PartialEq)]
struct TensorStorage<T> {
    data: Vec<T>,
}

#[derive(Debug, Clone, PartialEq)]
struct Tensor<T> {
    shape: TensorShape,
    storage: TensorStorage<T>,
}

You might wonder why we choose to store our data in a Vec this way, instead of in a list of lists or a different data structure:

  • It is simpler to transfer the data between devices and save the data to disk
  • We can guarantee that the raw data is contiguous for better performance

Row-major (aka "C-style") is the most popular ordering. In row-major order, moving one step in the last dimension corresponds to one step forward in the data buffer.

In row-major order, moving one step in the last dimension corresponds to one step forward in the data buffer.

So for a tensor with shape [2,2,2]: 0 ->[0,0,0] 1 ->[0,0,1] 2 ->[0,1,0] 14 ->[1,1,0] 15 ->[1,1,1]

In contrast, column-major order (used in Fortran and MATLAB) stores elements by progressing along the first dimension first.

0 ->[0,0,0] 1 ->[1,0,0] 2 ->[0,1,0] 14 ->[0,1,1] 15 ->[1,1,1]

In Production: Candle

Before moving on, I think it'd be helpful to quickly take a look at how Candle, an existing tensor framework written in Rust, handles their tensors.

First let's look at the Tensor_ struct:

pub struct Tensor_ {
  storage: Arc<RwLock<Storage>>,
  layout: Layout,
  // Other fields omitted for brevity
}

You can see that there are separate structs for the shape and storage similar to how we structured our own Tensor type.

The Layout struct that has fields about how to index a given tensor:

pub struct Layout {
    shape: Shape,
    // Other fields omitted for brevity
}

Including a Shape struct that is the same as our TensorShape above.

pub struct Shape(Vec<usize>);

The Storage enum looks like:

pub enum Storage {
    Cpu(CpuStorage),
    Cuda(CudaStorage),
    Metal(MetalStorage),
}

It has many different entries for data storage on different devices, but if we look at the CpuStorage enum:

pub enum CpuStorage {
    U8(Vec<u8>),
    U32(Vec<u32>),
    I64(Vec<i64>),
    BF16(Vec<bf16>),
    F16(Vec<f16>),
    F32(Vec<f32>),
    F64(Vec<f64>),
}

We can see that it is conceptually similar to our TensorStorage.

Tensor Initialization

First, we want to be able to create tensors. The simplest tensor you can create is the zero tensor.

In most frameworks there's a function called zeros. Let’s create an associated function on Tensor: Tensor::zeros.

This should only work when the inner type of the tensor is something that can be initialized to zero. We could create a trait for all our types like Zeroable that would give instructions on how to make a variable zero, like:

trait Zeroable {
    fn zero() -> Self
}

impl Zeroable for f32 {
    fn zero() -> f32 {
        0.0
    }
}

Or we can pull in the num-traits crate, which already includes a Zero trait and many other useful numeric traits.

This simplifies things significantly. This saves us from having to manually implement zero-initialization for every supported type.

use num_traits::Zero;

impl<T: Zero + Clone> Tensor<T> {
    fn zeros(shape: Vec<usize>) -> Self {
        let shape = TensorShape { shape };
        let storage = TensorStorage::<T>::zeros(shape.size());
        Tensor { shape, storage }
    }
}

impl<T: Zero + Clone> TensorStorage<T> {
    fn zeros(size: usize) -> Self {
        TensorStorage {
            data: vec![T::zero(); size],
        }
    }
}

Indexing

The first thing we really need to do is be able to index into our vector and view or change individual elements.

To index into storage, we are going to implement raveling and unraveling.

Raveling

Think of flattening a 2D spreadsheet into a single column. You iterate row by row, placing each element sequentially into a flat buffer. This is essentially what happens when any multi-dimensional tensor gets stored in your computer's linear memory. It needs to be "flattened" into a 1D representation.

For a multi-dimensional index for a tensor with shape, the canonical way that this mapping occurs is:

linear_index = index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ....

You can also think of this as taking the dot product between the vector of indexes, and a vector of "strides", Where:

strides = [..., shape[-2]*shape[-1], shape[-1], 1]

These strides can be precomputed once when a shape is set to save some time later.

For a tensor with shape [2, 3, 4], the strides would be [12, 4, 1]:

  • Moving one step in dimension 0 jumps 12 positions, every matrix
  • Moving one step in dimension 1 jumps 4 positions, every vector
  • Moving one step in dimension 2 jumps 1 position, every element

For example, we can think of a grayscale image as a 2D tensor of shape [h, w]. You can imagine this linear index going row by row down the image.

Where i indexes height (h) and j indexes width (w):

linear_index = j + i*w = i*w + j

So each i jumps a row, and each j selects within the row.

In impl TensorShape we can add the ravel_index function:

fn ravel_index(&self, indices: &[usize]) -> usize {
    if indices.len() != self.shape.len() {
        panic!("Indices length does not match tensor shape dimensions.");
    }
    
    indices.iter().zip(self.shape.iter())
        .rev()
        .scan(1, |stride, (&idx, &dim_size)| {
            let result = idx * *stride;
            *stride *= dim_size;
            Some(result)
        })
        .sum()
}

Scan is Rust's equivalent to a "stateful map". It's like map but has an accumulator, in this case stride, and an initial state, in this case 1.

Unraveling

If we want to get the index back from the linear_index, assuming we have the original shape, it is a bit trickier. If you're curious, it's worth working through this yourself first.

Let's clarify with our grayscale image example, where:

linear_index = i*w + j

If we wanted to get j back from this, we should remember how we would get the remainder of a division by w using the modulo (%). Since j is between [0, w):

j = linear_index % w

Now we just need i, which is:

i = linear_index - j = linear_index - linear_index % w = linear_index // w

Where // represents floor division, which is division rounded down to the nearest integer.

For higher-dimensional indices, the pattern is as follows:

index[-1] = linear_index % shape[-1]

This ensures that anything multiplied by shape[-1] in our linear_index goes to 0.

So:

  • = linear_index % shape[-1]
  • = (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) % shape[-1]
  • = index[-1] % shape[-1] + index[-2]*shape[-1] % shape[-1] + index[-3]*shape[-1]*shape[-2] % shape[-1] + ...
  • = index[-1] % shape[-1] + 0 + 0 + ...
  • = index[-1]

Then for the next one, we subtract the previously found index using our floor division // from earlier:

  • linear_index // shape[-1]
  • = (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) // shape[-1]
  • = index[-1] // shape[-1] + index[-2]*shape[-1] // shape[-1] + index[-3]*shape[-1]*shape[-2] // shape[-1] + ...
  • = 0 + index[-2] + index[-3]*shape[-2] + ...
  • = index[-2] + index[-3]*shape[-2] + ...

But since we have more dimensions, we need to take the modulo again for the next dimension:

  • index[-2] = (linear_index // shape[-1]) % shape[-2]
  • index[-3] = (linear_index // shape[-1]*shape[-2]) % shape[-3]
  • index[-4] = (linear_index // shape[-1]*shape[-2]*shape[-3]) % shape[-4]

Let’s return to our grayscale image example.

If we have linear_index = 15 for a 4x5 image, we get j = 15 % 5 = 0 and i = 15 // 5 = 3, so we're at position [3, 0], the first pixel of the fourth row.

In impl TensorShape we can add the unravel_index function:

fn unravel_index(&self, index: usize) -> Vec<usize> {
    if self.shape.is_empty() {
        return vec![];
    }

    let mut indices = vec![0; self.shape.len()];
    let mut remaining_index = index;

    for (i, &dim_size) in self.shape.iter().enumerate().rev() {
        indices[i] = remaining_index % dim_size;
        remaining_index /= dim_size;
    }

    indices
}

Index Final Bits

For our TensorStorage, let's implement the Index and IndexMut traits that just linearly index our storage.

use std::operations::{Index, IndexMut};

impl<T> Index<usize> for TensorStorage<T> {
    type Output = T;

    fn index(&self, index: usize) -> &Self::Output {
        &self.data[index]
    }
}

impl<T> IndexMut<usize> for TensorStorage<T> {
    fn index_mut(&mut self, index: usize) -> &mut Self::Output {
        &mut self.data[index]
    }
}

For our Tensor, let's implement the Index and IndexMut traits that ravel a multidimensional index into a linear_index and get the value.

impl<T> Index<&[usize]> for Tensor<T> {
    type Output = T;

    fn index(&self, indices: &[usize]) -> &Self::Output {
        &self.storage[self.shape.ravel_index(indices)]
    }
}

impl<T> IndexMut<&[usize]> for Tensor<T> {
    fn index_mut(&mut self, indices: &[usize]) -> &mut Self::Output {
        &mut self.storage[self.shape.ravel_index(indices)]
    }
}

In Production: Candle

Now that we understand the fundamentals of raveling and unraveling, let's see how Candle optimizes indexing operations in practice. Our ravel_index function calculates the linear index from scratch each time.

Candle takes a different approach by pre-computing strides—the multipliers for each dimension. The Layout struct includes both shape and stride information:

pub struct Layout {
    shape: Shape,
    stride: Vec<usize>,  // Pre-computed strides
    start_offset: usize, // Pre-computed start_offset
}

Let's look at how Candle implements AvgPool2D.

Here is a summary:

let (b_sz, c, h, w) = layout.shape().dims4()?;
let mut src_index = layout.start_offset();
for b_idx in 0..b_sz {
    src_index += b_idx * stride[0];   // Add batch offset
    for c_idx in 0..c {
        src_index += c_idx * stride[1];   // Add channel offset
        for m in 0..kernel_h {
            for n in 0..kernel_w {
                let final_index = src_index + m * stride[2] + n * stride[3];
            }
        }
    }
}

This is performing the raveling operation.

Contrast this with our current approach:

let (b_sz, c, h, w) = layout.shape().dims4()?;
for b_idx in 0..b_sz {
    for c_idx in 0..c {
        for m in 0..kernel_h {
            for n in 0..kernel_w {
                let final_index = layout.start_offset() + layout.ravel_index(&[b_idx, c_idx, m, n]); 
            }
        }
    }
}

Candle’s strided approach is more efficient due to predictable memory access patterns, which the compiler can optimize for CPU caching.

Code

To run the code from the blogpost:

First, clone the repo:

git clone [email protected]:greenrazer/easytensor.git

Then check out the tagged commit for this part:

git checkout part-1

Then run cargo test:

cargo test

Look at the code in src/

Next Steps

We've built the foundation of our tensor library with core indexing operations that mirror how production frameworks like Candle work internally. In the next part, we'll explore how to make view operations efficient.

Try experimenting with the code, create some tensors, index into them, and see how the raveling/unraveling works with different shapes. Understanding these fundamentals will make the more advanced operations we'll cover much clearer.

Community

Sign up or log in to comment