Building Tensors from Scratch in Rust (Part 1): Core Structure and Indexing
In this series, we’ll build a tensor library from scratch in Rust, similar to PyTorch or NumPy, focusing on understanding the fundamental concepts behind tensor operations. In this first part, we'll create the basic tensor structure and implement indexing operations.
Tensor Struct
A tensor might seem like a simple multi-dimensional array, but the design choices we make here will ripple through every operation we implement. The question is: how do we store multi-dimensional data efficiently while keeping operations like slicing, reshaping, and device transfers fast?
To make our implementation cleaner down the road, we’ll split our tensor struct into two parts: a shape component and a storage component.
We do this for a couple reasons:
First, we want to be flexible about where data is stored, whether in RAM, VRAM, or on a network drive somewhere.
Second, there are two different categories of operations:
- view operations: which reinterpret or reshape the data without copying it (like slicing or reshaping)
- data operations: which actually modify or transform the data (like addition or multiplication)
This means view operations don’t need access to the actual data, saving us from unnecessary data retrieval.
So we want to split the tensor into 2 parts:
TensorShape
, which stores the shape.TensorStorage<T>
, which stores the raw values of typeT
Let's combine them together in the Tensor<T>
type.
#[derive(Debug, Clone, PartialEq)]
struct TensorShape {
shape: Vec<usize>,
}
impl TensorShape {
fn size(&self) -> usize {
self.shape.iter().product()
}
}
#[derive(Debug, Clone, PartialEq)]
struct TensorStorage<T> {
data: Vec<T>,
}
#[derive(Debug, Clone, PartialEq)]
struct Tensor<T> {
shape: TensorShape,
storage: TensorStorage<T>,
}
You might wonder why we choose to store our data in a
Vec
this way, instead of in a list of lists or a different data structure:
- It is simpler to transfer the data between devices and save the data to disk
- We can guarantee that the raw data is contiguous for better performance
Row-major (aka "C-style") is the most popular ordering. In row-major order, moving one step in the last dimension corresponds to one step forward in the data buffer.
In row-major order, moving one step in the last dimension corresponds to one step forward in the data buffer.
So for a tensor with shape
[2,2,2]
: 0 ->[0,0,0]
1 ->[0,0,1]
2 ->[0,1,0]
14 ->[1,1,0]
15 ->[1,1,1]
In contrast, column-major order (used in Fortran and MATLAB) stores elements by progressing along the first dimension first.
0 ->
[0,0,0]
1 ->[1,0,0]
2 ->[0,1,0]
14 ->[0,1,1]
15 ->[1,1,1]
In Production: Candle
Before moving on, I think it'd be helpful to quickly take a look at how Candle, an existing tensor framework written in Rust, handles their tensors.
First let's look at the Tensor_
struct:
pub struct Tensor_ {
storage: Arc<RwLock<Storage>>,
layout: Layout,
// Other fields omitted for brevity
}
You can see that there are separate structs for the shape and storage similar to how we structured our own Tensor type.
The Layout
struct that has fields about how to index a given tensor:
pub struct Layout {
shape: Shape,
// Other fields omitted for brevity
}
Including a Shape
struct that is the same as our TensorShape
above.
pub struct Shape(Vec<usize>);
The Storage
enum looks like:
pub enum Storage {
Cpu(CpuStorage),
Cuda(CudaStorage),
Metal(MetalStorage),
}
It has many different entries for data storage on different devices,
but if we look at the CpuStorage
enum:
pub enum CpuStorage {
U8(Vec<u8>),
U32(Vec<u32>),
I64(Vec<i64>),
BF16(Vec<bf16>),
F16(Vec<f16>),
F32(Vec<f32>),
F64(Vec<f64>),
}
We can see that it is conceptually similar to our TensorStorage
.
Tensor Initialization
First, we want to be able to create tensors. The simplest tensor you can create is the zero tensor.
In most frameworks there's a function called zeros
.
Let’s create an associated function on Tensor:
Tensor::zeros
.
This should only work when the inner type of the tensor is something that
can be initialized to zero. We could create a trait for all our types like Zeroable
that
would give instructions on how to make a variable zero, like:
trait Zeroable {
fn zero() -> Self
}
impl Zeroable for f32 {
fn zero() -> f32 {
0.0
}
}
Or we can pull in the num-traits
crate,
which already includes a Zero
trait and many other useful numeric traits.
This simplifies things significantly. This saves us from having to manually implement zero-initialization for every supported type.
use num_traits::Zero;
impl<T: Zero + Clone> Tensor<T> {
fn zeros(shape: Vec<usize>) -> Self {
let shape = TensorShape { shape };
let storage = TensorStorage::<T>::zeros(shape.size());
Tensor { shape, storage }
}
}
impl<T: Zero + Clone> TensorStorage<T> {
fn zeros(size: usize) -> Self {
TensorStorage {
data: vec![T::zero(); size],
}
}
}
Indexing
The first thing we really need to do is be able to index into our vector and view or change individual elements.
To index into storage, we are going to implement raveling and unraveling.
Raveling
Think of flattening a 2D spreadsheet into a single column. You iterate row by row, placing each element sequentially into a flat buffer. This is essentially what happens when any multi-dimensional tensor gets stored in your computer's linear memory. It needs to be "flattened" into a 1D representation.
For a multi-dimensional index
for a tensor with shape
, the canonical way
that this mapping occurs is:
linear_index = index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ....
You can also think of this as taking the dot product between the vector of indexes, and a vector of "strides", Where:
strides = [..., shape[-2]*shape[-1], shape[-1], 1]
These strides can be precomputed once when a shape is set to save some time later.
For a tensor with shape [2, 3, 4], the strides would be [12, 4, 1]:
- Moving one step in dimension 0 jumps 12 positions, every matrix
- Moving one step in dimension 1 jumps 4 positions, every vector
- Moving one step in dimension 2 jumps 1 position, every element
For example, we can think of a grayscale image as a 2D tensor of shape [h, w]
.
You can imagine this linear index going row by row down the image.
Where i
indexes height (h
) and j
indexes width (w
):
linear_index = j + i*w = i*w + j
So each i
jumps a row, and each j
selects within the row.
In impl TensorShape
we can add the ravel_index
function:
fn ravel_index(&self, indices: &[usize]) -> usize {
if indices.len() != self.shape.len() {
panic!("Indices length does not match tensor shape dimensions.");
}
indices.iter().zip(self.shape.iter())
.rev()
.scan(1, |stride, (&idx, &dim_size)| {
let result = idx * *stride;
*stride *= dim_size;
Some(result)
})
.sum()
}
Scan is Rust's equivalent to a "stateful map". It's like map but has an accumulator, in this case
stride
, and an initial state, in this case1
.
Unraveling
If we want to get the index
back from the linear_index
, assuming we have the
original shape
, it is a bit trickier. If you're curious,
it's worth working through this yourself first.
Let's clarify with our grayscale image example, where:
linear_index = i*w + j
If we wanted to get j
back from this, we should remember how we would get the
remainder of a division by w
using the modulo (%
). Since j
is between [0, w)
:
j = linear_index % w
Now we just need i
, which is:
i = linear_index - j = linear_index - linear_index % w = linear_index // w
Where //
represents floor division, which is division rounded down to the nearest integer.
For higher-dimensional indices, the pattern is as follows:
index[-1] = linear_index % shape[-1]
This ensures that anything multiplied by shape[-1]
in our linear_index
goes
to 0
.
So:
= linear_index % shape[-1]
= (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) % shape[-1]
= index[-1] % shape[-1] + index[-2]*shape[-1] % shape[-1] + index[-3]*shape[-1]*shape[-2] % shape[-1] + ...
= index[-1] % shape[-1] + 0 + 0 + ...
= index[-1]
Then for the next one, we subtract the previously found index using our floor
division //
from earlier:
linear_index // shape[-1]
= (index[-1] + index[-2]*shape[-1] + index[-3]*shape[-1]*shape[-2] + ...) // shape[-1]
= index[-1] // shape[-1] + index[-2]*shape[-1] // shape[-1] + index[-3]*shape[-1]*shape[-2] // shape[-1] + ...
= 0 + index[-2] + index[-3]*shape[-2] + ...
= index[-2] + index[-3]*shape[-2] + ...
But since we have more dimensions, we need to take the modulo again for the next dimension:
index[-2] = (linear_index // shape[-1]) % shape[-2]
index[-3] = (linear_index // shape[-1]*shape[-2]) % shape[-3]
index[-4] = (linear_index // shape[-1]*shape[-2]*shape[-3]) % shape[-4]
Let’s return to our grayscale image example.
If we have linear_index = 15
for a 4x5
image, we get j = 15 % 5 = 0
and i = 15 // 5 = 3
, so we're at position [3, 0]
, the first pixel of the
fourth row.
In impl TensorShape
we can add the unravel_index
function:
fn unravel_index(&self, index: usize) -> Vec<usize> {
if self.shape.is_empty() {
return vec![];
}
let mut indices = vec![0; self.shape.len()];
let mut remaining_index = index;
for (i, &dim_size) in self.shape.iter().enumerate().rev() {
indices[i] = remaining_index % dim_size;
remaining_index /= dim_size;
}
indices
}
Index Final Bits
For our TensorStorage
, let's implement the Index
and IndexMut
traits
that just linearly index our storage.
use std::operations::{Index, IndexMut};
impl<T> Index<usize> for TensorStorage<T> {
type Output = T;
fn index(&self, index: usize) -> &Self::Output {
&self.data[index]
}
}
impl<T> IndexMut<usize> for TensorStorage<T> {
fn index_mut(&mut self, index: usize) -> &mut Self::Output {
&mut self.data[index]
}
}
For our Tensor
, let's implement the Index
and IndexMut
traits
that ravel a multidimensional index into a linear_index and get the value.
impl<T> Index<&[usize]> for Tensor<T> {
type Output = T;
fn index(&self, indices: &[usize]) -> &Self::Output {
&self.storage[self.shape.ravel_index(indices)]
}
}
impl<T> IndexMut<&[usize]> for Tensor<T> {
fn index_mut(&mut self, indices: &[usize]) -> &mut Self::Output {
&mut self.storage[self.shape.ravel_index(indices)]
}
}
In Production: Candle
Now that we understand the fundamentals of raveling and unraveling,
let's see how Candle optimizes indexing operations in practice.
Our ravel_index
function calculates the linear index from scratch each time.
Candle takes a different approach by pre-computing strides—the multipliers for each dimension. The Layout
struct includes both shape and stride information:
pub struct Layout {
shape: Shape,
stride: Vec<usize>, // Pre-computed strides
start_offset: usize, // Pre-computed start_offset
}
Let's look at how Candle implements AvgPool2D.
Here is a summary:
let (b_sz, c, h, w) = layout.shape().dims4()?;
let mut src_index = layout.start_offset();
for b_idx in 0..b_sz {
src_index += b_idx * stride[0]; // Add batch offset
for c_idx in 0..c {
src_index += c_idx * stride[1]; // Add channel offset
for m in 0..kernel_h {
for n in 0..kernel_w {
let final_index = src_index + m * stride[2] + n * stride[3];
}
}
}
}
This is performing the raveling operation.
Contrast this with our current approach:
let (b_sz, c, h, w) = layout.shape().dims4()?;
for b_idx in 0..b_sz {
for c_idx in 0..c {
for m in 0..kernel_h {
for n in 0..kernel_w {
let final_index = layout.start_offset() + layout.ravel_index(&[b_idx, c_idx, m, n]);
}
}
}
}
Candle’s strided approach is more efficient due to predictable memory access patterns, which the compiler can optimize for CPU caching.
Code
To run the code from the blogpost:
First, clone the repo:
git clone [email protected]:greenrazer/easytensor.git
Then check out the tagged commit for this part:
git checkout part-1
Then run cargo test:
cargo test
Look at the code in src/
Next Steps
We've built the foundation of our tensor library with core indexing operations that mirror how production frameworks like Candle work internally. In the next part, we'll explore how to make view operations efficient.
Try experimenting with the code, create some tensors, index into them, and see how the raveling/unraveling works with different shapes. Understanding these fundamentals will make the more advanced operations we'll cover much clearer.