Tensors: Memory, Shape, and Strides

What Is a Tensor, Really?

A tensor is not just a multidimensional array — it's a thin metadata wrapper around a flat block of memory. Understanding this distinction is the key to understanding why some operations are free (just change the metadata) while others require copying potentially gigabytes of data.

A tensor has two parts:

Storage: a contiguous, one-dimensional block of memory holding the actual numbers.
Metadata: shape, strides, dtype, offset — everything the library needs to interpret that flat memory as a multidimensional structure.

Let's see this concretely. We'll create a 2×3 matrix and inspect its internal structure — the shape, strides, data type, and the flat memory layout underneath.

import numpy as np

# Create a 2×3 matrix
x = np.array([[1, 2, 3],
              [4, 5, 6]], dtype=np.float32)

print(f"Shape:   {x.shape}")        # (2, 3)
print(f"Strides: {x.strides}")      # (12, 4) — bytes, not elements
print(f"Dtype:   {x.dtype}")        # float32 (4 bytes per element)
print(f"Size:    {x.nbytes} bytes") # 24 bytes total (6 elements × 4 bytes)

# The underlying memory is a flat sequence of bytes
flat = x.tobytes()
print(f"\nRaw memory ({len(flat)} bytes):")
print(f"  {[x.flat[i] for i in range(x.size)]}")
print(f"\nThe 2×3 'shape' is just our interpretation of these 6 numbers.")

In PyTorch, the equivalent looks nearly identical — but with one important difference we'll explore in a moment:

import torch

x = torch.tensor([[1, 2, 3],
                   [4, 5, 6]], dtype=torch.float32)

print(x.shape)        # torch.Size([2, 3])
print(x.stride())     # (3, 1) — in elements, not bytes
print(x.dtype)        # torch.float32
print(x.device)       # cpu (or cuda:0)
print(x.storage())    # flat storage: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

💡 One key difference: numpy strides are in bytes (float32 = 4 bytes, so a stride of 12 means 3 elements), while PyTorch strides are in elements. A PyTorch stride of (3, 1) means 'skip 3 elements to move one row, 1 element to move one column.'

Strides: How Shape Maps to Memory

Strides are the number of elements to skip in memory to move one step along each dimension. They're how the library translates a multidimensional index like [i, j] into a flat memory offset:

\text{offset} = \sum_{k=0}^{n-1} \text{index}[k] \times \text{stride}[k]

Here, index[k] is the position along dimension $k$, stride[k] is how many elements to skip per step in that dimension, and the sum gives the position in the flat storage. For a 2D tensor with strides (3, 1) , element [1, 2] is at offset $1 \times 3 + 2 \times 1 = 5$.

Let's verify this with a concrete example — we'll compute offsets manually and check that they match the actual array values.

import numpy as np
import json, js

x = np.array([[10, 20, 30, 40],
              [50, 60, 70, 80],
              [90, 100, 110, 120]], dtype=np.int32)

print(f"Shape: {x.shape}")     # (3, 4)
print(f"Strides: {x.strides}") # (16, 4) — 16 bytes = 4 elements × 4 bytes

# Manual offset calculation (in elements)
strides_elem = (4, 1)  # strides in elements for (3, 4) row-major

for i in range(3):
    for j in range(4):
        offset = i * strides_elem[0] + j * strides_elem[1]
        flat_val = x.flat[offset]
        assert flat_val == x[i, j], "Mismatch!"

print("All offsets verified!")
print()

# Show the mapping — collect rows for table
indices = [(0,0), (0,3), (1,0), (1,2), (2,3)]
table_rows = []
for i, j in indices:
    offset = i * strides_elem[0] + j * strides_elem[1]
    table_rows.append([f"[{i},{j}]", str(offset), str(x[i,j])])

js.window.py_table_data = json.dumps({
    "headers": ["Index", "Offset", "Value"],
    "rows": table_rows
})

This is why reshape and transpose can sometimes be "free" — they just change the strides without touching the data. A 3×4 matrix and a 4×3 matrix can share the same 12-element storage block; only the strides differ.

The -1 and None Conventions

PyTorch and numpy overload -1 and None to mean completely different things depending on context. This trips up almost everyone at some point, so let's break each usage down explicitly.

import numpy as np

x = np.arange(12).reshape(3, 4)
print(f"Original: shape {x.shape}")
print(x)
print()

# -1 in reshape: "infer this dimension"
# Total elements = 12. If one dim is 4, the other must be 12/4 = 3
a = x.reshape(-1, 4)   # → (3, 4)
b = x.reshape(3, -1)   # → (3, 4)
c = x.reshape(-1)      # → (12,) — flatten
d = x.reshape(2, -1)   # → (2, 6) — 12/2 = 6
print(f"reshape(-1, 4):  {a.shape}")
print(f"reshape(3, -1):  {b.shape}")
print(f"reshape(-1):     {c.shape}")
print(f"reshape(2, -1):  {d.shape}")
print()

# -1 in indexing: "last element"
print(f"x[-1] (last row):     {x[-1]}")
print(f"x[0, -1] (last col):  {x[0, -1]}")
print()

# None in indexing: "add a dimension" (like unsqueeze)
e = x[:, :, None]     # (3, 4) → (3, 4, 1)
f = x[None, :, :]     # (3, 4) → (1, 3, 4)
g = x[:, None, :]     # (3, 4) → (3, 1, 4)
print(f"x[:, :, None]:  {e.shape}  (added dim at end)")
print(f"x[None, :, :]:  {f.shape}  (added dim at start)")
print(f"x[:, None, :]:  {g.shape}  (added dim in middle)")

In PyTorch, these conventions carry over exactly, with the addition of .unsqueeze() as an explicit alternative to None indexing:

# In PyTorch:
x = torch.arange(12).reshape(3, 4)

# -1 in reshape/view: same as numpy
x.reshape(-1, 4)   # (3, 4)
x.view(2, -1)      # (2, 6)

# None in indexing: same as numpy
x[:, :, None]      # (3, 4, 1)

# .unsqueeze() is the explicit version:
x.unsqueeze(-1)    # (3, 4, 1) — same as x[:, :, None]
x.unsqueeze(0)     # (1, 3, 4) — same as x[None, :, :]

📌 A common source of confusion: None cannot be used inside .reshape() or .view() — it only works in indexing brackets []. To add a dimension via reshape, use 1: x.reshape(3, 4, 1) not x.reshape(3, 4, None).

view vs reshape

Both .view() and .reshape() change a tensor's shape, but they differ in one critical way.

.view() reinterprets the strides. It requires the tensor to be contiguous in memory — the elements must be laid out in the exact order implied by the current strides. If they aren't (e.g., after a transpose), view raises an error.

.reshape() tries to do a view first (free, no copy). If the tensor isn't contiguous, it falls back to copying the data into a fresh contiguous block and then viewing that. So reshape is strictly more permissive — it always works, but might silently copy.

We can observe this distinction in numpy, which has the same contiguity concept:

import numpy as np

x = np.arange(12).reshape(3, 4)
print(f"Original: shape {x.shape}, strides {x.strides}")
print(f"Contiguous (C-order): {x.flags['C_CONTIGUOUS']}")
print()

# Transpose changes strides but NOT the data
t = x.T  # (4, 3)
print(f"Transposed: shape {t.shape}, strides {t.strides}")
print(f"Contiguous (C-order): {t.flags['C_CONTIGUOUS']}")
print()

# reshape on non-contiguous → creates a copy
r = t.reshape(12)
print(f"reshape(12) on transposed: {r}")
print(f"  (this required a copy because the data wasn't contiguous)")
print()

# To make contiguous explicitly:
t_contig = np.ascontiguousarray(t)
print(f"After ascontiguousarray: strides {t_contig.strides}")
print(f"Contiguous: {t_contig.flags['C_CONTIGUOUS']}")

In PyTorch, the distinction is made explicit: .view() will raise a RuntimeError if the tensor is not contiguous, while .reshape() silently copies when needed:

# In PyTorch:
x = torch.arange(12).reshape(3, 4)
t = x.T  # shape (4, 3), but NOT contiguous

# This works — reshape copies if needed:
t.reshape(12)  # OK

# This fails — view requires contiguity:
t.view(12)     # RuntimeError: view size is not compatible with
               # input tensor's size and stride

# Fix: make contiguous first
t.contiguous().view(12)  # OK

💡 Rule of thumb: use .reshape() when you don't care about copies (most cases). Use .view() when you specifically want to guarantee no copy happens — if the tensor isn't contiguous, you want to know about it (the error is the point).

expand vs repeat

Both create a larger tensor by repeating data along a dimension, but they work fundamentally differently.

.expand() uses the stride-zero trick : it sets the stride to 0 along the expanded dimension, so every index maps to the same memory. No data is copied — the tensor just "pretends" to be larger. The expanded dimension must currently have size 1.

.repeat() physically copies the data, creating a new, larger storage block. It always allocates new memory proportional to the repeated size.

import numpy as np

x = np.array([[1, 2, 3]])  # shape (1, 3)
print(f"Original: {x}, shape {x.shape}")
print()

# broadcast_to is numpy's equivalent of torch.expand
expanded = np.broadcast_to(x, (4, 3))
print(f"Expanded (broadcast_to): shape {expanded.shape}")
print(expanded)
print(f"  Strides: {expanded.strides}")
print(f"  Note: stride[0] = 0 (every row points to same memory!)")
print(f"  Shares memory with original: {np.shares_memory(x, expanded)}")
print()

# np.tile is numpy's equivalent of torch.repeat
tiled = np.tile(x, (4, 1))
print(f"Tiled (np.tile): shape {tiled.shape}")
print(tiled)
print(f"  Strides: {tiled.strides}")
print(f"  Shares memory with original: {np.shares_memory(x, tiled)}")
print()

print("Key difference:")
print(f"  Expanded memory: {expanded.nbytes} bytes (virtual — no extra memory)")
print(f"  Tiled memory:    {tiled.nbytes} bytes (real copy)")

The PyTorch equivalents map directly:

x = torch.tensor([[1, 2, 3]])  # (1, 3)

# expand: stride-zero trick, no copy
e = x.expand(4, 3)       # (4, 3), stride (0, 1)
e = x.expand(-1, -1)     # -1 means "keep this dimension"

# repeat: physical copy
r = x.repeat(4, 1)       # (4, 3), new storage

gather and scatter: Index-Based Access

When you need to pick specific elements from specific positions — not slicing a contiguous block, but cherry-picking individual entries — you use .gather() (read) and .scatter() (write). These operations are essential in many ML workflows: selecting top-k predictions, building custom loss functions, or routing tokens to experts in mixture-of-experts architectures.

Let's see gather in action by selecting top-k predictions from a simulated logits matrix:

import numpy as np

# Simulated logits: 4 samples, 5 classes
np.random.seed(42)
logits = np.round(np.random.randn(4, 5), 2)
print("Logits (4 samples × 5 classes):")
print(logits)
print()

# Top-2 class indices per sample
top_k = 2
indices = np.argsort(logits, axis=1)[:, -top_k:][:, ::-1]
print(f"Top-{top_k} indices per sample:")
print(indices)
print()

# gather: pick the logit values at those indices
# np.take_along_axis is numpy's equivalent of torch.gather
gathered = np.take_along_axis(logits, indices, axis=1)
print(f"Gathered top-{top_k} logit values:")
print(gathered)
print()

# Verify manually for sample 0
sample_0 = logits[0]
idx_0 = indices[0]
print(f"Sample 0 logits:  {sample_0}")
print(f"Top-2 indices:    {idx_0}")
print(f"Gathered values:  {[sample_0[i] for i in idx_0]}")

In PyTorch, torch.gather provides the same operation, and its inverse scatter_ writes values back to specific positions:

# torch.gather(input, dim, index)
gathered = torch.gather(logits, dim=1, index=indices)

# torch.scatter(dim, index, src) — write values to specific positions
output = torch.zeros(4, 5)
output.scatter_(dim=1, index=indices, src=gathered)

Broadcasting Rules

Broadcasting lets you do operations between tensors of different shapes without explicitly copying data. It follows three rules, applied from the rightmost dimension:

Rule 1: If dimensions differ in size, the dimension of size 1 is stretched to match.
Rule 2: If one tensor has fewer dimensions, it's padded with 1s on the left.
Rule 3: Dimensions must either match or be 1 — otherwise it's an error.

These rules are identical in numpy and PyTorch. Let's see each one in action:

import numpy as np

# Rule 1: dimension of size 1 is stretched
a = np.array([[1], [2], [3]])      # (3, 1)
b = np.array([[10, 20, 30, 40]])   # (1, 4)
result = a + b                      # (3, 4)
print("(3,1) + (1,4) → (3,4):")
print(f"  a = {a.T[0]}")
print(f"  b = {b[0]}")
print(result)
print()

# Rule 2: fewer dimensions → pad with 1s on left
c = np.array([10, 20, 30, 40])    # (4,) → treated as (1, 4)
d = np.arange(12).reshape(3, 4)   # (3, 4)
result2 = c + d                    # (3, 4)
print("(4,) + (3,4) → (3,4):")
print(f"  (4,) is treated as (1,4), then stretched to (3,4)")
print(result2)
print()

# Rule 3: mismatch → error
e = np.array([1, 2, 3])           # (3,)
f = np.arange(12).reshape(3, 4)   # (3, 4)
try:
    result3 = e + f  # tries (3,) as (1,3) + (3,4) → 3≠4, error!
except ValueError as err:
    print(f"(3,) + (3,4) → ERROR:")
    print(f"  {err}")
    print(f"  (3,) becomes (1,3). Right dims: 3 vs 4 — no match!")
    print(f"  Fix: reshape to (3,1) first: e.reshape(3,1) + f works")

Broadcasting is not merely a convenience — it's deeply connected to the stride-zero trick we saw with .expand() . When numpy or PyTorch broadcasts a dimension of size 1, it effectively sets the stride to 0 along that dimension. The data isn't copied; the library simply reads the same element repeatedly. This is why broadcasting is essentially free in terms of memory — it's an expand under the hood.

Quiz

Test your understanding of tensor internals — memory layout, strides, and the operations that manipulate them.

What are the two components of a tensor?

A shape array and a dtype flag

A flat storage block and metadata (shape, strides, dtype)

A list of lists and an indexing function

A GPU buffer and a CPU buffer

In the stride formula offset = Σ index[k] × stride[k], what does stride[k] represent?

The total number of elements in dimension k

The byte size of each element

The number of elements to skip in memory to move one step along dimension k

The index of the last element in dimension k

What happens when you call .view() on a non-contiguous tensor in PyTorch?

It silently copies the data into contiguous memory

It returns a transposed view

It raises a RuntimeError (view requires contiguous memory)

It falls back to .reshape() automatically

How does .expand() avoid copying data?

It compresses the data using run-length encoding

It sets the stride to 0 along the expanded dimension, so all indices map to the same memory

It uses lazy evaluation and only copies on write

It stores a reference count and shares the storage between tensors

Why does (3,) + (3,4) fail under broadcasting, while (3,1) + (3,4) succeeds?

Broadcasting only works on 2D tensors

The (3,) tensor has too few elements to broadcast

Broadcasting aligns from the right: (3,) becomes (1,3), and 3≠4 is a mismatch; (3,1) aligns as-is, and 1 stretches to 4

Broadcasting requires both tensors to have the same number of dimensions