What Is a Tensor, Really?
A tensor is not just a multidimensional array — it's a thin metadata wrapper around a flat block of memory. Understanding this distinction is the key to understanding why some operations are free (just change the metadata) while others require copying potentially gigabytes of data.
A tensor has two parts:
- Storage: a contiguous, one-dimensional block of memory holding the actual numbers.
- Metadata: shape, strides, dtype, offset — everything the library needs to interpret that flat memory as a multidimensional structure.
Let's see this concretely. We'll create a 2×3 matrix and inspect its internal structure — the shape, strides, data type, and the flat memory layout underneath.
import numpy as np
# Create a 2×3 matrix
x = np.array([[1, 2, 3],
[4, 5, 6]], dtype=np.float32)
print(f"Shape: {x.shape}") # (2, 3)
print(f"Strides: {x.strides}") # (12, 4) — bytes, not elements
print(f"Dtype: {x.dtype}") # float32 (4 bytes per element)
print(f"Size: {x.nbytes} bytes") # 24 bytes total (6 elements × 4 bytes)
# The underlying memory is a flat sequence of bytes
flat = x.tobytes()
print(f"\nRaw memory ({len(flat)} bytes):")
print(f" {[x.flat[i] for i in range(x.size)]}")
print(f"\nThe 2×3 'shape' is just our interpretation of these 6 numbers.")
In PyTorch, the equivalent looks nearly identical — but with one important difference we'll explore in a moment:
import torch
x = torch.tensor([[1, 2, 3],
[4, 5, 6]], dtype=torch.float32)
print(x.shape) # torch.Size([2, 3])
print(x.stride()) # (3, 1) — in elements, not bytes
print(x.dtype) # torch.float32
print(x.device) # cpu (or cuda:0)
print(x.storage()) # flat storage: [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
Strides: How Shape Maps to Memory
Strides are the number of elements to skip in memory to move one step along each dimension. They're how the library translates a multidimensional index like
[i, j]
into a flat memory offset:
Here,
index[k]
is the position along dimension $k$,
stride[k]
is how many elements to skip per step in that dimension, and the sum gives the position in the flat storage. For a 2D tensor with strides
(3, 1)
, element
[1, 2]
is at offset $1 \times 3 + 2 \times 1 = 5$.
Let's verify this with a concrete example — we'll compute offsets manually and check that they match the actual array values.
import numpy as np
x = np.array([[10, 20, 30, 40],
[50, 60, 70, 80],
[90, 100, 110, 120]], dtype=np.int32)
print(f"Shape: {x.shape}") # (3, 4)
print(f"Strides: {x.strides}") # (16, 4) — 16 bytes = 4 elements × 4 bytes
# Manual offset calculation (in elements)
strides_elem = (4, 1) # strides in elements for (3, 4) row-major
for i in range(3):
for j in range(4):
offset = i * strides_elem[0] + j * strides_elem[1]
flat_val = x.flat[offset]
assert flat_val == x[i, j], "Mismatch!"
print("All offsets verified!")
print()
# Show the mapping — collect rows for aligned output
indices = [(0,0), (0,3), (1,0), (1,2), (2,3)]
rows = []
for i, j in indices:
offset = i * strides_elem[0] + j * strides_elem[1]
rows.append((i, j, offset, x[i,j]))
w_off = max(len(str(r[2])) for r in rows)
w_val = max(len(str(r[3])) for r in rows)
print("Index → Offset → Value")
for i, j, offset, val in rows:
print(f" [{i},{j}] → offset {offset:>{w_off}} → {val:>{w_val}}")
This is why
reshape
and
transpose
can sometimes be "free" — they just change the strides without touching the data. A 3×4 matrix and a 4×3 matrix can share the same 12-element storage block; only the strides differ.
The -1 and None Conventions
PyTorch and numpy overload
-1
and
None
to mean completely different things depending on context. This trips up almost everyone at some point, so let's break each usage down explicitly.
import numpy as np
x = np.arange(12).reshape(3, 4)
print(f"Original: shape {x.shape}")
print(x)
print()
# -1 in reshape: "infer this dimension"
# Total elements = 12. If one dim is 4, the other must be 12/4 = 3
a = x.reshape(-1, 4) # → (3, 4)
b = x.reshape(3, -1) # → (3, 4)
c = x.reshape(-1) # → (12,) — flatten
d = x.reshape(2, -1) # → (2, 6) — 12/2 = 6
print(f"reshape(-1, 4): {a.shape}")
print(f"reshape(3, -1): {b.shape}")
print(f"reshape(-1): {c.shape}")
print(f"reshape(2, -1): {d.shape}")
print()
# -1 in indexing: "last element"
print(f"x[-1] (last row): {x[-1]}")
print(f"x[0, -1] (last col): {x[0, -1]}")
print()
# None in indexing: "add a dimension" (like unsqueeze)
e = x[:, :, None] # (3, 4) → (3, 4, 1)
f = x[None, :, :] # (3, 4) → (1, 3, 4)
g = x[:, None, :] # (3, 4) → (3, 1, 4)
print(f"x[:, :, None]: {e.shape} (added dim at end)")
print(f"x[None, :, :]: {f.shape} (added dim at start)")
print(f"x[:, None, :]: {g.shape} (added dim in middle)")
In PyTorch, these conventions carry over exactly, with the addition of
.unsqueeze()
as an explicit alternative to
None
indexing:
# In PyTorch:
x = torch.arange(12).reshape(3, 4)
# -1 in reshape/view: same as numpy
x.reshape(-1, 4) # (3, 4)
x.view(2, -1) # (2, 6)
# None in indexing: same as numpy
x[:, :, None] # (3, 4, 1)
# .unsqueeze() is the explicit version:
x.unsqueeze(-1) # (3, 4, 1) — same as x[:, :, None]
x.unsqueeze(0) # (1, 3, 4) — same as x[None, :, :]
view vs reshape
Both
.view()
and
.reshape()
change a tensor's shape, but they differ in one critical way.
.view()
reinterprets the strides. It requires the tensor to be
contiguous
in memory — the elements must be laid out in the exact order implied by the current strides. If they aren't (e.g., after a transpose),
view
raises an error.
.reshape()
tries to do a view first (free, no copy). If the tensor isn't contiguous, it falls back to copying the data into a fresh contiguous block and then viewing that. So
reshape
is strictly more permissive — it always works, but might silently copy.
We can observe this distinction in numpy, which has the same contiguity concept:
import numpy as np
x = np.arange(12).reshape(3, 4)
print(f"Original: shape {x.shape}, strides {x.strides}")
print(f"Contiguous (C-order): {x.flags['C_CONTIGUOUS']}")
print()
# Transpose changes strides but NOT the data
t = x.T # (4, 3)
print(f"Transposed: shape {t.shape}, strides {t.strides}")
print(f"Contiguous (C-order): {t.flags['C_CONTIGUOUS']}")
print()
# reshape on non-contiguous → creates a copy
r = t.reshape(12)
print(f"reshape(12) on transposed: {r}")
print(f" (this required a copy because the data wasn't contiguous)")
print()
# To make contiguous explicitly:
t_contig = np.ascontiguousarray(t)
print(f"After ascontiguousarray: strides {t_contig.strides}")
print(f"Contiguous: {t_contig.flags['C_CONTIGUOUS']}")
In PyTorch, the distinction is made explicit:
.view()
will raise a
RuntimeError
if the tensor is not contiguous, while
.reshape()
silently copies when needed:
# In PyTorch:
x = torch.arange(12).reshape(3, 4)
t = x.T # shape (4, 3), but NOT contiguous
# This works — reshape copies if needed:
t.reshape(12) # OK
# This fails — view requires contiguity:
t.view(12) # RuntimeError: view size is not compatible with
# input tensor's size and stride
# Fix: make contiguous first
t.contiguous().view(12) # OK
expand vs repeat
Both create a larger tensor by repeating data along a dimension, but they work fundamentally differently.
.expand()
uses the
stride-zero trick
: it sets the stride to 0 along the expanded dimension, so every index maps to the same memory. No data is copied — the tensor just "pretends" to be larger. The expanded dimension must currently have size 1.
.repeat()
physically copies the data, creating a new, larger storage block. It always allocates new memory proportional to the repeated size.
import numpy as np
x = np.array([[1, 2, 3]]) # shape (1, 3)
print(f"Original: {x}, shape {x.shape}")
print()
# broadcast_to is numpy's equivalent of torch.expand
expanded = np.broadcast_to(x, (4, 3))
print(f"Expanded (broadcast_to): shape {expanded.shape}")
print(expanded)
print(f" Strides: {expanded.strides}")
print(f" Note: stride[0] = 0 (every row points to same memory!)")
print(f" Shares memory with original: {np.shares_memory(x, expanded)}")
print()
# np.tile is numpy's equivalent of torch.repeat
tiled = np.tile(x, (4, 1))
print(f"Tiled (np.tile): shape {tiled.shape}")
print(tiled)
print(f" Strides: {tiled.strides}")
print(f" Shares memory with original: {np.shares_memory(x, tiled)}")
print()
print("Key difference:")
print(f" Expanded memory: {expanded.nbytes} bytes (virtual — no extra memory)")
print(f" Tiled memory: {tiled.nbytes} bytes (real copy)")
The PyTorch equivalents map directly:
x = torch.tensor([[1, 2, 3]]) # (1, 3)
# expand: stride-zero trick, no copy
e = x.expand(4, 3) # (4, 3), stride (0, 1)
e = x.expand(-1, -1) # -1 means "keep this dimension"
# repeat: physical copy
r = x.repeat(4, 1) # (4, 3), new storage
gather and scatter: Index-Based Access
When you need to pick specific elements from specific positions — not slicing a contiguous block, but cherry-picking individual entries — you use
.gather()
(read) and
.scatter()
(write). These operations are essential in many ML workflows: selecting top-k predictions, building custom loss functions, or routing tokens to experts in mixture-of-experts architectures.
Let's see
gather
in action by selecting top-k predictions from a simulated logits matrix:
import numpy as np
# Simulated logits: 4 samples, 5 classes
np.random.seed(42)
logits = np.round(np.random.randn(4, 5), 2)
print("Logits (4 samples × 5 classes):")
print(logits)
print()
# Top-2 class indices per sample
top_k = 2
indices = np.argsort(logits, axis=1)[:, -top_k:][:, ::-1]
print(f"Top-{top_k} indices per sample:")
print(indices)
print()
# gather: pick the logit values at those indices
# np.take_along_axis is numpy's equivalent of torch.gather
gathered = np.take_along_axis(logits, indices, axis=1)
print(f"Gathered top-{top_k} logit values:")
print(gathered)
print()
# Verify manually for sample 0
sample_0 = logits[0]
idx_0 = indices[0]
print(f"Sample 0 logits: {sample_0}")
print(f"Top-2 indices: {idx_0}")
print(f"Gathered values: {[sample_0[i] for i in idx_0]}")
In PyTorch,
torch.gather
provides the same operation, and its inverse
scatter_
writes values back to specific positions:
# torch.gather(input, dim, index)
gathered = torch.gather(logits, dim=1, index=indices)
# torch.scatter(dim, index, src) — write values to specific positions
output = torch.zeros(4, 5)
output.scatter_(dim=1, index=indices, src=gathered)
Broadcasting Rules
Broadcasting lets you do operations between tensors of different shapes without explicitly copying data. It follows three rules, applied from the rightmost dimension:
- Rule 1: If dimensions differ in size, the dimension of size 1 is stretched to match.
- Rule 2: If one tensor has fewer dimensions, it's padded with 1s on the left.
- Rule 3: Dimensions must either match or be 1 — otherwise it's an error.
These rules are identical in numpy and PyTorch. Let's see each one in action:
import numpy as np
# Rule 1: dimension of size 1 is stretched
a = np.array([[1], [2], [3]]) # (3, 1)
b = np.array([[10, 20, 30, 40]]) # (1, 4)
result = a + b # (3, 4)
print("(3,1) + (1,4) → (3,4):")
print(f" a = {a.T[0]}")
print(f" b = {b[0]}")
print(result)
print()
# Rule 2: fewer dimensions → pad with 1s on left
c = np.array([10, 20, 30, 40]) # (4,) → treated as (1, 4)
d = np.arange(12).reshape(3, 4) # (3, 4)
result2 = c + d # (3, 4)
print("(4,) + (3,4) → (3,4):")
print(f" (4,) is treated as (1,4), then stretched to (3,4)")
print(result2)
print()
# Rule 3: mismatch → error
e = np.array([1, 2, 3]) # (3,)
f = np.arange(12).reshape(3, 4) # (3, 4)
try:
result3 = e + f # tries (3,) as (1,3) + (3,4) → 3≠4, error!
except ValueError as err:
print(f"(3,) + (3,4) → ERROR:")
print(f" {err}")
print(f" (3,) becomes (1,3). Right dims: 3 vs 4 — no match!")
print(f" Fix: reshape to (3,1) first: e.reshape(3,1) + f works")
Broadcasting is not merely a convenience — it's deeply connected to the stride-zero trick we saw with
.expand()
. When numpy or PyTorch broadcasts a dimension of size 1, it effectively sets the stride to 0 along that dimension. The data isn't copied; the library simply reads the same element repeatedly. This is why broadcasting is essentially free in terms of memory — it's an expand under the hood.
Quiz
Test your understanding of tensor internals — memory layout, strides, and the operations that manipulate them.
What are the two components of a tensor?
In the stride formula offset = Σ index[k] × stride[k], what does stride[k] represent?
What happens when you call .view() on a non-contiguous tensor in PyTorch?
How does .expand() avoid copying data?
Why does (3,) + (3,4) fail under broadcasting, while (3,1) + (3,4) succeeds?