Find Median from Data Stream: why two heaps beat a sorted array
The median of a static array is easy: sort it once, index the middle. The interesting constraint here is that numbers arrive one at a time and you have to answer findMedian correctly after every insertion — you never get to see all the numbers upfront. That is what makes this a design problem. The question is not "how do I compute a median" but "what data structure lets me maintain the median cheaply across an unbounded sequence of inserts?"
The constraints narrow the search space before you write a line. num is bounded by -10^5 to 10^5 — no overflow concerns. At most 5 * 10^4 calls total. That last number is the real signal: if you have up to fifty thousand operations and findMedian is called frequently, a solution that sorts the whole array on every query will burn O(n log n) per call and degrade badly. You need something where the cost-per-operation stays bounded as n grows.
The problem
Design a class MedianFinder that supports two operations: addNum(num) adds an integer from a data stream, and findMedian() returns the median of all integers added so far — where the median is the middle value for odd counts, or the average of the two middle values for even counts. Full statement: LeetCode 295.
Constraints:
- -10^5 <= num <= 10^5
- There will be at least one element in the data structure before calling findMedian.
- At most 5 * 10^4 calls will be made to addNum and findMedian.Approach 1: sort on every query
The simplest possible design: collect all numbers in a plain list, sort the list inside findMedian, and read off the middle. No invariant to maintain between calls.
class MedianFinder:
def __init__(self):
self.nums = []
def addNum(self, num: int) -> None:
self.nums.append(num)
def findMedian(self) -> float:
self.nums.sort()
n = len(self.nums)
if n % 2 == 1:
return float(self.nums[n // 2])
return (self.nums[n // 2 - 1] + self.nums[n // 2]) / 2.0public class MedianFinder {
private List<int> nums;
public MedianFinder() {
nums = new List<int>();
}
public void AddNum(int num) {
nums.Add(num);
}
public double FindMedian() {
nums.Sort();
int n = nums.Count;
if (n % 2 == 1) return (double)nums[n / 2];
return (nums[n / 2 - 1] + nums[n / 2]) / 2.0;
}
}addNum is O(1). findMedian is O(n log n) — Python's Timsort, C#'s introsort, both O(n log n) worst case. Space is O(n).
The problem with this is that every call to findMedian re-sorts work already done. You gain nothing from the previous sort. If findMedian is called after every insertion, you end up spending O(n log n) per call and O(n² log n) total for n insertions. With n = 5 * 10^4, that is around 25 billion operations in the worst case — well beyond what will pass in time.
Approach 2: maintain a sorted list, insert in position
The obvious fix: keep the list sorted at all times. Then findMedian is O(1) — just index the middle — and the sort cost moves entirely to addNum.
Binary search finds the correct insertion index in O(log n). The actual array shift after insertion costs O(n) because everything to the right must move. You trade an O(n log n) findMedian for an O(n) addNum, which is better when queries outnumber insertions but still linear per insert.
import bisect
class MedianFinder:
def __init__(self):
self.nums = []
def addNum(self, num: int) -> None:
bisect.insort(self.nums, num)
def findMedian(self) -> float:
n = len(self.nums)
if n % 2 == 1:
return float(self.nums[n // 2])
return (self.nums[n // 2 - 1] + self.nums[n // 2]) / 2.0public class MedianFinder {
private List<int> nums;
public MedianFinder() {
nums = new List<int>();
}
public void AddNum(int num) {
// Binary search for insertion point, then insert to maintain sorted order
int lo = 0, hi = nums.Count;
while (lo < hi) {
int mid = lo + (hi - lo) / 2;
if (nums[mid] < num) lo = mid + 1;
else hi = mid;
}
nums.Insert(lo, num);
}
public double FindMedian() {
int n = nums.Count;
if (n % 2 == 1) return (double)nums[n / 2];
return (nums[n / 2 - 1] + nums[n / 2]) / 2.0;
}
}addNum: O(n) — O(log n) search, O(n) shift. findMedian: O(1). Space: O(n).
This is better than Approach 1 when queries dominate, but still linear on insert. With 5 * 10^4 insertions all happening before any query, worst case is still O(n²) total. The shift cost in C#'s List.Insert and Python's bisect.insort cannot be avoided for array-backed lists — inserting in the middle means moving memory.
The two-heap split
The key insight is this: you do not need the numbers in a fully sorted order to find the median. You only need to know what is at the boundary between the lower half and the upper half.
Split the numbers into two groups: everything less than or equal to the median goes into a max-heap (so the largest of the small numbers is always at the top), and everything greater goes into a min-heap (so the smallest of the large numbers is always at the top). Keep the heaps balanced so they differ by at most one element.
The median is then either:
- The top of the max-heap, if the max-heap has one more element than the min-heap (odd total count)
- The average of both tops, if the heaps are equal in size (even total count)
Every addNum does at most two heap operations: one push and optionally one pop+push to rebalance. Both are O(log n). findMedian just peeks at one or two heap tops — O(1).
import heapq
class MedianFinder:
def __init__(self):
# max_heap stores the lower half; Python only has min-heap,
# so negate values to simulate max behavior
self.max_heap = [] # lower half
self.min_heap = [] # upper half
def addNum(self, num: int) -> None:
# Route to the correct half
if not self.max_heap or num <= -self.max_heap[0]:
heapq.heappush(self.max_heap, -num)
else:
heapq.heappush(self.min_heap, num)
# Rebalance: max_heap can hold at most one more than min_heap
if len(self.max_heap) > len(self.min_heap) + 1:
val = -heapq.heappop(self.max_heap)
heapq.heappush(self.min_heap, val)
elif len(self.min_heap) > len(self.max_heap):
val = heapq.heappop(self.min_heap)
heapq.heappush(self.max_heap, -val)
def findMedian(self) -> float:
if len(self.max_heap) == len(self.min_heap):
return (-self.max_heap[0] + self.min_heap[0]) / 2.0
return float(-self.max_heap[0])using System.Collections.Generic;
public class MedianFinder {
// C# PriorityQueue<TElement, TPriority> is a min-heap by default.
// For the max-heap side, we use negative priority to invert ordering.
private PriorityQueue<int, int> maxHeap; // lower half, max at top via neg priority
private PriorityQueue<int, int> minHeap; // upper half, min at top
public MedianFinder() {
maxHeap = new PriorityQueue<int, int>();
minHeap = new PriorityQueue<int, int>();
}
public void AddNum(int num) {
if (maxHeap.Count == 0 || num <= maxHeap.Peek()) {
maxHeap.Enqueue(num, -num);
} else {
minHeap.Enqueue(num, num);
}
// Rebalance
if (maxHeap.Count > minHeap.Count + 1) {
int val = maxHeap.Dequeue();
minHeap.Enqueue(val, val);
} else if (minHeap.Count > maxHeap.Count) {
int val = minHeap.Dequeue();
maxHeap.Enqueue(val, -val);
}
}
public double FindMedian() {
if (maxHeap.Count == minHeap.Count) {
return (maxHeap.Peek() + minHeap.Peek()) / 2.0;
}
return (double)maxHeap.Peek();
}
}addNum: O(log n). findMedian: O(1). Space: O(n).
Stepping through the example by hand
The problem's example: addNum(1), addNum(2), findMedian(), addNum(3), findMedian().
Start:
max_heap = [] (lower half)
min_heap = [] (upper half)
addNum(1):
max_heap is empty -> push 1 into max_heap
max_heap = [1]
min_heap = []
Balance check: sizes are 1 vs 0, within the allowed gap of 1. No rebalance.
addNum(2):
2 > top of max_heap (1) -> push 2 into min_heap
max_heap = [1]
min_heap = [2]
Balance check: sizes are 1 vs 1. Equal. No rebalance.
findMedian():
Equal sizes -> (max_heap.top + min_heap.top) / 2 = (1 + 2) / 2 = 1.5 ✓
addNum(3):
3 > top of max_heap (1) -> push 3 into min_heap
max_heap = [1]
min_heap = [2, 3]
Balance check: sizes are 1 vs 2. min_heap has one too many.
Pop 2 from min_heap, push into max_heap.
max_heap = [2, 1] (2 is the new max)
min_heap = [3]
Sizes: 2 vs 1. max_heap holds one more. Within limit.
findMedian():
max_heap has one more element -> return max_heap.top = 2 ✓
The rebalance step after addNum(3) is the key move. A number went into the wrong half — it was larger than the current boundary — and the rebalance corrected that by evicting the min of the upper half back into the lower half.
Edge cases and why they are non-issues
Single element in the stream: after addNum(x), max_heap has one element, min_heap is empty. findMedian returns float(-max_heap[0]) directly. The even/odd branch handles this without special casing.
All elements equal (e.g., addNum(5) three times): each new 5 satisfies num <= max_heap.top, so all three go to max_heap initially. Rebalancing moves elements to min_heap as needed. The tops of both heaps end up being 5, so whether you take the max_heap top alone or average both tops, you get 5.0. Correct.
Negative numbers: the routing condition num <= -max_heap[0] in Python (where stored values are negated) and num <= maxHeap.Peek() in C# work identically for negative values. No special handling required — the invariant is preserved independent of sign.
Interleaved add and query: the rebalance runs on every addNum, so the invariant holds at the moment any findMedian is called, regardless of operation order.
Overflow when averaging: in C#, maxHeap.Peek() + minHeap.Peek() with both values near 10^5 can overflow int when cast result is used. The division is by 2.0 which promotes to double, but the addition itself happens in int arithmetic. At 10^5, max sum is 2 * 10^5 = 200000 which is well within int range (max int is about 2.1 billion). No overflow risk in this problem.
Comparison across the three approaches
| Approach | addNum | findMedian | Space | When it makes sense |
|---|---|---|---|---|
| Sort on query | O(1) | O(n log n) | O(n) | Rare queries, many inserts |
| Sorted insert | O(n) | O(1) | O(n) | Rare inserts, many queries |
| Two heaps | O(log n) | O(1) | O(n) | Mixed workloads; always safe |
The sorted-insert approach wins over two heaps if inserts are extremely rare and queries dominate, because O(1) findMedian comes for free. But in a data stream problem, the whole premise is frequent inserts — which is exactly where sorted insert degrades. Two heaps are the right default.
What I would actually ship
Two heaps, without hesitation. The O(log n) / O(1) profile is ideal for streaming workloads, and the invariant (max_heap.size equals min_heap.size or max_heap.size equals min_heap.size + 1) is easy to maintain and verify in a code review.
The one thing I would double-check in a real codebase: Python's max-heap simulation via negation. It is idiomatic Python, but every place you interact with max_heap you have to remember to negate. In C# the PriorityQueue<TElement, TPriority> API makes this slightly cleaner — you pass the negative of the value as the priority for the max side, and dequeue always gives you back the original (non-negated) element. Less mental overhead.
The follow-up in the problem hints at a bucket sort optimization when all numbers are in [0, 100]. You maintain a frequency array of 101 entries and track the total count; findMedian scans at most 101 buckets to find the middle. That makes addNum O(1) and findMedian O(1) too, at the cost of range-bound values. For the constrained follow-up it is better; for the general case (values in [-10^5, 10^5]) the bucket array grows to 200001 entries which is still fixed-size but wastes memory if the stream is sparse.
The underlying pattern and where it transfers
The two-heap approach belongs to a broader family of "balanced partition" structures: you split data into two halves maintained by complementary heap invariants (max on the left, min on the right) so that the boundary element is always accessible in O(1). The same idea appears in:
- Sliding Window Median (LC 480) — same two-heap structure, but numbers leave the window as well as enter. You need lazy deletion or an ordered set to handle the removal. The heap invariant and the rebalance logic are identical; the complexity is the removal bookkeeping.
- IPO (LC 502) — uses a max-heap of profits and a min-heap of capital requirements to always pick the highest-profit project you can currently afford. Two heaps again, but this time they serve a greedy selection rather than a partition.
- Kth Largest Element in a Stream (LC 703) — a single min-heap of size k whose top is always the kth largest. A simplified version of one half of this problem.
- Design Twitter (LC 355) — uses a heap for the merge step of multiple sorted tweet streams. The "maintain a boundary" intuition is the same; the heap here merges rather than partitions.
The mental model to carry forward: whenever a problem asks you to maintain a running order statistic (median, kth element, top-k) over a dynamic set, and both insertions and queries need to be fast, a heap or two gives you logarithmic insert and constant-time access to the extreme — which is almost always enough.
This problem sits at position 7 in the heap / priority queue series, after Last Stone Weight, Kth Largest Element, K Closest Points to Origin, and Task Scheduler. Those problems all use a single heap. This is the first in the series that pairs two heaps against each other — the structural step up from "pick the extreme" to "maintain the middle."
Part of the series: Heap / Priority Queue
Occasional notes on what I'm building
Get an email when I publish a new post — engineering write-ups, no spam. Unsubscribe anytime.
Comments
No comments yet. Be the first.
