Ultra-optimized heap operations for Python,
written in C.
A single-file C extension delivering SIMD acceleration, GIL-releasing computation, and a multi-tier dispatch system that selects the optimal algorithm at runtime.
import heapx # Min-heap — Floyd's bottom-up O(n) construction data = [5, 2, 8, 1, 9, 3, 7] heapx.heapify(data) # Max-heap — native support, no negation hacks heapx.heapify(data, max_heap=True) # Push, pop, remove, replace, merge heapx.push(data, 6, max_heap=True) largest = heapx.pop(data, max_heap=True) top3 = heapx.pop(data, n=3, max_heap=True) # GIL-releasing for multi-threaded workloads floats = [float(x) for x in range(1_000_000)] heapx.heapify(floats, nogil=True) # Custom keys — computed once in O(n), not O(n log n) tasks = [{"pri": 10}, {"pri": 1}, {"pri": 5}] heapx.heapify(tasks, cmp=lambda x: x["pri"])
Drop-In heapq Replacement
Six functions (heapify, push, pop, remove, replace, merge) with native max-heap support, bulk operations, and custom key functions. No negation hacks. No wrappers.
Floyd's Bottom-Up Heapify
Wegener 1993 variant descends to a leaf comparing only children, then bubbles up, reducing comparisons by ~25% versus standard sift-down. O(n) construction.
Runtime Algorithm Selection
An 11-priority dispatch table selects the optimal code path at runtime based on data type, heap size, arity, key function presence, and element homogeneity.
SIMD Acceleration
AVX2, SSE2, and ARM NEON intrinsics for type pointer scanning (4 ptrs/cycle) and child selection in quaternary/octonary heaps. Scalar fallbacks on all platforms.
Fast Comparison Paths
Type-specialized C comparisons for int, float, bytes, str, bool, and tuple, bypassing PyObject_RichCompareBool entirely. Recursive fast-compare for nested tuples.
GIL-Releasing Computation
For homogeneous int/float arrays: extracts raw C values, releases the GIL, performs pure-C heapify, then rearranges Python objects via cycle-following permutation.
Multi-Arity Heaps
Binary, ternary, quaternary, and octonary heaps with bit-shift parent/child calculations. Arity 1–64 supported. Cache-friendly access patterns for large n.
Precomputed Key Caching
Keys computed in a single O(n) pass via Python 3.8+ vectorcall protocol, eliminating one tuple allocation per element. Stack-buffered for n ≤ 128.
Zero-Overhead Fast Paths
push(heap, item) and pop(heap) bypass all argument parsing for the common case, saving ~20ns per call. Bulk push delegates to O(n+k) heapify when k ≥ n.
Memory Safety
Py_INCREF/DECREF on all objects across comparisons. List-size validation after every Python callback. NaN-aware ordering across all float paths. No undefined behavior.
Stack-First Allocation
Key arrays ≤128 and value arrays ≤2,048 elements use stack buffers with architecture-aligned hints, avoiding malloc/free overhead for the common case.
Universal Compatibility
Python 3.9–3.14. macOS, Linux, Windows. x86-64, ARM64, RISC-V, POWER. GCC, Clang, MSVC. Works on any mutable sequence, not just lists. Zero dependencies.
Transforms a mutable sequence into a valid heap in-place using Floyd's bottom-up algorithm. Two-phase dispatch: homogeneous type-specialized paths first, then generic arity-selected paths.
O(n)Inserts one or more items. Lists trigger batch mode. Zero-overhead fast path for single binary min-heap push saves ~20ns per call. Bulk push delegates to heapify when k ≥ n.
O(log n) single · O(n+k) bulkRemoves and returns top element(s). n=1 returns single item, n>1 returns list. Type-specialized sift-down for bulk pop with GIL-releasing path for homogeneous arrays.
O(log n) single · O(k log n) bulkRemoves items by index, object identity, or predicate. Single-index hot path achieves O(log n) via inline sift-up/sift-down without full re-heapify.
O(log n) single · O(k + n) batchReplaces items by index, identity, or predicate while maintaining heap property. Single-index path writes value and performs O(log n) sift.
O(log n) single · O(k + n) batchMerges two or more sequences into a single valid heap. O(N) concatenation followed by Floyd's heapify with full dispatch including SIMD/nogil paths.
O(N) total-O3 -march=native -flto. Conda uses portable baselines.