heapify(), merge(), nlargest(), nsmallest()

heapify(), merge()

此函數在線性時間內將 list x 轉換為 heap。

PYTHON

heapify(x)

List in-place 重新排列，這意味著不使用額外的記憶體。複雜度為 O(n)，其中 n 是 x 中的項目數。

當你有一個未排序的 list，但需要按排序順序重複提取元素 (尤其是最小的元素) 時，此函數特別有用。

將多個排序輸入 *iterables，合併為一個排序輸出 (返回 iterator)。

類似於 sorted(itertools.chain(*iterables))，但返回一個 iterable 物件，不會一次將資料全部拉入內存，並假設每個輸入 streams 都已排序 (從小到大)。

在您有多個排序序列 (例如文件或 lists)，並且需要將它們有效且按排序順序合併為單個 sequence 的情況下很有用。

PYTHON

merge(*iterables，key=None，reverse=False)

參數

說明

key

指定一個參數的 key function，用於從每個輸入元素中提取比較 key。預設值為 None (直接比較元素)。

reverse

反轉。要實作類似 sorted(itertools.chain(*iterables), reverse=True) 的行為，所有可迭代物件都必須從最大到最小排序。

範例 – Basic Usage

Example: 在未排序的 list 中，尋找 3 個最小的元素。

PYTHON

import heapq

data = [9, 1, 5, 6, 3, 7, 8, 2, 4]
heapq.heapify(data)
print(data)  # Output: [1, 2, 5, 4, 3, 7, 8, 6, 9]

# Extracting the 3 smallest elements
smallest_elements = [heapq.heappop(data) for _ in range(3)]
print(smallest_elements)  # Output: [1, 2, 3]

如果您只需要尋找一次最小或最大元素，請避免使用 heapq.heapify。在這種情況下，min() 或 max() 效率更高。

Example: 合併多個排序 lists。

PYTHON

import heapq

data1 = [1, 3, 5]
data2 = [2, 4, 6]
data3 = [0, 7, 8]

merged = heapq.merge(data1, data2, data3)
print(list(merged))  # Output: [0, 1, 2, 3, 4, 5, 6, 7, 8]

data1 = [5, 3, 1]
data2 = [6, 4, 2]
data3 = [8, 7, 0]

merged = heapq.merge(data1, data2, data3, reverse=True)
print(list(merged))  # Output: [8, 7, 6, 5, 4, 3, 2, 1, 0]

data1 = [1, 3, 5]
data2 = [2, 4, 6]
data3 = [0, 7, 8]

# Merging with a custom key (negative of the actual numbers)
merged = heapq.merge(data1, data2, data3, key=lambda x: -x, reverse=True)
print(list(merged))  # Output: [0, 1, 2, 3, 4, 5, 6, 7, 8]

nlargest(), nsmallest()

從 iterable 找出 n 個最大元素。

PYTHON

nlargest(n, iterable, key=None)

參數

說明

要返回的元素數量。

iterable

從中尋找元素的資料。

key

指定一個參數的 key function，用於從每個輸入元素中提取比較 key。預設值為 None (直接比較元素)。

從 iterable 找出 n 個最小元素。

PYTHON

nsmallest(n, iterable, key=None)

參數

說明

要返回的元素數量。

iterable

從中尋找元素的資料。

key

指定一個參數的 key function，用於從每個輸入元素中提取比較 key。預設值為 None (直接比較元素)。

heapq.nlargest 和 heapq.nsmallest 對於只需要幾個元素的大型資料集特別有效。它們的性能比對整個 list 進行排序，然後對其進行 slicing 要好，尤其是當 n 遠小於輸入的大小時。

範例 – Basic Usage

Example 1: Basic Usage

PYTHON

import heapq

numbers = [10, 4, 7, 1, 12, 8, 3, 2]
print("3 Largest:", heapq.nlargest(3, numbers))    # Output: 3 Largest: [12, 10, 8]
print("3 Smallest:", heapq.nsmallest(3, numbers))  # Output: 3 Smallest: [1, 2, 3]

Example 2: 使用 key function

PYTHON

import heapq

data = [{'name': 'John', 'age': 45},
        {'name': 'Diana', 'age': 30},
        {'name': 'Alice', 'age': 25}]

oldest = heapq.nlargest(2, data, key=lambda x: x['age'])
print(oldest)

youngest = heapq.nsmallest(2, data, key=lambda x: x['age'])
print(youngest)

執行結果:

[{'name': 'John', 'age': 45}, 
 {'name': 'Diana', 'age': 30}]
 
[{'name': 'Alice', 'age': 25}, 
 {'name': 'Diana', 'age': 30}]

範例 – Maintained Heap

如果您需要頻繁更新集合並檢索 n 個最大或最小元素，則 Maintained heaps 或 Balanced Binary Search Tree 等資料結構可能會更有效。

在這種情況下，每次使用 heapq.nlargest 和 heapq.nsmallest 可能會效率很低，因為這些函數需要在每次呼叫時，處理整個資料集。

相反，maintaining a heap 或使用 balanced binary search tree (如, AVL Tree, Red-Black Tree) 可能會更有效。這些資料結構允許更快的插入和刪除操作，同時還提供存取最大或最小元素的有效方法。

Example 1: 假設您有一個不斷更新的 stream，並且您需要追蹤 3 個最大的元素。使用 min-heap 可能是有效的。

PYTHON

import heapq

class DynamicLargest:
    def __init__(self, n):
        self.n = n
        self.heap = []

    def add(self, item):
        if len(self.heap) < self.n:
            heapq.heappush(self.heap, item)
        elif item > self.heap[0]:
            heapq.heapreplace(self.heap, item)

    def get_largest(self):
        return sorted(self.heap, reverse=True)

# Usage
dl = DynamicLargest(3)
stream = [5, 2, 8, 3, 9, 1, 7, 10, 4, 6]

for number in stream:
    dl.add(number)
    print("Current 3 largest:", dl.get_largest())

執行結果:

Current 3 largest: [5]
Current 3 largest: [5, 2]
Current 3 largest: [8, 5, 2]
Current 3 largest: [8, 5, 3]
Current 3 largest: [9, 8, 5]
Current 3 largest: [9, 8, 5]
Current 3 largest: [9, 8, 7]
Current 3 largest: [10, 9, 8]
Current 3 largest: [10, 9, 8]
Current 3 largest: [10, 9, 8]

Balanced Binary Search Tree (BST) 對於動態資料集來說甚至更有效率，因為它維護有序結構，允許快速插入、刪除和檢索。但是，Python 沒有內建的 balanced BST，因此您可以使用 third-party library 或實作一個。

Key Takeaways

Heapq 方法: 非常適合靜態資料集或更新不頻繁的情況。
Maintained heaps: 更適合頻繁添加，但相對不頻繁刪除的場景。
Balanced BSTs: 非常適合頻繁新增和刪除的資料集，提供高效率的插入、刪除和檢索操作。

Challenges

排序穩定性和 Tuple 比較問題

排序穩定性：在優先權佇列中，您希望具有相同優先權的任務按照新增的順序進行處理。這稱為 "sort stability"。
Tuple 比較問題：如果您只是使用像 (priority, task) 這樣的 tuples 作為 queue items。當優先權相等時，Python 將比較元組的兩個元素。如果 task 物件無法相互比較，就會出現問題。

提供的解決方案是使用 3 元素的 list or tuple，即 (priority、entry_count、task)。這就是為什麼它有效的原因：

Entry Count：這是每個任務的唯一識別碼 (如時間戳記)。當兩個任務具有相同優先權時，heapq module 將比較它們的條目計數。由於這些都是唯一的，並且對於每個新任務都是遞增的，因此它確保了排序穩定性。
Tuple 比較：透過條目計數，Python's tuple 比較不需要直接比較任務，從而避免了不可比較任務的問題。

PYTHON

import heapq

class PriorityQueue:
    def __init__(self):
        self.heap = []
        self.entry_count = 0

    def add_task(self, task, priority=0):
        entry = (priority, self.entry_count, task)
        heapq.heappush(self.heap, entry)
        self.entry_count += 1

    def pop_task(self):
        if self.heap:
            _, _, task = heapq.heappop(self.heap)
            return task
        raise KeyError('pop from an empty priority queue')

# Usage
pq = PriorityQueue()
pq.add_task("Task 1", priority=2)
pq.add_task("Task 2", priority=1)
pq.add_task("Task 3", priority=2)

print(pq.pop_task())  # Output: Task 2 (since it has the highest priority)
print(pq.pop_task())  # Output: Task 1
print(pq.pop_task())  # Output: Task 3

解決不可比較任務問題的另一個解決方案是建立一個 wrapper class，該 class 忽略 task item 並且僅比較優先順序 field。

PYTHON

from dataclasses import dataclass, field
from typing import Any
import heapq

@dataclass(order=True)
class PrioritizedItem:
    priority: int
    item: Any = field(compare=False)

# Initialize an empty priority queue
pq = []

# Add tasks to the queue
heapq.heappush(pq, PrioritizedItem(1, "Task with lower priority"))
heapq.heappush(pq, PrioritizedItem(5, "Task with higher priority"))
heapq.heappush(pq, PrioritizedItem(3, "Task with medium priority"))

# Pop tasks from the queue
while pq:
    task = heapq.heappop(pq)
    print(task.item)  # This will print tasks in the order of their priorities

執行結果:

Task with lower priority
Task with medium priority
Task with higher priority

當使用 heappop() 擷取項目時，它們會依優先順序遞增的順序返回 (數字越小表示優先順序越高)。每個PrioritizedItem 實例的 item 屬性儲存實際任務，這是不可比較的，但這不會影響佇列操作，因為僅使用 priority 進行比較。

當您的任務無法相互比較，但需要確定優先順序時，此方法特別有用。 dataclasses module 可以方便地建立具有所需比較行為的結構化物件。

更改優先權和刪除任務

更改優先權：如何處理任務新增到 queue 後，優先權發生變更的情況。
刪除任務：如何從 queue 中尋找並刪除特定任務，尤其是待處理的任務。

刪除條目或更改其優先順序更加困難，因為這會破壞 heap 結構不變量。因此，一個可能的解決方案是將條目標記為已刪除，並新增具有修改優先順序的新條目：

PYTHON

import itertools
from heapq import heappush, heappop

pq = []                         # list of entries arranged in a heap
entry_finder = {}               # mapping of tasks to entries
REMOVED = '<removed-task>'      # placeholder for a removed task
counter = itertools.count()     # unique sequence count

def add_task(task, priority=0):
    'Add a new task or update the priority of an existing task'
    if task in entry_finder:
        remove_task(task)
    count = next(counter)
    entry = [priority, count, task]
    entry_finder[task] = entry
    heappush(pq, entry)

def remove_task(task):
    'Mark an existing task as REMOVED.  Raise KeyError if not found.'
    entry = entry_finder.pop(task)
    entry[-1] = REMOVED

def pop_task():
    'Remove and return the lowest priority task. Raise KeyError if empty.'
    while pq:
        priority, count, task = heappop(pq)
        if task is not REMOVED:
            del entry_finder[task]
            return task
    raise KeyError('pop from an empty priority queue')


def change_task_priority(task, new_priority):
    'Change the priority of an existing task'
    if task not in entry_finder or entry_finder[task][-1] is REMOVED:
        raise KeyError('Task not found')
    remove_task(task)
    add_task(task, new_priority)

# Example usage
add_task("Task 1", 2)
add_task("Task 2", 1)
add_task("Task 3", 3)
print(pq)           # Output: [[1, 1, 'Task 2'], [2, 0, 'Task 1'], [3, 2, 'Task 3']]
print(entry_finder) # Output: {'Task 1': [2, 0, 'Task 1'], 'Task 2': [1, 1, 'Task 2'], 'Task 3': [3, 2, 'Task 3']}

print("\nInitial queue: [priority, count, task]")
for task in pq:
    if task[-1] != REMOVED:
        print(task)

change_task_priority("Task 1", 4)
print(pq)           # Output: [[1, 1, 'Task 2'], [2, 0, '<removed-task>'], [3, 2, 'Task 3'], [4, 3, 'Task 1']]
print(entry_finder) # Output: {'Task 2': [1, 1, 'Task 2'], 'Task 3': [3, 2, 'Task 3'], 'Task 1': [4, 3, 'Task 1']}

print("\nQueue after changing priority of Task 1:")
for task in pq:
    if task[-1] != REMOVED:
        print(task)

# Pop tasks
print("\nPopped tasks:")
while pq:
    try:
        print(pop_task())
    except KeyError:
        break
print(pq)           # Output: []
print(entry_finder) # Output: {}

執行結果:

[[1, 1, 'Task 2'], [2, 0, 'Task 1'], [3, 2, 'Task 3']]

Initial queue: [priority, count, task]
[1, 1, 'Task 2']
[2, 0, 'Task 1']
[3, 2, 'Task 3']

Queue after changing priority of Task 1:
[1, 1, 'Task 2']
[3, 2, 'Task 3']
[4, 3, 'Task 1']

Popped tasks:
Task 2
Task 3
Task 1

參考資料

heapq — Heap queue algorithm — Python 3.12.0b4 documentation

Binary Heap Basics. In my last article I outlined the tree… | by Michael Verdi | Medium

Previousheapq 【堆積佇列】Nexttime 【時間】

Last updated 1 year ago

Was this helpful?