Implementing Bucket Sort Algorithm with Python

Bucket Sort is a non-comparison-based sorting algorithm that leverages the divide-and-conquer principle. It distributes data into multiple “buckets”, sorts each bucket individually, and then merges the results to achieve overall order. The core idea is “divide and conquer”, making it suitable for scenarios where data is uniformly distributed and has a limited range.

Basic Idea of Bucket Sort¶

Bucketing: Divide the data into several buckets based on distribution characteristics. For example, if data is uniformly distributed between 0 and 1, split the interval [0,1] into n equal subintervals (each subinterval corresponds to a bucket, where n is the number of data points).
Sorting Within Buckets: Since each bucket contains a small number of elements, use simple sorting algorithms (e.g., insertion sort or built-in sort functions) to sort elements inside each bucket.
Merging Buckets: Extract elements from buckets in order and combine them into the final sorted array.

Applicable Scenarios¶

Uniform Data Distribution: When data is evenly distributed within a continuous interval (e.g., between 0 and 1), bucket sort has a time complexity close to linear (O(n)).
Limited Range: If the data range is clear (e.g., integers from 0 to 1000), it is easy to set the number and size of buckets.
Not Applicable: If data is unevenly distributed (e.g., most elements concentrate in a few buckets), the algorithm may degenerate to O(n²), becoming less efficient than quicksort.

Python Implementation Steps (for Floating-Point Numbers in [0,1])¶

Determine Bucket Count: Set the number of buckets equal to the data length n, where each bucket covers an interval of length 1/n.
Create Buckets: Initialize n empty buckets (a list of lists).
Distribute Data to Buckets: Traverse the data, compute the bucket index using int(num * n), and place elements into the corresponding bucket.
Sort Within Buckets: Use the built-in sort function (e.g., sort()) for each bucket.
Merge Buckets: Traverse all buckets and concatenate elements to form the sorted array.

Code Implementation¶

def bucket_sort(arr):
    # 1. Handle empty array
    if not arr:
        return arr

    n = len(arr)
    # 2. Create n empty buckets
    buckets = [[] for _ in range(n)]

    # 3. Distribute data into buckets (assuming data is in [0,1])
    for num in arr:
        # Calculate bucket index: floor(num * n)
        bucket_index = int(num * n)
        buckets[bucket_index].append(num)

    # 4. Sort each bucket and merge results
    sorted_arr = []
    for bucket in buckets:
        bucket.sort()  # Efficient built-in sorting for small buckets
        sorted_arr.extend(bucket)  # Merge elements from each bucket

    return sorted_arr

# Test
if __name__ == "__main__":
    test_data = [0.42, 0.32, 0.33, 0.52, 0.37, 0.47, 0.51]
    sorted_data = bucket_sort(test_data)
    print("Original data:", test_data)
    print("Sorted data:  ", sorted_data)

Code Explanation¶

Bucket Count: n = len(arr) ensures each bucket covers an interval of 1/n, preventing overflow.
Bucket Index Calculation: int(num * n) maps values in [0,1] to indices in [0, n-1]. For example, num=0.3 with n=7 gives 0.3*7=2.1 → index 2.
Sorting Within Buckets: bucket.sort() uses Python’s efficient built-in Timsort for small datasets.
Merging: sorted_arr.extend(bucket) appends elements from each bucket in order, resulting in the final sorted array.

Output¶

Original data: [0.42, 0.32, 0.33, 0.52, 0.37, 0.47, 0.51]
Sorted data:   [0.32, 0.33, 0.37, 0.42, 0.47, 0.51, 0.52]

Notes¶

Data Range Adjustment: For data outside [0,1], adjust the index calculation (e.g., int((num - min_val) / bucket_size)).
Dynamic Bucket Size: Adjust the number of buckets based on data distribution to avoid overloaded buckets (e.g., increase bucket count for extreme values).

Summary¶

Bucket sort efficiently sorts data through “bucketing → sorting → merging”. It excels in uniformly distributed data scenarios by leveraging divide-and-conquer to reduce complexity. While simple to implement, it requires careful attention to data distribution to avoid performance degradation.