Set Deduplication: Creation and Common Operations of Python Sets

In Python, we often encounter situations where we need to remove duplicate elements, such as processing a list of data records with duplicates. In such cases, the set is a very practical tool. Its most notable features are element uniqueness and unordered structure. Using sets, we can easily achieve data deduplication and also perform set operations like intersection and union.

一、Why Use Sets for Deduplication?

Suppose we have a list [1, 2, 2, 3, 3, 3]. To remove duplicates, a list-based approach might require loops or list comprehensions (e.g., list(set([1,2,2,3,3,3]))). Sets simplify deduplication with a more direct, one-line solution and clearer logic.

二、Creating Sets

1. Directly with Curly Braces {}

Sets are defined using curly braces {}, with elements separated by commas. Note: Duplicate elements are automatically ignored.

# Define a set with duplicate elements
my_set = {1, 2, 3, 2, 3}
print(my_set)  # Output: {1, 2, 3} (duplicates are automatically removed)

2. Using the set() Function

To convert other iterable objects (e.g., lists, tuples) to sets, the set() function is more convenient.

# Create a set from a list (automatically removes duplicates)
my_list = [1, 2, 2, 3, 4]
my_set = set(my_list)
print(my_set)  # Output: {1, 2, 3, 4}

Important: Creating Empty Sets

  • {} cannot represent an empty set (it is a empty dictionary, which we will cover later).
  • Empty sets must be created with set():
  empty_set = set()
  print(empty_set)  # Output: set()

三、Common Set Operations

1. Basic Operations: Add and Remove Elements

  • Add Elements: Use the add() method to add a single element.
  my_set = {1, 2, 3}
  my_set.add(4)  # Add element 4
  print(my_set)  # Output: {1, 2, 3, 4}
  • Remove Elements: Use remove() or discard().
  • remove(x): Raises an error if x does not exist.
  • discard(x): Does nothing if x does not exist (safer).
  my_set = {1, 2, 3}
  my_set.remove(2)  # Remove element 2
  print(my_set)  # Output: {1, 3}

  my_set.discard(5)  # Remove non-existent element 5 (no error)
  print(my_set)  # Output: {1, 3}
  • Random Removal: Use pop() (since sets are unordered, it removes an arbitrary element).
  my_set = {1, 2, 3}
  removed = my_set.pop()  # Randomly remove an element (returns the removed element)
  print(my_set)  # Output: {2, 3} (e.g., if 1 was removed)
  print(removed)  # Output: 1 (the removed element)

2. Set Operations (Data Processing After Deduplication)

A core feature of sets is handling data relationships, such as intersection, union, and difference. These operations can be implemented with symbols or methods, as shown below:

  • Intersection: Elements present in both sets.
  • Symbol: & or method intersection()
  a = {1, 2, 3, 4}
  b = {3, 4, 5, 6}
  # Intersection: {3, 4}
  print(a & b)          # Output: {3, 4}
  print(a.intersection(b))  # Output: {3, 4}
  • Union: Combine all elements from both sets (removes duplicates).
  • Symbol: | or method union()
  a = {1, 2, 3, 4}
  b = {3, 4, 5, 6}
  # Union: {1, 2, 3, 4, 5, 6}
  print(a | b)          # Output: {1, 2, 3, 4, 5, 6}
  print(a.union(b))     # Output: {1, 2, 3, 4, 5, 6}
  • Difference: Elements in the first set but not in the second.
  • Symbol: - or method difference()
  a = {1, 2, 3, 4}
  b = {3, 4, 5, 6}
  # Difference: {1, 2} (elements in a but not in b)
  print(a - b)          # Output: {1, 2}
  print(a.difference(b))  # Output: {1, 2}

四、Set Characteristics (Must-Know for Beginners)

  1. Unordered: Sets do not preserve element insertion order, so elements cannot be accessed by index (e.g., my_set[0] will error).
   my_set = {1, 2, 3}
   print(my_set[0])  # Error: TypeError: 'set' object is not subscriptable
  1. Immutable Elements: Set elements must be of immutable types (e.g., numbers, strings, tuples). They cannot contain mutable types like lists or dictionaries.
   invalid_set = {[1, 2], 3}  # Error: TypeError: unhashable type: 'list'

五、Practical Case: List Deduplication

Deduplicating a list with sets is simple: convert the list to a set and back to a list.

# Original list (with duplicates)
my_list = [1, 2, 2, 3, 3, 3, 4]

# Deduplicate with set and convert back to list
unique_list = list(set(my_list))
print(unique_list)  # Output: [1, 2, 3, 4] (order may vary due to set unordered nature)

To preserve the original order (Python 3.7+), combine a set with a list comprehension:

seen = set()
unique_list = [x for x in my_list if not (x in seen or seen.add(x))]
print(unique_list)  # Output: [1, 2, 3, 4] (order preserved)

六、Summary

Sets are powerful for handling unordered, unique data in Python, especially for deduplication and set operations. Key takeaways:
- Creation: Use {} or set(), with empty sets only possible via set().
- Deduplication: One-line solution with list(set(duplicate_list)).
- Operations: Add (add()), remove (remove()/discard()), and set operations (intersection &, union |, difference -).
- Characteristics: Unordered (no indexing), immutable elements (no lists/dictionaries).

With practice, you’ll master using sets for deduplication and data processing efficiently!

Xiaoye