How do I remove duplicates from a list, while preserving order? Using a set to remove duplicates destroys the original order.
Is there a built-in or a Pythonic idiom?
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
def f7(seq): seen = set() seen_add = seen.add return [x for x in seq if not (x in seen or seen_add(x))]
seen_add instead of just calling
seen.add? Python is a dynamic language, and resolving
seen.add each iteration is more costly than resolving a local variable.
seen.add could have changed between iterations, and the runtime isn’t smart enough to rule that out. To play it safe, it has to check the object each time.
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
(Small additional note:
seen.add() always returns
None, so the
or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
The best solution varies by Python version and environment constraints:
Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):
First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain
dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5)
collections.OrderedDict. So the fastest solution, by far, is also the simplest:
>>> items = [1, 2, 0, 1, 3, 2] >>> list(dict.fromkeys(items)) # Or [*dict.fromkeys(items)] if you prefer [1, 2, 0, 3]
list(set(items)) this pushes all the work to the C layer (on CPython), but since
dicts are insertion ordered,
dict.fromkeys doesn’t lose ordering. It’s slower than
list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of
sets in a listcomp).
Important note: The
unique_everseen solution from
more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it’s the only solution that will work.
Python 3.5 (and all older versions if performance isn’t critical)
As Raymond pointed out, in CPython 3.5 where
OrderedDict is implemented in C, ugly list comprehension hacks are slower than
OrderedDict.fromkeys (unless you actually need the list at the end – and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the
OrderedDict equivalent of the 3.6+ use of plain
>>> from collections import OrderedDict >>> items = [1, 2, 0, 1, 3, 2] >>> list(OrderedDict.fromkeys(items)) [1, 2, 0, 3]
On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.
Python 3.4 and earlier, if performance is critical and third-party modules are acceptable
As @abarnert notes, the
more_itertools library (
pip install more_itertools) contains a
unique_everseen function that is built to solve this problem without any unreadable (
not seen.add) mutations in list comprehensions. This is the fastest solution too:
>>> from more_itertools import unique_everseen >>> items = [1, 2, 0, 1, 3, 2] >>> list(unique_everseen(items)) [1, 2, 0, 3]
Just one simple library import and no hacks.
The module is adapting the itertools recipe
unique_everseen which looks like:
def unique_everseen(iterable, key=None): "List unique elements, preserving order. Remember all elements ever seen." # unique_everseen('AAAABBBCCDAABBB') --> A B C D # unique_everseen('ABBCcAD', str.lower) --> A B C D seen = set() seen_add = seen.add if key is None: for element in filterfalse(seen.__contains__, iterable): seen_add(element) yield element else: for element in iterable: k = key(element) if k not in seen: seen_add(k) yield element
but unlike the
itertools recipe, it supports non-hashable items (at a performance cost; if all elements in
iterable are non-hashable, the algorithm becomes
O(n) if they’re all hashable).
Important note: Unlike all the other solutions here,
unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying
set grows to the same size), but if you don’t
listify the result, you just iterate it, you’ll be able to process unique items as they’re found, rather than waiting until the entire input has been deduplicated before processing the first unique item.
Python 3.4 and earlier, if performance is critical and third party modules are unavailable
You have two options:
Copy and paste in the
unique_everseenrecipe to your code and use it per the
Use ugly hacks to allow a single listcomp to both check and update a
setto track what’s been seen:
seen = set() [x for x in seq if x not in seen and not seen.add(x)]
at the expense of relying on the ugly hack:
which relies on the fact that
set.addis an in-place method that always returns
not Noneevaluates to
Note that all of the solutions above are
O(n) (save calling
unique_everseen on an iterable of non-hashable items, which is
O(n²), while the others would fail immediately with a
TypeError), so all solutions are performant enough when they’re not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don’t assume it is; it usually isn’t), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn’t worth it).
In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:
>>> list(dict.fromkeys('abracadabra')) ['a', 'b', 'r', 'c', 'd']
In Python 3.5 and below (including Python 2.7), use the
OrderedDict. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5 (when it gained a C implementation; prior to 3.5 it’s still the clearest solution, though not the fastest).
>>> from collections import OrderedDict >>> list(OrderedDict.fromkeys('abracadabra')) ['a', 'b', 'r', 'c', 'd']