Categories
duplicates list python unique

How do I remove duplicates from a list, while preserving order?

949

How do I remove duplicates from a list, while preserving order? Using a set to remove duplicates destroys the original order.
Is there a built-in or a Pythonic idiom?

Related question: In Python, what is the fastest algorithm for removing duplicates from a list so that all elements are unique while preserving order?

1

  • 7

    You may want to consider the 2020 edit to this answer stackoverflow.com/a/17016257/1219006 which seems to be the best solution now for Python 3.6(cpython)-7(all pythons)+ list(dict.fromkeys(items))

    – jamylak

    Dec 14, 2020 at 5:08


871

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn’t smart enough to rule that out. To play it safe, it has to check the object each time.

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)

23

  • 24

    @JesseDhillon seen.add could have changed between iterations, and the runtime isn’t smart enough to rule that out. To play safe, it has to check the object each time. — If you look at the bytecode with dis.dis(f), you can see that it executes LOAD_ATTR for the add member on each iteration. ideone.com/tz1Tll

    Mar 22, 2013 at 17:24

  • 6

    When I try this on a list of lists I get: TypeError: unhashable type: ‘list’

    Mar 11, 2014 at 14:28

  • 7

    Your solution is not the fastest one. In Python 3 (did not test 2) this is faster (300k entries list – 0.045s (yours) vs 0.035s (this one): seen = set(); return [x for x in lines if x not in seen and not seen.add(x)]. I could not find any speed effect of the seen_add line you did.

    Oct 24, 2014 at 16:39


  • 3

    @user136036 Please link to your tests. How many times did you run them?seen_add is an improvement but timings can be affected by system resources at the time. Would be interested to see full timings

    – jamylak

    May 20, 2015 at 6:22

  • 10

    To anyone who is writing Python code, you really should think twice before sacrificing readability and commonly-agreed Python conventions just to squeeze out a few more nanoseconds per loop. Testing with and without seen_add = seen.add yields only a 1% increase in speed. It’s hardly significant.

    – sleblanc

    May 14, 2019 at 3:25


583

The best solution varies by Python version and environment constraints:

Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):

First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) collections.OrderedDict. So the fastest solution, by far, is also the simplest:

>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items))  # Or [*dict.fromkeys(items)] if you prefer
[1, 2, 0, 3]

Like list(set(items)) this pushes all the work to the C layer (on CPython), but since dicts are insertion ordered, dict.fromkeys doesn’t lose ordering. It’s slower than list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of sets in a listcomp).

Important note: The unique_everseen solution from more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it’s the only solution that will work.

Python 3.5 (and all older versions if performance isn’t critical)

As Raymond pointed out, in CPython 3.5 where OrderedDict is implemented in C, ugly list comprehension hacks are slower than OrderedDict.fromkeys (unless you actually need the list at the end – and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the OrderedDict equivalent of the 3.6+ use of plain dict:

>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]

On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.

Python 3.4 and earlier, if performance is critical and third-party modules are acceptable

As @abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is the fastest solution too:

>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]

Just one simple library import and no hacks.

The module is adapting the itertools recipe unique_everseen which looks like:

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

but unlike the itertools recipe, it supports non-hashable items (at a performance cost; if all elements in iterable are non-hashable, the algorithm becomes O(n²), vs. O(n) if they’re all hashable).

Important note: Unlike all the other solutions here, unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying set grows to the same size), but if you don’t listify the result, you just iterate it, you’ll be able to process unique items as they’re found, rather than waiting until the entire input has been deduplicated before processing the first unique item.

Python 3.4 and earlier, if performance is critical and third party modules are unavailable

You have two options:

  1. Copy and paste in the unique_everseen recipe to your code and use it per the more_itertools example above

  2. Use ugly hacks to allow a single listcomp to both check and update a set to track what’s been seen:

    seen = set()
    [x for x in seq if x not in seen and not seen.add(x)]
    

    at the expense of relying on the ugly hack:

     not seen.add(x)
    

    which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.

Note that all of the solutions above are O(n) (save calling unique_everseen on an iterable of non-hashable items, which is O(n²), while the others would fail immediately with a TypeError), so all solutions are performant enough when they’re not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don’t assume it is; it usually isn’t), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn’t worth it).

16

  • 5

    Converting to some custom kind of dict just to take keys? Just another crutch.

    – Nakilon

    Jun 14, 2013 at 13:40

  • 7

    @Nakilon I don’t really see how it’s a crutch. It doesn’t expose any mutable state, so its very clean in that sense. Internally, Python sets are implemented with dict() (stackoverflow.com/questions/3949310/…), so basically you’re just doing what the interpreter would’ve done anyway.

    – Imran

    Jun 18, 2013 at 6:58


  • 1

    @EMS That doesn’t preserve order. You could just as well do seen = set(seq).

    Sep 10, 2013 at 0:59

  • 1

    This solution is extremly slower than the mentioned “hack”. For my list of 300k entries over 50x slower.

    Oct 24, 2014 at 16:23

  • 1

    @CommuSoft I agree, although practically it’s almost always O(n) due to the super highly unlikely worst case

    – jamylak

    May 20, 2015 at 5:22

149

In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:

>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

In Python 3.5 and below (including Python 2.7), use the OrderedDict. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5 (when it gained a C implementation; prior to 3.5 it’s still the clearest solution, though not the fastest).

>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']

5

  • 8

    The only gotcha is that the iterable “elements” must be hashable – would be nice to have the equivalent for iterables with arbitrary elements (as a list of lists)

    May 31, 2018 at 12:37

  • Insertion order iteration over a dict provides functionality that services more use-cases than removing duplicates. For example, scientific analyses rely on reproducible computations which non-deterministic dict iteration doesn’t support. Reproducibility is a major current objective in computational scientific modeling, so we welcome this new feature. Although I know it’s trivial to build with a deterministic dict, a high-performance, deterministic set() would help more naive users develop reproducible codes.

    – Arthur

    Jan 9, 2019 at 23:01


  • What about using [*dict.fromkeys('abracadabra')] (unpacking) instead of calling the function list(...)? In my tests this is faster, although only very small differences can be detected. So I’m not sure if this is just a coincidence.

    – colidyre

    Jun 24, 2020 at 12:57

  • 1

    @colidyre Yes, that would work. The small speed difference is likely due to operators not having to lookup a builtin function. There is a matter of clarity to considered as well.

    Jun 24, 2020 at 21:50

  • @RaymondHettinger: The lookup cost was small (got smaller with 3.8’s LOAD_GLOBAL); main advantage was avoiding constructor code paths (requiring construction of a tuple for args and passing NULL pointer as kwargs dict, then calling both the mostly empty __new__ and the __init__ separately, the latter of which then has to go through generalized argument parsing code, all to pass 0-1 positional arguments). As of 3.9 though, list() bypasses most of that via the vectorcall protocol, reducing the incremental benefit from 60-70 ns (3.8.5) to 20-30 ns (3.10.0) on my machine.

    Dec 27, 2021 at 18:23