Categories
pandas python python-2.7 python-3.x

How to elegantly one hot encode a series of lists in pandas

So I have the following data:

>>> test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
>>> test
0 [a, b, e]
1 [c, a]
2 [d]
3 [d]
4 [e]

I am trying to one-hot-encode all of the data in the lists back into my dataframe. To look like this:

>>> pd.DataFrame([[1, 1, 0, 0, 1], [1, 0, 1, 0, 0],
[0, 0, 0, 1, 0], [0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]],
columns = ['a', 'b', 'c', 'd', 'e'])
a b c d e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1

I have tried researching and I’ve found similar problems but none like this. I have attempted:

test.apply(pd.Series)

But that doesn’t quite accomplish the one-hot aspect. That simply unpacks my lists in an arbitrary order. I’m sure I could figure out a lengthly solution but I’d be glad to hear if there’s a more elegant way to perform this.

Thanks!

EDIT: I am aware that I can iterate through my test series, then create a column for each unique value found, then go back and iterate through test again, flagging said columns for unique values. But that doesn’t seem very pandorable to me and I’m sure there’s a more elegant way to do this.

MultiLabelBinarizer from the sklearn library is more efficient for these problems. It should be preferred to apply with pd.Series. Here’s a demo:

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
test = pd.Series([['a', 'b', 'e'], ['c', 'a'], ['d'], ['d'], ['e']])
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(test),
columns=mlb.classes_,
index=test.index)

Result

   a  b  c  d  e
0 1 1 0 0 1
1 1 0 1 0 0
2 0 0 0 1 0
3 0 0 0 1 0
4 0 0 0 0 1