Categories
database database-design database-normalization

Is storing a delimited list in a database column really that bad?

402

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.

Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.

I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?

Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I’m also asking because I’m thinking about cleaning up the program and make it more maintainable. There are some things in there I’m not entirely happy with, one of them is the topic of this question.

3

  • 31

    in that case, why bothering database?, saving in a file will do.

    – thavan

    Feb 12, 2013 at 9:31

  • 8

    Agreed with @thavan. Why even save the data for a proof of concept? Once you have the proof complete, then add a database correctly. Your fine doing lightweight for proof of concept, just don’t make things you have to unmake later.

    Aug 9, 2013 at 14:10

  • 2

    In Postgres, an array column should be preferred over a comma separated list. That at least ensures the proper data type, has no problems with distinguishing the delimiter from the actual data and it can be efficiently indexed.

    Aug 30, 2019 at 6:32

631

In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:

  • Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
  • Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
  • Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
  • Can’t delete a value from the list without fetching the whole list.
  • Can’t store a list longer than what fits in the string column.
  • Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
    idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
  • Hard to count elements in the list, or do other aggregate queries.
  • Hard to join the values to the lookup table they reference.
  • Hard to fetch the list in sorted order.
  • Hard to choose a separator that is guaranteed not to appear in the values

To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.

Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.

There are times when you need to employ denormalization, but as @OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.

17

  • 9

    An ARRAY (of any datatype) can fix the exception, just check PostgreSQL: postgresql.org/docs/current/static/arrays.html (@Bill: Great book, a must read for any developer or dba)

    Nov 24, 2011 at 21:18


  • 1

    For PostgreSQL-specific discussion see dba.stackexchange.com/q/55871/7788 . Comma-separated is just as awful, but an array field can be an acceptable performance optimisation under some circumstances if applied carefully and with consideration of the consequences.

    Dec 31, 2013 at 8:54

  • 1

    @CraigRinger, yes, it’s a type of denormalization. When used carefully, denormalization can be just the right thing to do for a certain query you are trying to optimize, but it must be done with full understanding that it harms other queries. If those other queries aren’t important to your application, then the pain is less.

    Sep 25, 2014 at 5:32

  • 3

    I know its not recommended, but playing devils advocate: most of these can be taken off if there is a ui that handles uniqueness and data types (otherwise would error or misbehave), ui drops and creates it anyway, there is a driver table where the values come from to make them unique, field like ‘%P%’ can be used, values being P, R, S, T, counting doesn’t matter, and sorting doesn’t matter. Depending on ui, values can be split[] e.g. to check checkboxes in a list from driver table in least common scenario without having to go to another table to get them.

    – jmcclure

    Sep 23, 2015 at 1:09

  • 3

    @PrabhuNandanKumar, I would store 174 rows in a second table that references your first table. Do not store 174 columns with similar data.

    Feb 28, 2018 at 16:33


48

“One reason was laziness”.

This rings alarm bells. The only reason you should do something like this is that you know how to do it “the right way” but you have come to the conclusion that there is a tangible reason not to do it that way.

Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.

(Some users would dispute the statement in my previous paragraph, saying that “you can never know what requirements will be added in the future”. These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)

1

  • 1

    I always hear some people saying that “my design is more flexible than yours” when I confront them about things like not setting up foreign key constraints, or storing lists in a single field. To me, flexibility (in such cases) == no discipline == laziness.

    Jan 30, 2015 at 3:28


43

There are numerous questions on SO asking:

  • how to get a count of specific values from the comma separated list
  • how to get records that have only the same 2/3/etc specific value from that comma separated list

Another problem with the comma separated list is ensuring the values are consistent – storing text means the possibility of typos…

These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.