Categories
.net c# character-encoding string

How do I get a consistent byte representation of strings in C# without manually specifying an encoding?

2325

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?

I’m going to encrypt the string. I can encrypt it without converting, but I’d still like to know why encoding comes to play here.

Also, why should encoding even be taken into consideration? Can’t I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

31

  • 27

    Every string is stored as an array of bytes right? Why can’t I simply have those bytes?

    Jan 23, 2009 at 14:05

  • 147

    The encoding is what maps the characters to the bytes. For example, in ASCII, the letter ‘A’ maps to the number 65. In a different encoding, it might not be the same. The high-level approach to strings taken in the .NET framework makes this largely irrelevant, though (except in this case).

    Apr 13, 2009 at 14:13

  • 22

    To play devil’s advocate: If you wanted to get the bytes of an in-memory string (as .NET uses them) and manipulate them somehow (i.e. CRC32), and NEVER EVER wanted to decode it back into the original string…it isn’t straight forward why you’d care about encodings or how you choose which one to use.

    – Greg

    Dec 1, 2009 at 19:47

  • 88

    Surprised no-one has given this link yet: joelonsoftware.com/articles/Unicode.html

    – Bevan

    Jun 29, 2010 at 2:57

  • 34

    A char is not a byte and a byte is not a char. A char is both a key into a font table and a lexical tradition. A string is a sequence of chars. (A words, paragraphs, sentences, and titles also have their own lexical traditions that justify their own type definitions — but I digress). Like integers, floating point numbers, and everything else, chars are encoded into bytes. There was a time when the encoding was simple one to one: ASCII. However, to accommodate all of human symbology, the 256 permutations of a byte were insufficient and encodings were devised to selectively use more bytes.

    – George

    Aug 28, 2014 at 15:43


1922

Contrary to the answers here, you DON’T need to worry about encoding if the bytes don’t need to be interpreted!

Like you mentioned, your goal is, simply, to “get what bytes the string has been stored in”.
(And, of course, to be able to re-construct the string from the bytes.)

For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.

Just do this instead:

static byte[] GetBytes(string str)
{
    byte[] bytes = new byte[str.Length * sizeof(char)];
    System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
    return bytes;
}

// Do NOT use on arbitrary bytes; only use on GetBytes's output on the SAME system
static string GetString(byte[] bytes)
{
    char[] chars = new char[bytes.Length / sizeof(char)];
    System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
    return new string(chars);
}

As long as your program (or other programs) don’t try to interpret the bytes somehow, which you obviously didn’t mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.

Additional benefit to this approach: It doesn’t matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!

It will be encoded and decoded just the same, because you are just looking at the bytes.

If you used a specific encoding, though, it would’ve given you trouble with encoding/decoding invalid characters.

122

  • 270

    What’s ugly about this one is, that GetString and GetBytes need to executed on a system with the same endianness to work. So you can’t use this to get bytes you want to turn into a string elsewhere. So I have a hard time to come up with a situations where I’d want to use this.

    May 13, 2012 at 11:14

  • 72

    @CodeInChaos: Like I said, the whole point of this is if you want to use it on the same kind of system, with the same set of functions. If not, then you shouldn’t use it.

    May 13, 2012 at 18:00


  • 213

    -1 I guarantee that someone (who doesn’t understand bytes vs characters) is going to want to convert their string into a byte array, they will google it and read this answer, and they will do the wrong thing, because in almost all cases, the encoding IS relevant.

    Jun 15, 2012 at 11:07

  • 426

    @artbristol: If they can’t be bothered to read the answer (or the other answers…), then I’m sorry, then there’s no better way for me to communicate with them. I generally opt for answering the OP rather than trying to guess what others might do with my answer — the OP has the right to know, and just because someone might abuse a knife doesn’t mean we need to hide all knives in the world for ourselves. Though if you disagree that’s fine too.

    Jun 15, 2012 at 14:04


  • 202

    This answer is wrong on so many levels but foremost because of it’s decleration “you DON’T need to worry about encoding!”. The 2 methods, GetBytes and GetString are superfluous in as much as they are merely re-implementations of what Encoding.Unicode.GetBytes() and Encoding.Unicode.GetString() already do. The statement “As long as your program (or other programs) don’t try to interpret the bytes” is also fundamentally flawed as implicitly they mean the bytes should be interpreted as Unicode.

    – David

    Jul 11, 2012 at 12:36

1136

It depends on the encoding of your string (ASCII, UTF-8, …).

For example:

byte[] b1 = System.Text.Encoding.UTF8.GetBytes (myString);
byte[] b2 = System.Text.Encoding.ASCII.GetBytes (myString);

A small sample why encoding matters:

string pi = "\u03a0";
byte[] ascii = System.Text.Encoding.ASCII.GetBytes (pi);
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes (pi);

Console.WriteLine (ascii.Length); //Will print 1
Console.WriteLine (utf8.Length); //Will print 2
Console.WriteLine (System.Text.Encoding.ASCII.GetString (ascii)); //Will print '?'

ASCII simply isn’t equipped to deal with special characters.

Internally, the .NET framework uses UTF-16 to represent strings, so if you simply want to get the exact bytes that .NET uses, use System.Text.Encoding.Unicode.GetBytes (...).

See Character Encoding in the .NET Framework (MSDN) for more information.

18

  • 15

    But, why should encoding be taken into consideration? Why can’t I simply get the bytes without having to see what encoding is being used? Even if it were required, shouldn’t the String object itself know what encoding is being used and simply dump what is in memory?

    Jan 23, 2009 at 13:48

  • 65

    A .NET strings are always encoded as Unicode. So use System.Text.Encoding.Unicode.GetBytes(); to get the set of bytes that .NET would using to represent the characters. However why would you want that? I recommend UTF-8 especially when most characters are in the western latin set.

    Jan 23, 2009 at 14:33

  • 8

    Also: the exact bytes used internally in the string don’t matter if the system that retrieves them doesn’t handle that encoding or handles it as the wrong encoding. If it’s all within .Net, why convert to an array of bytes at all. Otherwise, it’s better to be explicit with your encoding

    Jan 23, 2009 at 15:42

  • 12

    @Joel, Be careful with System.Text.Encoding.Default as it could be different on each machine it is run. That’s why it’s recommended to always specify an encoding, such as UTF-8.

    – Ash

    Jan 28, 2010 at 9:01

  • 26

    You don’t need the encodings unless you (or someone else) actually intend(s) to interpret the data, instead of treating it as a generic “block of bytes”. For things like compression, encryption, etc., worrying about the encoding is meaningless. See my answer for a way to do this without worrying about the encoding. (I might have given a -1 for saying you need to worry about encodings when you don’t, but I’m not feeling particularly mean today. :P)

    Apr 30, 2012 at 7:55


302

The accepted answer is very, very complicated. Use the included .NET classes for this:

const string data = "A string with international characters: Norwegian: ÆØÅæøå, Chinese: 喂 谢谢";
var bytes = System.Text.Encoding.UTF8.GetBytes(data);
var decoded = System.Text.Encoding.UTF8.GetString(bytes);

Don’t reinvent the wheel if you don’t have to…

10

  • 16

    In case the accepted answer gets changed, for record purposes, it is Mehrdad’s answer at this current time and date. Hopefully the OP will revisit this and accept a better solution.

    Sep 27, 2013 at 18:20

  • 9

    good in principle but, the encoding should be System.Text.Encoding.Unicode to be equivalent to Mehrdad’s answer.

    – Jodrell

    Nov 25, 2014 at 9:08

  • 7

    The question has been edited an umptillion times since the original answer, so, maybe my answer is a bit outdates. I never intended to give an exace equivalent to Mehrdad’s answer, but give a sensible way of doing it. But, you might be right. However, the phrase “get what bytes the string has been stored in” in the original question is very unprecise. Stored, where? In memory? On disk? If in memory, System.Text.Encoding.Unicode.GetBytes would probably be more precise.

    Nov 26, 2014 at 11:36

  • 8

    @AMissico, your suggestion is buggy, unless you are sure your string is compatible with your system default encoding (string containing only ASCII chars in your system default legacy charset). But nowhere the OP states that.

    Apr 6, 2016 at 20:53

  • 6

    @AMissico It can cause the program to give different results on different systems though. That’s never a good thing. Even if it’s for making a hash or something (I assume that’s what OP means with ‘encrypt’), the same string should still always give the same hash.

    – Nyerguds

    Apr 22, 2016 at 10:33