Eric Bergman-Terrell's Blog

.NET Programming Tip: How to Determine the Encoding of a Unicode File
October 4, 2010

The StreamReader class allows you to read in Unicode text from a file without having to worry about the precise encoding:

...
StreamReader SR = new StreamReader(FileName, true);

String Contents = SR.ReadToEnd();

SR.Close();
...

For example, the above code works for Unicode files having the following Encodings: Encoding.BigEndianUnicode, Encoding.Unicode, and Encoding.UTF8. It also works if the file is encoded in Encoding.ASCII format.

The file's encoding is automatically detected because the StreamReader constructor's second argument (detectEncodingFromByteOrderMarks) is true.

There's no problem reading in Unicode text using the StreamReader. The problem is writing updated text back to the file with the original Encoding intact. For example, if your program reads text in Encoding.BigEndianUnicode format, it should write it back in the same format.

Unfortunately the StreamReader object doesn't keep the original Encoding around for later use. Don't try to use the CurrentEncoding member, it's always Encoding.UTF8, regardless of the text file's actual Encoding. At least it always was when I experimented with it.

So how can you use a StreamWriter to write back text read from a StreamReader, with the original encoding intact? Use the following code to determine the file's original encoding, and specify that encoding in the StreamWriter's constructor.

Unicode files start with a two byte prefix called a BOM (Byte Order Mark) that identifies the exact Encoding of the file. GetFileEncoding() iterates through various Unicode Encoding values and compares the file's BOM with the current Encoding's BOM (returned by the GetPrefix() member). When a match is found, the corresponding Encoding value is returned. If no matches are found, the Encoding.Default value is returned.

public static Encoding GetFileEncoding(String FileName)

// Return the Encoding of a text file.  Return Encoding.Default if no Unicode
// BOM (byte order mark) is found.

{
    Encoding Result = null;

    FileInfo FI = new FileInfo(FileName);

    FileStream FS = null;

    try
        {
        FS = FI.OpenRead();

        Encoding[] UnicodeEncodings = { Encoding.BigEndianUnicode, Encoding.Unicode, Encoding.UTF8 };

        for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)
        {
            FS.Position = 0;

            byte[] Preamble = UnicodeEncodings[i].GetPreamble();

            bool PreamblesAreEqual = true;

            for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)
            {
                PreamblesAreEqual = Preamble[j] == FS.ReadByte();
            }

            if (PreamblesAreEqual)
            {
                Result = UnicodeEncodings[i];
            }
        }
    }
    catch (System.IO.IOException)
    {
    }
    finally
    {
        if (FS != null)
        {
            FS.Close();
        }
    }

    if (Result == null)
    {
        Result = Encoding.Default;
    }

    return Result;
}
Keywords: Unicode, Encoding, StreamReader, StreamWriter, BOM, Byte Order Mark, BigEndianUnicode, Encoding.Default, Encoding.ASCII, Encoding.Default, Encoding.Unicode, GetPreamble

Reader Comments

Comment on this Blog Post

Recent Posts

TitleDate
Java Programming Tip: SWT Photo Frame ProgramOctober 31, 2016
Vault 3 (Desktop) Version 1.63 ReleasedSeptember 9, 2016
"Compliance with Court Orders Act of 2016"April 9, 2016
Disable "Visual Voicemail" on Android / T-MobileJanuary 17, 2016
IPv6 HumorDecember 10, 2015
Java Programming Tip: Specify the JVM time zoneDecember 7, 2015
Node.js / Express Programming Tip: Detect and Fix Memory LeaksOctober 27, 2015