How to convert file encoding

How to convert file encoding

In this post, I will introduce 2 ways to convert file encoding

Method 1: use Linux command iconv

iconv -f sjis -t utf-8 -o <output file> <input file>

The file <input file> is read in sjis encoding and re-written to <output file> in utf-8 encoding.
Here is the explanation of the flag:

  • -f: from
  • -t: to
  • -o: output

To see the list of supported encodings

iconv -l

where, -l means list.

This method requires you to remember the flag, command name, and obviously, the iconv command has to be installed.

Method 2: use vim

This method has the following merits:

  • Easily accomplished even in Window, once vim is installed
  • You can check the encoding interactively in each step
  • Actually, vim use iconv internally

To read a file in specific encoding, namely sjis.

vim <file name>
:e ++enc=sjis

To write read a file in a specific encoding (e.g. utf-8), regardless of which encoding used when reading.

:w ++enc=utf-8

or

:set fenc=utf-8
:w

To check for the list of supported encodings in vim

:h encoding-values

Deeper explanation on fenc (fileencoding) option in vim

Firstly, do not confuse with enc (encoding) option. The enc option is used internally and does not relate to how vim read/write/interpret file/buffer. Moreover, enc option is removed in neovim and its value is fixed as utf8 in this vim implementation.
++enc is an option for :e and :w command. And it has nothing to do with the enc option, except that their names are identical unintentionally.

fenc decides how vim interprets text buffer to display its content in the terminal. Because your text file is stored in hard disk as a buffer of binary character, unless you want to work with the binary character, vim requires an option (with default value) to control how it displays this binary buffer to you.

Let see an example:

There is a binary file a.txt with its content in hex format as e3 81 82.

When being open with vim vim a.txt. Vim uses utf-8 as the default fileencoding. It interprets (decodes) e381 82 in utf-8 encoding and displays .

When fenc option is changed to sjis-8, vim tries converting the buffer's content such that when new content is decoded with sjis-8 encoding, it should not change how the being displayed character . The buffer's content is changed to 82 a0, and is marked as changed, which suggests that you need to :write to store the converted buffer.

If you re-open the converted file (c2 82), by default it is decoded in utf-8, which does not include these binary characters. Thus, the text is crashed and displayed as <82>.. To correctly read the text, you must specify ++enc option via :e ++enc=sjis.

Note: when you try the above example. Vim usually appends a0 character at the end of the file. This option can be disabled by executing :set binary and :set noeol

Buy Me A Coffee