How to convert file encoding
In this post, I will introduce 2 ways to convert file encoding
Method 1: use Linux command iconv
iconv -f sjis -t utf-8 -o <output file> <input file>
The file <input file>
is read in sjis
encoding and re-written to <output file>
in utf-8
encoding.
Here is the explanation of the flag:
-f
: from-t
: to-o
: output
To see the list of supported encodings
iconv -l
where, -l
means list.
This method requires you to remember the flag, command name, and obviously, the iconv
command has to be installed.
Method 2: use vim
This method has the following merits:
- Easily accomplished even in Window, once vim is installed
- You can check the encoding interactively in each step
- Actually, vim use
iconv
internally
To read a file in specific encoding, namely sjis
.
vim <file name>
:e ++enc=sjis
To write read a file in a specific encoding (e.g. utf-8
), regardless of which encoding used when reading.
:w ++enc=utf-8
or
:set fenc=utf-8
:w
To check for the list of supported encodings in vim
:h encoding-values
Deeper explanation on fenc
(fileencoding) option in vim
Firstly, do not confuse with enc
(encoding) option. The enc
option is used internally and does not relate to how vim read/write/interpret file/buffer. Moreover, enc
option is removed in neovim
and its value is fixed as utf8
in this vim implementation.++enc
is an option for :e
and :w
command. And it has nothing to do with the enc
option, except that their names are identical unintentionally.
fenc
decides how vim interprets text buffer to display its content in the terminal. Because your text file is stored in hard disk as a buffer of binary character, unless you want to work with the binary character, vim requires an option (with default value) to control how it displays this binary buffer to you.
Let see an example:
There is a binary file a.txt
with its content in hex format as e3 81 82
.
When being open with vim vim a.txt
. Vim uses utf-8
as the default fileencoding
. It interprets (decodes) e381 82
in utf-8 encoding and displays あ
.
When fenc
option is changed to sjis-8
, vim tries converting the buffer's content such that when new content is decoded with sjis-8
encoding, it should not change how the being displayed character あ
. The buffer's content is changed to 82 a0
, and is marked as changed, which suggests that you need to :write
to store the converted buffer.
If you re-open the converted file (c2 82
), by default it is decoded in utf-8
, which does not include these binary characters. Thus, the text is crashed and displayed as <82>.
. To correctly read the text, you must specify ++enc
option via :e ++enc=sjis
.
Note: when you try the above example. Vim usually appends a0
character at the end of the file. This option can be disabled by executing :set binary
and :set noeol