UTF8 Documentation

The UTF8 program is a tool for converting from a UTF-8 byte sequence to the Unicode code point. The byte sequence can also be referred to as the binary representation of the code even though the byte sequence is considered text. The term "byte sequence" here is used to refer to a sequence of bytes intended to represent something, which in this case is a Unicode character. The Unicode code point is the Unicode designation uniquely identifying that particular sequence. The Unicode code point is not specific to UTF-8 and persists across different encodings, such as UTF-16.

The idea behind the UTF8 program is to provide the answer to the question of what some special UTF-8 character is or to provide a way to create the UTF-8 character given the Unicode code point.

This tool is intended to be scriptable, should handle both piped data and files, and can convert entire files.

This tool can be used to validate a given byte sequence or can be used to get the character width of some byte sequence or code point.

This tool can be used to store binary data in a text-friendly format and then restore the binary data.

Standard Parameters

Short	Long	Description
`-h`	`--help`	Print the help message.
`+d`	`++dark`	Output using colors that show up better on dark backgrounds.
`+l`	`++light`	Output using colors that show up better on light backgrounds.
`+n`	`++no_color`	Do not print using color.
`+Q`	`++quiet`	Decrease verbosity, silencing most output.
`+E`	`++error`	Decrease verbosity, using only error output.
`+N`	`++normal`	Set verbosity to normal.
`+V`	`++verbose`	Increase verbosity beyond normal output.
`+D`	`++debug`	Enable debugging, significantly increasing verbosity beyond normal output.
`+v`	`++version`	Print only the version number.

The +Q/++quiet parameter silences all output that is not the intent and purpose of the program. For example, the purpose of the utf8 program is to print the Unicode code point or the UTF-8 byte sequence. The +Q/++quiet will not suppress this output. The new line printed at the end of the program, is however, not printed. The +Q/++quiet is ideal for using in scripting to help guarantee more consistent and controlled output.

The +N/++no_color simplifies the output to avoid the special color character codes. The special color character codes tend to take up a lot of extra space and may slow down printing performance.

Program Parameters

Short	Long	Description
`-b`	`--from_bytesequence`	The expected input format is byte sequence (character data).
`-c`	`--from_codepoint`	The expected input format is code point (such as U+0000).
`-f`	`--from_file`	Use the given file as the input source.
`-B`	`--to_bytesequence`	The output format is bytesequence (character data).
`-C`	`--to_codepoint`	The output format is codepoint (such as U+0000).
`-O`	`--to_combining`	The output format is to print whether or not character is combining or not.
`-F`	`--to_file`	Use the given file as the output destination.
`-W`	`--to_width`	The output format is to print the width of a character (either 0, 1, or 2).
`-H`	`--headers`	Print headers for each section (pipe, file, or parameter).
`-S`	`--separate`	Separate characters by new lines (implied when printing headers).
`-s`	`--strip_invalid`	Strip invalid Unicode characters (do not print invalid sequences).
`-v`	`--verify`	Only perform verification of valid sequences.

This program establishes a pattern for some of the parameters. The parameters that represent a "from" use lower case short characters and the parameters that represent a "to" use upper case short characters. For short parameters that have both a "from" and a "to", they use the same character with their case being different.

The default behavior is to assume the expected input is byte sequences from the command line to be output to the screen as codepoints.

Multiple input sources are allowed but only a single output destination is allowed.

When using the parameter --verify, no data is printed and 0 is returned if valid or 1 is returned if invalid.

When using the parameter --to_combining with the parameter --to_width, the 'C' character is printed to represent the combining and the digits are used to represent widths. The combining characters should be considered 1-width by themselves or 0-width when combined.