This class represent a simple stateless converter from UCS-4 and to UCS-4 for each single code point. More...
#include <util.hpp>
Public Member Functions | |
virtual | ~base_converter () |
virtual int | max_len () const |
Return the maximal length that one Unicode code-point can be converted to, for example for UTF-8 it is 4, for Shift-JIS it is 2 and ISO-8859-1 is 1. More... | |
virtual bool | is_thread_safe () const |
Returns true if calling the functions from_unicode, to_unicode, and max_len is thread safe. More... | |
virtual base_converter * | clone () const |
Create a polymorphic copy of this object, usually called only if is_thread_safe() return false. More... | |
virtual uint32_t | to_unicode (char const *&begin, char const *end) |
Convert a single character starting at begin and ending at most at end to Unicode code-point. More... | |
virtual uint32_t | from_unicode (uint32_t u, char *begin, char const *end) |
Convert a single code-point u into encoding and store it in [begin,end) range. More... | |
Static Public Attributes | |
static const uint32_t | illegal =utf::illegal |
This value should be returned when an illegal input sequence or code-point is observed: For example if a UCS-32 code-point is in the range reserved for UTF-16 surrogates or an invalid UTF-8 sequence is found. More... | |
static const uint32_t | incomplete =utf::incomplete |
This value is returned in following cases: The of incomplete input sequence was found or insufficient output buffer was provided so complete output could not be written. More... | |
This class represent a simple stateless converter from UCS-4 and to UCS-4 for each single code point.
This class is used for creation of std::codecvt facet for converting utf-16/utf-32 encoding to encoding supported by this converter
Please note, this converter should be fully stateless. Fully stateless means it should never assume that it is called in any specific order on the text. Even if the encoding itself seems to be stateless like windows-1255 or shift-jis, some encoders (most notably iconv) can actually compose several code-point into one or decompose them in case composite characters are found. So be very careful when implementing these converters for certain character set.
|
inlinevirtual |
|
inlinevirtual |
Create a polymorphic copy of this object, usually called only if is_thread_safe() return false.
References BOOST_ASSERT.
|
inlinevirtual |
Convert a single code-point u into encoding and store it in [begin,end) range.
If u is invalid Unicode code-point, or it can not be mapped correctly to represented character set, illegal should be returned
If u can be converted to a sequence of bytes c1, ... , cN (1<= N <= max_len() ) then
References illegal, and incomplete.
|
inlinevirtual |
Returns true if calling the functions from_unicode, to_unicode, and max_len is thread safe.
Rule of thumb: if this class' implementation uses simple tables that are unchanged or is purely algorithmic like UTF-8 - so it does not share any mutable bit for independent to_unicode, from_unicode calls, you may set it to true, otherwise, for example if you use iconv_t descriptor or UConverter as conversion object return false, and this object will be cloned for each use.
|
inlinevirtual |
Return the maximal length that one Unicode code-point can be converted to, for example for UTF-8 it is 4, for Shift-JIS it is 2 and ISO-8859-1 is 1.
|
inlinevirtual |
Convert a single character starting at begin and ending at most at end to Unicode code-point.
if valid input sequence found in [begin,code_point_end) such as begin < code_point_end && code_point_end <= end it is converted to its Unicode code point equivalent, begin is set to code_point_end
if incomplete input sequence found in [begin,end), i.e. there my be such code_point_end that code_point_end > end and [begin, code_point_end) would be valid input sequence, then incomplete is returned begin stays unchanged, for example for UTF-8 conversion a *begin = 0xc2, begin +1 = end is such situation.
if invalid input sequence found, i.e. there is a sequence [begin, code_point_end) such as code_point_end <= end that is illegal for this encoding, illegal is returned and begin stays unchanged. For example if *begin = 0xFF and begin < end for UTF-8, then illegal is returned.
References boost::asio::begin, illegal, and incomplete.
|
static |
This value should be returned when an illegal input sequence or code-point is observed: For example if a UCS-32 code-point is in the range reserved for UTF-16 surrogates or an invalid UTF-8 sequence is found.
Referenced by from_unicode(), and to_unicode().
|
static |
This value is returned in following cases: The of incomplete input sequence was found or insufficient output buffer was provided so complete output could not be written.
Referenced by from_unicode(), and to_unicode().