Doc/technotes/string_representation.txt

Title: String Representation in Thuban
Author: Bernhard Herzog <[email protected]>
Last-Modified: $Date$
Version: $Revision$


Introduction

    Thuban originally assumed that text is represented by byte-strings
    encoded in ISO-8859-1 (latin-1).  This is problematic when the
    default encoding in the user's locale is not in fact latin-1, but
    e.g. UTF-8.  The solution is to use a more flexible representation
    that will also allow the switch to Unicode as the internal string
    representation at one point.


Internal String Representation

    Thuban has an internal string representation.  All textual data read
    by Thuban has to be converted to the internal representation.  All
    data written by Thuban has to be converted into whatever form is
    used by the output device.

    Thuban provides functions to convert between the internal
    representation and other representations.  E.g.:
    internal_from_unicode which converts from unicode and should be used
    when reading XML files, for instance and unicode_from_internal for
    the conversion to Unicode.

    The ultimate goal is to use Unicode objects as the internal string
    representation.  It will be much work to get there because we will
    have to find all the places where we need to make the conversions.
    Therefore the internal representation will be byte strings in the
    user's default encoding.  

    With byte strings and especially encodings like latin-1 we can get
    by without doing all the conversions correctly because basically all
    byte strings are valid latin-1 strings, even if they have the wrong
    encoding.  In those cases, the text may look strange, but there
    won't be exceptions in most cases.  With Unicode objects, exceptions
    are much more likely.  And in the end it's better to see some
    incorrect characters than no data at all.

    All this boils down to the following steps:

    1. Byte-Strings as Internal Representation

    The internal representation are byte strings in the user's default
    encoding as determined by the locale.  The encoding is chosen so
    that such byte strings can be passed to wxPython without problems.
    This even works with Unicode builds if we take care to convert the
    translated strings (wxGetTranslation returns Unicode objects in a
    Unicode build).

    If no suitable encoding can be determined, use latin-1.  It might be
    better to use ASCII instead, but latin 1 offers somewhat better
    backwards compatibility with older Thuban versions.

    Start implementing the conversion functions and use them wherever
    we have hard coded conversions to latin-1.  It's not necessary to
    find all places where conversion has to be done at this point.
    Since we're using byte strings in the user's default encoding most
    byte-strings that are read by Thuban are already in the right form
    and in most cases it's also the right form for output.


    2. Implement the conversion wherever necessary

    Start working toward Unicode as the internal representation.  In
    this phase, we need to find all places where conversion has to be
    done.  To help with this, there will be a command line option that
    sets the internal representation to Unicode so that it's easy to
    test.

    The most difficult areas for this are probably the various data
    sources.  Some of them -- dbf files for instance -- q don't provide
    any information about the encodings used.


    3. Switch to Unicode

    Finally, switch to Unicode as the internal string representation.
    For this step it might be best to wait until Unicode builds of
    wxPython are the default on the common platforms.

1	bh	2642	Title: String Representation in Thuban
2			Author: Bernhard Herzog <[email protected]>
3			Last-Modified: $Date$
4			Version: $Revision$
5
6
7			Introduction
8
9			Thuban originally assumed that text is represented by byte-strings
10			encoded in ISO-8859-1 (latin-1). This is problematic when the
11			default encoding in the user's locale is not in fact latin-1, but
12			e.g. UTF-8. The solution is to use a more flexible representation
13			that will also allow the switch to Unicode as the internal string
14			representation at one point.
15
16
17			Internal String Representation
18
19			Thuban has an internal string representation. All textual data read
20			by Thuban has to be converted to the internal representation. All
21			data written by Thuban has to be converted into whatever form is
22			used by the output device.
23
24			Thuban provides functions to convert between the internal
25			representation and other representations. E.g.:
26			internal_from_unicode which converts from unicode and should be used
27			when reading XML files, for instance and unicode_from_internal for
28			the conversion to Unicode.
29
30			The ultimate goal is to use Unicode objects as the internal string
31			representation. It will be much work to get there because we will
32			have to find all the places where we need to make the conversions.
33			Therefore the internal representation will be byte strings in the
34			user's default encoding.
35
36			With byte strings and especially encodings like latin-1 we can get
37			by without doing all the conversions correctly because basically all
38			byte strings are valid latin-1 strings, even if they have the wrong
39			encoding. In those cases, the text may look strange, but there
40			won't be exceptions in most cases. With Unicode objects, exceptions
41			are much more likely. And in the end it's better to see some
42			incorrect characters than no data at all.
43
44			All this boils down to the following steps:
45
46			1. Byte-Strings as Internal Representation
47
48			The internal representation are byte strings in the user's default
49			encoding as determined by the locale. The encoding is chosen so
50			that such byte strings can be passed to wxPython without problems.
51			This even works with Unicode builds if we take care to convert the
52			translated strings (wxGetTranslation returns Unicode objects in a
53			Unicode build).
54
55			If no suitable encoding can be determined, use latin-1. It might be
56			better to use ASCII instead, but latin 1 offers somewhat better
57			backwards compatibility with older Thuban versions.
58
59			Start implementing the conversion functions and use them wherever
60			we have hard coded conversions to latin-1. It's not necessary to
61			find all places where conversion has to be done at this point.
62			Since we're using byte strings in the user's default encoding most
63			byte-strings that are read by Thuban are already in the right form
64			and in most cases it's also the right form for output.
65
66
67			2. Implement the conversion wherever necessary
68
69			Start working toward Unicode as the internal representation. In
70			this phase, we need to find all places where conversion has to be
71			done. To help with this, there will be a command line option that
72			sets the internal representation to Unicode so that it's easy to
73			test.
74
75			The most difficult areas for this are probably the various data
76			sources. Some of them -- dbf files for instance -- q don't provide
77			any information about the encodings used.
78
79
80			3. Switch to Unicode
81
82			Finally, switch to Unicode as the internal string representation.
83			For this step it might be best to wait until Unicode builds of
84			wxPython are the default on the common platforms.
85
Name	Value
svn:eol-style	native
svn:keywords	Author Date Id Revision