1 |
bh |
2642 |
Title: String Representation in Thuban |
2 |
|
|
Author: Bernhard Herzog <[email protected]> |
3 |
|
|
Last-Modified: $Date$ |
4 |
|
|
Version: $Revision$ |
5 |
|
|
|
6 |
|
|
|
7 |
|
|
Introduction |
8 |
|
|
|
9 |
|
|
Thuban originally assumed that text is represented by byte-strings |
10 |
|
|
encoded in ISO-8859-1 (latin-1). This is problematic when the |
11 |
|
|
default encoding in the user's locale is not in fact latin-1, but |
12 |
|
|
e.g. UTF-8. The solution is to use a more flexible representation |
13 |
|
|
that will also allow the switch to Unicode as the internal string |
14 |
|
|
representation at one point. |
15 |
|
|
|
16 |
|
|
|
17 |
|
|
Internal String Representation |
18 |
|
|
|
19 |
|
|
Thuban has an internal string representation. All textual data read |
20 |
|
|
by Thuban has to be converted to the internal representation. All |
21 |
|
|
data written by Thuban has to be converted into whatever form is |
22 |
|
|
used by the output device. |
23 |
|
|
|
24 |
|
|
Thuban provides functions to convert between the internal |
25 |
|
|
representation and other representations. E.g.: |
26 |
|
|
internal_from_unicode which converts from unicode and should be used |
27 |
|
|
when reading XML files, for instance and unicode_from_internal for |
28 |
|
|
the conversion to Unicode. |
29 |
|
|
|
30 |
|
|
The ultimate goal is to use Unicode objects as the internal string |
31 |
|
|
representation. It will be much work to get there because we will |
32 |
|
|
have to find all the places where we need to make the conversions. |
33 |
|
|
Therefore the internal representation will be byte strings in the |
34 |
|
|
user's default encoding. |
35 |
|
|
|
36 |
|
|
With byte strings and especially encodings like latin-1 we can get |
37 |
|
|
by without doing all the conversions correctly because basically all |
38 |
|
|
byte strings are valid latin-1 strings, even if they have the wrong |
39 |
|
|
encoding. In those cases, the text may look strange, but there |
40 |
|
|
won't be exceptions in most cases. With Unicode objects, exceptions |
41 |
|
|
are much more likely. And in the end it's better to see some |
42 |
|
|
incorrect characters than no data at all. |
43 |
|
|
|
44 |
|
|
All this boils down to the following steps: |
45 |
|
|
|
46 |
|
|
1. Byte-Strings as Internal Representation |
47 |
|
|
|
48 |
|
|
The internal representation are byte strings in the user's default |
49 |
|
|
encoding as determined by the locale. The encoding is chosen so |
50 |
|
|
that such byte strings can be passed to wxPython without problems. |
51 |
|
|
This even works with Unicode builds if we take care to convert the |
52 |
|
|
translated strings (wxGetTranslation returns Unicode objects in a |
53 |
|
|
Unicode build). |
54 |
|
|
|
55 |
|
|
If no suitable encoding can be determined, use latin-1. It might be |
56 |
|
|
better to use ASCII instead, but latin 1 offers somewhat better |
57 |
|
|
backwards compatibility with older Thuban versions. |
58 |
|
|
|
59 |
|
|
Start implementing the conversion functions and use them wherever |
60 |
|
|
we have hard coded conversions to latin-1. It's not necessary to |
61 |
|
|
find all places where conversion has to be done at this point. |
62 |
|
|
Since we're using byte strings in the user's default encoding most |
63 |
|
|
byte-strings that are read by Thuban are already in the right form |
64 |
|
|
and in most cases it's also the right form for output. |
65 |
|
|
|
66 |
|
|
|
67 |
|
|
2. Implement the conversion wherever necessary |
68 |
|
|
|
69 |
|
|
Start working toward Unicode as the internal representation. In |
70 |
|
|
this phase, we need to find all places where conversion has to be |
71 |
|
|
done. To help with this, there will be a command line option that |
72 |
|
|
sets the internal representation to Unicode so that it's easy to |
73 |
|
|
test. |
74 |
|
|
|
75 |
|
|
The most difficult areas for this are probably the various data |
76 |
|
|
sources. Some of them -- dbf files for instance -- q don't provide |
77 |
|
|
any information about the encodings used. |
78 |
|
|
|
79 |
|
|
|
80 |
|
|
3. Switch to Unicode |
81 |
|
|
|
82 |
|
|
Finally, switch to Unicode as the internal string representation. |
83 |
|
|
For this step it might be best to wait until Unicode builds of |
84 |
|
|
wxPython are the default on the common platforms. |
85 |
|
|
|